Efficient Visibility Approximation for Game AI using Neural Omnidirectional Distance Fields

Zhi Ying [email protected] 0009-0008-8390-3366 Ubisoft La ForgeShanghaiChina , Nicholas Edwards [email protected] 0009-0007-4621-541X UbisoftMontrealCanada and Mikhail Kutuzov [email protected] 0009-0004-4049-2924 UbisoftMontrealCanada

Abstract.

Visibility information is critical in game AI applications, but the computational cost of raycasting-based methods poses a challenge for real-time systems. To address this challenge, we propose a novel method that represents a partitioned game scene as neural Omnidirectional Distance Fields (ODFs), allowing scalable and efficient visibility approximation between positions without raycasting. For each position of interest, we map its omnidirectional distance data from the spherical surface onto a UV plane. We then use multi-resolution grids and bilinearly interpolated features to encode directions. This allows us to use a compact multi-layer perceptron (MLP) to reconstruct the high-frequency directional distance data at these positions, ensuring fast inference speed. We demonstrate the effectiveness of our method through offline experiments and in-game evaluation. For in-game evaluation, we conduct a side-by-side comparison with raycasting-based visibility tests in three different scenes. Using a compact MLP (128 neurons and 2 layers), our method achieves an average cold start speedup of 9.35 times and warm start speedup of 4.8 times across these scenes. In addition, unlike the raycasting-based method, whose evaluation time is affected by the characteristics of the scenes, our method’s evaluation time remains constant.

Visibility for Game AI, Line of Sight, Omnidirectional Distance Fields, Neural Implicit Representation, Multi-resolution Grid Encoding

^†^†copyright: acmlicensed^†^†journal: PACMCGIT^†^†journalyear: 2024^†^†journalvolume: 7^†^†journalnumber: 1^†^†article: 21^†^†publicationmonth: 5^†^†doi: 10.1145/3651289^†^†ccs: Computing methodologies Visibility^†^†ccs: Computing methodologies Shape representations^†^†ccs: Computing methodologies Reconstruction

Refer to caption — Figure 1. This work presents an efficient approach to approximate visibility between 3D positions in static game scenes using a neural Omnidirectional Distance Field (ODF) representation. It enables game AI entities, such as NPCs, to obtain visibility information without raycasting. The evaluation process of our method is as follows (1) From the bird’s-eye view of a 2D grid partitioned game scene, the visibility evaluation takes place between the source position $s$ and the target position $t$ , where $\hat{u}$ represents the normalized direction from $s$ to $t$ . (2) The local ODF representation associated with the partitioned area containing $s$ is inferred, the positional encoding $\gamma$ encodes the position, while the multi-resolution grid encodes the direction. These encoded features are concatenated and fed into a MLP. (3) The MLP returns the distance value originating from $s$ in the direction $\hat{u}$ . (4) The predicted visibility result is computed by comparing the inferred distance with the Euclidean distance between $s$ and $t$ .

1. Introduction

In video games, visibility or line-of-sight information is crucial for AI entities such as non-player characters (NPCs) to build spatial knowledge about the game world (Bourg and Seemann, 2004). It is often used to make real-time decisions, which can significantly affect the ability of NPCs to appear intelligent. The most basic result of a visibility test is a boolean value describing whether two positions in the game world are visible to each other, i.e. $V(s,\,t)\in\{0,1\}$ . Advanced systems for building complex and believable behaviors for game AI rely heavily on it (Walsh, 2019; Welsh, 2013; McIntosh, 2019; Jack, 2019).

For game AI, the general method of obtaining visibility information is through raycasting (Glassner, 1989). Given a ray, this approach determines the first intersection with a scene object. However, the time required increases with the geometric complexity of the scene. Furthermore, since game AI programs typically run in CPU processes, performing a large number of visibility tests can lead to a computational bottleneck.

To compensate for the high computational cost, game developers often limit the number of visibility tests that are processed in real time. However, this trade-off can have a significant impact on the complexity and believability of NPC behavior. Limiting visibility information can lead to simplistic and unrealistic NPC decision making, limiting the range of actions that NPCs can take in response to their environment. It is therefore important to develop more efficient methods for assessing visibility.

Inspired by recent research on modeling 3D objects and scenes with neural implicit representations (Mildenhall et al., 2020; Chibane et al., 2020; Müller et al., 2022; Liu and Yang, 2023), we propose to represent the surface of the game scene geometry with a continuous distance function that can be optimized by parametric encoding and neural networks to effectively approximate visibility. Due to the varying size of game scenes, we employ a partitioning strategy to divide the entire game scene into smaller areas. For each area, we represent it as a local ODF. The ODF can be expressed as a scalar function that returns the distance to the closest surface based on a given source position $p$ and direction $\hat{d}$ . Formally, it is defined as

(1)

ODF(p,\,\hat{d})\coloneqq\min(\lVert p-p^{I}\rVert_{2}),\,p^{I}\in\partial\Omega,

where $p^{I}=p+\lambda\hat{d},\,\lambda\geq 0$ . Using this representation, we can efficiently evaluate whether there are occluders between two positions $s$ and $t$ and compute their visibility:

(2)

V(s,\,t)=\begin{cases}1,&\text{if }ODF_{i}(s,\,\hat{u})>\lVert s-t\rVert_{2}\\ 0,&\text{otherwise}\end{cases}\quad\text{where}\quad\hat{u}=\frac{t-s}{\lVert t-s\rVert_{2}}\quad\text{and}\quad 1\leq i\leq N

Here $N$ is the total number of partitions, $i$ is the partition index associated with the source position $s$ . The visibility evaluation process depends on only one of these local ODFs, regardless of the location of the target position. This allows us to represent each area of the game scene with a smaller neural network and achieve fast evaluation, it also allows developers to selectively enable ODF representations in disconnected areas of the entire game scene. Obviously there’s a trade-off between computational cost and overall memory usage, but game developers can find their optimal configurations based on specific requirements.

Standard neural implicit representations designed for view synthesis or 3D shape reconstruction require only low-frequency signals in the direction dimensions. In contrast, our ODF representation requires the capture of higher-frequency details. To address this challenge, for each source position, we map the omnidirectional distance data on the unit spherical surface onto a UV plane. On this UV plane, we employ bilinearly interpolated multi-resolution grid encoding as our direction features, complemented by positional encoding as our position features. By concatenating these features and feeding them into a compact MLP, we can reconstruct ODFs for the game scene while maintaining fast evaluation speeds.

In summary, our contributions are threefold:

•

We provide evidence that the learned neural ODF representation can be used to approximate the visibility between positions in static game scenes.
•

Evaluating visibility using the ODF depends only on the local representation associated with the source position, regardless of the target’s location. We show that by partitioning the game scene into areas and representing each area as an independent ODF, our method can be scaled to game worlds of arbitrary size while maintaining evaluation speed.
•

We show that by mapping omnidirectional distance data from the spherical surface onto a UV plane and using multi-resolution grid encoding, we are able to effectively reconstruct the distance data using a compact MLP for visibility approximation.

2. Background and Related Work

The core of our method is to replace traditional raycasting-based visibility computation with neural representation evaluation for efficiency. In this section, we provide a brief review of traditional visibility for NPC behaviors and neural implicit representations. To our knowledge, our work is the first to integrate these two domains, demonstrating promising results in a real game environment.

2.1. Visibility for NPC Behaviors

In NPC behavior design, visibility refers to knowing whether one position in a game scene can be ”seen” by an object at another position. In order to design complex and believable behaviors, various parts of the NPC decision making logic rely heavily on visibility information. This includes higher-level systems such as NPC visual perception (Walsh, 2019; Welsh, 2013; McIntosh, 2019) and tactical position selection (Jack, 2019), as well as more moment-to-moment logic, such as checking whether an agent should pull the trigger, change stance, or take some other immediate actions (McIntosh, 2019). For example, when evaluating NPC movement destinations during combat, visibility information between candidate positions and the player’s current position is critical.

While z-buffer scanning and raycasting (Bittner and Wonka, 2003) are common algorithms for visibility tests in graphics, NPC visibility queries require testing pairs of positions scattered throughout the game scene. Therefore, raycasting-based methods are often more efficient. However, as game scenes become more complex, raycasting algorithms face performance challenges. (Smits, 2005) highlights the complexity and considerations required to optimize raycasting. Although more efficient raycasting algorithms (Möller and Trumbore, 2005), acceleration structures (Klimaszewski, 1994; Klosowski et al., 1998; Foley and Sugerman, 2005), and hardware-accelerated libraries (Wald et al., 2014) have been developed, the cost of intensive visibility computation is still significant in many modern video games.

The probe-based irradiance field with visibility representation, as introduced in (McGuire et al., 2017; Majercik et al., 2019), is a promising advancement in real-time global illumination. By storing and constantly updating visibility information in GPU-resident probes, this technique effectively mitigates rendering problems such as light leaks. Accessing low-detail dynamic visibility results from these probes and estimating them by interpolation can be efficient (Majercik et al., 2019). However, the transition to game AI applications presents notable challenges. First, the probes are designed for rendering tasks, so they are often placed around the camera or cover only part of the scene. For game AI applications, however, the probes need to be placed and active for a larger area, which can significantly increase its GPU memory usage. Second, game AI applications operate primarily on the CPU, requiring low throughput visibility tests with minimal latency response. This makes probe-based methods less optimal.

In our approach, we leverage neural representations to approximate visibility and execute in the CPU. This is because achieving 100% accuracy for each visibility test is often unnecessary in game AI applications. For example, the visibility from a position to an object is often evaluated by sampling multiple visibility tests against the volume of that object, so the sampling errors can be handled by statistical techniques.

2.2. Neural Implicit Representations

Neural implicit representations have emerged as a powerful tool for encoding high-dimensional data as a continuous, implicit function defined by a neural network, enabling various applications such as novel view synthesis (NVS), 3D shape reconstruction, and physical simulation. For example, NVS seeks to render a scene from unseen viewpoints using a dataset of images and camera poses. In particular, Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) learns a representation of the scene using a continuous volumetric function of color density and relies on ray marching to render view-dependent images. In our task, we are more interested in distance fields that can be used to evaluate visibility.

Previous studies have explored the use of neural distance functions primarily for 3D shape reconstruction, such as predicting unsigned distance fields from sparse point clouds (Chibane et al., 2020) or learning consistent distance and density fields for improved localization (Ueda et al., 2022). However, these distance functions only model distance with respect to the surface normal and rely on algorithms such as sphere tracing to reconstruct shape, making them unsuitable for fast visibility evaluation between positions.

Classical implicit neural representations suffer from slow training and inference due to the computational complexity. To improve efficiency, previous research falls broadly into two categories: accelerating volumetric representations and non-volumetric representations.

Accelerating volumetric representations. With the introduction of auxiliary acceleration data structures, NeRFs can efficiently evaluate samples with a shallow MLP or no MLP at all. Plenoctrees (Yu et al., 2021), plenoxels (Fridovich-Keil and Yu et al., 2022) and multiresolution hash encoding (Müller et al., 2022) use voxel grids or hash table data structures as position features, along with low degree truncated spherical harmonics (SH) as direction features. A typical setting is SH truncated at degree 2 ( $SH_{2}$ ). Sparse Neural Radiance Grid (SNeRG) (Hedman et al., 2021) bakes NeRF’s continuous neural volumetric scene representation into a discrete data structure at training time, and uses a small MLP with view direction at evaluation time for efficient rendering. These methods still rely on ray marching for evaluation and thus require multiple MLP inferences for a single sample. Another notable effort to accelerate volumetric representations is through point-based radiance fields. 3D Gaussian Splatting (Kerbl et al., 2023) represents the scene using point-based 3D Gaussians and optimizes them using differentiable rendering. This allows the representation evaluation to skip unnecessary computations in empty space, resulting in real-time rendering speed.

Non-volumetric representations. Various works represent the scene using other representations to be more efficient. MobileNeRF (Chen et al., 2023), represents the scene as a triangle mesh textured by learned deep features. This method leverages a classical rasterization pipeline, incorporating a compact MLP implemented as a GLSL fragment shader. The rendering process for each sample involves a single MLP inference. As a result, they achieve interactive frame rates on mobile phones. It shows that the volumetric representations can be baked onto a proxy surface geometry for efficient rendering. The ray-based implicit 3D shape representations (Feng et al., 2022; Liu and Yang, 2023) are another type of non-volumetric representation. They have a similar concept to our work, modeling distance fields based on rays. However, their ray parameterizations are tailored for 3D shape reconstruction, which limits their application to single objects or small scenes. In contrast, our focus is on rays that originate from selected sparse source positions and can be easily scaled to cover the entire game world. Our approach models the omnidirectional distance originating from these positions, allowing us to efficiently test occluders between them and arbitrary target positions. In particular, previous work has introduced the concept of neural ODF for 3D reconstruction using recursive inference (Houchens et al., 2022), the computation time of which is impractical for our task.

Although some NeRFs can handle scenes the size of a single room or building, they are generally still limited and do not scale to large environments such as the game world. Researchers in the field have proposed to extend NeRF using architectures that align multiple NeRFs in a distributed manner, as shown in (Tancik et al., 2022; Turki et al., 2022). Our approach follows a similar concept of partitioning the entire scene. However, unlike previous methods that use an end-to-end architecture to dynamically select NeRFs for rendering tasks, we exploit the fact that the evaluation of visibility depends only on the local ODF representation. This insight allows us to partition the game scene through a simple grouping of source positions, paving the way for a scalable and adaptable application.

3. Method

Our method can be seen as a precomputation approach that learns the scene geometry in an offline process and then evaluates the learned representation online. In this section, we outline the collection of training data, introduce the multi-resolution grid encoding used to reconstruct high-frequency signals in direction dimensions, and describe the details of model training and evaluation.

3.1. Training Data Collection

For positions of interest in the game scene, the input variables of ODF can be expressed as rays $r=(p_{r},\,d_{r})$ , where $p_{r}\in\mathbb{R}^{3}$ represents the position, and $d_{r}\in\mathbb{S}^{2}$ represents the normalized direction. The ODF distance value associated with each ray can be collected in the game scene through raycasting.

In practice, we can only generate a finite number of rays for data collection. To distribute the rays evenly across the surface of the unit sphere, we employ Fibonacci lattice (Swinbank and James Purser, 2006; González, 2010) as a basic approximation. Let $N$ be any natural number, integer $i\in[-N,\,+N]$ . The $i$ th direction in radians are

(3)

\begin{split}lat_{i}&=\arcsin\left(\frac{2i}{2N+1}\right),\\ lon_{i}&=2\pi i\phi^{-1},\end{split}

where

(4)

\begin{gathered}\phi=\frac{1+\sqrt{5}}{2}=\lim_{n\to\infty}\left(\frac{F_{n+1}}{F_{n}}\right),\\ F_{0}=0,\quad F_{1}=1,\quad F_{n}=F_{n-1}+F_{n-2}\quad\text{for }n>1\end{gathered}

The number of directions is

(5)

P=2N+1.

After collecting data at all positions of interest, partitioning can be implemented as a post-processing step. This involves virtually grouping rays based on position $p_{r}$ , for example, grouping them by the voxels to which $p_{r}$ belongs.

3.2. Multi-resolution Grid Encoding

Standard MLPs are known to have a learning bias toward smooth functions, making them poorly suited for low-dimensional coordinate-based vision and graphics tasks (Rahaman et al., 2019). To overcome this limitation, neural representations employ encoding methods that map inputs to higher dimensions, improving the expressiveness for high-frequency signals. Unlike other representations designed for volumetric rendering or 3D shape reconstruction (Yu et al., 2021; Fridovich-Keil and Yu et al., 2022; Müller et al., 2022), whose common practice is to use lower degree truncated SH to encode directions, e.g. $SH_{2}$ , our neural ODF representation requires capturing high frequency details for directions. This is demonstrated by an experiment of ODF reconstruction at a single position. In the experiment, we set several truncated SH degrees to encode directions, the greater the truncated SH degree, the more SH coefficients are used, the more high-frequency details the trained neural network can capture, as shown in Fig. 3.

However, high-degree SH encoding is computationally expensive. The direct computation scales $\mathcal{O}(N^{3})$ (Müller, 2006). Although there are some methods to reduce the computational cost (Suda and Takami, 2002; Dutordoir et al., 2020), we decided to develop a more efficient approach.

To address the encoding challenge while aiming for fast evaluation, we consider the onmidirectional distance data as a 2D spherical surface, so that it can be mapped onto the UV plane for 2D reconstruction. Inspired by the trade-off between memory and evaluation time made by discretized volumetric representations (Hedman et al., 2021; Müller et al., 2022), we employ a multi-resolution grid data structure to store our parametric direction features. Each grid is rectangular in shape and has a ratio of approximately $2:1$ for the longitude and latitude axes to match their value ranges. The grid settings we used in our offline experiments are the coarsest resolution of $16\times 8$ and the finest resolution of $256\times 128$ , with the number of levels $L=16$ and the length of the feature stored in each texel $F=5$ . For each ray direction $d_{r}$ , the direction features are bilinearly interpolated from the four nearest texels of the UV plane and concatenated across multiple resolutions of the grid. We illustrate a two-level resolution grid example in Fig. 4.

3.3. Learning and Evaluating ODF

In addition to the multi-resolution grid used for direction encoding, we map the position $p_{r}$ to higher dimensional space using hash encoding (Müller et al., 2022) or positional encoding (PE) (Mildenhall et al., 2020), these features are then concatenated and finally fed into a compact MLP. To optimize the neural network, we perform direct regression of the ground truth distance, minimizing the mean squared error. During the optimization, we use the Adam optimizer. To prevent divergence, we apply weak weight decay regularization (coefficient $10^{-5}$ ) to the MLP.

Directly evaluating performance on randomly split data is not ideal due to ray aliasing (Feng et al., 2022). If we consider moving the origin position $p_{r}$ of a ray along it’s direction $\hat{d_{r}}$ , the new ray $r^{\prime}=(p_{r^{\prime}},\hat{d_{r}})$ is an aliased version of $r$ . This aliasing effect can also be observed from the ODF perspective, as highlighted by (Houchens et al., 2022) with the equation:

(6)

\nabla_{[\hat{d},\,\hat{0}]}ODF(p,\,\hat{d})=-1,

we can rewrite it as

(7)

ODF(p+\lambda\hat{d},\,\hat{d})=ODF(p,\,\hat{d})-\lambda,\,\lambda\in[-ODF(p,\,-\hat{d}),\,ODF(p,\,\hat{d})].

A direct evaluation based on a randomly partitioned training and testing set of these rays could result in overestimated metrics.

To mitigate ray aliasing in performance evaluation, we construct our testing dataset in the game scene separately by simulating visibility tests between candidate NPC movement destination positions and player positions. Each candidate position corresponds to one of the source positions where we collected omnidirectional distance data, while each target position is chosen randomly from the entire game scene. Additionally, we collect the results of raycasting-based visibility tests as the ground truth labels, denoted as $y$ . The testing dataset is represented as:

(8)

\mathcal{D}=\{(x^{(i)},\,y^{(i)})\}_{i=1}^{M},\,y\in\{0,1\},\,x=(s,\,t),\,s\in\mathcal{P},\,t\in\mathcal{Q},

where $\mathcal{P}$ represents positions of interest and $\mathcal{Q}$ represents all positions within the game scene.

4. Experiments

4.1. Offline Experiment

To establish an offline environment for convenient performance comparison with baselines, we first collect training and testing data in a game scene using the methods described in Section 3.1 and Section 3.3. The data is exported as binary files for convenient integration with machine learning frameworks such as PyTorch. We then post-process the training data by partitioning the 6227 source positions using a simple 2D grid scheme, as shown in Fig. 5(b). Based on this $8\times 8$ partitioning scheme, we create 64 independently initialized small models for each method under evaluation. During both the training and testing phases, each sample infers only one of these models, determined by the local partition index associated with the given source position.

Table 1. Offline evaluation for encoding methods.

	$\uparrow$ Acc.	$\uparrow$ Prec.	$\uparrow$ Recall	$\uparrow$ F1	$\downarrow$ Params # (k)	$\downarrow$ Time (ms)
No mapping	0.8721	0.9032	0.7151	0.7982	50	$0.27\pm 0.03$
Positional encoding	0.8913	0.9136	0.7651	0.8328	54	$0.41\pm 0.02$
Hash encoding ( $SH_{2}$ )	0.8750	0.9054	0.7222	0.8035	2152	$6.44\pm 0.19$
Hash encoding ( $SH_{12}$ )	0.9029	0.9303	0.7843	0.8511	2172	$33.54\pm 4.94$
FFM ( $SH_{12}$ )	0.8975	0.9272	0.7708	0.8418	104	$29.70\pm 0.47$
Ours (Hash, Long-Lat)	0.9079	0.9328	0.7971	0.8596	3118	$12.66\pm 0.55$
Ours (Hash, Mercator)	0.9081	0.9336	0.7940	0.8594	3118	$12.88\pm 0.37$
Ours (PE, Long-Lat)	0.9034	0.9167	0.7996	0.8542	1022	$6.74\pm 0.11$
Ours (PE, Mercator)	0.9031	0.9221	0.7931	0.8528	1022	$7.00\pm 0.16$

The evaluation results are presented in Table 1. All these experiments are performed offline with PyTorch. The prediction accuracy is evaluated through classification metrics (Powers, 2008). The inference time is benchmarked on a Xeon W-2135 CPU and is constrained to single-threaded execution without batching, simulating the typical execution of visibility tests in game AI applications.

In the positional encoding (Mildenhall et al., 2020) experiment, we use 6 exponential growth frequencies ( $n_{\text{freq}}=6$ ) for both position and direction inputs. In the hash encoding (Müller et al., 2022) experiment, we use the setting $(F=2,\,L=16)$ with $2^{16}$ as the hash table size. For its direction encoding we employ SH truncated at degree 2 and degree 12. In the Fourier Feature Mapping (FFM) (Tancik et al., 2020) experiment, we use a standard deviation $\sigma=1.0$ for the Gaussian distribution to encode positions and SH truncated at degree 12 to encode directions. In our method, for directions, we experimented with longitude-latitude projection and the Mercator projection for spherical mapping, the multi-resolution grid setting is described in Section 3.2, which is $(F=5,\,L=16)$ . For positions, we use hash encoding and PE of $n_{\text{freq}}=6$ . The hidden size of the MLPs is 128 neurons and 4 layers for all methods.

For all methods, there is a trade-off between prediction performance, memory, and computational cost. From the result, our method with PE shows promising metrics in both prediction performance and computational cost, at a reasonable memory cost. With hash encoding, our method achieves slightly better prediction performance but slower speed. We don’t see much difference between the two projection methods with the experimented grid settings. Due to the smoothness in reconstructing omnidirectional distance data, hash encoding with SH truncated at degree 2 has a similar prediction performance to the no mapping baseline. In our experiments, we observed that while positional encoding can yield commendable results with appropriate frequency configurations, its effectiveness is highly sensitive, requiring hyperparameter tuning across different scenes. Hash encoding and FFM with SH truncated at degree 12 are much slower due to their SH computations. Among different versions of our methods, we choose the PE with longitude-latitude projection for further in-game experiments due to its simplicity, faster speed, and lower memory cost.

Certain applications require the use of multi-threaded CPU or GPU batching operations to perform high-throughput visibility testing. Our approach benefits from the integration of parallel and vectorized computation with batching. The scalability of the inference throughput of our method for both CPU and GPU is illustrated in Fig. 6. The throughput depends on the hardware and inference framework used; we present the measurements tested using PyTorch, a Xeon W-2135 CPU, and a RTX A5000 GPU.

4.2. Memory Usage Analysis

Our method involves compressing omnidirectional depth maps for all source positions within the local partition into a single neural ODF representation. Fig. 7 illustrates a comparison of the memory composition. To analyze the difference in memory usage, we conduct an experiment using the offline environment with the partitioning and settings of our method described in Section 4.1. Each partition in the experiment contained an average of 100 source positions. For PNG DEFLATE lossless compression, we used ZLIB compression level 6.

Table 2. Memory usage per partition.

Depth Map $(256\times 128)$		Depth Map $(512\times 256)$		Ours
Uncompressed	PNG Compressed	Uncompressed	PNG Compressed	Ours
13.1 MB	559 KB	52.4 MB	1.3 MB	4.95 MB

As shown in Table.2, for each partition, our neural representation has a significantly reduced memory size compared to the size of uncompressed depth maps, although it remains larger than PNG compressed memory size. Traditional image compression methods have their own drawbacks when applied to game AI applications, such as additional computation time, inconvenience in evaluating individual pixels, and lack of generalization ability. To further optimize the memory size of our neural representation, future considerations may include actions such as reducing weight precision to 16-bit float, applying quantization, or using smaller feature vectors. For the scope of this study, our primary focus is on prediction accuracy and inference time in a real game environment, which will be discussed in the following section.

4.3. In-game Evaluation

Table 3. In-game evaluation results.

	Environment A			Environment B			Environment C
	(809k triangles)			(1078k triangles)			(136k triangles)
	Cold (us)	Warm (us)	Acc.	Cold (us)	Warm (us)	Acc.	Cold (us)	Warm (us)	Acc.
Raycasting	55.28	8.79	-	78.79	13.46	-	62.37	10.84	-
Ours $(128\times 4)$	10.0	4.5	0.950	10.0	4.5	0.897	10.0	4.5	0.990
Ours $(128\times 2)$	7.0	2.3	0.950	7.0	2.3	0.894	7.0	2.3	0.989
Ours $(64\times 2)$	4.0	1.3	0.947	4.0	1.3	0.886	4.0	1.3	0.988
Ours $(32\times 2)$	3.0	1.1	0.940	3.0	1.1	0.881	3.0	1.1	0.985

To evaluate our method in a real game environment, we perform a side-by-side comparison with raycasting using an AAA game engine. The raycasting implementation is based on a commonly used commercial product called Havok Physics. The maximum raycasting distance for both data collection and baseline comparison is set to 100 meters. Three types of scenes are selected as our evaluation environment, as shown in Fig. 8. These scenes are partitioned into $16\text{\,}\mathrm{m}$ $\times$ $16\text{\,}\mathrm{m}$ $\times$ $16\text{\,}\mathrm{m}$ voxels, and each partition contains a different number of source positions of interest, ranging from 30 to 200. The evaluation results for execution time and visibility prediction accuracy are presented in Table.3. Due to the significant impact of CPU caching on execution time, we collect both cold start and warm start average execution times for all tasks. The execution time and accuracy of our method are measured based on a C++ implementation and executed on a single-threaded CPU process without batching. We use (PE, Long-Lat) version of our method. In addition, various sizes of MLPs are evaluated, while keeping the encoding settings the same ( $n_{\text{freq}}=6,\,F=2,\,L=16$ ). So the input size of the MLPs is $2\times 3\times 6+2\times 16=68$ .

Table 4. In-game memory usage estimation.

	Raycasting	Part. #	Ours $(128\times 4)$	Ours $(128\times 2)$	Ours $(64\times 4)$	Ours $(32\times 2)$
Environment A	13.3 MB	56	104.7 MB	97.3 MB	95.4 MB	92.3 MB
Environment B	30.5 MB	80	149.5 MB	139 MB	136.2 MB	131.9 MB
Environment C	6.9 MB	30	56.1 MB	52.1 MB	51.1 MB	49.5 MB

For each partition, using 32-bit float precision, the memory size of the multi-resolution grid is approximately 1.64 MB, while the memory size of the MLP ranges from 13 KB to 234 KB. The observed memory usage is lower than that of offline experiments due to the utilization of smaller feature vectors and MLP for in-game evaluation. To compare memory usage with raycasting, as shown in Table 4, we estimate raycasting memory usage by assessing the memory size of the collision geometry at the level of detail employed for raycasting. For our methods, we estimate memory usage by multiplying the number of partitions for each environment by the model size for each partition.

Due to the different characteristics such as average collision distance and number of triangles in the three scenes, the execution time of raycasting-based visibility is different. In contrast, our inference time remains the same across the three scenes because we are using exactly the same model architecture. This is potentially a desirable property for game AI applications, as it makes the execution time more predictable.

From the experimental results presented in Table.3, we observe that the most challenging scene for our method is Environment B, which contains many small objects in the open area. We hypothesize that the reason for the lower accuracy is due to the reconstruction error at object edges. As observed in the visualizations shown in Fig. 9, most of the incorrect predictions are around object edges, corresponding to the viewing direction from the source position. In the future, this problem may solved by using a hybrid approach that detects object edges and switches between raycasting-based visibility tests and ours. For Environment C, where most of the target positions are not visible, the prediction accuracy of our method is very high ( $>98.5\%$ ). Under different MLP settings and scenes, our method shows speedups ranging from $1.95\times$ to $26.26\times$ .

5. Discussion and future work

In this section, we will discuss the limitations of our method and possible directions for future work.

Direction feature enhancements. In our method we use a multi-resolution grid for direction encoding and use it to store learnable, locally interpolated features. While our current techniques for sphere discretization and feature interpolation remain relatively unexplored, there exists potential for enhancement and refinement. To make the features more informative, in the future we can explore sphere discretization methods with better mathematical properties such as HEALPix (Gorski et al., 2005), data structures that take spherical curvature into account, such as Spherical Feature-Grid (Dou et al., 2023) or adopt better feature interpolation methods that use the geodesic distance.

Precomputation time. As a form of precomputation methods, our approach requires learning of the surface geometry information of the scene into another representation before being evaluated at runtime. A disadvantage of precomputation methods is the additional time required for offline processing. The precomputation time for our method consists of data collection time and model training time. For small areas of the game scene, this process takes several minutes, which is acceptable. However, scaling our method to the entire game world could extend the precomputation time to several hours or even days, depending on the number of positions of interest and directions. To improve the data collection process in the future, instead of relying on raycasting with predefined directions, we can explore the use of panoramic depth frames from the GPU. This improvement would allow us to efficiently collect omnidirectional distance data for positions, taking advantage of the graphics pipeline.

Dynamic scene handling is another limitation of our method. This is because many game scenarios involve highly dynamic scenes, including destructible walls and buildings, large moving objects such as vehicles, dynamic visual effects such as smoke and fog, etc. However, our current method is only applicable to static part of scenes. To overcome this limitation, future improvements could include modeling dynamic objects with temporal variations (Ost et al., 2021; Cao and Johnson, 2023; Luiten et al., 2024) or variations of the game scene.

Extended game applications to replace raycasting. In addition to the visibility test use case presented in this paper, raycasting is widely used in many other aspects of video game development. For example, navigation mesh construction, path finding, projectile ballistics, spatial clearance tests, UI logic, and melee combat all rely heavily on raycasting techniques. While these applications may have different requirements in terms of memory, accuracy, and computational cost, by adjusting the grid and MLP settings, our method can be used as an accelerated approximation technique. It would be valuable to explore the potential in other application scenarios to replace traditional raycasting.

Generalized representation. Spatial reasoning is a major challenge in crafting more sophisticated, believable, and intelligent NPC behavior in games. Existing approaches, such as (Forbus et al., 2002; Dill, 2019), have made progress in this area. Our method provides a potential solution to their computational challenge by learning a neural representation of the spatial environment that is not limited to ray and distance pairs. We can generalize the representation to incorporate other information obtained from game environment for efficient information retrieval at runtime. We see many potential applications of our approach in this area.

6. Conclusion

For game AI applications, obtaining real-time visibility information is critical but computationally expensive. Our work presents the first method designed to address the computational bottleneck of visibility tests for game AI using a neural representation.

We show that by optimizing neural networks to represent scene geometry as an ODF, we can efficiently predict the visibility between positions. In addition, evaluating visibility using our representation only depends on the local ODF associated with the source position, regardless of the location of the target. This allows scalability to the entire game world without sacrificing model inference time, at the cost of potentially larger overall memory requirements. We show that by using multi-resolution grid encoding for direction features, we can improve the expressiveness of our neural network for ODF reconstruction. Compared to raycasting in a real game environment, we not only achieve speed up, but also provide constant time inference. And besides, as an efficient representation designed for game AI applications, our method can potentially serve as a fast information retrieval solution for complex systems such as spatial reasoning in the future.

Acknowledgements.

We would like to thank Georges Nader, Ludovic Denoyer, Jean-Philippe Barrette-LaPierre, Alexis Rolland, and Yves Jacquier for their valuable discussions and reviews. This work was supported by Ubisoft.

References

(1)
Bittner and Wonka (2003) Jiří Bittner and Peter Wonka. 2003. Visibility in computer graphics. Environment and Planning B: Planning and Design 30, 5 (2003), 729–755.
Bourg and Seemann (2004) David M Bourg and Glenn Seemann. 2004. AI for game developers. ” O’Reilly Media, Inc.”.
Cao and Johnson (2023) Ang Cao and Justin Johnson. 2023. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 130–141.
Chen et al. (2023) Zhiqin Chen, Thomas Funkhouser, Peter Hedman, and Andrea Tagliasacchi. 2023. MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures. In The Conference on Computer Vision and Pattern Recognition (CVPR).
Chibane et al. (2020) Julian Chibane, Aymen Mir, and Gerard Pons-Moll. 2020. Neural Unsigned Distance Fields for Implicit Function Learning. In Advances in Neural Information Processing Systems (NeurIPS).
Dill (2019) Kevin Dill. 2019. Spatial reasoning for strategic decision making. In Game AI Pro 360. CRC Press, 125–148.
Dou et al. (2023) Yishun Dou, Zhong Zheng, Qiaoqiao Jin, Bingbing Ni, Yugang Chen, and Junxiang Ke. 2023. Real-Time Neural BRDF with Spherically Distributed Primitives. arXiv preprint arXiv:2310.08332 (2023).
Dutordoir et al. (2020) Vincent Dutordoir, Nicolas Durrande, and James Hensman. 2020. Sparse Gaussian processes with spherical harmonic features. In International Conference on Machine Learning. PMLR, 2793–2802.
Feng et al. (2022) Brandon Y Feng, Yinda Zhang, Danhang Tang, Ruofei Du, and Amitabh Varshney. 2022. PRIF: Primary Ray-Based Implicit Function. In European Conference on Computer Vision. Springer, 138–155.
Foley and Sugerman (2005) Tim Foley and Jeremy Sugerman. 2005. KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. 15–22.
Forbus et al. (2002) Kenneth D Forbus, James V Mahoney, and Kevin Dill. 2002. How qualitative spatial reasoning can improve strategy game AIs. IEEE Intelligent Systems 17, 4 (2002), 25–30.
Fridovich-Keil and Yu et al. (2022) Fridovich-Keil and Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance Fields without Neural Networks. In CVPR.
Glassner (1989) Andrew S Glassner. 1989. An introduction to ray tracing. Morgan Kaufmann.
González (2010) Álvaro González. 2010. Measurement of areas on a sphere using Fibonacci and latitude–longitude lattices. Mathematical Geosciences 42 (2010), 49–64.
Gorski et al. (2005) Krzysztof M Gorski, Eric Hivon, Anthony J Banday, Benjamin D Wandelt, Frode K Hansen, Mstvos Reinecke, and Matthia Bartelmann. 2005. HEALPix: A framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal 622, 2 (2005), 759.
Hedman et al. (2021) Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. 2021. Baking Neural Radiance Fields for Real-Time View Synthesis. ICCV (2021).
Houchens et al. (2022) Trevor Houchens, Cheng-You Lu, Shivam Duggal, Rao Fu, and Srinath Sridhar. 2022. NeuralODF: Learning Omnidirectional Distance Fields for 3D Shape Representation. arXiv:2206.05837 [cs.CV]
Jack (2019) Matthew Jack. 2019. Tactical position selection: An architecture and query language. In Game AI Pro 360. CRC Press, 1–24.
Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Klimaszewski (1994) Krzysztof S Klimaszewski. 1994. Faster ray tracing using adaptive grids and area sampling. Brigham Young University.
Klosowski et al. (1998) James T Klosowski, Martin Held, Joseph SB Mitchell, Henry Sowizral, and Karel Zikan. 1998. Efficient collision detection using bounding volume hierarchies of k-DOPs. IEEE transactions on Visualization and Computer Graphics 4, 1 (1998), 21–36.
Liu and Yang (2023) Zhuoman Liu and Bo Yang. 2023. RayDF: Neural Ray-surface Distance Fields with Multi-view Consistency. arXiv:2310.19629 [cs.CV]
Luiten et al. (2024) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2024. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. In 3DV.
Majercik et al. (2019) Zander Majercik, Jean-Philippe Guertin, Derek Nowrouzezahrai, and Morgan McGuire. 2019. Dynamic Diffuse Global Illumination with Ray-Traced Irradiance Fields. Journal of Computer Graphics Techniques (JCGT) 8, 2 (5 June 2019), 1–30. http://jcgt.org/published/0008/02/01/
McGuire et al. (2017) Morgan McGuire, Mike Mara, Derek Nowrouzezahrai, and David Luebke. 2017. Real-time global illumination using precomputed light field probes. In Proceedings of the 21st ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (San Francisco, California) (I3D ’17). Association for Computing Machinery, New York, NY, USA, Article 2, 11 pages. https://doi.org/10.1145/3023368.3023378
McIntosh (2019) Travis McIntosh. 2019. Human Enemy AI in The Last of Us. In Game AI Pro 360. CRC Press, 13–24.
Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
Möller and Trumbore (2005) Tomas Möller and Ben Trumbore. 2005. Fast, minimum storage ray/triangle intersection. In ACM SIGGRAPH 2005 Courses. 7–es.
Müller (2006) Claus Müller. 2006. Spherical harmonics. Vol. 17. Springer.
Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Trans. Graph. 41, 4, Article 102 (July 2022), 15 pages. https://doi.org/10.1145/3528223.3530127
Ost et al. (2021) Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. 2021. Neural Scene Graphs for Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2856–2865.
Powers (2008) David Powers. 2008. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Mach. Learn. Technol. 2 (01 2008).
Rahaman et al. (2019) Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. On the spectral bias of neural networks. In International Conference on Machine Learning. PMLR, 5301–5310.
Smits (2005) Brian Smits. 2005. Efficiency issues for ray tracing. In ACM SIGGRAPH 2005 Courses. 6–es.
Suda and Takami (2002) Reiji Suda and Masayasu Takami. 2002. A fast spherical harmonics transform algorithm. Mathematics of computation 71, 238 (2002), 703–715.
Swinbank and James Purser (2006) Richard Swinbank and R James Purser. 2006. Fibonacci grids: A novel approach to global modelling. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography 132, 619 (2006), 1769–1793.
Tancik et al. (2022) Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben P. Mildenhall, Pratul Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar. 2022. Block-NeRF: Scalable Large Scene Neural View Synthesis. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8238–8248. https://doi.org/10.1109/CVPR52688.2022.00807
Tancik et al. (2020) Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. 2020. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS (2020).
Turki et al. (2022) Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. 2022. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12922–12931.
Ueda et al. (2022) Itsuki Ueda, Yoshihiro Fukuhara, Hirokatsu Kataoka, Hiroaki Aizawa, Hidehiko Shishido, and Itaru Kitahara. 2022. Neural Density-Distance Fields. In Proceedings of the European Conference on Computer Vision.
Wald et al. (2014) Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst. 2014. Embree: A Kernel Framework for Efficient CPU Ray Tracing. ACM Trans. Graph. 33, 4, Article 143 (jul 2014), 8 pages. https://doi.org/10.1145/2601097.2601199
Walsh (2019) Martin Walsh. 2019. Modeling Perception and Awareness in Tom Clancy’s Splinter Cell Blacklist. In Game AI Pro 360. CRC Press, 73–86.
Welsh (2013) Rich Welsh. 2013. Crytek’s Target Tracks Perception System. Game AI Pro: Collected Wisdom of Game AI Professionals 403 (2013), 411.
Yu et al. (2021) Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. 2021. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5752–5761.