This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Visibility Approximation for Game AI using Neural Omnidirectional Distance Fields

Zhi Ying [email protected] 0009-0008-8390-3366 Ubisoft La ForgeShanghaiChina Nicholas Edwards [email protected] 0009-0007-4621-541X UbisoftMontrealCanada  and  Mikhail Kutuzov [email protected] 0009-0004-4049-2924 UbisoftMontrealCanada
Abstract.

Visibility information is critical in game AI applications, but the computational cost of raycasting-based methods poses a challenge for real-time systems. To address this challenge, we propose a novel method that represents a partitioned game scene as neural Omnidirectional Distance Fields (ODFs), allowing scalable and efficient visibility approximation between positions without raycasting. For each position of interest, we map its omnidirectional distance data from the spherical surface onto a UV plane. We then use multi-resolution grids and bilinearly interpolated features to encode directions. This allows us to use a compact multi-layer perceptron (MLP) to reconstruct the high-frequency directional distance data at these positions, ensuring fast inference speed. We demonstrate the effectiveness of our method through offline experiments and in-game evaluation. For in-game evaluation, we conduct a side-by-side comparison with raycasting-based visibility tests in three different scenes. Using a compact MLP (128 neurons and 2 layers), our method achieves an average cold start speedup of 9.35 times and warm start speedup of 4.8 times across these scenes. In addition, unlike the raycasting-based method, whose evaluation time is affected by the characteristics of the scenes, our method’s evaluation time remains constant.

Visibility for Game AI, Line of Sight, Omnidirectional Distance Fields, Neural Implicit Representation, Multi-resolution Grid Encoding
copyright: acmlicensedjournal: PACMCGITjournalyear: 2024journalvolume: 7journalnumber: 1article: 21publicationmonth: 5doi: 10.1145/3651289ccs: Computing methodologies Visibilityccs: Computing methodologies Shape representationsccs: Computing methodologies Reconstruction
Refer to caption
Figure 1. This work presents an efficient approach to approximate visibility between 3D positions in static game scenes using a neural Omnidirectional Distance Field (ODF) representation. It enables game AI entities, such as NPCs, to obtain visibility information without raycasting. The evaluation process of our method is as follows (1) From the bird’s-eye view of a 2D grid partitioned game scene, the visibility evaluation takes place between the source position s𝑠s and the target position t𝑡t, where u^^𝑢\hat{u} represents the normalized direction from s𝑠s to t𝑡t. (2) The local ODF representation associated with the partitioned area containing s𝑠s is inferred, the positional encoding γ𝛾\gamma encodes the position, while the multi-resolution grid encodes the direction. These encoded features are concatenated and fed into a MLP. (3) The MLP returns the distance value originating from s𝑠s in the direction u^^𝑢\hat{u}. (4) The predicted visibility result is computed by comparing the inferred distance with the Euclidean distance between s𝑠s and t𝑡t.

1. Introduction

In video games, visibility or line-of-sight information is crucial for AI entities such as non-player characters (NPCs) to build spatial knowledge about the game world (Bourg and Seemann, 2004). It is often used to make real-time decisions, which can significantly affect the ability of NPCs to appear intelligent. The most basic result of a visibility test is a boolean value describing whether two positions in the game world are visible to each other, i.e. V(s,t){0,1}𝑉𝑠𝑡01V(s,\,t)\in\{0,1\}. Advanced systems for building complex and believable behaviors for game AI rely heavily on it (Walsh, 2019; Welsh, 2013; McIntosh, 2019; Jack, 2019).

For game AI, the general method of obtaining visibility information is through raycasting (Glassner, 1989). Given a ray, this approach determines the first intersection with a scene object. However, the time required increases with the geometric complexity of the scene. Furthermore, since game AI programs typically run in CPU processes, performing a large number of visibility tests can lead to a computational bottleneck.

To compensate for the high computational cost, game developers often limit the number of visibility tests that are processed in real time. However, this trade-off can have a significant impact on the complexity and believability of NPC behavior. Limiting visibility information can lead to simplistic and unrealistic NPC decision making, limiting the range of actions that NPCs can take in response to their environment. It is therefore important to develop more efficient methods for assessing visibility.

Inspired by recent research on modeling 3D objects and scenes with neural implicit representations (Mildenhall et al., 2020; Chibane et al., 2020; Müller et al., 2022; Liu and Yang, 2023), we propose to represent the surface of the game scene geometry with a continuous distance function that can be optimized by parametric encoding and neural networks to effectively approximate visibility. Due to the varying size of game scenes, we employ a partitioning strategy to divide the entire game scene into smaller areas. For each area, we represent it as a local ODF. The ODF can be expressed as a scalar function that returns the distance to the closest surface based on a given source position p𝑝p and direction d^^𝑑\hat{d}. Formally, it is defined as

(1) ODF(p,d^)min(ppI2),pIΩ,formulae-sequence𝑂𝐷𝐹𝑝^𝑑subscriptdelimited-∥∥𝑝superscript𝑝𝐼2superscript𝑝𝐼ΩODF(p,\,\hat{d})\coloneqq\min(\lVert p-p^{I}\rVert_{2}),\,p^{I}\in\partial\Omega,

where pI=p+λd^,λ0formulae-sequencesuperscript𝑝𝐼𝑝𝜆^𝑑𝜆0p^{I}=p+\lambda\hat{d},\,\lambda\geq 0. Using this representation, we can efficiently evaluate whether there are occluders between two positions s𝑠s and t𝑡t and compute their visibility:

(2) V(s,t)={1,if ODFi(s,u^)>st20,otherwisewhereu^=tsts2and1iNformulae-sequence𝑉𝑠𝑡cases1if 𝑂𝐷subscript𝐹𝑖𝑠^𝑢subscriptdelimited-∥∥𝑠𝑡20otherwisewhereformulae-sequence^𝑢𝑡𝑠subscriptdelimited-∥∥𝑡𝑠2and1𝑖𝑁V(s,\,t)=\begin{cases}1,&\text{if }ODF_{i}(s,\,\hat{u})>\lVert s-t\rVert_{2}\\ 0,&\text{otherwise}\end{cases}\quad\text{where}\quad\hat{u}=\frac{t-s}{\lVert t-s\rVert_{2}}\quad\text{and}\quad 1\leq i\leq N

Here N𝑁N is the total number of partitions, i𝑖i is the partition index associated with the source position s𝑠s. The visibility evaluation process depends on only one of these local ODFs, regardless of the location of the target position. This allows us to represent each area of the game scene with a smaller neural network and achieve fast evaluation, it also allows developers to selectively enable ODF representations in disconnected areas of the entire game scene. Obviously there’s a trade-off between computational cost and overall memory usage, but game developers can find their optimal configurations based on specific requirements.

Standard neural implicit representations designed for view synthesis or 3D shape reconstruction require only low-frequency signals in the direction dimensions. In contrast, our ODF representation requires the capture of higher-frequency details. To address this challenge, for each source position, we map the omnidirectional distance data on the unit spherical surface onto a UV plane. On this UV plane, we employ bilinearly interpolated multi-resolution grid encoding as our direction features, complemented by positional encoding as our position features. By concatenating these features and feeding them into a compact MLP, we can reconstruct ODFs for the game scene while maintaining fast evaluation speeds.

In summary, our contributions are threefold:

  • We provide evidence that the learned neural ODF representation can be used to approximate the visibility between positions in static game scenes.

  • Evaluating visibility using the ODF depends only on the local representation associated with the source position, regardless of the target’s location. We show that by partitioning the game scene into areas and representing each area as an independent ODF, our method can be scaled to game worlds of arbitrary size while maintaining evaluation speed.

  • We show that by mapping omnidirectional distance data from the spherical surface onto a UV plane and using multi-resolution grid encoding, we are able to effectively reconstruct the distance data using a compact MLP for visibility approximation.

2. Background and Related Work

The core of our method is to replace traditional raycasting-based visibility computation with neural representation evaluation for efficiency. In this section, we provide a brief review of traditional visibility for NPC behaviors and neural implicit representations. To our knowledge, our work is the first to integrate these two domains, demonstrating promising results in a real game environment.

2.1. Visibility for NPC Behaviors

In NPC behavior design, visibility refers to knowing whether one position in a game scene can be ”seen” by an object at another position. In order to design complex and believable behaviors, various parts of the NPC decision making logic rely heavily on visibility information. This includes higher-level systems such as NPC visual perception (Walsh, 2019; Welsh, 2013; McIntosh, 2019) and tactical position selection (Jack, 2019), as well as more moment-to-moment logic, such as checking whether an agent should pull the trigger, change stance, or take some other immediate actions (McIntosh, 2019). For example, when evaluating NPC movement destinations during combat, visibility information between candidate positions and the player’s current position is critical.

While z-buffer scanning and raycasting (Bittner and Wonka, 2003) are common algorithms for visibility tests in graphics, NPC visibility queries require testing pairs of positions scattered throughout the game scene. Therefore, raycasting-based methods are often more efficient. However, as game scenes become more complex, raycasting algorithms face performance challenges. (Smits, 2005) highlights the complexity and considerations required to optimize raycasting. Although more efficient raycasting algorithms (Möller and Trumbore, 2005), acceleration structures (Klimaszewski, 1994; Klosowski et al., 1998; Foley and Sugerman, 2005), and hardware-accelerated libraries (Wald et al., 2014) have been developed, the cost of intensive visibility computation is still significant in many modern video games.

The probe-based irradiance field with visibility representation, as introduced in (McGuire et al., 2017; Majercik et al., 2019), is a promising advancement in real-time global illumination. By storing and constantly updating visibility information in GPU-resident probes, this technique effectively mitigates rendering problems such as light leaks. Accessing low-detail dynamic visibility results from these probes and estimating them by interpolation can be efficient (Majercik et al., 2019). However, the transition to game AI applications presents notable challenges. First, the probes are designed for rendering tasks, so they are often placed around the camera or cover only part of the scene. For game AI applications, however, the probes need to be placed and active for a larger area, which can significantly increase its GPU memory usage. Second, game AI applications operate primarily on the CPU, requiring low throughput visibility tests with minimal latency response. This makes probe-based methods less optimal.

In our approach, we leverage neural representations to approximate visibility and execute in the CPU. This is because achieving 100% accuracy for each visibility test is often unnecessary in game AI applications. For example, the visibility from a position to an object is often evaluated by sampling multiple visibility tests against the volume of that object, so the sampling errors can be handled by statistical techniques.

2.2. Neural Implicit Representations

Neural implicit representations have emerged as a powerful tool for encoding high-dimensional data as a continuous, implicit function defined by a neural network, enabling various applications such as novel view synthesis (NVS), 3D shape reconstruction, and physical simulation. For example, NVS seeks to render a scene from unseen viewpoints using a dataset of images and camera poses. In particular, Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) learns a representation of the scene using a continuous volumetric function of color density and relies on ray marching to render view-dependent images. In our task, we are more interested in distance fields that can be used to evaluate visibility.

Previous studies have explored the use of neural distance functions primarily for 3D shape reconstruction, such as predicting unsigned distance fields from sparse point clouds (Chibane et al., 2020) or learning consistent distance and density fields for improved localization (Ueda et al., 2022). However, these distance functions only model distance with respect to the surface normal and rely on algorithms such as sphere tracing to reconstruct shape, making them unsuitable for fast visibility evaluation between positions.

Classical implicit neural representations suffer from slow training and inference due to the computational complexity. To improve efficiency, previous research falls broadly into two categories: accelerating volumetric representations and non-volumetric representations.

Accelerating volumetric representations. With the introduction of auxiliary acceleration data structures, NeRFs can efficiently evaluate samples with a shallow MLP or no MLP at all. Plenoctrees (Yu et al., 2021), plenoxels (Fridovich-Keil and Yu et al., 2022) and multiresolution hash encoding (Müller et al., 2022) use voxel grids or hash table data structures as position features, along with low degree truncated spherical harmonics (SH) as direction features. A typical setting is SH truncated at degree 2 (SH2𝑆subscript𝐻2SH_{2}). Sparse Neural Radiance Grid (SNeRG) (Hedman et al., 2021) bakes NeRF’s continuous neural volumetric scene representation into a discrete data structure at training time, and uses a small MLP with view direction at evaluation time for efficient rendering. These methods still rely on ray marching for evaluation and thus require multiple MLP inferences for a single sample. Another notable effort to accelerate volumetric representations is through point-based radiance fields. 3D Gaussian Splatting (Kerbl et al., 2023) represents the scene using point-based 3D Gaussians and optimizes them using differentiable rendering. This allows the representation evaluation to skip unnecessary computations in empty space, resulting in real-time rendering speed.

Non-volumetric representations. Various works represent the scene using other representations to be more efficient. MobileNeRF (Chen et al., 2023), represents the scene as a triangle mesh textured by learned deep features. This method leverages a classical rasterization pipeline, incorporating a compact MLP implemented as a GLSL fragment shader. The rendering process for each sample involves a single MLP inference. As a result, they achieve interactive frame rates on mobile phones. It shows that the volumetric representations can be baked onto a proxy surface geometry for efficient rendering. The ray-based implicit 3D shape representations (Feng et al., 2022; Liu and Yang, 2023) are another type of non-volumetric representation. They have a similar concept to our work, modeling distance fields based on rays. However, their ray parameterizations are tailored for 3D shape reconstruction, which limits their application to single objects or small scenes. In contrast, our focus is on rays that originate from selected sparse source positions and can be easily scaled to cover the entire game world. Our approach models the omnidirectional distance originating from these positions, allowing us to efficiently test occluders between them and arbitrary target positions. In particular, previous work has introduced the concept of neural ODF for 3D reconstruction using recursive inference (Houchens et al., 2022), the computation time of which is impractical for our task.

Although some NeRFs can handle scenes the size of a single room or building, they are generally still limited and do not scale to large environments such as the game world. Researchers in the field have proposed to extend NeRF using architectures that align multiple NeRFs in a distributed manner, as shown in (Tancik et al., 2022; Turki et al., 2022). Our approach follows a similar concept of partitioning the entire scene. However, unlike previous methods that use an end-to-end architecture to dynamically select NeRFs for rendering tasks, we exploit the fact that the evaluation of visibility depends only on the local ODF representation. This insight allows us to partition the game scene through a simple grouping of source positions, paving the way for a scalable and adaptable application.

3. Method

Our method can be seen as a precomputation approach that learns the scene geometry in an offline process and then evaluates the learned representation online. In this section, we outline the collection of training data, introduce the multi-resolution grid encoding used to reconstruct high-frequency signals in direction dimensions, and describe the details of model training and evaluation.

Refer to caption
(a)
Refer to caption
(b)
Figure 2. Omnidirectional distance data collection at a source position. (a) Directions are distributed on the spherical surface by a Fibonacci lattice, this visualization shows the total number of directions P=201𝑃201P=201. (b) Collected omnidirectional distance data at one position of our offline experiment scene, with P=40001𝑃40001P=40001.

3.1. Training Data Collection

For positions of interest in the game scene, the input variables of ODF can be expressed as rays r=(pr,dr)𝑟subscript𝑝𝑟subscript𝑑𝑟r=(p_{r},\,d_{r}), where pr3subscript𝑝𝑟superscript3p_{r}\in\mathbb{R}^{3} represents the position, and dr𝕊2subscript𝑑𝑟superscript𝕊2d_{r}\in\mathbb{S}^{2} represents the normalized direction. The ODF distance value associated with each ray can be collected in the game scene through raycasting.

In practice, we can only generate a finite number of rays for data collection. To distribute the rays evenly across the surface of the unit sphere, we employ Fibonacci lattice (Swinbank and James Purser, 2006; González, 2010) as a basic approximation. Let N𝑁N be any natural number, integer i[N,+N]𝑖𝑁𝑁i\in[-N,\,+N]. The i𝑖ith direction in radians are

(3) lati=arcsin(2i2N+1),loni=2πiϕ1,formulae-sequence𝑙𝑎subscript𝑡𝑖2𝑖2𝑁1𝑙𝑜subscript𝑛𝑖2𝜋𝑖superscriptitalic-ϕ1\begin{split}lat_{i}&=\arcsin\left(\frac{2i}{2N+1}\right),\\ lon_{i}&=2\pi i\phi^{-1},\end{split}

where

(4) ϕ=1+52=limn(Fn+1Fn),F0=0,F1=1,Fn=Fn1+Fn2for n>1\begin{gathered}\phi=\frac{1+\sqrt{5}}{2}=\lim_{n\to\infty}\left(\frac{F_{n+1}}{F_{n}}\right),\\ F_{0}=0,\quad F_{1}=1,\quad F_{n}=F_{n-1}+F_{n-2}\quad\text{for }n>1\end{gathered}

The number of directions is

(5) P=2N+1.𝑃2𝑁1P=2N+1.

After collecting data at all positions of interest, partitioning can be implemented as a post-processing step. This involves virtually grouping rays based on position prsubscript𝑝𝑟p_{r}, for example, grouping them by the voxels to which prsubscript𝑝𝑟p_{r} belongs.

3.2. Multi-resolution Grid Encoding

Refer to caption
Figure 3. The effect of the high-frequency expressiveness of the neural network due to direction encoding in ODF reconstruction. As the truncated SH degree for direction encoding increases, the reconstructed distance data from the trained neural network contains more details.

Standard MLPs are known to have a learning bias toward smooth functions, making them poorly suited for low-dimensional coordinate-based vision and graphics tasks (Rahaman et al., 2019). To overcome this limitation, neural representations employ encoding methods that map inputs to higher dimensions, improving the expressiveness for high-frequency signals. Unlike other representations designed for volumetric rendering or 3D shape reconstruction (Yu et al., 2021; Fridovich-Keil and Yu et al., 2022; Müller et al., 2022), whose common practice is to use lower degree truncated SH to encode directions, e.g. SH2𝑆subscript𝐻2SH_{2}, our neural ODF representation requires capturing high frequency details for directions. This is demonstrated by an experiment of ODF reconstruction at a single position. In the experiment, we set several truncated SH degrees to encode directions, the greater the truncated SH degree, the more SH coefficients are used, the more high-frequency details the trained neural network can capture, as shown in Fig. 3.

However, high-degree SH encoding is computationally expensive. The direct computation scales 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3}) (Müller, 2006). Although there are some methods to reduce the computational cost (Suda and Takami, 2002; Dutordoir et al., 2020), we decided to develop a more efficient approach.

Refer to caption
(a) Spherical view
Refer to caption
(b) UV plane view
Figure 4. Illustration of multi-resolution grid encoding. (a) Spherical view of a two-level resolution grid in the longitude-latitude projection, where the sample direction is represented as ”𝐱𝐱\mathbf{x}”. (b) The (partial) UV plane view with the same grid and direction. The four nearest texels that store relevant features are highlighted for each grid.

To address the encoding challenge while aiming for fast evaluation, we consider the onmidirectional distance data as a 2D spherical surface, so that it can be mapped onto the UV plane for 2D reconstruction. Inspired by the trade-off between memory and evaluation time made by discretized volumetric representations (Hedman et al., 2021; Müller et al., 2022), we employ a multi-resolution grid data structure to store our parametric direction features. Each grid is rectangular in shape and has a ratio of approximately 2:1:212:1 for the longitude and latitude axes to match their value ranges. The grid settings we used in our offline experiments are the coarsest resolution of 16×816816\times 8 and the finest resolution of 256×128256128256\times 128, with the number of levels L=16𝐿16L=16 and the length of the feature stored in each texel F=5𝐹5F=5. For each ray direction drsubscript𝑑𝑟d_{r}, the direction features are bilinearly interpolated from the four nearest texels of the UV plane and concatenated across multiple resolutions of the grid. We illustrate a two-level resolution grid example in Fig. 4.

3.3. Learning and Evaluating ODF

In addition to the multi-resolution grid used for direction encoding, we map the position prsubscript𝑝𝑟p_{r} to higher dimensional space using hash encoding (Müller et al., 2022) or positional encoding (PE) (Mildenhall et al., 2020), these features are then concatenated and finally fed into a compact MLP. To optimize the neural network, we perform direct regression of the ground truth distance, minimizing the mean squared error. During the optimization, we use the Adam optimizer. To prevent divergence, we apply weak weight decay regularization (coefficient 105superscript10510^{-5}) to the MLP.

Directly evaluating performance on randomly split data is not ideal due to ray aliasing (Feng et al., 2022). If we consider moving the origin position prsubscript𝑝𝑟p_{r} of a ray along it’s direction dr^^subscript𝑑𝑟\hat{d_{r}}, the new ray r=(pr,dr^)superscript𝑟subscript𝑝superscript𝑟^subscript𝑑𝑟r^{\prime}=(p_{r^{\prime}},\hat{d_{r}}) is an aliased version of r𝑟r. This aliasing effect can also be observed from the ODF perspective, as highlighted by (Houchens et al., 2022) with the equation:

(6) [d^,0^]ODF(p,d^)=1,subscript^𝑑^0𝑂𝐷𝐹𝑝^𝑑1\nabla_{[\hat{d},\,\hat{0}]}ODF(p,\,\hat{d})=-1,

we can rewrite it as

(7) ODF(p+λd^,d^)=ODF(p,d^)λ,λ[ODF(p,d^),ODF(p,d^)].formulae-sequence𝑂𝐷𝐹𝑝𝜆^𝑑^𝑑𝑂𝐷𝐹𝑝^𝑑𝜆𝜆𝑂𝐷𝐹𝑝^𝑑𝑂𝐷𝐹𝑝^𝑑ODF(p+\lambda\hat{d},\,\hat{d})=ODF(p,\,\hat{d})-\lambda,\,\lambda\in[-ODF(p,\,-\hat{d}),\,ODF(p,\,\hat{d})].

A direct evaluation based on a randomly partitioned training and testing set of these rays could result in overestimated metrics.

To mitigate ray aliasing in performance evaluation, we construct our testing dataset in the game scene separately by simulating visibility tests between candidate NPC movement destination positions and player positions. Each candidate position corresponds to one of the source positions where we collected omnidirectional distance data, while each target position is chosen randomly from the entire game scene. Additionally, we collect the results of raycasting-based visibility tests as the ground truth labels, denoted as y𝑦y. The testing dataset is represented as:

(8) 𝒟={(x(i),y(i))}i=1M,y{0,1},x=(s,t),s𝒫,t𝒬,formulae-sequence𝒟superscriptsubscriptsuperscript𝑥𝑖superscript𝑦𝑖𝑖1𝑀formulae-sequence𝑦01formulae-sequence𝑥𝑠𝑡formulae-sequence𝑠𝒫𝑡𝒬\mathcal{D}=\{(x^{(i)},\,y^{(i)})\}_{i=1}^{M},\,y\in\{0,1\},\,x=(s,\,t),\,s\in\mathcal{P},\,t\in\mathcal{Q},

where 𝒫𝒫\mathcal{P} represents positions of interest and 𝒬𝒬\mathcal{Q} represents all positions within the game scene.

4. Experiments

4.1. Offline Experiment

Refer to caption
(a) Perspective view
Refer to caption
(b) Bird’s-eye view with partitioning
Figure 5. Offline evaluation environment.

To establish an offline environment for convenient performance comparison with baselines, we first collect training and testing data in a game scene using the methods described in Section 3.1 and Section 3.3. The data is exported as binary files for convenient integration with machine learning frameworks such as PyTorch. We then post-process the training data by partitioning the 6227 source positions using a simple 2D grid scheme, as shown in Fig. 5(b). Based on this 8×8888\times 8 partitioning scheme, we create 64 independently initialized small models for each method under evaluation. During both the training and testing phases, each sample infers only one of these models, determined by the local partition index associated with the given source position.

Table 1. Offline evaluation for encoding methods.
\uparrow Acc. \uparrow Prec. \uparrow Recall \uparrow F1 \downarrow Params # (k) \downarrow Time (ms)
No mapping 0.8721 0.9032 0.7151 0.7982 50 0.27±0.03plus-or-minus0.270.030.27\pm 0.03
Positional encoding 0.8913 0.9136 0.7651 0.8328 54 0.41±0.02plus-or-minus0.410.020.41\pm 0.02
Hash encoding (SH2𝑆subscript𝐻2SH_{2}) 0.8750 0.9054 0.7222 0.8035 2152 6.44±0.19plus-or-minus6.440.196.44\pm 0.19
Hash encoding (SH12𝑆subscript𝐻12SH_{12}) 0.9029 0.9303 0.7843 0.8511 2172 33.54±4.94plus-or-minus33.544.9433.54\pm 4.94
FFM (SH12𝑆subscript𝐻12SH_{12}) 0.8975 0.9272 0.7708 0.8418 104 29.70±0.47plus-or-minus29.700.4729.70\pm 0.47
Ours (Hash, Long-Lat) 0.9079 0.9328 0.7971 0.8596 3118 12.66±0.55plus-or-minus12.660.5512.66\pm 0.55
Ours (Hash, Mercator) 0.9081 0.9336 0.7940 0.8594 3118 12.88±0.37plus-or-minus12.880.3712.88\pm 0.37
Ours (PE, Long-Lat) 0.9034 0.9167 0.7996 0.8542 1022 6.74±0.11plus-or-minus6.740.116.74\pm 0.11
Ours (PE, Mercator) 0.9031 0.9221 0.7931 0.8528 1022 7.00±0.16plus-or-minus7.000.167.00\pm 0.16

The evaluation results are presented in Table 1. All these experiments are performed offline with PyTorch. The prediction accuracy is evaluated through classification metrics (Powers, 2008). The inference time is benchmarked on a Xeon W-2135 CPU and is constrained to single-threaded execution without batching, simulating the typical execution of visibility tests in game AI applications.

In the positional encoding (Mildenhall et al., 2020) experiment, we use 6 exponential growth frequencies (nfreq=6subscript𝑛freq6n_{\text{freq}}=6) for both position and direction inputs. In the hash encoding (Müller et al., 2022) experiment, we use the setting (F=2,L=16)formulae-sequence𝐹2𝐿16(F=2,\,L=16) with 216superscript2162^{16} as the hash table size. For its direction encoding we employ SH truncated at degree 2 and degree 12. In the Fourier Feature Mapping (FFM) (Tancik et al., 2020) experiment, we use a standard deviation σ=1.0𝜎1.0\sigma=1.0 for the Gaussian distribution to encode positions and SH truncated at degree 12 to encode directions. In our method, for directions, we experimented with longitude-latitude projection and the Mercator projection for spherical mapping, the multi-resolution grid setting is described in Section 3.2, which is (F=5,L=16)formulae-sequence𝐹5𝐿16(F=5,\,L=16). For positions, we use hash encoding and PE of nfreq=6subscript𝑛freq6n_{\text{freq}}=6. The hidden size of the MLPs is 128 neurons and 4 layers for all methods.

For all methods, there is a trade-off between prediction performance, memory, and computational cost. From the result, our method with PE shows promising metrics in both prediction performance and computational cost, at a reasonable memory cost. With hash encoding, our method achieves slightly better prediction performance but slower speed. We don’t see much difference between the two projection methods with the experimented grid settings. Due to the smoothness in reconstructing omnidirectional distance data, hash encoding with SH truncated at degree 2 has a similar prediction performance to the no mapping baseline. In our experiments, we observed that while positional encoding can yield commendable results with appropriate frequency configurations, its effectiveness is highly sensitive, requiring hyperparameter tuning across different scenes. Hash encoding and FFM with SH truncated at degree 12 are much slower due to their SH computations. Among different versions of our methods, we choose the PE with longitude-latitude projection for further in-game experiments due to its simplicity, faster speed, and lower memory cost.

Refer to caption
(a) CPU
Refer to caption
(b) GPU
Figure 6. Visibility test throughput (kilo tests per second) of our method (PE, Long-Lat) for different batch sizes, tested on CPU and GPU using PyTorch.

Certain applications require the use of multi-threaded CPU or GPU batching operations to perform high-throughput visibility testing. Our approach benefits from the integration of parallel and vectorized computation with batching. The scalability of the inference throughput of our method for both CPU and GPU is illustrated in Fig. 6. The throughput depends on the hardware and inference framework used; we present the measurements tested using PyTorch, a Xeon W-2135 CPU, and a RTX A5000 GPU.

4.2. Memory Usage Analysis

Refer to caption
(a)
Refer to caption
(b)
Figure 7. Comparison of memory composition between omnidirectional depth maps and our neural ODF representation. (a) Shows omnidirectional depth maps from 6 source positions within a local partition. (b) Shows the neural ODF representation with a 3-level resolution grid for the same partition. The visualization shows the first 3 dimensions of the feature vectors of the grids.

Our method involves compressing omnidirectional depth maps for all source positions within the local partition into a single neural ODF representation. Fig. 7 illustrates a comparison of the memory composition. To analyze the difference in memory usage, we conduct an experiment using the offline environment with the partitioning and settings of our method described in Section 4.1. Each partition in the experiment contained an average of 100 source positions. For PNG DEFLATE lossless compression, we used ZLIB compression level 6.

Table 2. Memory usage per partition.
Depth Map (256×128)256128(256\times 128) Depth Map (512×256)512256(512\times 256) Ours
Uncompressed PNG Compressed Uncompressed PNG Compressed
13.1 MB 559 KB 52.4 MB 1.3 MB 4.95 MB

As shown in Table.2, for each partition, our neural representation has a significantly reduced memory size compared to the size of uncompressed depth maps, although it remains larger than PNG compressed memory size. Traditional image compression methods have their own drawbacks when applied to game AI applications, such as additional computation time, inconvenience in evaluating individual pixels, and lack of generalization ability. To further optimize the memory size of our neural representation, future considerations may include actions such as reducing weight precision to 16-bit float, applying quantization, or using smaller feature vectors. For the scope of this study, our primary focus is on prediction accuracy and inference time in a real game environment, which will be discussed in the following section.

4.3. In-game Evaluation

Refer to caption
(a) Environment A
Refer to caption
(b) Environment B
Refer to caption
(c) Environment C
Figure 8. Three types of scene for in-game evaluation. Environment A represents a compact outdoor environment, Environment B represents a wide outdoor environment, Environment C represents a multi-level indoor environment.
Table 3. In-game evaluation results.
Environment A Environment B Environment C
(809k triangles) (1078k triangles) (136k triangles)
Cold (us) Warm (us) Acc. Cold (us) Warm (us) Acc. Cold (us) Warm (us) Acc.
Raycasting 55.28 8.79 - 78.79 13.46 - 62.37 10.84 -
Ours (128×4)1284(128\times 4) 10.0 4.5 0.950 10.0 4.5 0.897 10.0 4.5 0.990
Ours (128×2)1282(128\times 2) 7.0 2.3 0.950 7.0 2.3 0.894 7.0 2.3 0.989
Ours (64×2)642(64\times 2) 4.0 1.3 0.947 4.0 1.3 0.886 4.0 1.3 0.988
Ours (32×2)322(32\times 2) 3.0 1.1 0.940 3.0 1.1 0.881 3.0 1.1 0.985

To evaluate our method in a real game environment, we perform a side-by-side comparison with raycasting using an AAA game engine. The raycasting implementation is based on a commonly used commercial product called Havok Physics. The maximum raycasting distance for both data collection and baseline comparison is set to 100 meters. Three types of scenes are selected as our evaluation environment, as shown in Fig. 8. These scenes are partitioned into 16 mtimes16meter16\text{\,}\mathrm{m} ×\times 16 mtimes16meter16\text{\,}\mathrm{m} ×\times 16 mtimes16meter16\text{\,}\mathrm{m} voxels, and each partition contains a different number of source positions of interest, ranging from 30 to 200. The evaluation results for execution time and visibility prediction accuracy are presented in Table.3. Due to the significant impact of CPU caching on execution time, we collect both cold start and warm start average execution times for all tasks. The execution time and accuracy of our method are measured based on a C++ implementation and executed on a single-threaded CPU process without batching. We use (PE, Long-Lat) version of our method. In addition, various sizes of MLPs are evaluated, while keeping the encoding settings the same (nfreq=6,F=2,L=16formulae-sequencesubscript𝑛freq6formulae-sequence𝐹2𝐿16n_{\text{freq}}=6,\,F=2,\,L=16). So the input size of the MLPs is 2×3×6+2×16=68236216682\times 3\times 6+2\times 16=68.

Table 4. In-game memory usage estimation.
Raycasting Part. # Ours (128×4)1284(128\times 4) Ours (128×2)1282(128\times 2) Ours (64×4)644(64\times 4) Ours (32×2)322(32\times 2)
Environment A 13.3 MB 56 104.7 MB 97.3 MB 95.4 MB 92.3 MB
Environment B 30.5 MB 80 149.5 MB 139 MB 136.2 MB 131.9 MB
Environment C 6.9 MB 30 56.1 MB 52.1 MB 51.1 MB 49.5 MB

For each partition, using 32-bit float precision, the memory size of the multi-resolution grid is approximately 1.64 MB, while the memory size of the MLP ranges from 13 KB to 234 KB. The observed memory usage is lower than that of offline experiments due to the utilization of smaller feature vectors and MLP for in-game evaluation. To compare memory usage with raycasting, as shown in Table 4, we estimate raycasting memory usage by assessing the memory size of the collision geometry at the level of detail employed for raycasting. For our methods, we estimate memory usage by multiplying the number of partitions for each environment by the model size for each partition.

Due to the different characteristics such as average collision distance and number of triangles in the three scenes, the execution time of raycasting-based visibility is different. In contrast, our inference time remains the same across the three scenes because we are using exactly the same model architecture. This is potentially a desirable property for game AI applications, as it makes the execution time more predictable.

Refer to caption
(a)
Refer to caption
(b)
Figure 9. Visualizations of visibility prediction in Environment B. (a) The selected partition is visualized as an orange voxel. The source position is visualized as a blue sphere inside this voxel. Visibility target positions that were predicted correctly or incorrectly are visualized as green and red spheres, respectively. (b) Highlighted areas of predictions.

From the experimental results presented in Table.3, we observe that the most challenging scene for our method is Environment B, which contains many small objects in the open area. We hypothesize that the reason for the lower accuracy is due to the reconstruction error at object edges. As observed in the visualizations shown in Fig. 9, most of the incorrect predictions are around object edges, corresponding to the viewing direction from the source position. In the future, this problem may solved by using a hybrid approach that detects object edges and switches between raycasting-based visibility tests and ours. For Environment C, where most of the target positions are not visible, the prediction accuracy of our method is very high (>98.5%absentpercent98.5>98.5\%). Under different MLP settings and scenes, our method shows speedups ranging from 1.95×1.95\times to 26.26×26.26\times.

5. Discussion and future work

In this section, we will discuss the limitations of our method and possible directions for future work.

Direction feature enhancements. In our method we use a multi-resolution grid for direction encoding and use it to store learnable, locally interpolated features. While our current techniques for sphere discretization and feature interpolation remain relatively unexplored, there exists potential for enhancement and refinement. To make the features more informative, in the future we can explore sphere discretization methods with better mathematical properties such as HEALPix (Gorski et al., 2005), data structures that take spherical curvature into account, such as Spherical Feature-Grid (Dou et al., 2023) or adopt better feature interpolation methods that use the geodesic distance.

Precomputation time. As a form of precomputation methods, our approach requires learning of the surface geometry information of the scene into another representation before being evaluated at runtime. A disadvantage of precomputation methods is the additional time required for offline processing. The precomputation time for our method consists of data collection time and model training time. For small areas of the game scene, this process takes several minutes, which is acceptable. However, scaling our method to the entire game world could extend the precomputation time to several hours or even days, depending on the number of positions of interest and directions. To improve the data collection process in the future, instead of relying on raycasting with predefined directions, we can explore the use of panoramic depth frames from the GPU. This improvement would allow us to efficiently collect omnidirectional distance data for positions, taking advantage of the graphics pipeline.

Dynamic scene handling is another limitation of our method. This is because many game scenarios involve highly dynamic scenes, including destructible walls and buildings, large moving objects such as vehicles, dynamic visual effects such as smoke and fog, etc. However, our current method is only applicable to static part of scenes. To overcome this limitation, future improvements could include modeling dynamic objects with temporal variations (Ost et al., 2021; Cao and Johnson, 2023; Luiten et al., 2024) or variations of the game scene.

Extended game applications to replace raycasting. In addition to the visibility test use case presented in this paper, raycasting is widely used in many other aspects of video game development. For example, navigation mesh construction, path finding, projectile ballistics, spatial clearance tests, UI logic, and melee combat all rely heavily on raycasting techniques. While these applications may have different requirements in terms of memory, accuracy, and computational cost, by adjusting the grid and MLP settings, our method can be used as an accelerated approximation technique. It would be valuable to explore the potential in other application scenarios to replace traditional raycasting.

Generalized representation. Spatial reasoning is a major challenge in crafting more sophisticated, believable, and intelligent NPC behavior in games. Existing approaches, such as (Forbus et al., 2002; Dill, 2019), have made progress in this area. Our method provides a potential solution to their computational challenge by learning a neural representation of the spatial environment that is not limited to ray and distance pairs. We can generalize the representation to incorporate other information obtained from game environment for efficient information retrieval at runtime. We see many potential applications of our approach in this area.

6. Conclusion

For game AI applications, obtaining real-time visibility information is critical but computationally expensive. Our work presents the first method designed to address the computational bottleneck of visibility tests for game AI using a neural representation.

We show that by optimizing neural networks to represent scene geometry as an ODF, we can efficiently predict the visibility between positions. In addition, evaluating visibility using our representation only depends on the local ODF associated with the source position, regardless of the location of the target. This allows scalability to the entire game world without sacrificing model inference time, at the cost of potentially larger overall memory requirements. We show that by using multi-resolution grid encoding for direction features, we can improve the expressiveness of our neural network for ODF reconstruction. Compared to raycasting in a real game environment, we not only achieve speed up, but also provide constant time inference. And besides, as an efficient representation designed for game AI applications, our method can potentially serve as a fast information retrieval solution for complex systems such as spatial reasoning in the future.

Acknowledgements.
We would like to thank Georges Nader, Ludovic Denoyer, Jean-Philippe Barrette-LaPierre, Alexis Rolland, and Yves Jacquier for their valuable discussions and reviews. This work was supported by Ubisoft.

References

  • (1)
  • Bittner and Wonka (2003) Jiří Bittner and Peter Wonka. 2003. Visibility in computer graphics. Environment and Planning B: Planning and Design 30, 5 (2003), 729–755.
  • Bourg and Seemann (2004) David M Bourg and Glenn Seemann. 2004. AI for game developers. ” O’Reilly Media, Inc.”.
  • Cao and Johnson (2023) Ang Cao and Justin Johnson. 2023. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 130–141.
  • Chen et al. (2023) Zhiqin Chen, Thomas Funkhouser, Peter Hedman, and Andrea Tagliasacchi. 2023. MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures. In The Conference on Computer Vision and Pattern Recognition (CVPR).
  • Chibane et al. (2020) Julian Chibane, Aymen Mir, and Gerard Pons-Moll. 2020. Neural Unsigned Distance Fields for Implicit Function Learning. In Advances in Neural Information Processing Systems (NeurIPS).
  • Dill (2019) Kevin Dill. 2019. Spatial reasoning for strategic decision making. In Game AI Pro 360. CRC Press, 125–148.
  • Dou et al. (2023) Yishun Dou, Zhong Zheng, Qiaoqiao Jin, Bingbing Ni, Yugang Chen, and Junxiang Ke. 2023. Real-Time Neural BRDF with Spherically Distributed Primitives. arXiv preprint arXiv:2310.08332 (2023).
  • Dutordoir et al. (2020) Vincent Dutordoir, Nicolas Durrande, and James Hensman. 2020. Sparse Gaussian processes with spherical harmonic features. In International Conference on Machine Learning. PMLR, 2793–2802.
  • Feng et al. (2022) Brandon Y Feng, Yinda Zhang, Danhang Tang, Ruofei Du, and Amitabh Varshney. 2022. PRIF: Primary Ray-Based Implicit Function. In European Conference on Computer Vision. Springer, 138–155.
  • Foley and Sugerman (2005) Tim Foley and Jeremy Sugerman. 2005. KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. 15–22.
  • Forbus et al. (2002) Kenneth D Forbus, James V Mahoney, and Kevin Dill. 2002. How qualitative spatial reasoning can improve strategy game AIs. IEEE Intelligent Systems 17, 4 (2002), 25–30.
  • Fridovich-Keil and Yu et al. (2022) Fridovich-Keil and Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance Fields without Neural Networks. In CVPR.
  • Glassner (1989) Andrew S Glassner. 1989. An introduction to ray tracing. Morgan Kaufmann.
  • González (2010) Álvaro González. 2010. Measurement of areas on a sphere using Fibonacci and latitude–longitude lattices. Mathematical Geosciences 42 (2010), 49–64.
  • Gorski et al. (2005) Krzysztof M Gorski, Eric Hivon, Anthony J Banday, Benjamin D Wandelt, Frode K Hansen, Mstvos Reinecke, and Matthia Bartelmann. 2005. HEALPix: A framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal 622, 2 (2005), 759.
  • Hedman et al. (2021) Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. 2021. Baking Neural Radiance Fields for Real-Time View Synthesis. ICCV (2021).
  • Houchens et al. (2022) Trevor Houchens, Cheng-You Lu, Shivam Duggal, Rao Fu, and Srinath Sridhar. 2022. NeuralODF: Learning Omnidirectional Distance Fields for 3D Shape Representation. arXiv:2206.05837 [cs.CV]
  • Jack (2019) Matthew Jack. 2019. Tactical position selection: An architecture and query language. In Game AI Pro 360. CRC Press, 1–24.
  • Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
  • Klimaszewski (1994) Krzysztof S Klimaszewski. 1994. Faster ray tracing using adaptive grids and area sampling. Brigham Young University.
  • Klosowski et al. (1998) James T Klosowski, Martin Held, Joseph SB Mitchell, Henry Sowizral, and Karel Zikan. 1998. Efficient collision detection using bounding volume hierarchies of k-DOPs. IEEE transactions on Visualization and Computer Graphics 4, 1 (1998), 21–36.
  • Liu and Yang (2023) Zhuoman Liu and Bo Yang. 2023. RayDF: Neural Ray-surface Distance Fields with Multi-view Consistency. arXiv:2310.19629 [cs.CV]
  • Luiten et al. (2024) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2024. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. In 3DV.
  • Majercik et al. (2019) Zander Majercik, Jean-Philippe Guertin, Derek Nowrouzezahrai, and Morgan McGuire. 2019. Dynamic Diffuse Global Illumination with Ray-Traced Irradiance Fields. Journal of Computer Graphics Techniques (JCGT) 8, 2 (5 June 2019), 1–30. http://jcgt.org/published/0008/02/01/
  • McGuire et al. (2017) Morgan McGuire, Mike Mara, Derek Nowrouzezahrai, and David Luebke. 2017. Real-time global illumination using precomputed light field probes. In Proceedings of the 21st ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (San Francisco, California) (I3D ’17). Association for Computing Machinery, New York, NY, USA, Article 2, 11 pages. https://doi.org/10.1145/3023368.3023378
  • McIntosh (2019) Travis McIntosh. 2019. Human Enemy AI in The Last of Us. In Game AI Pro 360. CRC Press, 13–24.
  • Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
  • Möller and Trumbore (2005) Tomas Möller and Ben Trumbore. 2005. Fast, minimum storage ray/triangle intersection. In ACM SIGGRAPH 2005 Courses. 7–es.
  • Müller (2006) Claus Müller. 2006. Spherical harmonics. Vol. 17. Springer.
  • Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Trans. Graph. 41, 4, Article 102 (July 2022), 15 pages. https://doi.org/10.1145/3528223.3530127
  • Ost et al. (2021) Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. 2021. Neural Scene Graphs for Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2856–2865.
  • Powers (2008) David Powers. 2008. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Mach. Learn. Technol. 2 (01 2008).
  • Rahaman et al. (2019) Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. On the spectral bias of neural networks. In International Conference on Machine Learning. PMLR, 5301–5310.
  • Smits (2005) Brian Smits. 2005. Efficiency issues for ray tracing. In ACM SIGGRAPH 2005 Courses. 6–es.
  • Suda and Takami (2002) Reiji Suda and Masayasu Takami. 2002. A fast spherical harmonics transform algorithm. Mathematics of computation 71, 238 (2002), 703–715.
  • Swinbank and James Purser (2006) Richard Swinbank and R James Purser. 2006. Fibonacci grids: A novel approach to global modelling. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography 132, 619 (2006), 1769–1793.
  • Tancik et al. (2022) Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben P. Mildenhall, Pratul Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar. 2022. Block-NeRF: Scalable Large Scene Neural View Synthesis. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8238–8248. https://doi.org/10.1109/CVPR52688.2022.00807
  • Tancik et al. (2020) Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. 2020. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS (2020).
  • Turki et al. (2022) Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. 2022. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12922–12931.
  • Ueda et al. (2022) Itsuki Ueda, Yoshihiro Fukuhara, Hirokatsu Kataoka, Hiroaki Aizawa, Hidehiko Shishido, and Itaru Kitahara. 2022. Neural Density-Distance Fields. In Proceedings of the European Conference on Computer Vision.
  • Wald et al. (2014) Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst. 2014. Embree: A Kernel Framework for Efficient CPU Ray Tracing. ACM Trans. Graph. 33, 4, Article 143 (jul 2014), 8 pages. https://doi.org/10.1145/2601097.2601199
  • Walsh (2019) Martin Walsh. 2019. Modeling Perception and Awareness in Tom Clancy’s Splinter Cell Blacklist. In Game AI Pro 360. CRC Press, 73–86.
  • Welsh (2013) Rich Welsh. 2013. Crytek’s Target Tracks Perception System. Game AI Pro: Collected Wisdom of Game AI Professionals 403 (2013), 411.
  • Yu et al. (2021) Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. 2021. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5752–5761.