**In Pursuit of Real-Time Ray Tracing**
Teguh Hofstee
---
> "Ray tracing is the future and ever will be"
> -- David Kirk
# Introduction
Ray tracing is one of many techniques commonly used for generating digital images. In contrast to rasterization, the presently dominating technique used for real-time rendering, which projects triangles onto a virtual sensor, ray tracing projects rays out from said sensor and bounces them off the objects in our scene. This inversion makes modeling the complex behavior of light comparatively simple, naturally lending itself to challenging problems such as global illumination and shadows. In theory, ray tracing scales better than rasterization as it only needs to access $O(\log n)$ triangles compared to the $O(n)$ of rasterization. In practice, we have extremely efficient ways to cull occluded triangles such as cluster-based LODs [#Haar2015, #Karis2021], which when combined with an extremely predictable streaming access pattern for the majority of triangles allows us to render a practically unbounded number of triangles in essentially constant time[^f1].
Unfortunately for ray tracing, its elegant simplicity in expression results in brutal complexity in execution. The bouncing behavior leads to effectively random access to the triangles in the scene. We can't be as aggressive with the culling or we risk removing objects that would have been hit by ray bounces eliminating its primary benefit over rasterization. This also fits poorly with the wide SIMT architecture common in modern GPUs as each ray could be accessing different portions of memory leading to heavy scatter/gather behavior, and rays will find their intersections or terminate at different times leading to high divergence and thus poor utilization.
We'll look at a variety of approaches to different aspects of the problems ray tracing faces from a few different perspectives, ranging from state-of-the-art acceleration structure construction on GPUs to specialized hardware for construction and rendering. For the sake of simplicity, unless explicitly mentioned we're also assuming that our base primitive is the triangle and everything is composed of them.
---
[^f1]: Of course, this shifts work like computation of clusters into precomputation and the exact scaling is hard to compute, but in practice, this works so well that from the perspective of an artist authoring content for games they no longer need to worry about polygon count beyond how much space it takes up on disk.
# Acceleration Structures
A naive ray tracer is horrendously slow -- without an understanding of the scene, each ray needs to test against every triangle to find the closest hit if one exists. Multiplied by the number of samples needed for anti-aliasing and the bounces needed to capture the complex light paths in a scene, rendering even a simple image takes an incredibly large amount of time. We need to do something to reduce the number of triangles we compare against in a scene.
The primary insight acceleration structures for ray tracing make is that if we can query if our ray intersects a certain region in space, we can limit the triangles we test against to only those that fall within that region. There are many approaches including binary space partitioning (BSP) trees, uniform-grids, octrees, and plenty of others, but the predominant data structures in use today are KD-trees and the bounding volume hierarchy (BVH).
Both the KD-tree and the BVH are hierarchical data structures and can be thought of as duals of each other. At each level, a KD-tree picks an axis-aligned splitting plane and places triangles on the side(s) of the plane they overlap with. During traversal, a ray queries the splitting planes it passes through until it intersects a triangle encapsulated completely in the explored region or determines a miss with the scene. A BVH instead considers splitting the triangles into groups and encapsulates them in a bounding volume, often an axis-aligned bounding box (AABB). During traversal, a ray tests whether it intersects the bounding volume and continues testing against volumes until it hits a triangle or misses the scene. In a KD-tree a triangle may overlap multiple splitting planes, while in a BVH the bounding volumes fully contain their triangles but may overlap other volumes.
Historically KD-trees have been shown to outperform BVHs in terms of rendering performance, but more recently the BVH has become the de-facto standard acceleration structure of choice. While the KD-tree efficiently partitions space, allowing a ray to generally perform fewer queries before finding an intersection, build times are too slow for practical use in real-time environments. BVHs on the other hand have had many high-performance construction algorithms developed in recent years, and in a SIMD environment typically result in less divergence than KD-trees leading to higher performance. As BVHs are built around objects rather than space, this also leads to a natural fit in real-time environments when many individual objects are combined to create a scene (instead of a single mesh) as each object's BVH can be independently constructed and later combined to create a BVH of the entire scene.
Modern acceleration structures often feature two levels, commonly referred to as the top-level acceleration structure (TLAS) and bottom-level acceleration structure (BLAS). In the TLAS, instances of the BLAS replace what would typically be primitives, and each BLAS contains primitives for portions of the scene. The type of acceleration structure need not be consistent, and often the BLAS is constructed to take advantage of certain properties of the model, such as if it is unlikely to be seen or if it is animated.
## BVH Construction
When constructing a BVH we want to select a splitting plane that minimizes the cost needed to trace rays, and while estimating this cost is difficult we do have rather good heuristics. Intuitively, a good split results in our bounding boxes containing primitives that are close together while minimizing the wasted volume within and overlap between each of the boxes. A simple choice of splitting plane is the median of the primitives' centroids, but we can do significantly better by realizing that the probability of a random ray hitting a box is proportional to the surface area of said box. This leads to the most popular heuristic used to estimate the cost of a BVH -- the surface area heuristic (SAH).
With a quantifiable cost, we can now consider construction algorithms that minimize it, and the standard[^f2] to beat is Sweep SAH. In a simple BVH, each primitive is referenced only once. As any splitting plane between two primitives will result in the same partitioning, we only have N-1 planes to query at each level. A sweeping BVH builder evaluates the cost heuristic at each of these splitting planes and generally works in a top-down greedy manner. When the heuristic is the SAH, this algorithm is commonly referred to as Sweep SAH. While Sweep SAH builds very high-quality BVHs, evaluating all splitting planes is prohibitively expensive, particularly at the top levels of the tree where we have the largest number of primitives.
One approach to reduce the time needed to construct a BVH is by considering a small number of splitting planes, also known as *binning* [#Wald2007]. Commonly as few as 16 or 32 equidistant splitting planes are chosen. At each step, we iterate over our current primitives and place them into bins according to their centroids while updating the bounding volume of the bin. To compute the cost of each split we now only have to merge a small number of bins, rather than the full number of triangles. This has a relatively small impact on quality (say, on the order of 10%) but allows for significantly faster construction.
This style of top-down builder has been efficiently parallelized on CPUs [#Wald2007, #Wald2012] by taking advantage of parallelism across primitives at the top levels, and switching to parallelism across nodes at lower levels once enough independent work has been created.
---
[^f2]: There are many other methods to construct BVHs and Sweep SAH is by no means the highest performing during tracing. As far as a gold standard for achievable performance, the Split BVH (SBVH) [#Stich2009] is often chosen. Briefly, SBVH allows primitives to be split along the splitting plane resulting in duplicate references as well as many of the benefits and drawbacks you would get from KD-trees.
## Parallel GPU Approaches
Existing approaches to constructing high-quality BVHs scale rather well for CPUs, but their parallelism at the top of the tree is limited making them a poor fit for the massive parallelism requirements of GPUs.
**Linear BVH (LBVH)**
An LBVH is one approach to building a BVH from the bottom up and simplifies the problem by choosing the order of the leaf nodes first. The main idea is to use a Z-order curve on interleaved dimensions of our 3D point to get a space-filling curve in 3D. Sorting based on this interleaving, such as with a parallel radix sort, gives the order we want and now nodes in the BVH can be thought of as ranges of leaf nodes, with the splits chosen by the most significant differing bit. With a bit of magic, this can be implemented extremely quickly on a GPU [#Karras2012].
The biggest problem with LBVHs is their low quality. Compared to SweepSAH it's not uncommon to see as much as a 30% decrease in performance. Luckily for us, prior work [#Bittner2013] has established that it's possible to take an existing low-quality BVH and refine them into reasonably high-quality BVHs. However, approaches utilizing local modifications such as tree rotations [#Kensler2008] get stuck in local optima, while other approaches once again map poorly to the GPU. As there are situations where precomputation of a BVH is not feasible, we wish to find an approach that can utilize the parallelism available of a GPU to quickly construct a high-quality BVH.
**Treelet Restructuring (TRBVH)**
To effectively feed a GPU, we need large amounts of independent work, and one way of partitioning the BVH is into *treelets*, or small connected subgraphs of the BVH. We can start with an LBVH and follow it up with optimizing these treelets to improve the quality of our BVH [#Karras2013]. The key insight is that there are significantly more ways to form a treelet than there are tree rotations, which suggests that optimizing them has a higher potential for creating high-quality BVHs as well as being less likely to get stuck in local optima. Even treelets with just 7 leaves are sufficient for this task. Bottom-up traversal is used to construct the BVH and groups of threads are used to optimize single treelets to combat otherwise low SIMD utilization.
Treelets are chosen by starting from its root and two children, then expanding to include the children of whichever leaf has the largest surface area. As this is performed in a bottom-up process, nodes can propagate far up through the BVH but will never get pushed much further down than they started. Optimizing treelets is then performed by using dynamic programming across a group of threads, starting by optimizing subsets of two elements in the treelet and increasing with each iteration. As each iteration depends on the results of the prior and there is only so much work to do for small subsets, utilization will unavoidably be suboptimal. Potential treelets can be skipped during traversal by only considering subtrees with some minimum number of leaves, trading quality for construction time.
Repeating this process in several rounds increases the scope of potential modifications, also allowing nodes to be pushed further down, and can be done until the BVH cost converges. As a final step, some subtrees are collapsed into a single node to improve SAH cost. Triangle splitting can also be performed as an optional preprocessing stage to improve the quality of the BVH if desired.
**Parallel Locally-Ordered Clustering (PLOC)**
Another way of constructing a BVH from the bottom up is agglomerative clustering. In each iteration, we search through all unconnected objects and group the closest two under a new node until we have a complete tree. While proper agglomerative clustering is costly to compute, approximate methods (AAC) [#Gu2013] have been used to create very efficient parallel CPU builders that generate high-quality trees, but once again these algorithms are not a great fit for GPUs.
The main idea that PLOC [#Meister2017] is built around is that if two nodes agree that the other is its nearest neighbor, they can be merged before the next iteration. Each iteration of nearest-neighbor search, merging, and compaction is referred to as a sweep. To search for the approximate nearest neighbors, we once again take advantage of the Z-order curve to define a radius around the current node to search, similarly to AAC. A GPU implementation can swap buffers between the input and output of each iteration, using parallel prefix sums to determine indices for writes and compacting the clusters. The clusters are not sorted after each sweep, as clusters stay roughly ordered, sorting is costly, and the algorithm is approximate anyways. A final stage of leaf collapsing is also beneficial as it was in TRBVH. In practice, behavior is close to the best case linear complexity, and the search radius gives an effective parameter to trade quality for construction speed.
# Hardware Construction
## PLOCTree
The main motivation for implementing PLOC in hardware [#Viitanen2018] is that the GPU implementation has extremely high bandwidth requirements, both within and across sweeps. The insight in making the algorithm amenable to streaming data is that we can immediately start merging the $i\text{th}$ element of our input once we've computed the nearest neighbor of $i+R$. Reframed as a streaming algorithm, parallelism can now be achieved through pipelining multiple sweeps at once.
The first stage in our pipeline is a sorting subsystem to convert our unordered primitives to the Z-order stream we want. Next, there are two stages of sweep pipelines, with resources allocated so they empirically perform about the same amount of work (which ends up being very different numbers of sweep iterations). Each sweep pipeline stage shares resources to perform several sweeps over the streamed input and avoid transferring data to memory.
For each sweep, we start by having a sliding-window buffer sized for the $2R+1$ elements we're interested in, and we can think of the center of this window as our current index $i$. For each of the $2R$ remaining elements in the window, we perform a cost evaluation in parallel and feed the costs into a comparator to get our nearest neighbor. We then push this into another sliding-window buffer containing the nearest neighbors for $[i-R, i]$. Everything on our left has finished computing its nearest neighbor, so if our nearest neighbor lies on that side we can check for mutual correspondence and mark it to merge the two AABBs[^f4]. We can reframe the next portion by considering our current element to be $i-R$. At this point, we know everything on both sides of our search radius has computed its nearest neighbor, so we can check if we've been marked as a neighbor. If so, we merge and place in the output if we are the left neighbor, otherwise if there was no mutual correspondence we push the original AABB back into the output to be reprocessed in the next sweep. The output is routed either to a queue between the sweep pipelines or to the top of the current sweep pipeline depending on how many sweeps it was involved in.
Instead of special logic for handling boundary conditions, $2R$ dummy AABBs are padded at the end per sweep performed which results in later stages processing more dummy AABBs than valid ones. One disadvantage of PLOCTree compared to its GPU implementation is that it is unable to perform adaptive collapsing of leaf nodes, so that would still need to be run as a separate postprocessing kernel on the GPU. Another tradeoff is that the search radius here is fixed so we completely lose the ability to generate higher quality trees.
The largest consumer of power is the numerous window buffers in the second set of sweep pipelines. Performing more elements of a sweep in a SIMD parallel fashion would help to amortize the cost, and it may also be worthwhile to examine the effects of decreasing the search radius at later stages in the pipeline to reduce the memory requirements of the window buffers.
---
[^f4]: A simpler way, perhaps, would be to check later from $i-R$ if the neighbor is on the right and mark mutually corresponding neighbors to discard their node. At the very least it saves a minuscule buffer. :)
## RayCore
RayCore [#Nah2014] is a MIMD architecture targeting mobile ray tracing that includes both dedicated traversal and acceleration structure construction units. In this section, we'll focus primarily on the design of its Tree Builder Unit (TBU), which creates KD-trees. The approach RayCore takes to tree construction consists of two phases: binning and sorting. While data is too large for sorting, the TBU operates in a binning mode, and once data in splits is small enough to fit in memory it is transferred to sorting-based tree building pipelines (TBPs). As a depth-first layout is preferred to save memory, the TBU buffers data first on-chip before reordering the transfer to make efficient use of DRAM burst accesses.
RayCore focuses its TBU solely on building dynamic objects, leaving the tree of static objects to be prebuilt on a CPU. The dynamic tree takes a two-level approach, consisting of dynamic subtrees for each object, and is not incorporated with the static tree. The TBU scales roughly linearly with the number of triangles, and compared to PLOCTree has a roughly 4x higher memory traffic on scenes with similar numbers of triangles or memory traffic. Assuming it could match the 1 GHz clock speed of PLOCTree it's around 20x slower to build an acceleration structure. PLOCTree is about 50% larger though, at 2.4mm^2 compared to the 1.6mm^2 of the TBU, both on a 28nm process. Still, it seems unlikely for the approach taken here to outperform PLOCTree or other BVH-based approaches. With very large models it also seems likely that the sorting-based TBPs will be severely underutilized as more time will be spent in the binning phases.
# Hardware Ray Tracing
One way to reframe the process of ray tracing is to turn it into a pipeline, similar to the graphics pipeline often used in rasterization. Before tracing any rays, first, we need to construct our acceleration structure(s). This might happen offline, before each frame, or some combination of the two depending on the workload. Second, we create the initial set of primary rays to feed through the pipeline. Next, rays will need to traverse through the acceleration structure, perform primitive intersection tests, secondary ray generation (which is eventually fed back to the traversal stage), and shading, though the boundaries between these stages are not as clear cut. As an example, modern Nvidia RTX GPUs build acceleration structures on the GPU, then start primary ray generation in shaders running on the SMs, perform dedicated traversal and triangle intersection stages in the RT cores, and return to the SMs to perform (combined) shading and secondary ray generation.
On a CPU or other small SIMD environments, we can trace small numbers, or *packets*, of rays to better utilize the available resources. The choice to defer processing secondary rays until later leads to the idea of *wavefront* ray tracing common in GPU implementations.
## RayCore
As mentioned earlier, RayCore is a MIMD architecture that bases its threads of computation on the tracing of a single ray. In contrast to prior approaches like SaarCOR which contain varying ratios of fixed pipelines for traversal and intersection, RayCore opts to unify both pipelines allowing a core to perform in either mode for the sake of load balancing. Tracing of rays is performed in a pipelined fashion, with stages for traversal and intersection, hit point calculation, and shading. Secondary rays are placed on a stack that takes priority over primary ray generation to avoid deadlock. RayCore also has two acceleration structures, a dynamic tree that is traversed first, and a static tree that is traversed after.
RayCore opts to omit programmability, citing the presence of GPUs already in mobile devices and suggests that a combination of the two would be one method to achieve programmable shading with real-time ray tracing. As such, RayCore is limited in what types of algorithms it can model with ray tracing and only handles Phong shading and a limited variety of sampling distributions. In complex shaders, it is not uncommon for the calculation to take hundreds of cycles, and it's not clear if such a modification would be amenable to the proposed architecture.
RayCore attempts to keep rays on-chip as much as possible and traces individual rays to termination while holding a small stack of secondary rays generated in the process. As such it dynamically accesses scene data, and when rays become incoherent such as in the case of diffuse interreflection rays the memory traffic increases by over an order of magnitude, though performance isn't significantly degraded. At ASIC speeds, the demand for memory traffic is likely so high that it will become a bottleneck. All the evaluated scenes are also rather small in size and likely to stay in cache, so it's unknown how this architecture would behave when that is no longer the case. All that being said, RayCore does manage to only request around 20-33 bytes per ray which is extremely low.
In the case of a cache miss, the ray passes through the pipeline until it is marked active again at the start of the next loop. Effectively this behaves like a cache prefetch for the next loop iteration when the ray can try again. This differs from a typical pipeline stall as other rays in the pipeline can continue to perform useful work while the current ray is marked invalid.
A curious decision RayCore makes is to pre-compute TriAccel [#Wald2004] data for the triangles stored in the acceleration structure to decrease the number of stages in its unified pipeline. Using TriAccel for ray-triangle intersection tests requires approximately half the memory compared to directly using the vertices and results in a more cache-friendly access pattern compared to indexed accesses to vertices. However, this needs to be recomputed for all geometry data each time the acceleration structure is updated, such as in the case of dynamic scenes, and the triangle information is still required during the shading stages. It's unclear if this choice results in a reduction in energy, particularly in comparison to alternative approaches such as cache-blocked treelets (discussed next) or hierarchically compressed BVHs, the latter of which is used in conjunction with visibility buffers in Nanite[^f3] to dramatically reduce the memory needed to be transferred to the GPU.
---
[^f3]: Nanite manages to compress the in-memory representation of its BVH to under 9 bytes per triangle. Other methods such as an unpublished paper by Ingo Wald achieve similar footprints of 5-8 bytes per triangle.
## Treelet streaming
As we trace larger numbers of rays with many bounces, it's common for the rays to become incoherent. This leads to problems of low utilization in SIMD environments so many hardware approaches attempt to skirt this issue of coherence by treating every ray as independent and adopting a MIMD architecture. However, by nature of the rays being incoherent, this increases the size of our working set and gives us random memory access patterns which are far from ideal. So inevitably, at some point, we will run into problems where memory traffic is the primary bottleneck.
Minimizing memory traffic requires our working set to be coherent in the portion of the scene it is accessing. Such behavior makes adopting a SIMD approach attractive, but doing so requires a large number of coherent rays to be available for processing at any given time. One way to do this is by processing rays in large batches of coherent rays [#Pharr1997], but first, those ideas need to be adapted for the modern practice of using BVHs as the primary acceleration structure. Instead of cells in a uniform grid, we can batch rays at (cache-sized) treelets in a BVH. This approach of treelet streaming [#Aila2010] effectively reduces the amount of scene data that is loaded during ray tracing but results in a significant increase in auxiliary traffic for managing ray queues and their associated stacks.
STRaTA [#Kopta2015] builds on these ideas in two key ways -- it attempts to keep rays on-chip in buffers by limiting the maximum number of rays in flight, and it observes that memory bandwidth is an incomplete proxy for performance. Keeping rays on-chip requires their footprint to be minimized, and this is done by keeping parent pointers in the BVH as well as a single bit per level to indicate if the left or right child was traversed first, essentially eliminating the traversal stack. Through accurate modeling of caches and DRAM, it's shown that efficiently scheduling DRAM accesses can simultaneously result in increased bandwidth while also reducing energy consumption by optimizing for the row buffer.
One source of inefficiency in STRaTA is that treelets might need to be loaded multiple times even if we are only tracing primary rays. Dual streaming [#Shkurko2017] addresses this by switching to a wavefront ray tracer and duplicating rays to the queues of all treelets they intersect. This ensures that rays only ever flow from parent to child and allows the treelets to be loaded a maximum of one time per wavefront with a predictable access pattern. Duplicate ray segments complicate intersection as a segment may intersect multiple primitives and the closest hit is (generally) what we are interested in. This is handled with a global hit record stored in memory which is accessed and updated as segments intersect primitives. Due to the way rays are traversed, the ability to perform early termination of a ray is also mostly lost.
## Mach-RT
The ray queues and the global hit record were once again the primary sources of memory overhead in dual streaming, so moving them on-chip would eliminate the associated memory traffic. Unfortunately, storing all the rays on-chip would require unrealistically large (>100 MB) buffers. Mach-RT instead splits the load across multiple chips and has them all share a large off-chip L3 cache, with each chip processing a portion of the final image. This allows the access behavior to be completely streamed, as is the case for rasterization.
An optimization that is attempted is to bias scheduling treelets towards depth-first order instead of breadth-first to promote benefits from early termination, but the performance difference is negligible. This is attributed to the global order of traversal across a wavefront, while early termination is most noticeable when each ray is allowed to traverse the closest potential intersections first.
Overall it's shown that multiple processors can share certain off-chip memory resources effectively. Given that a set of 16 chips has a total die area of around 360mm^2 when using scaling factors from 65nm to 7nm, it has a similar total area as a mid-level Nvidia RTX 3070 Ti GPU while having around 5x as many threads. I would be curious to see more evaluation into the applicability of Mach-RT to other computational tasks beyond ray tracing, as well as an examination of how memory resource usage changes when shading becomes more complex.
# Discussion
## Acceleration Structure Choice
It's still not completely clear if/when a KD-tree is better than a BVH. A claim in the literature is that a KD-tree is better for MIMD situations so one might expect scalar CPU code to mirror that claim, but at least in preliminary experiments modifying PBRTv4 that doesn't seem to be the case. Additionally, new types of acceleration structures such as dual-split trees [#Lin2019] pose as interesting alternatives. With the flexibility provided by two-level acceleration structures, optimizing BLAS choice beyond simply construction to incorporate multiple different types of acceleration structures might be worth considering.
## The Need for Build Performance
The fastest approaches so far are still GPU LBVH for outright speed and PLOCTree for higher quality trees. Both approaches still take a significant amount of time to construct the trees, and even in modern AAA games utilizing DirectX Raytracing you are frequently limited to only a handful of BVH updates/rebuilds per frame, alongside the TLAS build. To realize the potential for fully destructible environments, large numbers of dynamic/animated actors, particle systems, and more we will still benefit greatly from higher outright build performance. That being said, acceleration structure builders don't exist in a vacuum -- performance saved during construction that is then lost in traversal isn't compelling, but also means that there is a lot of potential for builders where the payoff lies almost entirely during traversal.
Performing builds from scratch is wasteful as we're throwing out potentially useful information from a BVH constructed in prior frames. It would be interesting to see more future work focus on refining existing BVHs beyond refits to see if there are scalable techniques that would allow refinement to be used in places where rebuilding is typically required like particles and rapidly moving objects. Doing so could also eliminate the need for prebuilt BVHs for various keyframes in animated actors which are used as a base for rebuilds, which would save on memory and allow for more robust behavior in situations where all the keyframes are not necessarily known ahead of time such as AI-based animation or VR. Deferred (re)builds of acceleration structures, in particular, might be one way to simplify the demand on engine developers, as meshes could be updated when required instead of only picking a handful to rebuild each frame assuming sufficient build performance.
As an alternative to rebuilds and refits, it might make sense for the TLAS or BLAS of certain types of dynamic actors to be optimized with something like parallel reinsertion [#Meister2018] as we could take advantage of scene knowledge to potentially perform less work. Going in the other direction, it might be possible to construct a better TLAS through ideas like partial re-braiding [#Benthin2017]. With sufficient build performance, it may even be possible to create ray-specialized acceleration structures [#Hunt2008] according to the behavior of the scene. Whether it's possible or worthwhile to construct hardware for these ideas has not been explored.
As the SAH is not a perfect model for ray tracing cost, and things like endpoint overlap (EPO) are not easily calculated during construction, it's conceivable that future builders might utilize learned models to predict the cost of a BVH. SAH is also somewhat computationally expensive, so finding effective costs that are cheaper for hardware may be beneficial if a large number of them are required to be computed in parallel.
Lastly building compressed BVHs is largely unexplored in hardware, and all the designs mentioned have been serial pipelines with no way of sharing work across multiple hardware units. While it's possible to have separate units to build independent acceleration structures this doesn't apply to all workloads. In some cases, it's possible to scale up a design to improve certain attributes of its construction, but it would be ideal to have scalable architectures that could both cooperate on a single acceleration structure (such as the TLAS) as well as work independently on smaller pieces (like BLAS rebuilds) of work.
## Future Assumptions
Prior evaluations limit their ray path depth to a fairly small number -- RayCore limits to 15, Mach-RT limits to 5. If realistic rendering for scenes with transparent objects is a goal, then a depth of 10 or so is likely a minimum requirement. Adding in hair or liquids makes a depth of 30 a more reasonable target. Accurate modeling of lenses might need a depth in the hundreds. The scaling of existing approaches concerning increasing depth complexity is largely unexplored and may lead to consideration of different architectures as the number of rays stuck in deep paths may be too small to run efficiently. In particular, comparing the effects of packet vs. wavefront vs. recursive architectures on deep paths would be informative. Another possible avenue of exploration is considering wider BVHs and performing multiple intersection tests for a single ray, rather than a single intersection test per ray. Approaches such as ReSTIR [#Bitterli2020] have been gaining traction and also tend to trace a significantly smaller number of rays, so examining the effect this style of computation has on existing architectures could be interesting.
Similarly, the shading supported or explored is simplistic -- RayCore supports Phong shading and all the papers in the Mach-RT chain only evaluate with purely diffuse shading. As such the cost of shading (and sampling) is essentially ignored when in practice evaluating a stack of BSDFs combined with importance sampling can easily end up being a significant cost, even exceeding the cost of ray traversal. Future architectures need to be flexible enough to support complex user-defined queries to pick sample directions for secondary rays, as well as facilitate efficient shading. Figuring out efficient ways of either implementing these items or good interfaces for deferring these tasks to a GPU would be promising, as well as evaluating the effect that increased sampling/shading costs have on these architectures. Another situation where this behavior might be likely to show up is by implementing something like real-time stochastic lightcuts [#Lin2020]. Efficient support of any-hit tracing could also be worthwhile for things such as transparent objects, alpha textures, or shadows.
What if we could trace things other than rays and intersect things other than triangles? While procedural geometry is not uncommon in film rendering, it's very complex to evaluate so you're just allowing a fully user-defined primitive intersection at that point. Currently, if you define a user primitive intersection using an RTX GPU it will bounce back and forth between the RT cores and the SMs which kills performance. To name a couple of possibilities -- intersecting curves could be useful for hair rendering, and tracing frustums could be desirable for audio. Finding alternative use cases or better ways of integrating ray tracing accelerators with existing computing architectures would make their adoption more appealing; a ray tracing accelerator can no longer compete directly with a GPU as a GPU is no longer used only for graphics.
## Out-of-Core Approaches
With increasing geometric complexity, a BVH for micropolygon-detail scenes in the future is unlikely to fit into memory. Attempting to solve this problem might need to involve some combination of highly-compressed acceleration structures, streaming construction, deferred construction and loading of geometry, level of detail, and more. While Nanite can stream only the geometry it expects to be visible, ray tracing requires more geometry to be accessible as contributing rays may bounce off objects out of sight. Attempting to create a system capable of this level of detail given real-time constraints is ripe with ideas and lessons to be learned. Architectures capable of creating the compressed BVHs that Nanite uses could also alleviate some of the burdens of adapting Nanite to dynamic objects.
## Optimizing for the Worst Case
As a final note, again, one of the primary reasons that ray tracing is desirable is because it makes it simpler to implement incredibly difficult challenges like ambient occlusion, shadows, global illumination, transparency, etc. Briefly, to make it easier to author the content we want to see. To truly do so, we need to minimize pathological cases and make these limitations known. For a single example -- long and thin triangles, as well as off-axis transformations, are notorious for destroying the quality of a constructed BVH. SBVH is a very effective tool to combat these cases, so perhaps trying to adapt it (or something like triangle splitting) to hardware is an idea worth considering. Evaluating the impact a few bad BVHs might have in a two-level acceleration structure could also be interesting.
(#) References:
[#Aila2010]: Timo Aila and Tero Karras. 2010. Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics, 113–122.
[#Benthin2017]: Carsten Benthin, Sven Woop, Ingo Wald, and Attila T Áfra. 2017. Improved two-level BVHs using partial re-braiding. In Proceedings of High Performance Graphics. 1–8.
[#Bitterli2020]: Benedikt Bitterli, Chris Wyman, Matt Pharr, Peter Shirley, Aaron Lefohn, and Wojciech Jarosz. 2020. Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting. ACM Transactions on Graphics (TOG) 39, 4 (2020), 148–1.
[#Bittner2013]: Jiřı́ Bittner, Michal Hapala, and Vlastimil Havran. 2013. Fast insertion-based optimization of bounding volume hierarchies. In Computer Graphics Forum, Wiley Online Library, 85–100.
[#Gu2013]: Yan Gu, Yong He, Kayvon Fatahalian, and Guy Blelloch. 2013. Efficient BVH construction via approximate agglomerative clustering. In Proceedings of the 5th High-Performance Graphics Conference, 81–88.
[#Haar2015]: Ulrich Haar and Sebastian Aaltonen. 2015. GPU-Driven Rendering Pipelines.
[#Hunt2008]: Warren Hunt and William R Mark. 2008. Ray-specialized acceleration structures for ray tracing. In 2008 IEEE Symposium on Interactive Ray Tracing, IEEE, 3–10.
[#Karis2021]: Brian Karis, Rune Stubbe, and Graham Wihlidal. 2021. Nanite A Deep Dive.
[#Karras2012]: Tero Karras. 2012. Maximizing parallelism in the construction of BVHs, octrees, and k-d trees. In Proceedings of the Fourth ACM SIGGRAPH/Eurographics conference on High-Performance Graphics, 33–37.
[#Karras2013]: Tero Karras and Timo Aila. 2013. Fast parallel construction of high-quality bounding volume hierarchies. In Proceedings of the 5th High-Performance Graphics Conference, 89–99.
[#Kensler2008]: Andrew Kensler. 2008. Tree rotations for improving bounding volume hierarchies. In 2008 IEEE Symposium on Interactive Ray Tracing, IEEE, 73–76.
[#Kopta2015]: Daniel Kopta, Konstantin Shkurko, Josef Spjut, Erik Brunvand, and Al Davis. 2015. Memory considerations for low energy ray tracing. In Computer Graphics Forum, Wiley Online Library, 47–59.
[#Lin2019]: Daqi Lin, Konstantin Shkurko, Ian Mallett, and Cem Yuksel. 2019. Dual-split trees. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, 1–9.
[#Lin2020]: Daqi Lin and Cem Yuksel. 2020. Real-time stochastic lightcuts. Proceedings of the ACM on Computer Graphics and Interactive Techniques 3, 1 (2020), 1–18.
[#Meister2017]: Daniel Meister and Jiřı́ Bittner. 2017. Parallel locally-ordered clustering for bounding volume hierarchy construction. IEEE transactions on visualization and computer graphics 24, 3 (2017), 1345–1353.
[#Meister2018]: Daniel Meister and Jiřı́ Bittner. 2018. Parallel reinsertion for bounding volume hierarchy optimization. In Computer Graphics Forum, Wiley Online Library, 463–473.
[#Nah2014]: Jae-Ho Nah, Hyuck-Joo Kwon, Dong-Seok Kim, Cheol-Ho Jeong, Jinhong Park, Tack-Don Han, Dinesh Manocha, and Woo-Chan Park. 2014. RayCore: A ray-tracing hardware architecture for mobile devices. ACM Transactions on Graphics (TOG) 33, 5 (2014), 1–15.
[#Pharr1997]: Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. 1997. Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, 101–108.
[#Shkurko2017]: Konstantin Shkurko, Tim Grant, Daniel Kopta, Ian Mallett, Cem Yuksel, and Erik Brunvand. 2017. Dual streaming for hardware-accelerated ray tracing. In Proceedings of High Performance Graphics. 1–11.
[#Stich2009]: Martin Stich, Heiko Friedrich, and Andreas Dietrich. 2009. Spatial splits in bounding volume hierarchies. In Proceedings of the Conference on High Performance Graphics 2009, 7–13.
[#Viitanen2018]: Timo Viitanen, Matias Koskela, Pekka Jääskeläinen, Aleksi Tervo, and Jarmo Takala. 2018. PLOCTree: A Fast, High-Quality Hardware BVH Builder. Proceedings of the ACM on Computer Graphics and Interactive Techniques 1, 2 (2018), 1–19.
[#Wald2004]: Ingo Wald. 2004. Realtime ray tracing and interactive global illumination. Ausgezeichnete Informatikdissertationen 2004 (2004).
[#Wald2007]: Ingo Wald. 2007. On fast construction of SAH-based bounding volume hierarchies. In 2007 IEEE Symposium on Interactive Ray Tracing, IEEE, 33–40.
[#Wald2012]: Ingo Wald. 2012. Fast construction of SAH BVHs on the Intel many integrated core (MIC) architecture. IEEE Transactions on Visualization and Computer Graphics 18, 1 (2012), 47–57.