I was experimenting with a hybrid algorithm to increase spatial coherency. Conceptually my scene is subdivided into a grid of 3D cells. One requirement of the algorithm is that each cell has a local tracing range, that is, the maximum range in which it can perform path tracing. I define this tracing range to be the content of the cell itself, along with a set of its neighbours (could be 26, 63, 124… neigbours, depending on the range we want to take into account). Now my question is, how do I best organize the acceleration structures for each cell? I do not want a shared acceleration structure that contains my whole scene, because the whole point of the algorithm is to only access the ‘local data’ within the tracing range.
One idea was to have a separate acceleration structure per cell that only contains the geometry of that cell. I would then take all ASes of the neighbours in the tracing range and have to test for intersections against each one of them (there are optimizations possible such as skipping the non-facing neighbours but let’s omit that here), saving the closest intersection. An advantage of this approach I believe is that we have no duplicate geometry present in the ASes, each AS just represents the geometry of the cell it belongs to. However, we do have to loop over the ASes and test for intersections against each one of them, which negatively impacts performance (correct me if I am wrong here).
Another idea was to create an AS for each group of neighbours in a tracing range. So for a given cell, take its own geometry and the geometry of its neighbours within the tracing range to build the AS. Now we end up with only one AS per cell that we need to trace against, but there is severe redundancy in the geometry that multiple ASes represent. So a disadvantage that I expect here is much greater memory usage.
I hope these 2 ideas make it a little bit more clear what I am trying to achieve.
I think you’re right that building a Geometry AS (GAS) per cell and looping over neighbor GASes explicitly would slow down your perf considerably. However, one thing to remember that may open up some options for your is that you can use Instance ASes (IAS) to group GASes however you want. So you do have the option to build 1 GAS per cell, and avoid all duplicate geometry, but also build 1 IAS per cell+neighbors. Then the only thing that’s duplicated is the handles to the neighbor GASes, rather than all the geometry. You can then trace a single ray against the IAS for a given cell+neighbors, and OptiX+RTX takes care of looping over all the neighbors in a much more efficient way than if you try to loop explicitly. So I think you can combine both approaches you describe, and take the best of each. Does that makes sense and sound doable? I think the biggest potential for trickiness here is that depending on what your shading/material system looks like, you might have to be careful with SBT offsets, or you might have to create an SBT for each IAS, which might create some data duplication you weren’t thinking about.
All that said, I know you said you don’t want a shared acceleration structure, but I’m curious if you really truly need to physically subdivide your scene? What if you kept the AS for the entire scene, and had an auxiliary grid structure for your cells that keeps only tracing range information? Would it still work for you if you could look up the tracing range before tracing a ray, and then set the ray’s tmax parameter accordingly? This would keep a ray from tracing beyond any given cell’s neighbors, without having to explicitly partition the scene. If your tracing range is constant, then of course you don’t need an explicit grid structure either, but you can still read/write other information per cell before & after tracing a ray, if that’s part of the algorithm.
The reasons to consider the single shared AS are that you can still control locality of access via the ray origin and the ray tmax parameter. The ray will not iterate through anything in the scene that it can trivially cull, so if you shoot a ray inside a conceptual cell with a range that stops at the cell’s neighbor bounds, traversal will only access approximately the cell & neighbor contents. Each ray might traverse through a couple of large top-level boxes in the hierarchy, but keep in mind that if you partition your scene explicitly into an AS per cell, it will not eliminate traversal through the top level boxes, it will just move that part of traversal into software, where it might be slower. My suggestion above to use an IAS for the cell+neighbors is an example of this: while it can add more steps in the BVH tree traversal on the hardware side, it replaces a much slower software iteration over the neighbors, and the hardware can do extra BVH culling in parallel and for approximately free from your perspective. So, I recommend re-examining the assumption that you don’t want a single shared AS, you might be able to explore your locality algorithm even with a single AS, and it might even give you more freedom to set a continuous range limit on rays, instead of using a discrete cell size.
Thank you for sharing your insight! The IAS per cell + neighbors approach sounds great, I did not know that was possible but it sounds like it is indeed the best of the 2 worlds that I described above.
Concerning the approach using only 1 AS for the entire scene: I think what I am currently doing is exactly what you described there, setting the tmax parameter according to the tracing range of that cell. However, I had my concerns in terms of coherency, which is the reason why I asked this question and wanted to transition to the IAS per cell + neighbors approach.
In this project I am experimenting with a memory coherent approach to path tracing. What I really want to achieve here is to make sure that the local data of one cell + it’s neighbors inside the tracing range fits into cache, ultimately leading to a large amount of cache hits. I then want to optimize the cell size depending on the GPU’s cache sizes and so on. For small toy scenes this approach will not/barely make a difference, but my hope is that the method can benefit performance in larger scenes. I am not really aware of the underlying OptiX implementation when it comes to intersection tests, but my concern with one giant AS for the whole scene was that a lot of incoherent memory accesses will still be made, essentially rendering the technique I am trying to implement useless. Do you think that would be the case?
Conceptually I agree with what you said, if I just set the tmax parameter to the tracing range, only some top-level large bounding boxes will be intersected with, the intersection itself should still be found within the local tracing range. But I am not sure what exactly will happen ‘under the hood’. I also believe that for large scenes the size of the AS scales up considerably as well (possibly many levels of the hierarchy need to be traversed before we are in ‘tracing range’), am I right? Given this scenario, do you think the access to those bounding boxes higher up in the hierarchy can lead to incoherent memory accesses? If not, one AS for the whole scene is the easiest to implement and also the most preferable option I think.
Makes sense to me! Yes I’d still guess that using tmax will help constrain memory access and keep it a bit more coherent. Under the hood, OptiX is not requesting memory for anything culled by a ray, and use of tmax will automatically cull anything further away than tmax. So I would expect using tmax to limit the amount of memory requested for a ray during traversal, and to limit the memory requests to geometry and tree nodes that land within in the [tmin, tmax] range.
The good news is that it should be pretty easy to test. Nsight Compute can show you stats about total memory request sizes and L1/L2 cache hit rates. If we’re right about reducing tmax helping your algorithm, then the memory stats and any associated speedup should be visible without too much effort.
This may depend entirely on how many rays you launch at a time, and whether a batch of rays are starting nearby each other. If you limit the tmax for all rays, but cast them from every cell in the scene all going random directions, then the kernel overall is still going to request approximately everything in the scene, and the access might still be quite incoherent. This is one reason to start looking at a grid data structure that can process batches of rays that are all near each other. Another consideration is whether all the rays in a warp tend to be spatially local. OptiX tiles your thread indices to keep camera rays together, which usually helps with that first path segment. Secondary rays for AO or global illumination can become incoherent quickly, and this might lead to warp level divergence where trimming tmax or even partitioning the scene might not do that much good.
One option in the near future will be use to the upcoming Shader Execution Reordering feature to launch rays sorted by their spatial locality. This doesn’t limit the batch to have nearby rays, but it might help make sure that all rays in a warp are very near each other, which might allow the GPU to coalesce some or all of the memory access in the warp.
Great! That sounds exactly like what I want to achieve. About the profiling with Nsight Compute, I recently started looking into this. Is it correct that I should profile my overall application with Nsight Systems and look into more detail on the places where possible overhead occurs with Nsight Compute?