Voxel engines - any possibility to benefit from RTX?

Was thinking about this several days and cannot fell asleep … I hv checked existing voxel-related posts in this OptiX subforum but still want to post.

So the thing is simple, I’d like to build a real-time voxel engine for games that:

  • renders really far - LOD / partially load the world around player and scrolls with player movement
  • can do voxel rigid-bodies; not necessarily has correct physics, but can be rendered efficiently with non-integer positions / rotations. Take windmills, vehicles, animals (animated, voxel animals similar to Fugl) as examples.
  • Interactive, potentially large-scale voxelized animations (plants, …)
  • ray traced

I’d like to take the benefit from RTX. In my few days playing with OptiX, BVH traversal seems really fast (with RTX). However, what is the best practice to use them for voxels ?


So far my idea was:

Use AABBs (custom primitives) for chunks (size may vary w.r.t. view distance), build AS’s for those chunks and implement raymarching (or maybe SVO traversal) in hit shaders. Hopefully this will reduce overhead when ray need to pass through a big chunk of nothing, similar to the “beam optimization” in Laine et al. 2010 SVO paper.

This brings several questions I’d like to ask:

  • What is the best practice to handle AS’s with varying primitive counts, yet all primitives are static if they exist?
  1. A single GAS with some redundant space allocated and degenerate / hide unused primitives;
  2. Multiple nested IAS’s, with each leaf GAS corresponds to some chunk of chunks, and this GAS got rebuilt on modifications - Is IAS beneficial if instances don’t move ? and I somehow feels basically it resembles an Octree …
  3. Anything else?
  • Will it worth those effort ? How fast is “RTX-BVH” ? … Will pure SVO / DAG simply runs faster than this approach ? How to handle e.g. “rotating windmill” / “floating aircraft” with pure SVOs ?

Thanks for reading this post and I will appreciate your comments … !

Hi @betairylia, welcome!

Wow, this is a huge question. It’s a very good question, but I have to be clear from the start that I’m not sure we can give you concrete advice yet. We are definitely happy to discuss some of the tradeoffs of various approaches you might take though.

OptiX (and RTX in general) does indeed have very fast ray-BVH traversal. But keep in mind that RTX has been designed primarily for ray tracing surface representations, and not primarily for volumetric data like voxels or for ray marching. This means that the usual tradeoffs could be even larger for you, and that best practices can vary tremendously depending on exactly what you need.

If you want ray marching through sparse volumes, then one option you should at least consider is Nanovdb. https://developer.nvidia.com/blog/accelerating-openvdb-on-gpus-with-nanovdb/

We have done some performance experiments recently using Nanovdb in combination with a hierarchical OptiX BVH, and it becomes clear that Nanovdb is pretty good at doing exactly what it is designed for – traversing sparse volumes. With a lot of effort, the OptiX representations can sometimes help accelerate traversal of the sparse volume, but not always. Nanovdb does not use RTX BVH traversal, but it is pretty fast because it’s well designed to handle sparse voxel grids. I’d recommend researching a bit and thinking about how to use each API according to it’s strengths - OptiX for ray tracing surfaces and animated objects, and Nanovdb for ray-marching sparse volumes when it makes sense.

To handle AS’s with varying static primitives, do you mean primitives that exist in the accel structure, but sometimes you want to render them, and sometimes you want them to be invisible? We offer a fixed small number of “visibility masks” - currently 8 - that you can use to build a layering system. Visibility masks are implemented in hardware, so they are very fast. If you need more than 8, then you can use any-hit shaders or custom intersection shaders to dynamically decide to skip intersections, however this approach can be considerably slower. You can also dynamically rebuild your IAS’s to include or exclude children, since IAS’s are often rather small. Updating IAS’s results in faster traversal performance than using shaders, at the expense of limited flexibility and the per-frame rebuild. Some OptiX users are mixing all of these methods to get the advantages of each.

If you want to render animated moving objects, that is something where a hybrid renderer that mixes volume techniques with surface techniques may be applicable. In OptiX, you would handle animated meshes using traversable motion transforms: https://raytracing-docs.nvidia.com/optix7/guide/index.html#acceleration_structures#traversable-objects

There are some OptiX users who are mixing Nanovdb with OptiX ray tracing. One way to do this is to setup a scene using an OptiX AS, and inside of it you might have custom primitives that represent volumetic objects. When you hit your custom primitive, you can then launch the ray into the volume object. If the ray misses or exits the volume, you can then continue tracing through the OptiX BVH by re-launching the ray at the volume’s entry or exit point (depending on what you need), or if you’re using any-hit programs to handle the volume, then the ray will continue on it’s own.

I suspect much of this is might be determined more by your game engine requirements than by which technique has the best performance. If you want an asset streaming system to dynamically load and hide nearby partitions of your world, the design of that might add some constraints on what you can do, or lead you more naturally in certain directions.

To answer your specific questions: using a single GAS is probably not recommended for showing and hiding primitives, unless you can take advantage of visibility masks. If you can’t use visibility masks, then it’s going to mean either any-hit shaders or GAS rebuilds. Nested IAS’s is certainly viable, and there’s nothing wrong with it feeling similar to an octtree… it is similar.

I know that doesn’t answer any of your questions completely, but I hope that helps a little.


David.

1 Like

Thank you so much for your reply, @dhart !

First of all, as you pointed out, I understand that RTX is designed for surfaces. However using the hardware level acceleration is always an interesting idea xD

And thank you for introducing NanoVDB to me, it looks pretty good ! I noticed that NanoVDB used a sequential (idk) representation of the VDB (kind of a tree?) in GPU, and it is hard to modify the contents dynamically, so it might not be able for me to use it out-of-the-box. But at a first glance, VDB structure looks very competitive when it comes to voxels, so I definitely will learn about that. Thanks!

By the way, do you think it is practical to use static NanoVDB’s to hold each chunk (say, 32^3 voxels) and rebuilt the chunk entirely for modifications? Is this chunk too small to unleash full power of NanoVDB or is the rebuilt process too slow (to fit in a frame, say 5~15ms) ?

You said that “With a lot of effort, the OptiX representations can sometimes help accelerate traversal of the sparse volume, but not always.” - so does this concludes that RTX BVH is not capable to achieve an order-of-magnitude advantage in voxels compared with well-written software code (say, CUDA / NanoVDB) ? (Yes for tracking animated objects I think it is still necessary)

After read your reply I think I’d better do more research before I make the actual thing xD


Regards to AS’s, “do you mean primitives that exist in the accel structure, but sometimes you want to render them, and sometimes you want them to be invisible?”

Currently in my mind, only chunk AABB’s that are visible (on the surface) will be rendered - pushed into the AS. Imagine digging a hole on the ground; when you dug long enough, eventually there will be a new chunk must be pushed into the AS, with no other chunks being removed, so there’s 1 more primitive and AS need to be rebuilt. Otherwise, I must push all chunks within a certain range of the player, then I’m afraid that this will degrade performance even if they are set as “invisible”. Perhaps as you suggested, I will use nested IAS’s if I decided to go with this (Chunk BVHs) direction.


Again, big thanks to your reply! I will consider to take a look at existing works that you have mentioned. I think my final product (if I am lucky enough to make it) must be something hybrid, maybe even with my own modified structures. But for now I guess it’s better for me to learn more! I’d like to find some works using Nanovdb + OptiX, maybe by googling stuffs =w=

Sorry for a long post and so many questions. I really do appreciate our enlightening discussion!

so does this concludes that RTX BVH is not capable to achieve an order-of-magnitude advantage in voxels compared with well-written software code (say, CUDA / NanoVDB) ?

I was intentionally vague, but this really depends extremely a lot on your data. :) The kinds of data where OptiX can be faster is like opaque volumetric level-set data, meaning very sparse volumes that look a lot like surfaces. If you have heavy scattering or a highly translucent medium, like clouds or steam or fog or fire or smoke, and/or a lot of mixing dense areas with sparse areas, then you might be better off with NanoVDB. If you are primarily rendering opaque block surfaces in a blocky world, and you could replace the blocks with cubes made of textured triangles, then OptiX might potentially be much faster, but at a higher memory cost.

I don’t know enough about NanoVDB to answer your question about loading and rebuilding chunks and updating the NanoVDB data structure. I will ask someone who’s used it more than me to contribute here.

For your chunked voxels, it’s hard to say what will work without knowing a lot more about the game engine design and how big the world is going to be, how many chunks you might potentially have to traverse in the average and worst cases, and how big the IAS rebuilds might be. But my gut reaction is nested IASs still sound good to me based on your additional description. I’d say try that first and see how the performance is, and whether you have any bottlenecks, before doing something more complicated. (But, yes, also do more research before implementing anything based on what I talked about… I am not a volume rendering expert, and there probably are good strategies out there that I don’t know.)

Sounds like a very fun project, we’ll be here to answer more OptiX questions along the way.


David.

1 Like

I guess there will mostly be opaque objects, but I don’t know if it is beneficial to convert them into triangles first (both rebuilding time and space). Now I’m using this kind of approach with rasterization, but without LOD the furthest I can go with is around 1200x256x1200 voxels, which I’d like to extend (already ~200M verts, and there’s Shadow Map passes etc.).

4096x1024x4096 with dynamic streaming following player movement is my current expectation. with this number, there will be 524,288 chunks if they are 32x32x32 big. Obviously we don’t need all of them tho, since we can only see a small part of them at once. (occlusion, view frustrum etc.)

So yea, as you suggested, IASs seems to be a good solution.

Thanks so much for your help !! I will post in this thread again if I got any problems that I’d like to discuss, here’s such a warm place to live xD

1 Like

there will be 524,288 chunks if they are 32x32x32 big

Okay, so just some napkin arithmetic: if you had only 1 top level IAS for the chunks, and each chunk had 1 bottom level GAS, then how long would it take to rebuild your IAS after you load some new chunks?

On Turing GPUs, if I remember correctly, BVH GAS builds can be roughly in the neighborhood of 100 million primitives per second, especially if the BVH is large. By “roughly” I mean things could vary by 2x or 3x depending on many factors, but I think it’s unlikely to vary by 10x or more. Ampere GPUs will be faster to rebuild due to the higher SM count.

So, assuming 100M prims/sec, a rebuild of 524288 prims might take around half a millisecond. Might round that up to 1 millisecond to be safe, there is some overhead as well. That sounds like it could be reasonable, though I know from experience it’s common for game engines to not have an extra millisecond just sitting around.

BTW I need to check to make sure IAS rebuilds are as fast as GAS rebuilds. All the timing experiments I know of are measuring large GAS rebuilds. Hopefully I’m not off by a factor of 5 or something. ;) I will update if I’m wrong.

If you used another level of IAS, say you had 32^3 uber-chunks, then to add 1 new chunk, you would have two IAS rebuilds each with 1000 elements or fewer (1 uber-chunk rebuild plus 1 chunk rebuild), so the time would be very small. But the rendering traversal cost of a 2-level IAS, or 3-level scene, is a bit higher than the traversal cost of a 2-level scene, so keep that in mind if you expect for the ray tracing to be the bottleneck.


David.

1 Like

Wow those lovely numbers.
Sounds very reasonable, and I think it aligns well with my numbers! ( ah yea I guess 524288 is 0.5M so 100M/sec gives ~5ms? 0_0 (but we have multi-IAS xD) )

So far I got those numbers on a 2080super, building a single GAS with 1x1x1 ~ 4x4x4 AABBs scattered within 4096x256x4096 (for traversal I just report intersection no matter what, in intersection program; idk if this makes sense tho):

#AABBs - build time (no alloc etc.) - traversal time @ 1280x720 1spp
524,288 - 3.49ms - 0.58ms
1,048,576 - 7.90ms - 0.66ms
4,194,304 - 27.63ms - 0.58ms ( GAS already occpied around 1.5 GiB vRAM )

it sounds more solid after hearing from you, but I still need to think about how to do this properly. maybe I should alloc more space to avoid reallocation after adding primitives … ?

But the rendering traversal cost of a 2-level IAS, or 3-level scene, is a bit higher than the traversal cost of a 2-level scene, so keep that in mind if you expect for the ray tracing to be the bottleneck.

Do you know how much is “a bit” ? like 1.1 / 1.2x or drastically depends on the situation ?
Thanks !! <3

ah yea I guess 524288 is 0.5M so 100M/sec gives ~5ms?

Oops, you are right I was missing a zero. Good thing you’re getting faster builds than what I remembered, it looks like you’re seeing ~150M prims/sec on your 2080.

Do note that because of overheads, very small BVH builds won’t see the same prims/sec rate as large BVH builds. That means if you use a multi-level IAS and build 1000 IASs that are each 1000 prims, it will not be as fast as building a single IAS of 1M prims. This won’t be a problem at all if you build only a small number of IAS for any given frame, but could become an issue if you need to rebuild many IAS.

Do you know how much is “a bit” ? like 1.1 / 1.2x or drastically depends on the situation ?

It depends on the situation, so always measure. The cost has changed (gone down) since the last time I measured, and we are always trying to reduce it more. I’m guessing - speculating - that it would be reasonable to assume 1.2x in your case and then (hopefully) be pleasantly surprised with a lower number when you measure it.


David.

Thanks for all those information and your help! I think I will survey more and maybe try with nested IAS’s.
It’s glad to hear it’s around 1.2x, sounds reasonable xD