Handling of very large meshes

piotr.holonowicz · October 24, 2022, 11:25am

Hello!
In the current project (I work on), I have introduced the Optix API and it brings significant speedups (even up to 1000 times in certain tasks!). However, the mesh data I have used to do the tests are small, comparing to the mesh data we use in practice. Normally, our meshes would consist of billions of triangles + the textures, in extreme cases the total size exceeds 250 GiB… Worse, usually the data do not compress very well as they do not contain repeatable objects, as they come from the scans of very large rooms.
So I have some questions regarding that:

Would it be possible to partition those meshes over multiple GPU’s ? If so, how to do that with the Optix 7.4, 7.5 ?
Are you planning to allow to use the unified or at least the virtual memory for the BVH’s? Or at least for the meshes that are the input for building the GAS/IAS ? Right now, the maximum BVH size is limited by the half of the VRAM at best, since the meshes need to be loaded into the VRAM first using cudaMalloc->cudaMemcpy then the GAS must be built, only then I can release the VRAM used for the meshes with the cudaFree(as the vertex data are then stored by the BVH, anyway). AFAIK, with the unified memory, it is possible to overbook the VRAM on the newer GPU’s (>=SM6.0). Even if it caused a performance drop, it could still turn out to be faster approach than doing the raycasting on the CPU.
Thanks in advance for any hint regarding the problems decribed above…

dhart · October 24, 2022, 7:39pm

Hi @piotr.holonowicz,

This is a great question, and a great use-case we are interested in addressing. There are some potential ways to handle this today and/or soon. I don’t know if there are any easy solutions, but here are a few options.

One thing we just announced is the Displacement Micro-Mesh. https://developer.nvidia.com/rtx/ray-tracing/micro-mesh This is coming relatively soon to OptiX, and it allows loading meshes with many times more triangles than is currently possible. There are some limitations, and it might require translating your meshes into a format that is compatible with the API, but this is something to consider in order to ray trace huge meshes on your GPUs.

You can also partition meshes on your own. Aside from the above mesh format, OptiX doesn’t have anything specific in the API to help you with partitioning, it’s more a matter of designing your own partitioning scheme and then running OptiX separately on each partition and combining the results. You don’t necessarily even need multiple GPUs, but doing this on a single GPU might be a lot slower than with multiple GPUs. Depending on what rendering algorithm you’re using, you might be able to load a partition one at a time, render it to a G-buffer that has your ray t value in it, and check the t values before writing any given pixel. If you cycle through all partitions, after that your final G buffer will have the same answer as though you had rendered the whole mesh in one go. (I’m thinking specifically of Chris Hellmuth’s example described here: https://www.render-blog.com/) This, as you say, might still be faster than raycasting on the CPU.

Right now, the maximum BVH size is limited by the half of the VRAM at best, since the meshes need to be loaded into the VRAM first using cudaMalloc->cudaMemcpy then the GAS must be built, only then I can release the VRAM used for the meshes with the cudaFree(as the vertex data are then stored by the BVH, anyway)

While you’re correct about needing a resident copy while the BVH is being built, you do have the option to subdivide your mesh into smaller pieces, build multiple acceleration structures, and then add each piece to an instance acceleration structure. If you did that, you would be able to use most of your memory (if you serialize some or all of your BVH builds), and it would allow you to do BVH compaction as well. You might be able to find some real mileage this way.

In OptiX we do have a “demand loading” library that loads texture tiles on demand. We would like to adapt it for use with geometry (e.g. load subregions of your scene only when rays actually enter that space), and other people have tried doing this kind of thing with some success. This would take some effort, and it depends on whether rays end up touching all of your triangles when you render, or if there are regions that don’t end up being sampled. (It’s common to point out that we don’t have billions of pixels, and so billions of triangles must be overkill in some sense, if we’re rendering a picture… just one problem: which triangles do we not need?) If you’re interested in trying something like that, I think we could start a longer conversation about how and where to start.

Do you already use any noise reduction and/or flat surface decimation process for your scanned data?

–
David.

piotr.holonowicz · October 26, 2022, 12:46pm

Hi David,
thank You for really exhaustive answer:

The micromesh idea sounds like one of the solutions we would like to try. I would love to learn more about that and pass this knowledge to my team.
Currently we are working on partitioning the meshes with our own approach, but that one is basically based on the G-buffer idea you have described above. The problem is that we need to load the entire, large mesh and split it with the CPU first, then we can feed the Optix pipeline(s) with the submeshes.
The thing is that we have two different pipelines. For the first one, we know all the camera poses in advance, so actually we could simply eliminate the triangles that are outside the camera frustrums, before creating the acceleration structures. With the second one the problem is more complicated, as we cannot do that so easily.
From my experience, the BVH compaction works poorly with the scanned meshes.
I have took a glance on the “demand loading” library… the code looks quite complicated… need to look at this more thoroughly
We do some preprocessing regarding the noise reduction and the mesh simplification.

Piotr

dhart · October 26, 2022, 9:46pm

There’s a public high level advertisement of the new Displaced Micro-Meshes in this Ada architecture whitepaper on this page: NVIDIA Ada Lovelace Architecture (Scroll to the bottom and click on the “Architecture” white paper button.) The API and SDK examples for this are coming soon.

Thanks for the extra info. So yes it’s trivial if you can cull based on camera position and then load a single mesh, but if you’re using two different pipelines on the same data, then you probably need a different partitioning scheme and may also need to introduce a loop over the partitions in order to render 1 frame. I don’t mean to make it sound easy, this is a difficult research-level problem. You might investigate some of the recent publications that describe schemes for GPU scene partitioning, such as " GPU Accelerated Path Tracing of Massive Scenes" by Jaroš et. al. (https://dl.acm.org/doi/10.1145/3447807) That’s good for multi-GPU. Or for single-GPU “out of core” rendering perhaps something like " Out-of-core GPU ray tracing of complex scenes", Garanzha et. al. (https://dl.acm.org/doi/10.1145/2037826.2037854), or “Out-of-Core GPU Path Tracing on Large Instanced Scenes via Geometry Streaming”, Berchtold ("Out-of-Core GPU Path Tracing on Large Instanced Scenes via Geometry St" by Jeremy Berchtold).

Those techniques have significant overlap with the idea of using our Demand Loading library for streaming geometry, so that is also an idea that will take time to develop. The main idea with Demand Loading is that after you’ve decided what your partitioning or clustering scheme is, the library can help you with managing and fulfilling the load requests. We’re investigating this and hope to be able to publish examples some day, but in the mean time you could think of the above papers as discussing the partitioning schemes, and the existing Demand Loading library as providing a tool to use for implementing a lazy loading & restart system (as opposed to trying to render a G-buffer like I mentioned earlier).

–
David.

Topic		Replies	Views
Optix-low computational usage on GPU OptiX	12	943	June 22, 2022
Out of memory recovery OptiX	8	1274	June 14, 2022
OPTIX, acceleration structure requires too much space OptiX	10	2646	June 15, 2022
Multi GPU OptiX	7	3135	June 14, 2022
Handling Large(?) Indexed Triangular Mesh, Best Practice OptiX	3	677	June 14, 2022
Recommendations for splitting work between GPUs OptiX	4	2484	February 12, 2024
Best way to turn entities ON/OFF during ray tracing in Optix OptiX	22	1784	June 14, 2022
Optimal vertex and index layout order on modern GPU's? OpenGL	4	4690	November 5, 2012
Multi-GPU with OptiX OptiX	10	5452	June 14, 2022
memory usage in multi GPU system (NVLink) Linux OptiX	6	1526	June 14, 2022

Handling of very large meshes

Related topics