Handling of very large meshes

In the current project (I work on), I have introduced the Optix API and it brings significant speedups (even up to 1000 times in certain tasks!). However, the mesh data I have used to do the tests are small, comparing to the mesh data we use in practice. Normally, our meshes would consist of billions of triangles + the textures, in extreme cases the total size exceeds 250 GiB… Worse, usually the data do not compress very well as they do not contain repeatable objects, as they come from the scans of very large rooms.
So I have some questions regarding that:

  1. Would it be possible to partition those meshes over multiple GPU’s ? If so, how to do that with the Optix 7.4, 7.5 ?
  2. Are you planning to allow to use the unified or at least the virtual memory for the BVH’s? Or at least for the meshes that are the input for building the GAS/IAS ? Right now, the maximum BVH size is limited by the half of the VRAM at best, since the meshes need to be loaded into the VRAM first using cudaMalloc->cudaMemcpy then the GAS must be built, only then I can release the VRAM used for the meshes with the cudaFree(as the vertex data are then stored by the BVH, anyway). AFAIK, with the unified memory, it is possible to overbook the VRAM on the newer GPU’s (>=SM6.0). Even if it caused a performance drop, it could still turn out to be faster approach than doing the raycasting on the CPU.
    Thanks in advance for any hint regarding the problems decribed above…

Hi @piotr.holonowicz,

This is a great question, and a great use-case we are interested in addressing. There are some potential ways to handle this today and/or soon. I don’t know if there are any easy solutions, but here are a few options.

One thing we just announced is the Displacement Micro-Mesh. https://developer.nvidia.com/rtx/ray-tracing/micro-mesh This is coming relatively soon to OptiX, and it allows loading meshes with many times more triangles than is currently possible. There are some limitations, and it might require translating your meshes into a format that is compatible with the API, but this is something to consider in order to ray trace huge meshes on your GPUs.

You can also partition meshes on your own. Aside from the above mesh format, OptiX doesn’t have anything specific in the API to help you with partitioning, it’s more a matter of designing your own partitioning scheme and then running OptiX separately on each partition and combining the results. You don’t necessarily even need multiple GPUs, but doing this on a single GPU might be a lot slower than with multiple GPUs. Depending on what rendering algorithm you’re using, you might be able to load a partition one at a time, render it to a G-buffer that has your ray t value in it, and check the t values before writing any given pixel. If you cycle through all partitions, after that your final G buffer will have the same answer as though you had rendered the whole mesh in one go. (I’m thinking specifically of Chris Hellmuth’s example described here: https://www.render-blog.com/) This, as you say, might still be faster than raycasting on the CPU.

Right now, the maximum BVH size is limited by the half of the VRAM at best, since the meshes need to be loaded into the VRAM first using cudaMalloc->cudaMemcpy then the GAS must be built, only then I can release the VRAM used for the meshes with the cudaFree(as the vertex data are then stored by the BVH, anyway)

While you’re correct about needing a resident copy while the BVH is being built, you do have the option to subdivide your mesh into smaller pieces, build multiple acceleration structures, and then add each piece to an instance acceleration structure. If you did that, you would be able to use most of your memory (if you serialize some or all of your BVH builds), and it would allow you to do BVH compaction as well. You might be able to find some real mileage this way.

In OptiX we do have a “demand loading” library that loads texture tiles on demand. We would like to adapt it for use with geometry (e.g. load subregions of your scene only when rays actually enter that space), and other people have tried doing this kind of thing with some success. This would take some effort, and it depends on whether rays end up touching all of your triangles when you render, or if there are regions that don’t end up being sampled. (It’s common to point out that we don’t have billions of pixels, and so billions of triangles must be overkill in some sense, if we’re rendering a picture… just one problem: which triangles do we not need?) If you’re interested in trying something like that, I think we could start a longer conversation about how and where to start.

Do you already use any noise reduction and/or flat surface decimation process for your scanned data?


Hi David,
thank You for really exhaustive answer:

  1. The micromesh idea sounds like one of the solutions we would like to try. I would love to learn more about that and pass this knowledge to my team.
  2. Currently we are working on partitioning the meshes with our own approach, but that one is basically based on the G-buffer idea you have described above. The problem is that we need to load the entire, large mesh and split it with the CPU first, then we can feed the Optix pipeline(s) with the submeshes.
    The thing is that we have two different pipelines. For the first one, we know all the camera poses in advance, so actually we could simply eliminate the triangles that are outside the camera frustrums, before creating the acceleration structures. With the second one the problem is more complicated, as we cannot do that so easily.
  3. From my experience, the BVH compaction works poorly with the scanned meshes.
  4. I have took a glance on the “demand loading” library… the code looks quite complicated… need to look at this more thoroughly
  5. We do some preprocessing regarding the noise reduction and the mesh simplification.


There’s a public high level advertisement of the new Displaced Micro-Meshes in this Ada architecture whitepaper on this page: NVIDIA Ada Lovelace Architecture (Scroll to the bottom and click on the “Architecture” white paper button.) The API and SDK examples for this are coming soon.

Thanks for the extra info. So yes it’s trivial if you can cull based on camera position and then load a single mesh, but if you’re using two different pipelines on the same data, then you probably need a different partitioning scheme and may also need to introduce a loop over the partitions in order to render 1 frame. I don’t mean to make it sound easy, this is a difficult research-level problem. You might investigate some of the recent publications that describe schemes for GPU scene partitioning, such as " GPU Accelerated Path Tracing of Massive Scenes" by Jaroš et. al. (https://dl.acm.org/doi/10.1145/3447807) That’s good for multi-GPU. Or for single-GPU “out of core” rendering perhaps something like " Out-of-core GPU ray tracing of complex scenes", Garanzha et. al. (https://dl.acm.org/doi/10.1145/2037826.2037854), or “Out-of-Core GPU Path Tracing on Large Instanced Scenes via Geometry Streaming”, Berchtold ("Out-of-Core GPU Path Tracing on Large Instanced Scenes via Geometry St" by Jeremy Berchtold).

Those techniques have significant overlap with the idea of using our Demand Loading library for streaming geometry, so that is also an idea that will take time to develop. The main idea with Demand Loading is that after you’ve decided what your partitioning or clustering scheme is, the library can help you with managing and fulfilling the load requests. We’re investigating this and hope to be able to publish examples some day, but in the mean time you could think of the above papers as discussing the partitioning schemes, and the existing Demand Loading library as providing a tool to use for implementing a lazy loading & restart system (as opposed to trying to render a G-buffer like I mentioned earlier).