This is a great question, and a great use-case we are interested in addressing. There are some potential ways to handle this today and/or soon. I don’t know if there are any easy solutions, but here are a few options.
One thing we just announced is the Displacement Micro-Mesh. https://developer.nvidia.com/rtx/ray-tracing/micro-mesh This is coming relatively soon to OptiX, and it allows loading meshes with many times more triangles than is currently possible. There are some limitations, and it might require translating your meshes into a format that is compatible with the API, but this is something to consider in order to ray trace huge meshes on your GPUs.
You can also partition meshes on your own. Aside from the above mesh format, OptiX doesn’t have anything specific in the API to help you with partitioning, it’s more a matter of designing your own partitioning scheme and then running OptiX separately on each partition and combining the results. You don’t necessarily even need multiple GPUs, but doing this on a single GPU might be a lot slower than with multiple GPUs. Depending on what rendering algorithm you’re using, you might be able to load a partition one at a time, render it to a G-buffer that has your ray t value in it, and check the t values before writing any given pixel. If you cycle through all partitions, after that your final G buffer will have the same answer as though you had rendered the whole mesh in one go. (I’m thinking specifically of Chris Hellmuth’s example described here: https://www.render-blog.com/) This, as you say, might still be faster than raycasting on the CPU.
Right now, the maximum BVH size is limited by the half of the VRAM at best, since the meshes need to be loaded into the VRAM first using cudaMalloc->cudaMemcpy then the GAS must be built, only then I can release the VRAM used for the meshes with the cudaFree(as the vertex data are then stored by the BVH, anyway)
While you’re correct about needing a resident copy while the BVH is being built, you do have the option to subdivide your mesh into smaller pieces, build multiple acceleration structures, and then add each piece to an instance acceleration structure. If you did that, you would be able to use most of your memory (if you serialize some or all of your BVH builds), and it would allow you to do BVH compaction as well. You might be able to find some real mileage this way.
In OptiX we do have a “demand loading” library that loads texture tiles on demand. We would like to adapt it for use with geometry (e.g. load subregions of your scene only when rays actually enter that space), and other people have tried doing this kind of thing with some success. This would take some effort, and it depends on whether rays end up touching all of your triangles when you render, or if there are regions that don’t end up being sampled. (It’s common to point out that we don’t have billions of pixels, and so billions of triangles must be overkill in some sense, if we’re rendering a picture… just one problem: which triangles do we not need?) If you’re interested in trying something like that, I think we could start a longer conversation about how and where to start.
Do you already use any noise reduction and/or flat surface decimation process for your scanned data?