Questions regarding optixAccelBuild and "an illegal memory access was encountered"

Hi!

I’m a bit at a loss with a strange issue so I figured that I might get some help here. I’m testing OptiX 7 as a part of a larger Linux program written in Python and using PyTorch. I wrote a standalone C++/CUDA library that deals with OptiX 7.3, and the Python code calls that library through a pybind11 wrapper. So far, so good.

However, after a bit of testing within the Python application, I started getting some “cudaErrorIllegalAddress: an illegal memory access was encountered” errors during the call to optixAccelBuild, but only after it’s called a few times. I made a minimum code that basically just calls the constructor creating the OptiX pipeline in a loop, and I get the error as soon as I do anything with PyTorch before these calls, at the 5th call to the constructor. Thus I ran the code with cuda-memcheck, cuda-gdb, CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1, but it did not show where the problems are happening.

I raised the issue on the PyTorch forums, but I might get some OptiX-specific feedback here.

Some other pieces of information:

  • I could not reproduce the error directly when using only the C++ library.
  • I tested the code on different machines, various GPUs (GeForce GTX 1660, Tesla T4), with different driver version (470.86, 495.29.05) and different PyTorch versions. The errors were consistent.
  • The memory pointers passed to optixAccelBuild seem to be properly aligned (the documentation says nothing about this constraint, but I found the information on this forum). The code initially used cudaMalloc directly, I also tested with PyTorch’s caching allocator which calls cudaMalloc and provides blocks rounded to 512 bytes.
  • The OptiX 7 code is very similar to Ingo Wald’s example04_firstTriangleMesh, so nothing too complex. The input is just a triangle mesh. I added the missing clean up steps to make sure that there is no memory leak or anything of the sort.

So I don’t really know what to try next. Are there any hidden requirements for the optixAccelBuild function that could explain this behavior? Besides pointer alignment issues, what could trigger illegal memory address errors in this function?

Hi @bch, welcome!

Maybe something to triple check is the timing of memory allocations vs memory copies vs BVH build. If you have anything async or use multiple streams at all, it’s easy to accidentally launch a BVH build out of order, or on a block of geometry that isn’t done getting copied. Include freeing memory in that as well - if the block is accidentally freed before the BVH build is done, that would also result in an illegal memory access.

Are the BVH builds identical in both cases; both when using PyTorch and when not using PyTorch? Is it possible there could be a bug with indices in your index buffer that point outside of the vertex buffer?

Is this BVH build over triangles, or a custom primitive of some sort? Asking since providing custom primitive bounds has more potential failure points than built-in meshes.


David.

Hi David,

Thanks for the quick answer!

Maybe something to triple check is the timing of memory allocations vs memory copies vs BVH build. If you have anything async or use multiple streams at all, it’s easy to accidentally launch a BVH build out of order, or on a block of geometry that isn’t done getting copied. Include freeing memory in that as well - if the block is accidentally freed before the BVH build is done, that would also result in an illegal memory access.

I use cudaDeviceSynchronize() before the call for this exact reason, and all of the operations use the same CUDA stream. I also tested with CUDA_LAUNCH_BLOCKING=1 to be extra safe.

The pointers passed to optixAccelBuild() are allocated right before the call, with their sizes computed from optixAccelComputeMemoryUsage() (I tried with or without compaction, with or without OPTIX_BUILD_FLAG_PREFER_FAST_TRACE…).

Are the BVH builds identical in both cases; both when using PyTorch and when not using PyTorch? Is it possible there could be a bug with indices in your index buffer that point outside of the vertex buffer?

Absolutely: the input is strictly the same. As for the index buffer, this is a good idea, I’ll check that. I’m using an external library to load a PLY file, and I am not doing anything with the indices before forwarding them to OptiX, plus the mesh passed some validity tests (e.g. in MeshLab).

Is this BVH build over triangles, or a custom primitive of some sort? Asking since providing custom primitive bounds has more potential failure points than built-in meshes.

Good ol’ triangles, and I’m running the tests on a really simple mesh without that many vertices (~30k).

It sounds like you’re checking all the right things, maybe this problem is happening outside of your OptiX code. It’d be worth putting a call to cudaDeviceSynchronize() after your call to optixAccelBuild() to see if that changes anything.

If the same inputs are used in both cases, and you can see the models outside of PyTorch, then I don’t suspect that an index buffer bug is the problem.

Maybe I didn’t understand enough about your setup. Everything works correctly all the way through pipeline creation, BVH build, and render launch the first time? It’s only after using PyTorch before OptiX that crashing starts to occur, and only after several successful runs all the way through?

Is the PyTorch usage characterized as doing CUDA launches? What calls do you make to tear down after rendering is done? Are there any differences in either CUDA or OptiX API usage during the post-render cleanup? Are you creating a new CUDA context and/or new OptiX context every time? Is PyTorch handling some of the startup & teardown? I wonder if it could be closing the CUDA context in such a way that it’s not suitable for re-use, or if PyTorch could be hanging on to more resources each time, perhaps somewhat similar to this post.

BTW it might help to understand your environment. Are you testing inside a Jupyter notebook, or inside a Docker container, or any other container or wrapper? Or is this command line Python? Does everything work correctly if you restart your render process every time?


David.