Linux Optix 7.4 Debugging problem

I’m running on a Fedora 35 Linux system with driver version 495.29.05, CUDA 11.5 and Optix 7.4

I’m trying to debug my Optix program which apparently runs fine when I create meshes with only a few triangles. I generated another mesh which has 122,000 triangles, and when I run my program it crashes. If I run it with cuda-gdb it stops with an error

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fffce48ea70

Thread 4 "GPUThread" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 301, block (2619,0,0), thread (64,0,0), device 0, sm 0, warp 4, lane 0]

I have been unable to get any kind of line number information or look at veriables even though when I compile my Optix kernel to PTX I specify the -g and -lineinfo options, specify compileOptions.optLevel = OPTIX_COMPILE_OPTIMIZATION_LEVEL_0 and compileOptions.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_FULL when calling optixModuleCreateFromPTX, and linkOptions.debugLevel = OPTIX_COMPILE_LEVEL_FULL when calling optixPipelineCreate.

If I run my program with Compute Sanitzer, specifying no options so it defaults to memory checking I get several errors complaining about CUDA_ERROR_ILLEGAL_ACCESS in CUDA calls to cuStreamSybchronize and cuEventRecord in the second iteration of my main loop which completely regenerates all Optix structures each iteration.

I suspect I’ve done something to seriously clobber device memory, probably with a cudaMemcpy related to copying my vertex or texture objects, but in the absence of any useful debug info to point to the failure, I have no idea what went wrong.

Since this is complaining about a Warm illegal address error I’m pretty sure the error is somewhere inside my Optix device code. Any suggestions how I get debug info?

Hi @drwootton1,

It does sound like you’ve tried all the right things. It is the case that some kinds of illegal memory address errors won’t show you actionable line number info even if the symbols are all compiled and plumbed properly, and unfortunately it’s a little more likely with Linux. We are working on improving the debug info – I’ve been saying this on the forum for a while now, but it really is coming soon. In the mean time, maybe I can suggest some alternate strategies since I know from experience what it’s like trying to debug from vague errors like this.

It does look like your OptiX launch is probably triggering the error, so to find the location when the debugger fails, I usually try to bisect my OptiX application until I can pinpoint the offending memory access. (You can verify the launch is the trigger by synchronizing the device just before launch, checking for errors, and then doing the same after the launch.) Usually I start by disabling parts of my program until it stops crashing, for example excluding some geometry, disabling shader types, turning off features, etc. Your first goal should be to identify whether the invalid access is coming from your own shader code or from OptiX (for example via a bad device pointer build input or something). For example, if you turn off all shading except closest hit, and you use a dummy closest hit that only returns a solid color without doing anything else, then if the program still crashes, the likely cause is a memory access during BVH build or traversal. If the program stops crashing, then it’s more likely the invalid access is happening inside some shader code.

If the culprit seems to be shader code, it can also help tremendously to narrow down the launch size & image region that triggers the crash, ideally down to a single pixel. After that I might start by putting an early return in my suspected shader to verify, and then move the return statement further and further down the shader until the crash comes back. It usually doesn’t take long to binary search this way.

If you’ve clobbered some device memory that OptiX needs, you won’t get any line info there by design. In that case, because it sounds like you’re using streams & events, the first thing I would suggest is investigating whether an async memory write or read is happening out of order by making your app single-stream and synchronous, for example by sprinkling cudaDeviceSynchronize() after every API call. If that fixes your crash, then it would be a matter of examining which buffer dependencies aren’t being properly met. If it still crashes even when all mem copies and launches are synchronized, then it’s probably a data size or indexing problem rather than a bad dependency, and you can inspect & review your buffer sizes and indexing. I would still try to first narrow down exactly which geometry & which buffer it seems to be tripping on.

I realize this is not a particularly satisfying answer but I hope that at least helps get you further. If you get completely stuck and you are able and willing to share a minimal reproducer either publicly or privately, we can certainly dive a little deeper and try to debug from our end.


David.

Thanks. I had tried commenting out code with subscripts and trying printfs in the device code.

I finally figured out that I was calling cudaFree to free the buffer containing all my triangle/vertex data just after calling optixAccelBuild. Once I moved that cudaFree call to the end of my outer loop my program works.

I’m kind of surprised this ever worked at all, but it did consistently with small meshes. I did find that it would run with in the range of 9000 triangles total, but then if I added one triangle it stopped working. I guess that affected memory management just enough that my vertex data really got clobbered.

I’m also kind of surprised Compute Sanitizer didn’t catch the invalid memory access, but then realize no memory checker is 100% correct, and I doubt some of the memory protection/invalidation hardware in a CPU exists in a GPU.

I’m looking forward to improved tools. I considered moving this all to Windows with the hope of better debug tools, but learning Visual Studio wasn’t appealing.

Ah, really glad you found it!! Yes this is indeed an error you might have uncovered by sprinkling cudaDeviceSyncronize() throughout – by dropping one of those between optixAccelBuild() and cudaFree(). One important thing I failed to mention is that the OptiX Validation Mode (https://raytracing-docs.nvidia.com/optix7/guide/index.html#context#validation-mode) will turn all launches into synchronous launches, among other error checks, which might help reveal this kind of memory access issue. That may or may not have helped directly in this case, but for posterity it’s a useful thing to try first when suspecting async problems, or when other debugging methods don’t work, especially because it’s a ~1 line code change.

Yeah, async issues with referencing deleted pointers can be really hard to find – even on the CPU! There’s no guarantee the memory manager will mark something off limits and catch it immediately. It makes sense to me that it would take enough time that it fails on larger inputs but not smaller ones.

It’s worth noting that some of the CUDA tools including Compute Sanitizer and cuda-memcheck are not generally even expected to work correctly on OptiX programs because the OptiX ABI is different than the CUDA ABI. This is certainly something we are working towards rectifying, but sorry you lost time trying it.


David.