What are some techniques to debug intermittent crash in closesthit program?

Posted this originally in the CUDA board, but was directed to post here in Optix because it is a special problem for optix, since the closesthit program is called for every pixel, which makes it even harder to set a breakpoint, etc.

My issue has to do with math at a certain point, so I would really like to get in the device code to look at variables at time of crash if that is possible. But I have tried with cuda-gdb but can’t get access to any variables. Can’t store info into a buffer since it would never make it out before crashing, etc What are some other techniques to try?

Below is all the information I can get currently with cuda-gdb

--Type <RET> for more, q to quit, c to continue without paging--

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555557bc2e60

Thread 1 "SensorApp" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 33, grid 296, block (395,6,0), thread (62,0,0), device 0, sm 10, warp 13, lane 30]
0x0000555557bc2f00 in __closesthit__radiance_ptID_0xe1371196ca236d47<<<(832,13,1),(64,1,1)>>> ()

Hi @brian.h.wagener,

cuda-gdb support is currently behind, but it might make a difference to double-check a few things. Are you compiling your closest-hit shader using nvcc? Make sure to use the -G option when compiling, and also specify OptixModuleCompileOptions.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_FULL as well as OptixModuleCompileOptions.optLevel = OPTIX_COMPILE_OPTIMIZATION_LEVEL_0. If any of those were missing, it’s worth trying again with cuda-gdb. You probably won’t get variable inspection, but you might get a line of code report with the crash. If you have any way to run this in Windows at all, that could be worth trying since Windows debugging support is a bit better than cuda-gdb at the moment.

Aside from trying to get the variable inspection to work, it did give you two nearby instruction addresses where the crash potentially occurred, so if you don’t yet know exactly where it’s crashing, with a little work you can use these addresses to locate the crash location in code. You can use cuda-gdb to list your SASS code near that address, and correlate it manually against the SASS listing you get in Nsight Compute, which will then take you to the corresponding code. You might even be able to do some spelunking to find which registers the crashing instruction is using and find a correspondence to the variables in code (Nsight Compute’s “register dependencies” column can help out with this). You might also check $errorpc and/or the disas in cuda-gdb (CUDA-GDB :: CUDA Toolkit Documentation).

Some other low-tech ways I’ve debugged a mysterious crash are:

  • Put a return statement in your closest-hit shader before where you suspect the crash is (or equivalently if you prefer, comment some of the code out). Bisect it by moving the return statement or comment block until you narrow in on the offending code.

  • If you have multiple materials in your scene, narrow down which material is causing the crash by removing various objects or materials from the scene until it no longer crashes, e.g., bisect across closest-hit programs.

  • Bisect across pixels by limiting the launch range - see if you can identify which region of the image is crashing - ideally even a single pixel (and then using breakpoints will be viable!). Your thread index that cuda-gdb shows above (“<<<(832,13,1),(64,1,1)>>>”) might help guide you to a starting place, it looks like your shader did not crash on the first thread or warp, but perhaps somewhere specific in the view of your scene.

  • printf() can be used to inspect variables, and is useful in combination with something that limits the amount of printf output. Personally I really like to add code that will only invoke printf for a single pixel that I click on, this is super useful even when the debugger works. Do note that CUDA’s printf buffer is limited in size.

  • Think through the possible causes of crash in your hit shader, come up with a hypothesis as to which ones are the most likely, and then design some tests that can validate each theory. If you are accessing local/global memory, then maybe it’s crashing due to an out of bounds memory access. To test that, you could put information somewhere about the array bounds, or just make up some conservative bounds, and add code to clamp all array accesses. If the crash, persists then you’ve ruled out one possible cause. Maybe the crash is an alignment issue somewhere, you could print the pointers and comment out the access to check that kind of thing.

I hope that helps a little bit, I know it can be frustrating to track down this kind of thing.


@dhart so I was able to isolate it to a single line, but I can’t figure out what is wrong. Basically I have a if check

’ if (sbtData.hasNormalMap) …’

which in this case, I have it set to 0, for this model. And if I add some printf’s I do see it is 0. Now the problem is somehow it gets inside this if statement, and crashes things because we don’t have a normal map. I have verified my host code, and hasNormalMap can’t change after startup, so only thing I can think of is there is some type of memory corruption or something. Note if I just add a “&& false)” if statement, so it never can be true, everything works, so it is def this. Any ideas?

Note I am getting sbtData using

“const TriangleMeshSBTData &sbtData = *(const TriangleMeshSBTData *)optixGetSbtDataPointer();”


struct TriangleMeshSBTData {
vec3f color;
vec3f *vertex;
vec3f *normal;
vec2f *texcoord;
vec3i *index;
int hasTexture;
int hasNormalMap;
int isTerrain;
int applyFog;
cudaTextureObject_t texture;
cudaTextureObject_t normal_map;

That certainly looks like you’ve narrowed it down. Was the printf() that you tried put in the device code, in closest-hit (and not on the host side)? Is it possible that the SBT index is wrong for this invocation? (Seems unlikely based on your description, but just in case, it’d be good to be certain that the shader is being called for the correct geometry.)

Maybe the next most likely possibility is that the stack size is too small, which can indeed result in memory corruption. Have you set your stack size? Double check your trace depth & scene depth, and that you’ve accounted for all your shader programs. The optixPathTracer SDK sample shows how to do this. You could also try just doubling or tripling your final stack size that you report to OptiX, and if it fixes the bug it indicates that something was either missing or buggy & incorrect in the stack size calculation.

Besides corruption and stack size, another possibility is a compiler error having to do with structs or register allocation. You’re doing something extremely common, so I hesitate to jump to this first, but maybe there is some other factor here that’s hard to see. Are you using parallel compilation or payload annotations or any other advanced features of OptiX? Is your scene structure (in terms of IAS, GAS, instances, etc.) fairly simple, or somewhat tricky?

Which OptiX version and driver version are you using? (If for any reason you haven’t tried the very latest driver, that’s always a good thing to try.) If stack size isn’t the issue, is it possible to share a reproducer with us? (You can do so confidentially if you prefer, via DM or the optix help list).