What are some techniques to debug intermittent crash in closesthit program?

dhart · August 29, 2022, 8:16pm

cuda-gdb support is currently behind, but it might make a difference to double-check a few things. Are you compiling your closest-hit shader using nvcc? Make sure to use the -G option when compiling, and also specify OptixModuleCompileOptions.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_FULL as well as OptixModuleCompileOptions.optLevel = OPTIX_COMPILE_OPTIMIZATION_LEVEL_0. If any of those were missing, it’s worth trying again with cuda-gdb. You probably won’t get variable inspection, but you might get a line of code report with the crash. If you have any way to run this in Windows at all, that could be worth trying since Windows debugging support is a bit better than cuda-gdb at the moment.

Aside from trying to get the variable inspection to work, it did give you two nearby instruction addresses where the crash potentially occurred, so if you don’t yet know exactly where it’s crashing, with a little work you can use these addresses to locate the crash location in code. You can use cuda-gdb to list your SASS code near that address, and correlate it manually against the SASS listing you get in Nsight Compute, which will then take you to the corresponding code. You might even be able to do some spelunking to find which registers the crashing instruction is using and find a correspondence to the variables in code (Nsight Compute’s “register dependencies” column can help out with this). You might also check $errorpc and/or the disas in cuda-gdb (CUDA-GDB :: CUDA Toolkit Documentation).

Some other low-tech ways I’ve debugged a mysterious crash are:

Put a return statement in your closest-hit shader before where you suspect the crash is (or equivalently if you prefer, comment some of the code out). Bisect it by moving the return statement or comment block until you narrow in on the offending code.
If you have multiple materials in your scene, narrow down which material is causing the crash by removing various objects or materials from the scene until it no longer crashes, e.g., bisect across closest-hit programs.
Bisect across pixels by limiting the launch range - see if you can identify which region of the image is crashing - ideally even a single pixel (and then using breakpoints will be viable!). Your thread index that cuda-gdb shows above (“<<<(832,13,1),(64,1,1)>>>”) might help guide you to a starting place, it looks like your shader did not crash on the first thread or warp, but perhaps somewhere specific in the view of your scene.
printf() can be used to inspect variables, and is useful in combination with something that limits the amount of printf output. Personally I really like to add code that will only invoke printf for a single pixel that I click on, this is super useful even when the debugger works. Do note that CUDA’s printf buffer is limited in size.
Think through the possible causes of crash in your hit shader, come up with a hypothesis as to which ones are the most likely, and then design some tests that can validate each theory. If you are accessing local/global memory, then maybe it’s crashing due to an out of bounds memory access. To test that, you could put information somewhere about the array bounds, or just make up some conservative bounds, and add code to clamp all array accesses. If the crash, persists then you’ve ruled out one possible cause. Maybe the crash is an alignment issue somewhere, you could print the pointers and comment out the access to check that kind of thing.

I hope that helps a little bit, I know it can be frustrating to track down this kind of thing.

–
David.