Debugging Optix code with CUDA 11.6 cuda-gdb locks desktop GUI

I saw the notice about cuda-gdb supporting Optix code debugging so gave it a try, running Fedora 35, driver 510.39.01, CUDA 11.6 with a RTX 3060 GPU.

The debugger seems to work well. I put an assert in a miss program and when the assert triggered, the debugger showed code reasonably close to the assert.

One thing I encountered is that when I run my program, my desktop GUI blocks for between 5-10 seconds, so I can’t do anything, unblocks and runs a bit, then blocks again, until the assert triggers, then the desktop behaves normally. I also noticed that nvidia-smi -l shows the GPU at 100% load until the assert triggers.

I am using the same GPU to run my program and to run my Linux KDE desktop GUI, since I have only one GPU in the system.

I also tried issuing the set cuda software_preemption on command even though the cuda-gdb reference says it’s not required for RTX 3060, with no effect.

Is using the same GPU for running the desktop and the Optix program a limitation? Do I need to do something like ssh to my Fedora system and run cuda-gdb thru the ssh session?

Hi @drwootton1,

I haven’t had a chance to try to reproduce this yet, but wanted to mention a couple of things anyway.

Using a remote debug setup will indeed at least make the lockup experience less painful, but at some cost to convenience. If you have two machines or a spare GPU, another option is to run your display with a different GPU than your debug session.

I remember some amount of short stalls when using cuda-gdb, but perhaps not as big as 5-10 seconds, and asking around my team nobody recalls seeing such large stalls. I will try to confirm or deny next week. My understanding is that when breaking and stepping, cuda-gdb copies GPU memory to the host in order to be able to display info on memory, registers, instructions for all threads, so it was never particularly fast. I suppose the stall time may depend on what’s going on in your program and how much active memory there is.

One thing you can try is debugging an OptiX SDK sample to see if the stall is similar. Another thing to check is using cuda-gdb on one of the CUDA SDK samples. These tests would at least tell you whether your stall is specific to your program and/or specific to an OptiX program.


David.

I tried the unmodified optixTriangle Optix 7.4 sample and see the same long pauses.
I set up a VNC session from a laptop and the machine that has the RTX 3060 had long hangs where the display locked up.

optixTriangle did use the VNC display to display its image and the VNC session did not lock up. The cuda-gdb session did run quite slowly and it took a while to reach the breakpoint I set in a miss program. So this does work a little better.

I get the same hang behavior running a CUDA program which does not use Optix at all. I did not see this behavior trying to use cuda-gdb to debug Optix programs prior to updating to CUDA 11.6.

So this might be due to the new cuda-gdb in CUDA 11.6, the new driver, 510.39.01, or possibly an updated Linux kernel 5.15.16-200.fc35.x86_64.

I also don’t seem to see this consistently with CUDA code. I thought I saw it yesterday, then it went away, came back today.