Cuda-gdb is slow as molasses

So this is a bit of a rant, but I’ve been waiting for cuda-gdb to execute a ‘next’ command for about 10 minutes now. Don’t worry, it’ll get there, eventually, but why on earth is it so slow, can I do anything about it, and is there anything being done about it??

The same thing happens when hitting a breakpoint in a kernel for the first time. I get a message saying it’s switching focus to kernel blah_blah block(blah,blah) thread(blah,blah) yadda yadda, and I know I can now go do something else, and when I come back in ~10ish minutes it will be ready for me. At least until I say ‘next’, at which point I have another 10 minute wait. After that it is fine, no more waiting, at least until I have to restart the program and do it all over again.

This behavior is really killing my edit-compile-debug flow. I’m using cuda-11.1.1

Hi there! Thanks for reaching out.

Sorry you are having a poor experience with CUDA-GDB. Are you able to share anymore information about the application you are trying to debug? A reproducer would help us determine why the slow down is occurring.

There can be high initial load time costs upon the launch of the first kernel for certain CUDA applications. Ten minutes however is not expected. Does your application contain many (order of hundreds) of CUDA kernels? The debugger team has recently been examining some of our load time issues and will be addressing performance bottlenecks in future releases. It might be worthwhile to try a CUDA-GDB from one of our newer CUDA Toolkit releases. There has also been driver improvements that may benefit your use case.

You can read about the latest CUDA-GDB changes here.

Wow, an employee, that’s awesome. The application is something at work, so I can’t share it unfortunately. We have order tens of kernels, but they are large kernels, with many thousands of of lines of template heavy code and lots of function calls. The slow downs seem to be related to the number of debugging symbols. We’ve found that if we compile kernels we aren’t debugging in in release mode, it minimizes the amount of time it takes to get through these initial waits that I’ve complained about. (This also benefits the initial JIT compile time as well!)

I typically compile in JIT mode using compute_60 as a baseline. We see the issue on Telsa and Volta cards.

We have tried CUDA 11.2 and 11.3.1, but they have led to performance decreases in the actual runtime, so have opted to stay at 11.1.1 for the moment. But that’s another issue, which we haven’t investigated any further.

Edit, because of the templates, we probably actually have ~100 kernels. Most quite large.