Cuda-gdb is slow as molasses

So this is a bit of a rant, but I’ve been waiting for cuda-gdb to execute a ‘next’ command for about 10 minutes now. Don’t worry, it’ll get there, eventually, but why on earth is it so slow, can I do anything about it, and is there anything being done about it??

The same thing happens when hitting a breakpoint in a kernel for the first time. I get a message saying it’s switching focus to kernel blah_blah block(blah,blah) thread(blah,blah) yadda yadda, and I know I can now go do something else, and when I come back in ~10ish minutes it will be ready for me. At least until I say ‘next’, at which point I have another 10 minute wait. After that it is fine, no more waiting, at least until I have to restart the program and do it all over again.

This behavior is really killing my edit-compile-debug flow. I’m using cuda-11.1.1

Hi there! Thanks for reaching out.

Sorry you are having a poor experience with CUDA-GDB. Are you able to share anymore information about the application you are trying to debug? A reproducer would help us determine why the slow down is occurring.

There can be high initial load time costs upon the launch of the first kernel for certain CUDA applications. Ten minutes however is not expected. Does your application contain many (order of hundreds) of CUDA kernels? The debugger team has recently been examining some of our load time issues and will be addressing performance bottlenecks in future releases. It might be worthwhile to try a CUDA-GDB from one of our newer CUDA Toolkit releases. There has also been driver improvements that may benefit your use case.

You can read about the latest CUDA-GDB changes here.

Wow, an employee, that’s awesome. The application is something at work, so I can’t share it unfortunately. We have order tens of kernels, but they are large kernels, with many thousands of of lines of template heavy code and lots of function calls. The slow downs seem to be related to the number of debugging symbols. We’ve found that if we compile kernels we aren’t debugging in in release mode, it minimizes the amount of time it takes to get through these initial waits that I’ve complained about. (This also benefits the initial JIT compile time as well!)

I typically compile in JIT mode using compute_60 as a baseline. We see the issue on Telsa and Volta cards.

We have tried CUDA 11.2 and 11.3.1, but they have led to performance decreases in the actual runtime, so have opted to stay at 11.1.1 for the moment. But that’s another issue, which we haven’t investigated any further.

Edit, because of the templates, we probably actually have ~100 kernels. Most quite large.

cuda-gdb is too slow to be a debugger… Help!

I can’t agree more! Besides, it is also uncomfortable to jump to the breakpoint of another thread suddenly when jumping to the next breakpoint!

1 Like

It’s a completely awful experience to be stepping through a kernel, only to have the focus suddenly switch to a different configuration entirely for no apparent reason. Another issue is when it skips lines. I. Just. Want. To. Single. Step.

Maybe I should know why it happens like that, or what to do when it does / to prevent it from doing so, but I can’t find a clear answer, or helpful information, anywhere in the documentation about these situations. It’s like wanting some help on an essay, and being directed to the dictionary.

Thanks for reaching out. Can you provide more insight into the nature of your application? Without a reproducer we can only guess as to what is causing the slowdown.

Can you provide what is printed to the screen when focus change occurs while single stepping?

Is your application compiled with -G?

In general, we have found that overall general slowness can sometimes be attributed to certain applications calling cudaSetDevice() before every CUDA API call. Today this forces the debugger to halt execution to retrieve a context push/pop event. The performance hit due to this will be resolved in an upcoming release. For now make sure you application is not calling cudaSetDevice() before every API call as a workaround.

When reporting performance issues in cuda-gdb, it would be very helpful to the debugger team if at a minimum the following information is provided:

  1. Reach the point in the execution where the slow step/next/continue occurs
  2. Enable stat collection by setting set cuda collect_stats on
  3. Issue command that is observed to be slow
  4. Issue maint print cuda_stats and copy/paste results to the report.