Cuda-gdb is slow as molasses

dwwork · September 16, 2021, 6:10pm

So this is a bit of a rant, but I’ve been waiting for cuda-gdb to execute a ‘next’ command for about 10 minutes now. Don’t worry, it’ll get there, eventually, but why on earth is it so slow, can I do anything about it, and is there anything being done about it??

The same thing happens when hitting a breakpoint in a kernel for the first time. I get a message saying it’s switching focus to kernel blah_blah block(blah,blah) thread(blah,blah) yadda yadda, and I know I can now go do something else, and when I come back in ~10ish minutes it will be ready for me. At least until I say ‘next’, at which point I have another 10 minute wait. After that it is fine, no more waiting, at least until I have to restart the program and do it all over again.

This behavior is really killing my edit-compile-debug flow. I’m using cuda-11.1.1

agontarek · September 16, 2021, 9:21pm

Hi there! Thanks for reaching out.

Sorry you are having a poor experience with CUDA-GDB. Are you able to share anymore information about the application you are trying to debug? A reproducer would help us determine why the slow down is occurring.

There can be high initial load time costs upon the launch of the first kernel for certain CUDA applications. Ten minutes however is not expected. Does your application contain many (order of hundreds) of CUDA kernels? The debugger team has recently been examining some of our load time issues and will be addressing performance bottlenecks in future releases. It might be worthwhile to try a CUDA-GDB from one of our newer CUDA Toolkit releases. There has also been driver improvements that may benefit your use case.

You can read about the latest CUDA-GDB changes here.

dwwork · September 17, 2021, 2:15pm

Wow, an employee, that’s awesome. The application is something at work, so I can’t share it unfortunately. We have order tens of kernels, but they are large kernels, with many thousands of of lines of template heavy code and lots of function calls. The slow downs seem to be related to the number of debugging symbols. We’ve found that if we compile kernels we aren’t debugging in in release mode, it minimizes the amount of time it takes to get through these initial waits that I’ve complained about. (This also benefits the initial JIT compile time as well!)

I typically compile in JIT mode using compute_60 as a baseline. We see the issue on Telsa and Volta cards.

We have tried CUDA 11.2 and 11.3.1, but they have led to performance decreases in the actual runtime, so have opted to stay at 11.1.1 for the moment. But that’s another issue, which we haven’t investigated any further.

dwwork · September 17, 2021, 2:34pm

Edit, because of the templates, we probably actually have ~100 kernels. Most quite large.

Ziqi · May 5, 2022, 2:27am

cuda-gdb is too slow to be a debugger… Help!

TherLf · August 15, 2023, 8:51am

I can’t agree more! Besides, it is also uncomfortable to jump to the breakpoint of another thread suddenly when jumping to the next breakpoint!

russellmatt66 · July 4, 2024, 10:17pm

It’s a completely awful experience to be stepping through a kernel, only to have the focus suddenly switch to a different configuration entirely for no apparent reason. Another issue is when it skips lines. I. Just. Want. To. Single. Step.

Maybe I should know why it happens like that, or what to do when it does / to prevent it from doing so, but I can’t find a clear answer, or helpful information, anywhere in the documentation about these situations. It’s like wanting some help on an essay, and being directed to the dictionary.

agontarek · July 5, 2024, 2:43am

Thanks for reaching out. Can you provide more insight into the nature of your application? Without a reproducer we can only guess as to what is causing the slowdown.

Can you provide what is printed to the screen when focus change occurs while single stepping?

Is your application compiled with -G?

In general, we have found that overall general slowness can sometimes be attributed to certain applications calling cudaSetDevice() before every CUDA API call. Today this forces the debugger to halt execution to retrieve a context push/pop event. The performance hit due to this will be resolved in an upcoming release. For now make sure you application is not calling cudaSetDevice() before every API call as a workaround.

agontarek · July 5, 2024, 3:53pm

When reporting performance issues in cuda-gdb, it would be very helpful to the debugger team if at a minimum the following information is provided:

Reach the point in the execution where the slow step/next/continue occurs
Enable stat collection by setting set cuda collect_stats on
Issue command that is observed to be slow
Issue maint print cuda_stats and copy/paste results to the report.

veraj · July 22, 2024, 6:00am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is setting break point using cuda-gdb very slow? CUDA-GDB cuda-gdb	3	2052	November 18, 2023
cuda-gdb cannot break in device code CUDA Programming and Performance	2	1853	April 12, 2011
Should we expect cuda-gdb to repeatedly allocate and deallocate memory on the fly? CUDA-GDB	7	691	May 17, 2021
Cuda-gdb hangs, any method to know why? CUDA-GDB	7	1248	January 31, 2023
Cuda-gdb hangs indefinitely on first cuda API call CUDA-GDB cuda-gdb	8	1375	July 11, 2023
cuda-gdb hangs CUDA-GDB	12	8400	May 23, 2014
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13307	July 9, 2008
cuda-gdb performance CUDA Programming and Performance	12	6969	June 15, 2009
cuda-gdb hang and compiled program spewing nonsense CUDA Programming and Performance	7	2248	February 15, 2011
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1742	July 19, 2022

Cuda-gdb is slow as molasses

Related topics