cuda-gdb performance

mvgalen · April 29, 2009, 9:19am

Recently I have been trying to use the debugger, but I am wondering if the behaviour that I am seeing is correct.

A kernel which takes normally 27ms to execute now takes up to 86000ms. I also noticed that under the debug build when run without gdb, some kernels also take significantly longer to execute.

Is such a major slowdown to be expected while using the debugger?

(cudaThreadSynchronize() is used when timing the kernels and a Tesla C1060 is used for execution)

E.D_Riedijk · April 30, 2009, 4:44am

I think so. I read in one of the files that comes with it, that all variables spill over to local memory when building for the debugger. So you basically only use the registers for storing values when calling a function and afterwards the return value gets writting to global memory again.

tmurray · April 30, 2009, 6:27am

Look at ptxvars.cu in your /usr/local/cuda/bin directory, and you’ll get a better idea of how the debugger works.

bbales2 · June 14, 2009, 10:21pm

I too am curious about the debugger performance (it will be difficult to debug real applications with performance drops that get too insane).

I guess the big question is: would an app compiled with -O0 perform about the same as the debug one?

I’m having trouble running my debugged code too. What runs fine with -deviceemu (and for real runs) gives ‘unspecified launch failures’ on the GDB version. It appears that the program just hangs when I run it from within the debugger.

Any thoughts or ideas?

Ben

bbales2 · June 14, 2009, 10:23pm

Also, it seems like dumping registers to shared memory wouldn’t work. Are registers held in global memory for these tests? Where does stuff get shifted to?

How much communication has to occur between the host and the device for meaningful debugging to occur?

Ben

bbales2 · June 14, 2009, 11:48pm

Looking at the PTX, it seems like duplicates of the registers are being kept in local memory, which maps to global memory doesn’t it?

I still don’t see any clues as to how bandwidth heavy the debugger is. Is there anything really going back to the host?

Ben

bbales2 · June 15, 2009, 12:58pm

Yet another update: It appears the binary compiled with -g -G fails sporadically when launched as a regular application (rather than run in the debugger).

So in my case, “./test; ./test; ./test; ./test” gave me Pass, Pass, Fail, Pass (and running it again gave Fail, Fail, Fail, Pass).

The fail error is “unspecified launch failure”.

Any ideas? I’m going to go try installing Valgrind now.

Ben

bbales2 · June 15, 2009, 1:49pm

Valgrind isn’t throwing up any hugely red flags.

It notes there are some allocations that weren’t freed on exit, but it says nothing about stuff jumping out of bounds.

Ben

tmurray · June 15, 2009, 3:50pm

Can you give me a repro case?

bbales2 · June 15, 2009, 5:12pm

This is just a wrapper around some code Vasily Volkov posted to this thread: [url=“http://forums.nvidia.com/index.php?showtopic=47689”]http://forums.nvidia.com/index.php?showtopic=47689[/url]

The only size (used for everything) is defined by the ‘N’ at the top of the main function. sgemm.cu is included inline into main.cu.

You’ll need to chop all the '.txt’s off the filenames. The forum didn’t allow me to upload .tgz, .cu, or extension-free files.

Also, do these debugger slowdowns sound right?

I’m seeing the plain launches (./test.opt and ./test.dbg) of the SGEMM slowing down 30-80x. Launching the debug code in cuda-gdb (cuda-gdb ./test.dbg causes a 2000x slowdown from a base -O3 run.

For the SDK MonteCarlo program, I see a 28x drop by building with -g -G. Launching with the debugger gives a 2400x slowdown.

For the SDK binomialOptions program, I see a 15x drop by building with -g G. Didn’t finish running after a few minutes when launched with cuda-gdb.

Ben
sgemm.cu.txt (5.54 KB)
Makefile.txt (437 Bytes)
main.cu.txt (2.28 KB)

tmurray · June 15, 2009, 5:23pm

Yeah, it’s very slow. A lot (a lot) of stuff gets spilled to global memory after every arithmetic operation, so such slowdowns are not unimaginable.

bbales2 · June 15, 2009, 5:37pm

What would explain the differences in the ./ and cuda-gdb launch timings of the -g -G program?

Ben

tmurray · June 15, 2009, 5:40pm

The debugger sets certain internal GPU bits that make debugging possible but also yield a performance penalty–more than what the compiler actually can do in this situation.

Topic		Replies	Views
Cannot debug cuda application CUDA Programming and Performance	0	3522	July 5, 2010
Cudbgprocess CUDA Programming and Performance	22	2877	August 2, 2022
cuda memory usage in debug(with GDB),debug(without GDB) and release differ, extra 2GB usage in relea CUDA Programming and Performance	11	4208	February 9, 2016
Wrong output when adding blocks what am I doing wrong? CUDA Programming and Performance	13	11970	December 4, 2007
Cuda-gdb is slow as molasses CUDA-GDB cuda-gdb	9	3356	July 22, 2024
debug build versus release build CUDA Programming and Performance	9	1771	June 24, 2014
cuda-gdb cannot break in device code CUDA Programming and Performance	2	1862	April 12, 2011
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13464	July 9, 2008
Should we expect cuda-gdb to repeatedly allocate and deallocate memory on the fly? CUDA-GDB	7	692	May 17, 2021
cuda-gdb hang and compiled program spewing nonsense CUDA Programming and Performance	7	2249	February 15, 2011

cuda-gdb performance

Related topics