cuda-gdb performance

Recently I have been trying to use the debugger, but I am wondering if the behaviour that I am seeing is correct.

A kernel which takes normally 27ms to execute now takes up to 86000ms. I also noticed that under the debug build when run without gdb, some kernels also take significantly longer to execute.

Is such a major slowdown to be expected while using the debugger?

(cudaThreadSynchronize() is used when timing the kernels and a Tesla C1060 is used for execution)

I think so. I read in one of the files that comes with it, that all variables spill over to local memory when building for the debugger. So you basically only use the registers for storing values when calling a function and afterwards the return value gets writting to global memory again.

Look at ptxvars.cu in your /usr/local/cuda/bin directory, and you’ll get a better idea of how the debugger works.

I too am curious about the debugger performance (it will be difficult to debug real applications with performance drops that get too insane).

I guess the big question is: would an app compiled with -O0 perform about the same as the debug one?

I’m having trouble running my debugged code too. What runs fine with -deviceemu (and for real runs) gives ‘unspecified launch failures’ on the GDB version. It appears that the program just hangs when I run it from within the debugger.

Any thoughts or ideas?

Ben

Also, it seems like dumping registers to shared memory wouldn’t work. Are registers held in global memory for these tests? Where does stuff get shifted to?

How much communication has to occur between the host and the device for meaningful debugging to occur?

Ben

Looking at the PTX, it seems like duplicates of the registers are being kept in local memory, which maps to global memory doesn’t it?

I still don’t see any clues as to how bandwidth heavy the debugger is. Is there anything really going back to the host?

Ben

Yet another update: It appears the binary compiled with -g -G fails sporadically when launched as a regular application (rather than run in the debugger).

So in my case, “./test; ./test; ./test; ./test” gave me Pass, Pass, Fail, Pass (and running it again gave Fail, Fail, Fail, Pass).

The fail error is “unspecified launch failure”.

Any ideas? I’m going to go try installing Valgrind now.

Ben

Valgrind isn’t throwing up any hugely red flags.

It notes there are some allocations that weren’t freed on exit, but it says nothing about stuff jumping out of bounds.

Ben

Can you give me a repro case?

This is just a wrapper around some code Vasily Volkov posted to this thread: http://forums.nvidia.com/index.php?showtopic=47689

The only size (used for everything) is defined by the ‘N’ at the top of the main function. sgemm.cu is included inline into main.cu.

You’ll need to chop all the '.txt’s off the filenames. The forum didn’t allow me to upload .tgz, .cu, or extension-free files.


Also, do these debugger slowdowns sound right?

I’m seeing the plain launches (./test.opt and ./test.dbg) of the SGEMM slowing down 30-80x. Launching the debug code in cuda-gdb (cuda-gdb ./test.dbg causes a 2000x slowdown from a base -O3 run.

For the SDK MonteCarlo program, I see a 28x drop by building with -g -G. Launching with the debugger gives a 2400x slowdown.

For the SDK binomialOptions program, I see a 15x drop by building with -g G. Didn’t finish running after a few minutes when launched with cuda-gdb.

Ben
sgemm.cu.txt (5.54 KB)
Makefile.txt (437 Bytes)
main.cu.txt (2.28 KB)

Yeah, it’s very slow. A lot (a lot) of stuff gets spilled to global memory after every arithmetic operation, so such slowdowns are not unimaginable.

What would explain the differences in the ./ and cuda-gdb launch timings of the -g -G program?

Ben

The debugger sets certain internal GPU bits that make debugging possible but also yield a performance penalty–more than what the compiler actually can do in this situation.