Recently I have been trying to use the debugger, but I am wondering if the behaviour that I am seeing is correct.
A kernel which takes normally 27ms to execute now takes up to 86000ms. I also noticed that under the debug build when run without gdb, some kernels also take significantly longer to execute.
Is such a major slowdown to be expected while using the debugger?
(cudaThreadSynchronize() is used when timing the kernels and a Tesla C1060 is used for execution)
I think so. I read in one of the files that comes with it, that all variables spill over to local memory when building for the debugger. So you basically only use the registers for storing values when calling a function and afterwards the return value gets writting to global memory again.
I too am curious about the debugger performance (it will be difficult to debug real applications with performance drops that get too insane).
I guess the big question is: would an app compiled with -O0 perform about the same as the debug one?
I’m having trouble running my debugged code too. What runs fine with -deviceemu (and for real runs) gives ‘unspecified launch failures’ on the GDB version. It appears that the program just hangs when I run it from within the debugger.
Also, it seems like dumping registers to shared memory wouldn’t work. Are registers held in global memory for these tests? Where does stuff get shifted to?
How much communication has to occur between the host and the device for meaningful debugging to occur?
Yet another update: It appears the binary compiled with -g -G fails sporadically when launched as a regular application (rather than run in the debugger).
So in my case, “./test; ./test; ./test; ./test” gave me Pass, Pass, Fail, Pass (and running it again gave Fail, Fail, Fail, Pass).
The fail error is “unspecified launch failure”.
Any ideas? I’m going to go try installing Valgrind now.
The only size (used for everything) is defined by the ‘N’ at the top of the main function. sgemm.cu is included inline into main.cu.
You’ll need to chop all the '.txt’s off the filenames. The forum didn’t allow me to upload .tgz, .cu, or extension-free files.
Also, do these debugger slowdowns sound right?
I’m seeing the plain launches (./test.opt and ./test.dbg) of the SGEMM slowing down 30-80x. Launching the debug code in cuda-gdb (cuda-gdb ./test.dbg causes a 2000x slowdown from a base -O3 run.
For the SDK MonteCarlo program, I see a 28x drop by building with -g -G. Launching with the debugger gives a 2400x slowdown.
For the SDK binomialOptions program, I see a 15x drop by building with -g G. Didn’t finish running after a few minutes when launched with cuda-gdb.
The debugger sets certain internal GPU bits that make debugging possible but also yield a performance penalty–more than what the compiler actually can do in this situation.