question about debugging

Hello, i have a question.

I would like to know which is the differences between Debug, Emudebug, Emurelease and release.


You know first the Debug and Release is running on GPU and indeed it’s no DEBUG because you cannot debug on GPU.
EmuDubug and EmuRelease is the emulation virsion on CPU running your GPU program and thus you can debug on EmuDebug mode to find some low level errors. But It’s a pity that there is still huge difference between EmuDebug and Debug. And it depends on yourselfe to find the real way to debug on GPU which I think it’s really a challenge.

Gook Luck!

You mean when i want to debug i use Emudebug, and when i want to run my program i use release. Is that correct ?

In that case, what are debug and emurelease for ?

Debug mode will “activate” all the macros from cutil.h, such as CUDA_SAFE_CALL etc.

Contrary to what LuYao said, it is possible to debug on the GPU. A debugger, running code on the GPU, has been demonstrated at SuperComputing 07. It will be released in the future.


Well, I’m not sure we can tell LuYao was wrong. At the moment it’s not possible to debug on the GPU. I think it’s what he meant.

And yes, we heard about this debugger. I’m gonna ask the obvious question as always after one of your ‘announcement’. Can we have a clue about a possible release date? (for developers and public). Is it a matter of weeks or months?

Debugger on GPU is crucial to my projects. Is that true? Do you have some detail information?
At the moment there’s no access to debug on GPU .
My program can run on EmuDebug mode but not on Debug mode which brings me a lot of trouble and I even don’t know where the mistake happens.Some high-level logical errors cannot be found in EmuDebug. That’s TERRIBLE.

We don’t comment on release dates, sorry.

If you can’t run in Debug mode, then there is likely a problem in your code somewhere. What exactly is the issue?

Debug mode enables debugging and asserts on the CPU, but keeps the GPU code running on the GPU. So your kernels still run at nearly full speed and you can debug any problems in the rest of your code. It should be the standard mode you develop in because any crash will give you a back trace, you can step through CPU functions, and CUDA_SAFE_CALL will tell you whenever a CUDA call results in an error. Sure, you can’t debug the kernels themselves in Debug mode, but you can still debug EVERYTHING ELSE.

When it comes to actually debugging a kernel call itself, I have yet to run into a problem that I couldn’t solve by setting breakpoints in kernel calls in EmuDebug. Sure, the emu mode is slow, but then you just run your kernel with a reduced number of blocks or on a smaller problem size.

Release mode compiles your CPU code with full compiler optimizations, removes the information needed for generating back traces and stepping through code, disables asserts, and makes CUDA_SAFE_CALL into a noop (and thus ignoring any CUDA error). It is the mode you should compile in for full performance, but with significantly reduced error checking.

EmuDebug emulates kernel calls on the cpu and cpmpiles your code with no optimizations and with debug symbols for backtraces and stepping through code.

EmuRelease emulates kernel calls on the cpu and compiles your code with full optimizations and no debug symbols. It is perhaps the least useful of all the modes, but if you have a really subtle bug that shows up in Release, not Debug and you think it might be related to a kernel (however unlikely) then you can compile in EmuRelease and printf debug kernel calls.

I’m transplanting a program into GPU and I just need to rewrite the kenel.

The main progress of the program is:

  1. read input file from harddisk into host memory

  2. transfer data to device memory

  3. running kernel

    Each block have one tread processing its own part of input data . All the treads work separately so I needn’t care about

syncthreads.After some declearationg of parameters, in one circle, a thread reads some input data to buff-in zone in registor from

device memory and put resaults into buff-out zone also in registor. When buff out zone is full, the resaults will be tranferred to general

resault output array in device memory. The input and output array offsets have been calculated for many times and I’m sure they are all


  1. transfer resaults back to host memory

  2. write resaults into files.

All the program has been checked for many times and what puzzled me is that I can run it on Emu mode and resaults are correct but

when I run on GPU, the whole machine just stuck and even the mouse cannot move when the kernel is running. Sometimes it will show

CUDA ERRORS but sometimes it just reboot. Of cause the resault file is empty.

I’ve also tried many ways to “debug” on GPU and I found when I remove the code that transfer resaults from register to device memory

the kernel runs well. And of cause it’s not just remove, but all code writing device memory will behave the same.But when I remove all

the computation part just leaving the read-and-write device memory part it runs well too.(fail in write but success in read, meanwhile,

all access to registor is just OK). But that’s not the end. In the declearation part, each thread will write some parameter values into

device memory for they are too big to put into registors. The form of this write into device memory is just the same as that transfering

resaults. So I’m considering there must be some logical errors in the process that occurs write-failure to device memoryBut I cannot

detect them because it is perfectly done in EMU mode.That’s painful!

Just a note: if you comment out the global memory write, the dead code optimization will remove all the computations.

How long does your kernel take to run in emulation mode? The results you describe when running the kernel (cant move mouse) are normal for a kernel that takes more than 1 second to execute. Does the kernel die and result in CUDA errors after running for 5 seconds? You may be hitting the watchdog timeout then. Have you tried a smaller dataset?

You say: “Each block have one tread processing its own part of input data” which has me worried because 1 thread per block is not a very optimal way to use the device when the warp size is 32. How many blocks are you launching? Even an empty kernel with 65,000 blocks takes a while to launch due to scheduling overhead.

I had similar problems when i was working on buffering mechanisms,

[1] When you are waiting for the buff-out to be full. Are other blocks idling(waiting)for others? If yes try re-launch another kernel that does this for you.

[2] Try re-checking conflicts between offsets.Dump them into a device memory and re-check them with those calculated on the CPU. Warp size on the CPU emulation is 1 therefore it is highly unlikely that you will catch concurrent - write errors in the emulation mode.

[3] Make sure the offsets do not extend beyond the Allocated memory.

Hope this helps,



All the threads work separately and even when I just use one block and one thread the same things happens.

Thanks for your advice and I will check it again!

Wish you good luck too!