Release mode error debug

Hi, Guys,

I am running into some release-mode issues. The program runs fine with debug-mode nvcc flags, however it has some issues when compiled with release mode.

The program somehow hangs at some point and can not finish. It mostly happens at some __sync points.

We have run ‘mem-check’ and ‘racecheck’ to check the memory and race condition issues with the debug code.

But we can not run the same test on release code, since the process never stops.

We guess it is caused by the compiler optimization on the device code.

We would like to know how to compile a ‘release’ code without compiler optimization?

Something like ‘-O0’ flag works?


It is not clear what you mean by " ‘release’ code without compiler optimization ". The whole point of a release build is that compiler optimizations are turned on.

The nvcc driver program calls a number of different programs under the hood. In terms of optimizations, the two relevant ones are cicc, which contains the LLVM-based source-to-PTX compiler, and ptxas, which is the PTX-to-SASS (machine code) compiler. I am not aware of a way to control the optimizations performed by cicc. You can control ptxas optimizations via -Xptxas -On on the nvcc command line. The default setting for release builds is -O3.

If you have hangs at __syncthreads(), the first thing you would want to check is whether these are in divergent code paths, which gives rise to undefined behavior that can also including hangs. Compound branch conditions in conjunction with C++ short-circuit evaluation rules sometimes make it hard to readily notice that a __syncthreads() call is in fact part of a divergent flow.

Some compiler issue could be the reason as well, but I have not encountered a compiler bug related to synchronization in years, so this seems very unlikely to me.

I am aware of the __syncthreads() in divergent code paths could cause issues, which does not happen in our code I think.

A strange thing happened in our case is that the code works on debug mode(-G device size debug flag), but fails(hangs at some __sycn points) with normal release build.

What we are trying to know is that is there something way that we can turn of both device code optimization and ptx optimization for release build?

are you struggling with the same code from 2 months ago?

"What we are trying to know is that is there something way that we can turn of both device code optimization and ptx optimization for release build? "

yes, simply use your debug build; a debug build would be a release build with none of the mentioned optimizations

“A strange thing happened in our case is that the code works on debug mode(-G device size debug flag), but fails(hangs at some __sycn points) with normal release build.”

do you know for a fact that “fails(hangs at some __sycn points)”, or are you guessing?

have the code (release build) write out a message at strategic positions in code, such that you can track the positions it reaches, and such that you can pinpoint the likely section of code containing the release build bug

if it points to a kernel (the code enters, but never exits a kernel launch), write an intermediate kernel that dumps key variables from time to time, such that you can form an idea of the likely problem

Yes. This problem has been torching me for more than 2 months.

What do you mean ‘intermediate kernel’? You mean writing another kernel that writes the key variables out from the device global memory?(We mostly use shared memory or registers to store those key variables).

Our code works in Debug mode, but not release mode. That’s why we can not use ‘regular’ debug mode to explore the problem. Instead, we would like to have something like ‘debug mode with optimization’ to exam whether it is caused by the compile optimization.

What we observe is that the program starts, and hangs after the host section is executed and the kernel is launched, which we are 100% sure. We have also double checked this, using cuda-gdb attach to a running buggy process.

i would simply test whether the kernel actually commences, and subsequently exists, simply by means of output messages at the right places

but, assume you are right in your observation that the kernel is the culprit

i would, as an informal means, use the nvidia x server to note whether the kernel has stalled, or whether the kernel is essentially in an infinite loop (note the stipulated gpu utilization value)
i suppose you could also use the debugger, as you point out, pause during execution, and note the state of blocks
it is possible that the kernel is in an infinite loop, and it is possible that the kernel has diverged (some of the threads inactive, whilst others are still labouring); i suppose there are other possibilities as well

to formally debug a release build buggy kernel, i would consider again writing messages/ counts to track progress through/ within the kernel (it helps to know where in the kernel things go off-track), and occasionally dump/ display strategic variables that i know greatly determine/ guide kernel execution direction

alternatively, just post your complete kernel

Hi Shenjun, If you are a CUDA registered developer, can you try installing the CUDA 7.0 Release Candidate? Cuda-memcheck in CUDA 7.0 has a feature called synccheck that can detect cases where __syncthreads() has been used in an invalid fashion.

That’s a good hint.

I have installed CUDA 7.0 RC and run the synccheck, which does show some Barrier errors with divergent threads in thread block.

However, it does not tell me where is the line of code that cause the problem or which __syncthreads() can not reach.

Try compiling your code with -lineinfo

I’m not sure it will give any more info, but it might.

I have tried compiling with -lineinfo flat.
No extra information is provided. It is still the RC version of CUDA 7.0 though.

i can understand that the line number may be helpful; at the same time, i do not think it is crucial

a) kernels hardly contain that many synchronization calls
b) you can easily chop your kernel from the back (by commenting out), to note the first sync call not reached/ completed
c) if you comprehend what the compiler may/ can/ will do when allowing optimization - a release build - you should readily be able to read/ scrutinize your kernel code and tell which sync calls are ‘vulnerable’ to optimization