It is not clear what you mean by " ‘release’ code without compiler optimization ". The whole point of a release build is that compiler optimizations are turned on.
The nvcc driver program calls a number of different programs under the hood. In terms of optimizations, the two relevant ones are cicc, which contains the LLVM-based source-to-PTX compiler, and ptxas, which is the PTX-to-SASS (machine code) compiler. I am not aware of a way to control the optimizations performed by cicc. You can control ptxas optimizations via -Xptxas -On on the nvcc command line. The default setting for release builds is -O3.
If you have hangs at __syncthreads(), the first thing you would want to check is whether these are in divergent code paths, which gives rise to undefined behavior that can also including hangs. Compound branch conditions in conjunction with C++ short-circuit evaluation rules sometimes make it hard to readily notice that a __syncthreads() call is in fact part of a divergent flow.
Some compiler issue could be the reason as well, but I have not encountered a compiler bug related to synchronization in years, so this seems very unlikely to me.
are you struggling with the same code from 2 months ago?
"What we are trying to know is that is there something way that we can turn of both device code optimization and ptx optimization for release build? "
yes, simply use your debug build; a debug build would be a release build with none of the mentioned optimizations
“A strange thing happened in our case is that the code works on debug mode(-G device size debug flag), but fails(hangs at some __sycn points) with normal release build.”
do you know for a fact that “fails(hangs at some __sycn points)”, or are you guessing?
have the code (release build) write out a message at strategic positions in code, such that you can track the positions it reaches, and such that you can pinpoint the likely section of code containing the release build bug
if it points to a kernel (the code enters, but never exits a kernel launch), write an intermediate kernel that dumps key variables from time to time, such that you can form an idea of the likely problem
Yes. This problem has been torching me for more than 2 months.
What do you mean ‘intermediate kernel’? You mean writing another kernel that writes the key variables out from the device global memory?(We mostly use shared memory or registers to store those key variables).
Our code works in Debug mode, but not release mode. That’s why we can not use ‘regular’ debug mode to explore the problem. Instead, we would like to have something like ‘debug mode with optimization’ to exam whether it is caused by the compile optimization.
What we observe is that the program starts, and hangs after the host section is executed and the kernel is launched, which we are 100% sure. We have also double checked this, using cuda-gdb attach to a running buggy process.
i would simply test whether the kernel actually commences, and subsequently exists, simply by means of output messages at the right places
but, assume you are right in your observation that the kernel is the culprit
i would, as an informal means, use the nvidia x server to note whether the kernel has stalled, or whether the kernel is essentially in an infinite loop (note the stipulated gpu utilization value)
i suppose you could also use the debugger, as you point out, pause during execution, and note the state of blocks
it is possible that the kernel is in an infinite loop, and it is possible that the kernel has diverged (some of the threads inactive, whilst others are still labouring); i suppose there are other possibilities as well
to formally debug a release build buggy kernel, i would consider again writing messages/ counts to track progress through/ within the kernel (it helps to know where in the kernel things go off-track), and occasionally dump/ display strategic variables that i know greatly determine/ guide kernel execution direction
Hi Shenjun, If you are a CUDA registered developer, can you try installing the CUDA 7.0 Release Candidate? Cuda-memcheck in CUDA 7.0 has a feature called synccheck that can detect cases where __syncthreads() has been used in an invalid fashion.
i can understand that the line number may be helpful; at the same time, i do not think it is crucial
a) kernels hardly contain that many synchronization calls
b) you can easily chop your kernel from the back (by commenting out), to note the first sync call not reached/ completed
c) if you comprehend what the compiler may/ can/ will do when allowing optimization - a release build - you should readily be able to read/ scrutinize your kernel code and tell which sync calls are ‘vulnerable’ to optimization