How to debug a program that only bugs in release mode? Debug and emu do not show the problem at all

Why would this happened. This is the same program compile 3 times with : release, debug, emurelease.
Could it be due to rounding ? The numbers I can check in the host code do start to be slightly differents in the varoius modes as the program runs.

Ah, the dreaded Heisenbug.

These often happen due to race conditions but that’s as much as my crystall ball is willing to tell me now.

It can also be due to differing orders of floating point operations, dependent on the precise definition of “slightly different.” However, my crystal ball is also suffering from extreme fogging right now.

Define “bugs”.

Seriously, if you expect help, you are going to have to describe what the problem with more precision. Were are not playing “20 questions” here…

It seems that over a very narrow range of input numbers , one of the kernels would not launched at all except in debug or emulation mode.
For thread per blocks of 256
N = 29000 to about 31000

kernel<<< N/256 , 256 >>> …
And I solve this problem by changing N .

This is still very strange!

I’d put in a cudaThreadSynchronize() between all CUDA operations
and check for error codes following that

Then when an operation fails, you will know.

Error checking, now there is an idea…

Possibly - Something to do with your “register” count…and the number of threads per block you are spawning…

Instead of changing “N”, just change the way you are spawning the kernel as (N/32, 32) and see if it works…
OR
If you are knowledgeable enough, You can work out the register math yourself from the cubin…

None of these suggestions applys to the main problem that the same program compiled with different flags (dgb,emu,none) behaves differently at runtime.
To summarized:

WHEN : only in release mode at some specific input values
WHERE: in one kernel of about 10
WHAT : the kernel does not launch with the message :unspecified launch failure
WHY : I do not know
WHO : probably me : mistake in variable declaration ?

It it just annoying because it works with new values and this is good enough right now.

That usually indicates an invalid memory access in your kernel (out of bounds access).

Probably for certain configurations of your kernel the out of bounds access does not occur while for other

configurations it does and hence fails.

eyal

Huzzah! We now know what the problem actually is. Unspecified launch failures are usually out of bounds memory access. I can think of at least three reasons why it might only be appearing in “release” mode :

    [*]in emulation everything is in the same memory space and your code is probably silently reading/writing over memory inside the process image, which you won’t see unless you use something like valgrind. The warp size is one, which also eliminates about 99.9% of race conditions that might otherwise happen on the device.

    [*]in debugging mode compiler optimizations are disabled and registers are spilled to local memory, so potential subtle race conditions in poorly written code aren’t exposed.

    [*]Your hardware is flakey, and only when the code is running at full speed does it start misbehaving (I had this happen once).

My suggesting is to use emulation + valgrind or gpu-ocelot to run the code and see what happens. I have found ocelot to be flawless at detecting illegal memory access. Alternatively, if you are running linux, there is the new cuda-memcheck utility which should do the same thing as Ocelot, but on the device. If I were to guess I would say you have an indexing or addressing snafu in your kernel code, mostly because you are getting the failures over an apparently narrow range of execution parameters.

EDIT: eyal beat me to most of it.