Kernal works in debugging mode, but not in production mode

I’ve been working on a long kernel that is part of an insect dispersal function in a spatially explicit model. To the degree that emulation mode works, it works when I compile it with emu=1. When I compile it with dbg=1, it also works. When I compile with a straight make, I get errors in the output (I’m losing insects when I shouldn’t be). When I examine the ptx file, it reports -O0 for the optimization level when dbg=1 is defined, and -O3 when not. I haven’t been able to alter these optimization levels. Have others experienced this? To me it appears that something is going wrong with the optimization, but then, I’m an entomologist. Any clues as to what I might be looking for? Is my only option to start changing the code at random until it works? If it really is an optimization problem, is it likely that writing a simplified version of the code as an example would still have the problem?

Thanks for your help,

Do you use sin/cos/exp/log/pow in your code? The optimized build will enable -use_fast_math, which uses fast, but less precise, version of these functions.

Maybe you can try adding cudaThreadSynchronize() after every kernel call (and possibly every cudaMemcpy(). You might want add some extra checks for failures of any operation using cudaGetLastError(). In dbg=1 mode (at least when using cutil macros) you get implicit synchronization. However in dbg=0 mode things runs asynchronous. Maybe this is why the problem appears.


I’d second Christian’s advice. Further, synchronization issues might also occur on the device, not only between host and device.