Different behavior on omitting --device-debug


I have an application that does some image processing for computer vision purposes with CUDA, specifically, the application is to be used for drone navigation in a known environment. There are several things bugging me, the first one being the most important:

Upon compiling with -G (or --device-debug ), the application produces the expected results although it runs slowly. Contrary-wise, disabling debugging leaves the program to run in real-time, but the result is slightly off, and becomes useless over time.

I am compiling with -arch compute_50 -code -sm_50, both the latest (compute/sm_62) and the earliest (compute/sm_20) supported alternatives gives an Invalid Texture-error. I am using nvcc from the CUDA-8.0 SDK.

Furthermore, upon compiling with -Xptxas -g, the code seems to run slowly, but simultaneously produces the wrong result, implying that this option differs from -G, contrary to what is described in the docs: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#ptxas-options

The program is non-probabilistic and the observations above are made consistently. Does the behavior of -G have any aspects that could possibly produce the above result, that is, that the program runs fine with -G and not without it?

By the way, I have run cuda-memcheck on the program both with and without debugging enabled, both resulting in no detected errors.

(1) cuda-memcheck can find many problems but not all of them. For example, it cannot find all race conditions, and it cannot find accesses that are out of bounds for a particular array, but are within the allocated memory.

(2) Throwing the debugging switch turns off all compiler optimizations. This can change the numerical properties of floating-point computation for example, by turning off contractions of FADD and FMUL into FMA. This in turn could interfere with convergence criteria, number of loop iterations, etc.

(3) Presumably -G affects the code generation of both the nvmm and ptxas portions of the toolchain, as both nvvm and ptxas are optimizing compilers. So changing the ptxas switch alone is unlikely to result in the same machine code.

The most likely cause of you troubles is a latent bug in the code which is exposed at higher optimization levels. This could be due to invoking undefined behavior as defined by C++, or something CUDA-specific, such as violating the rules on the use of synchronization barriers, or using warp-synchronous programming without proper safe guards. A compiler issue is possible, but unlikely.

I would suggest using code instrumentation and use of the CUDA debugger to get to the bottom of this.

I finally managed to track down the bug, it actually was a matter of racing conditions floating to the surface as the optimizations started up.

I believe the main problem here was my own limited knowledge of the cuda-memcheck debugger, as running cuda-memcheck --tool racecheck gave away the crucial points right away.

Anyway, thank you for your reply!