printf in kernel changes results

We are running some code which we were debugging with a printf statement. Results were good so we removed it. Now the results have changed.

However even if we put the printf in a if conditional that should always be false it makes us get the expected results. We assume its some sort of race condition? Not sure, any help is welcomed. The code has a lot of prop stuff in it so hard to share. But we have for certain narrowed the changes down to the printf statement not being there. Below is my card and the driver I installed was,

Device 0: “GeForce GTX 570”
CUDA Driver Version / Runtime Version 4.10 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1280 MBytes (1341849600 bytes)
(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores
GPU Clock Speed: 1.46 GHz
Memory Clock rate: 1900.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:

Sounds like a “Heisenbug” of sorts (the moment you go looking for the problem, here with printf, it disappears). For serial code on the CPU I would suspect uninitialized data or out-of-bounds accesses. This also applies to parallel code on the GPU, and in addition there is the possibility of a race condition, as you note. I would suggest looking for missing __syncthreads(), __syncthreads() in a divergent control flow, or multiple threads writing to the same location. Those are the most common scenario’s in my experience. Have you tried running with cuda-memcheck to exclude out-of-bounds accesses?

Race condition is a common and tricky bug that can be affected by the printf statements.

A lot of bugs that seem to obviously be a driver / compiler bug such as this one turn out to actually be a race condition in my experience ( and usually really hard to find). If all else fails try utilizing a lot of __syncthreads() even where you think you dont need them to try and rule out where one might be needed.