Code correct when not run in cuda-gdb, but incorrect when run in cuda-gdb

Hello,

I recently moved from CUDA 10.2 to V11.0.167. I am developing a larger application, and of course must often resort to the debugger since I am still building experience with CUDA.

I have encountered a particularly vexing behavior of the debugger and cuda-memcheck recently. My program produces the expected result when not run in the debugger, but, when run in the debugger, it is as if all threads after the 32nd simply ignore the work they were assigned. I am running 16 blocks of 64 threads.

Is it possible to make a programming mistake that induces this, like a race condition? I do not have any shared data, but am using atomicAdd. Although, I don’t think races should result if atomicAdd is used.

Or should the debugger/memcheck not affect the program’s result, no matter what? To clarify, I have no API or memory errors here.

EDIT: I have also tested this using the new compute sanitizer. My output is incorrect when running in the sanitizer, but correct when run without. Moreover, the sanitizer detects zero errors.

EDIT2: Here is the start of my kernel, and some debugger output watching the value of “tid”, which I use to index into an array. Somehow, it is changing for no reason on this thread! The start of the kernel:

__global__ void put_bank_sites(ParticleArray_d* parray, FissionBank_d* bank) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  int n_particles = parray->n_particles_;

  while (tid < n_particles) {
    // Check if slot in particle array is open or we're done
    if (parray->wgt_[tid] != 0) {   
      tid += gridDim.x * blockDim.x;
      continue;
    }
    
    // Exit if the particle array is full
    if (bank->next_particle == bank->particle_per_generation) return;

Yes, I know there is horrible divergence. This is just a starting point. Anyways, the debugger gives this, stepping through the above lines:

90	 int n_particles = parray->n_particles_;
(cuda-gdb) p tid
$1 = 0
(cuda-gdb) n
92	 while (tid < n_particles) {
(cuda-gdb) p tid
$2 = 0
(cuda-gdb) n
94	   if (parray->wgt_[tid] != 0) {
(cuda-gdb) p tid
$3 = 0
(cuda-gdb) n
100	   if (bank->next_particle == bank->particle_per_generation) return;
(cuda-gdb) p tid
$4 = 32767

Um, what the heck! The value of “tid” has changed from zero to 32767 for no reason whatsoever!! This seems to not happen if I don’t run in memcheck or the debugger. Is this my fault? What have I done to anger the GPU gods?