Syncthreads before Printing

Hi,
I’m writing a reduction kernel, as follows

__global__ void reduction_divergence(int *input, int *output, int len) {
    unsigned int segmentStart = blockIdx.x * BLK_SIZE * 2;
    //compute data index for each thread
    unsigned int tid  = threadIdx.x;

    __shared__ int smem[BLK_SIZE*2];
  

  1.  smem[tid]            = segmentStart + tid < len ? input[segmentStart + tid] : 0;
  2.  smem[tid+blockDim.x] = segmentStart + blockDim.x + tid < len ? input[segmentStart + blockDim.x + tid] : 0;

    __syncthreads();

    if (tid == 0 && blockIdx.x == 0) {
      for (int i=0; i<BLK_SIZE*2; ++i)
  3.     printf("%d,",smem[i]);
    }

    unsigned int iidx = threadIdx.x * 2;
    for (int stride = 1; stride <= blockDim.x; stride *= 2) {
      if (tid % stride != 0)
        continue;

        smem[iidx] += smem[iidx+stride];
        __syncthreads();
    }

    if (tid == 0) {
      output[blockIdx.x] = smem[iidx];
    }
    return;
}

For simplicity, I’m just trying with 130 elements and my BLOCK_SIZE is 128. I"m trying to load all 130 elements into shared memory before doing reduction.
Simply, since my block size is 128. I first load 128 elements and then I shift each thread index by 128 (line marked as 2). The issue is even after issuing __syncthreads after loading into shared memory, if I print the values before doing the reduction I’m not seeing the correct values.
For example, the print statement should print values from 0 to 129 and followed by zeros, I understand the values can be jumbled, but I’m getting values out of the range that is greater than 129, (instead of 64, I get 129, instead of 66, I get 133 and so on
to be exact, I got the following values:

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,129,65,133,67,137,69,141,71,145,73,149,75,153,77,157,79,161,81,165,83,169,85,173,87,177,89,181,91,185,93,189,95,193,97,197,99,201,101,205,103,209,105,213,107,217,109,221,111,225,113,229,115,233,117,237,119,241,121,245,123,249,125,253,127,257,129,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

If you see in this list, I’m also getting values in 200’s, I’m not sure why this is happening. However, I cross-verified my reduction computation, my cpu and gpu results match. Hence, I think correct values are stored in shared memory eventually, I’m not sure why print statements can’t able to print the correct values even after sync threads.
It would be very much helpful if you could help me with this.
thanks

other threads can race ahead and modify smem contents while thread 0 is spending time doing printing.

Put a __syncthreads() call after the if-statement that has the line 3 printf:

   printf("%d,",smem[i]);
}
__syncthreads();  // add this

Also use of syncthreads in a conditional block is only legal if all threads evaluate the conditional the same way. You are violating that rule, although it occurs after the printf statement:

  if (tid % stride != 0)  // this conditional prevents all threads from participating
    continue;

    smem[iidx] += smem[iidx+stride];
    __syncthreads();  // here
}
1 Like

Ugh. I completely forgot that threads can race and overwrite the values to SMEM. Thank you very much for your kind help.
Also you mentioned “Also use of syncthreads in a conditional block is only legal if all threads evaluate the conditional the same way”, if I can’t issue sync threads after each iteration, how I can be sure before starting the next iteration all my reduced values are correct. I’m not sure how I can write a reduction kernel without divergent control flow.

Update: If I update my reduction loop as below, it is legal right?

    for (int stride = 1; stride <= blockDim.x; stride *= 2) {
      if (tid % stride == 0) {
        smem[iidx] += smem[iidx+stride];
      }
        __syncthreads();
    }

Yes, that’s a typical approach. Let all threads participate in the __syncthreads - leave it outside the conditional area. Writing a reduction kernel is covered in various places. A commonly cited tutorial is this one. There is also unit 5 of this online tutorial series.

sure thing, thanks a lot for your help, I will take a look at the tutorials.