Hi,
I’m writing a reduction kernel, as follows
__global__ void reduction_divergence(int *input, int *output, int len) {
unsigned int segmentStart = blockIdx.x * BLK_SIZE * 2;
//compute data index for each thread
unsigned int tid = threadIdx.x;
__shared__ int smem[BLK_SIZE*2];
1. smem[tid] = segmentStart + tid < len ? input[segmentStart + tid] : 0;
2. smem[tid+blockDim.x] = segmentStart + blockDim.x + tid < len ? input[segmentStart + blockDim.x + tid] : 0;
__syncthreads();
if (tid == 0 && blockIdx.x == 0) {
for (int i=0; i<BLK_SIZE*2; ++i)
3. printf("%d,",smem[i]);
}
unsigned int iidx = threadIdx.x * 2;
for (int stride = 1; stride <= blockDim.x; stride *= 2) {
if (tid % stride != 0)
continue;
smem[iidx] += smem[iidx+stride];
__syncthreads();
}
if (tid == 0) {
output[blockIdx.x] = smem[iidx];
}
return;
}
For simplicity, I’m just trying with 130 elements and my BLOCK_SIZE is 128. I"m trying to load all 130 elements into shared memory before doing reduction.
Simply, since my block size is 128. I first load 128 elements and then I shift each thread index by 128 (line marked as 2). The issue is even after issuing __syncthreads after loading into shared memory, if I print the values before doing the reduction I’m not seeing the correct values.
For example, the print statement should print values from 0 to 129 and followed by zeros, I understand the values can be jumbled, but I’m getting values out of the range that is greater than 129, (instead of 64, I get 129, instead of 66, I get 133 and so on
to be exact, I got the following values:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,129,65,133,67,137,69,141,71,145,73,149,75,153,77,157,79,161,81,165,83,169,85,173,87,177,89,181,91,185,93,189,95,193,97,197,99,201,101,205,103,209,105,213,107,217,109,221,111,225,113,229,115,233,117,237,119,241,121,245,123,249,125,253,127,257,129,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
If you see in this list, I’m also getting values in 200’s, I’m not sure why this is happening. However, I cross-verified my reduction computation, my cpu and gpu results match. Hence, I think correct values are stored in shared memory eventually, I’m not sure why print statements can’t able to print the correct values even after sync threads.
It would be very much helpful if you could help me with this.
thanks