I’m having a weird issue while using volatile shared memory to accumulate values.
My kernel receives a vector with values segmented each 32 elements. I need to make operations and masks, but to simplify let’s assume I’m only interested on adding them, something like:
Then I need res0=1+2+…, res1=32+33,.
I was thinking on using a warp aware kernel so each (threadIdx.x & 31) will accumulate the values on an array located at the shared memory.
For a single kernel launch like:
conv4<<<1,512>>>
The expected result is d_out[1]=16
but I’m getting d_out[1]=1
I have no clue on what’s going on. Can anyone help me please?
thank you very much.
__global__ void conv4(unsigned int *d_in, unsigned int* d_out){
__shared__ unsigned int _s_partial[BLOCK_SIZE];
__shared__ unsigned int volatile _s_warp[WARP_SIZE];
unsigned int tx=threadIdx.x;
unsigned int start=blockIdx.x*blockDim.x+tx;
unsigned int wtx=threadIdx.x&31;
_s_warp[tx]=0;
_s_partial[tx]=d_in[start];
__syncthreads();
_s_warp[wtx]+=_s_partial[tx];
__syncthreads();
d_out[start]=_s_warp[wtx];
}
Note wtx =(threadIdx.x &31), and _s_warp has WARP_SIZE positions (32)
I want to accumulate each position 0 on a warp-wide threads, so thread 0,32,64 write to the same position in the _s_warp array.
This may be an stupid question, but I want to understand the reason behind it:
There is no way to make threads inside the same block to write to the same position _s_warp[wtx]? Why this is not possible, even with a volatile array?
The problem is that the threads read from the same location. So thread 0 reads the value, now thread 32 reads from the same location, but before thread 0 had written its results. Same with thread 64. So they all read 0 and add 1. Atomic functions ensure that the memory location is locked until all operations are done.