Shared memory issue

Hello,

I’m having a weird issue while using volatile shared memory to accumulate values.

My kernel receives a vector with values segmented each 32 elements. I need to make operations and masks, but to simplify let’s assume I’m only interested on adding them, something like:

Position: 0,1 ,2,…,32,33,34
Value: 1,32,64…,2, 33,65

Then I need res0=1+2+…, res1=32+33,.
I was thinking on using a warp aware kernel so each (threadIdx.x & 31) will accumulate the values on an array located at the shared memory.

For a single kernel launch like:
conv4<<<1,512>>>

The expected result is d_out[1]=16
but I’m getting d_out[1]=1

I have no clue on what’s going on. Can anyone help me please?

thank you very much.

__global__ void conv4(unsigned int *d_in, unsigned int* d_out){

__shared__ unsigned int  _s_partial[BLOCK_SIZE];
  __shared__ unsigned int volatile _s_warp[WARP_SIZE];

  unsigned int tx=threadIdx.x;
  unsigned int start=blockIdx.x*blockDim.x+tx;

  unsigned int wtx=threadIdx.x&31;

_s_warp[tx]=0;
  _s_partial[tx]=d_in[start];

__syncthreads();

_s_warp[wtx]+=_s_partial[tx];

__syncthreads();

    d_out[start]=_s_warp[wtx];

}

For the code you wrote you should use atomic functions or change this part

__syncthreads();

_s_warp[wtx]+=_s_partial[tx];

__syncthreads();
to a loop executed by some of the threads.

__syncthreads();
if(tx < blockDim.x/32)
{
for (int i=tx;i < blockDim.x;i=i+32)
{ 

_s_warp[tx]+=_s_partial[i];
}
}__syncthreads();

Note wtx =(threadIdx.x &31), and _s_warp has WARP_SIZE positions (32)
I want to accumulate each position 0 on a warp-wide threads, so thread 0,32,64 write to the same position in the _s_warp array.

This may be an stupid question, but I want to understand the reason behind it:
There is no way to make threads inside the same block to write to the same position _s_warp[wtx]? Why this is not possible, even with a volatile array?

Thank you very much

Hello,

The problem is that the threads read from the same location. So thread 0 reads the value, now thread 32 reads from the same location, but before thread 0 had written its results. Same with thread 64. So they all read 0 and add 1. Atomic functions ensure that the memory location is locked until all operations are done.

I think now I got it, thanks pasoleatis
I’m going to rethink it.