Hello,

I’m having a weird issue while using volatile shared memory to accumulate values.

My kernel receives a vector with values segmented each 32 elements. I need to make operations and masks, but to simplify let’s assume I’m only interested on adding them, something like:

Position: 0,1 ,2,…,32,33,34

Value: 1,32,64…,2, 33,65

Then I need res0=1+2+…, res1=32+33,.

I was thinking on using a warp aware kernel so each (threadIdx.x & 31) will accumulate the values on an array located at the shared memory.

For a single kernel launch like:

conv4<<<1,512>>>

The expected result is d_out[1]=16

but I’m getting d_out[1]=1

I have no clue on what’s going on. Can anyone help me please?

thank you very much.

```
__global__ void conv4(unsigned int *d_in, unsigned int* d_out){
__shared__ unsigned int _s_partial[BLOCK_SIZE];
__shared__ unsigned int volatile _s_warp[WARP_SIZE];
unsigned int tx=threadIdx.x;
unsigned int start=blockIdx.x*blockDim.x+tx;
unsigned int wtx=threadIdx.x&31;
_s_warp[tx]=0;
_s_partial[tx]=d_in[start];
__syncthreads();
_s_warp[wtx]+=_s_partial[tx];
__syncthreads();
d_out[start]=_s_warp[wtx];
}
```