Shared memory issue

eljedi · July 17, 2013, 7:16pm

Hello,

I’m having a weird issue while using volatile shared memory to accumulate values.

My kernel receives a vector with values segmented each 32 elements. I need to make operations and masks, but to simplify let’s assume I’m only interested on adding them, something like:

Position: 0,1 ,2,…,32,33,34
Value: 1,32,64…,2, 33,65

Then I need res0=1+2+…, res1=32+33,.
I was thinking on using a warp aware kernel so each (threadIdx.x & 31) will accumulate the values on an array located at the shared memory.

For a single kernel launch like:
conv4<<<1,512>>>

The expected result is d_out[1]=16
but I’m getting d_out[1]=1

I have no clue on what’s going on. Can anyone help me please?

thank you very much.

__global__ void conv4(unsigned int *d_in, unsigned int* d_out){

__shared__ unsigned int  _s_partial[BLOCK_SIZE];
  __shared__ unsigned int volatile _s_warp[WARP_SIZE];

  unsigned int tx=threadIdx.x;
  unsigned int start=blockIdx.x*blockDim.x+tx;

  unsigned int wtx=threadIdx.x&31;

_s_warp[tx]=0;
  _s_partial[tx]=d_in[start];

__syncthreads();

_s_warp[wtx]+=_s_partial[tx];

__syncthreads();

    d_out[start]=_s_warp[wtx];

}

pasoleatis · July 17, 2013, 10:14pm

For the code you wrote you should use atomic functions or change this part

__syncthreads();

_s_warp[wtx]+=_s_partial[tx];

__syncthreads();
to a loop executed by some of the threads.

__syncthreads();
if(tx < blockDim.x/32)
{
for (int i=tx;i < blockDim.x;i=i+32)
{ 

_s_warp[tx]+=_s_partial[i];
}
}__syncthreads();

eljedi · July 18, 2013, 7:23am

Note wtx =(threadIdx.x &31), and _s_warp has WARP_SIZE positions (32)
I want to accumulate each position 0 on a warp-wide threads, so thread 0,32,64 write to the same position in the _s_warp array.

This may be an stupid question, but I want to understand the reason behind it:
There is no way to make threads inside the same block to write to the same position _s_warp[wtx]? Why this is not possible, even with a volatile array?

Thank you very much

pasoleatis · July 18, 2013, 9:03am

Hello,

The problem is that the threads read from the same location. So thread 0 reads the value, now thread 32 reads from the same location, but before thread 0 had written its results. Same with thread 64. So they all read 0 and add 1. Atomic functions ensure that the memory location is locked until all operations are done.

eljedi · July 18, 2013, 9:37am

I think now I got it, thanks pasoleatis
I’m going to rethink it.

Topic		Replies	Views
volatile and __syncthreads and in warp... but still not what I expect Question on how to use shared CUDA Programming and Performance	1	1081	February 28, 2011
Every thread add to the same __shared__ memory at once? CUDA Programming and Performance	3	470	May 30, 2017
shared memory intra-warp conflicts summing into shared memory, how? CUDA Programming and Performance	2	2790	September 5, 2009
not reading all values from array CUDA Programming and Performance	3	637	April 26, 2017
When reading scattered data for a single warp in CUDA, how can we achieve coalesced memory access? CUDA Programming and Performance	7	456	March 15, 2024
Shared memory using structure instead of array CUDA Programming and Performance	7	1325	February 29, 2020
Reading the same memory with many threads CUDA Programming and Performance	6	2065	January 29, 2009
Warp synchronous programming CUDA Programming and Performance	4	1721	January 30, 2015
Using shared Memory CUDA Programming and Performance	3	4869	March 11, 2012
Good solution? CUDA Programming and Performance	1	787	June 11, 2010

Shared memory issue

Related topics