shared memory writes

quak · December 27, 2007, 5:26pm

hi,

I intend to write with all threads of a wrap to the same variable in shared memory simultaneously:

static __shared__ var;

__global__ void kernelFunc(){

 var = threadIdx.x;

  __syncthreads();

}

Do you know whether the result of this write is guaranteed to be one of the values I wrote ( i.e. one of the threadIds) or is the result undefined ?

And do you know whether this write is serialized (i.e. there are bank conflicts) or does it happen concurrently ?

Thanks, quak

paulius · December 29, 2007, 12:20am

Why would you want to do this? I can’t see how this could be useful in a practical application. Please post if there is a use for this.

Whenever there are smem conflicts, the conflicting accesses are serialized.

Paulius

quak · December 29, 2007, 5:34pm

Example:

I have all threads of a wrap evaluating a condition like a<b with a and b being different for each thread. If any of these conditions are true I want all threads branching to another piece of code. The most efficient way I can see to achieve this is to conditionally write to a variable in shared memory with all threads, syncronize the threads and then branch on this variable:

static __shared__ var = 0;

__global__ void kernelFunc(){

 if( a<b )

      var = 1;

 __syncthreads();

 if( var )

    doThis();

  else

    doThat();

}

Another less abstract example:

static __shared__ var = 1;

__global__ void kernelFunc(){

 while( var ){

   do something...

   if( a<b )

      var = 0;

   __syncthreads();

  }

}

seibert · December 29, 2007, 5:49pm

I think the most efficient way to do this correctly (since the code you show won’t work) will be a parallel reduction algorithm using the logical OR operation:

[url=“http://www.gpgpu.org/sc2007/SC07_CUDA_4_DataParallel_Owens.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_4_Da...allel_Owens.pdf[/url] (starting on slide 15)

The only potential problem I can see is that the reduction will initially need as many shared memory locations as you have threads in the block.

paulius · December 30, 2007, 12:02am

Seibert is right, you can’t use simultaneous writes to the same smem location to achieve what you need. Reduction is the way to go. The other alternative, atomic writes to gmem, would be more expensive than reduction in smem.

Paulius

quak · December 30, 2007, 12:34pm

I changed the “template” project in the CUDA sdk to test concurrent writes to the same smem location:

__global__ void

testKernel( unsigned int* g_idata, unsigned int* g_odata) 

{

  // shared memory

  // the size is determined by the host application

  extern  __shared__  unsigned int sdata[];

 // access thread id

  const unsigned int tid = threadIdx.x;

  // access number of threads in this block

  const unsigned int num_threads = blockDim.x;

 for( int i=0; i<num_threads; i++ )

    SDATA(i) = tid;

 __syncthreads();

 // write data to global memory

  g_odata[tid] = SDATA(tid);

}

I also measured the performance of this piece of code by wraping it with a loop.

For comparison I changed the SDATA(i) = tid; line to SDATA(tid) = tid; (i.e. guaranteed bank-conflict free).

The results I got (in release mode):

The g_odata array elements all contained the value ‘31’ i.e. the last thread index.
The performance decreased by a factor of 3.6-3.8 compared to the bank-conflict free version.

The way I interpret this:

The write accsesses to the same smem location are serialized. They are ordered by the thread id, the thread with the smallest id writing first.

I am aware of the fact that this test only has a limited scope and might not take all factors into account and that my conclusion thus might be wrong.

What do you think of it ?

DenisR · December 30, 2007, 2:57pm

As Paulius said, concurrent write accesses to the same location in shared memory are serialized. So yes, you are right. About the ordering: It might be true that within a warp the ordering is like you say, but I would not count on that being always true, also warp scheduling might be more unpredictable.

Topic		Replies	Views
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	7953	July 20, 2007
setting bits in shared memory CUDA Programming and Performance	16	17114	June 6, 2007
Shared memory write performance CUDA Programming and Performance	6	777	April 18, 2017
Deliberate race condition CUDA Programming and Performance	4	25	January 14, 2025
warp synchronization test CUDA Programming and Performance	5	1665	September 2, 2014
Question about shared memory usage How to use as reg and volatile effect CUDA Programming and Performance	11	6897	November 28, 2007
Shared memory write conflicts Looking for a little help... CUDA Programming and Performance	5	4917	September 7, 2007
Output through shared memory CUDA Programming and Performance	6	4144	June 17, 2010
shared memory intra-warp conflicts summing into shared memory, how? CUDA Programming and Performance	2	2793	September 5, 2009
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8297	April 15, 2011

shared memory writes

Related topics