shared memory writes

hi,

I intend to write with all threads of a wrap to the same variable in shared memory simultaneously:

static __shared__ var;

__global__ void kernelFunc(){

 var = threadIdx.x;

  __syncthreads();

}

Do you know whether the result of this write is guaranteed to be one of the values I wrote ( i.e. one of the threadIds) or is the result undefined ?

And do you know whether this write is serialized (i.e. there are bank conflicts) or does it happen concurrently ?

Thanks, quak

Why would you want to do this? I can’t see how this could be useful in a practical application. Please post if there is a use for this.

Whenever there are smem conflicts, the conflicting accesses are serialized.

Paulius

Example:

I have all threads of a wrap evaluating a condition like a<b with a and b being different for each thread. If any of these conditions are true I want all threads branching to another piece of code. The most efficient way I can see to achieve this is to conditionally write to a variable in shared memory with all threads, syncronize the threads and then branch on this variable:

static __shared__ var = 0;

__global__ void kernelFunc(){

 if( a<b )

      var = 1;

 __syncthreads();

 if( var )

    doThis();

  else

    doThat();

}

Another less abstract example:

static __shared__ var = 1;

__global__ void kernelFunc(){

 while( var ){

   do something...

   if( a<b )

      var = 0;

   __syncthreads();

  }

}

I think the most efficient way to do this correctly (since the code you show won’t work) will be a parallel reduction algorithm using the logical OR operation:

[url=“http://www.gpgpu.org/sc2007/SC07_CUDA_4_DataParallel_Owens.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_4_Da...allel_Owens.pdf[/url] (starting on slide 15)

The only potential problem I can see is that the reduction will initially need as many shared memory locations as you have threads in the block.

Seibert is right, you can’t use simultaneous writes to the same smem location to achieve what you need. Reduction is the way to go. The other alternative, atomic writes to gmem, would be more expensive than reduction in smem.

Paulius

I changed the “template” project in the CUDA sdk to test concurrent writes to the same smem location:

__global__ void

testKernel( unsigned int* g_idata, unsigned int* g_odata) 

{

  // shared memory

  // the size is determined by the host application

  extern  __shared__  unsigned int sdata[];

 // access thread id

  const unsigned int tid = threadIdx.x;

  // access number of threads in this block

  const unsigned int num_threads = blockDim.x;

 for( int i=0; i<num_threads; i++ )

    SDATA(i) = tid;

 __syncthreads();

 // write data to global memory

  g_odata[tid] = SDATA(tid);

}

I also measured the performance of this piece of code by wraping it with a loop.

For comparison I changed the SDATA(i) = tid; line to SDATA(tid) = tid; (i.e. guaranteed bank-conflict free).

The results I got (in release mode):

  1. The g_odata array elements all contained the value ‘31’ i.e. the last thread index.

  2. The performance decreased by a factor of 3.6-3.8 compared to the bank-conflict free version.

The way I interpret this:

The write accsesses to the same smem location are serialized. They are ordered by the thread id, the thread with the smallest id writing first.

I am aware of the fact that this test only has a limited scope and might not take all factors into account and that my conclusion thus might be wrong.

What do you think of it ?

As Paulius said, concurrent write accesses to the same location in shared memory are serialized. So yes, you are right. About the ordering: It might be true that within a warp the ordering is like you say, but I would not count on that being always true, also warp scheduling might be more unpredictable.