I have all threads of a wrap evaluating a condition like a<b with a and b being different for each thread. If any of these conditions are true I want all threads branching to another piece of code. The most efficient way I can see to achieve this is to conditionally write to a variable in shared memory with all threads, syncronize the threads and then branch on this variable:
static __shared__ var = 0;
__global__ void kernelFunc(){
if( a<b )
var = 1;
__syncthreads();
if( var )
doThis();
else
doThat();
}
Another less abstract example:
static __shared__ var = 1;
__global__ void kernelFunc(){
while( var ){
do something...
if( a<b )
var = 0;
__syncthreads();
}
}
I think the most efficient way to do this correctly (since the code you show won’t work) will be a parallel reduction algorithm using the logical OR operation:
Seibert is right, you can’t use simultaneous writes to the same smem location to achieve what you need. Reduction is the way to go. The other alternative, atomic writes to gmem, would be more expensive than reduction in smem.
I changed the “template” project in the CUDA sdk to test concurrent writes to the same smem location:
__global__ void
testKernel( unsigned int* g_idata, unsigned int* g_odata)
{
// shared memory
// the size is determined by the host application
extern __shared__ unsigned int sdata[];
// access thread id
const unsigned int tid = threadIdx.x;
// access number of threads in this block
const unsigned int num_threads = blockDim.x;
for( int i=0; i<num_threads; i++ )
SDATA(i) = tid;
__syncthreads();
// write data to global memory
g_odata[tid] = SDATA(tid);
}
I also measured the performance of this piece of code by wraping it with a loop.
For comparison I changed the SDATA(i) = tid; line to SDATA(tid) = tid; (i.e. guaranteed bank-conflict free).
The results I got (in release mode):
The g_odata array elements all contained the value ‘31’ i.e. the last thread index.
The performance decreased by a factor of 3.6-3.8 compared to the bank-conflict free version.
The way I interpret this:
The write accsesses to the same smem location are serialized. They are ordered by the thread id, the thread with the smallest id writing first.
I am aware of the fact that this test only has a limited scope and might not take all factors into account and that my conclusion thus might be wrong.
As Paulius said, concurrent write accesses to the same location in shared memory are serialized. So yes, you are right. About the ordering: It might be true that within a warp the ordering is like you say, but I would not count on that being always true, also warp scheduling might be more unpredictable.