I changed the “template” project in the CUDA sdk to test concurrent writes to the same smem location:
testKernel( unsigned int* g_idata, unsigned int* g_odata)
// shared memory
// the size is determined by the host application
extern __shared__ unsigned int sdata;
// access thread id
const unsigned int tid = threadIdx.x;
// access number of threads in this block
const unsigned int num_threads = blockDim.x;
for( int i=0; i<num_threads; i++ )
SDATA(i) = tid;
// write data to global memory
g_odata[tid] = SDATA(tid);
I also measured the performance of this piece of code by wraping it with a loop.
For comparison I changed the SDATA(i) = tid; line to SDATA(tid) = tid; (i.e. guaranteed bank-conflict free).
The results I got (in release mode):
The g_odata array elements all contained the value ‘31’ i.e. the last thread index.
The performance decreased by a factor of 3.6-3.8 compared to the bank-conflict free version.
The way I interpret this:
The write accsesses to the same smem location are serialized. They are ordered by the thread id, the thread with the smallest id writing first.
I am aware of the fact that this test only has a limited scope and might not take all factors into account and that my conclusion thus might be wrong.
What do you think of it ?