Beginer question Thread synchronization with shared memory

kng · April 5, 2010, 5:19pm

Hey guys a stupid question but i got stuck once again.

i require a single float on shared memory per block i declared

__shared__ float s_BlockResult;

does this create a float per thread or per block and if so how do i correctly declare a shared variable per block where the number of blocks in unknown until run time?

seibert · April 5, 2010, 5:26pm

It declares a shared variable per block which is visible to all threads in the block. You don’t need to do anything special to get one copy of that variable per block, since that is the definition of shared. :)

kng · April 5, 2010, 5:40pm

do i need to __syncthreads(); after its declaration since all threads within the block will be adding results to it?

and also is float addition on shared memory atomic?

avidday · April 5, 2010, 5:48pm

It isn’t atomic and block synchronization won’t help that sort of problem. There are shared memory atomic functions (they are described in the programming guide), but they will serialize memory access and make everything very slow. There are no atomic shared memory floating point arithmetic operations

If you are doing something like summation, then you should use one shared memory entry per thread so each thread maintains a partial sum, then add a synchronization barrier, then have a warp of threads do a parallel reduction to produce the result and use only one thread write the result back to shared memory (if this is part of an iteration within a block), or to global memory if it is the thread result.

tera · April 5, 2010, 5:52pm

You don’t need to __syncthreads(); after the declaration (unless it also assigns a value to the variable), as the declaration by itself does not generate any code. You do however need __syncthreads(); between accesses from different threads.

No, but there are atomic functions (see Appendix C of the programming guide) including addition that operate on shared memory.

tera · April 5, 2010, 5:55pm

Oh, I’m too slow.

kng · April 5, 2010, 6:01pm

Thanks very much avidday :D

does anyone know of an example of a synchronization barrier?

avidday · April 5, 2010, 6:16pm

__syncthreads() is the basic block synchronization barrier. You can see the sort of shared memory block summation I was talking about in this thread, which does a form of dense matrix vector multiply close to BLAS gemv.

kng · April 5, 2010, 6:22pm

Thanks again avidday really appreciate your help :D i ll check out that post just 1 last question what do you mean by parallel reduction?

kng · April 5, 2010, 6:27pm

Ok i found what you ment by parallel reduction

// Use first warp of block to compute parallel reduction on the

	// partial sum in shared memory.

	if (threadIdx.x < 32) {

		#pragma unroll 

		for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 

	}

	if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }

	if (threadIdx.x < 8)  { buff[threadIdx.x] += buff[threadIdx.x+8]; }

	if (threadIdx.x < 4)  { buff[threadIdx.x] += buff[threadIdx.x+4]; }

	if (threadIdx.x < 2)  { buff[threadIdx.x] += buff[threadIdx.x+2]; }

	// Finalise and write out the results to global memory

	if (threadIdx.x == 0)  { 

		r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];

	}

}

but what id theadId.x is larger then 32?

avidday · April 5, 2010, 6:32pm

Nothing. All threads in the block compute their partial results and then threads 0-31 process the whole shared memory array (32 is the warp size, so there is implicit synchronization).

kng · April 5, 2010, 6:35pm

Ok i found what you ment by parallel reduction

// Use first warp of block to compute parallel reduction on the

	// partial sum in shared memory.

	if (threadIdx.x < 32) {

		#pragma unroll 

		for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 

	}

	if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }

	if (threadIdx.x < 8)  { buff[threadIdx.x] += buff[threadIdx.x+8]; }

	if (threadIdx.x < 4)  { buff[threadIdx.x] += buff[threadIdx.x+4]; }

	if (threadIdx.x < 2)  { buff[threadIdx.x] += buff[threadIdx.x+2]; }

	// Finalise and write out the results to global memory

	if (threadIdx.x == 0)  { 

		r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];

	}

}

but what id theadId.x is larger then 32?

I think i get it know you constantly doing hierarchical reduction with different offsets. Pretty ingenious of you :D

kng · April 5, 2010, 6:39pm

Very elegant and ingenious i must say :D

kng · April 5, 2010, 6:43pm

Final question is it possible to declare 2 shared arrays externally? if so how is it done?

Or do i declare one external array and partition it into multiple arrays using offset?

__shared__ float s_BlockResult;

float * shared_values = (float *)shared_data;

int * shared_rowIDs = (int*)&shared_data[maxColLen*blockDim.x];

int * shared_resultOrder = (int*)&shared_data[maxColLen*blockDim.x*2];

tmurray · April 5, 2010, 6:47pm

aw man, I wouldn’t introduce a beginner to warp level programming

kng · April 5, 2010, 7:00pm

Dont worry about it i m quick learner understood what he meant and i implemented it working fine :D just need some guidance

avidday · April 5, 2010, 7:18pm

Final question is it possible to declare 2 shared arrays externally? if so how is it done?

Or do i declare one external array and partition it into multiple arrays using offset?
__shared__ float s_BlockResult;

float * shared_values = (float *)shared_data;

int * shared_rowIDs = (int*)&shared_data[maxColLen*blockDim.x];

int * shared_resultOrder = (int*)&shared_data[maxColLen*blockDim.x*2];

You can declare as many dynamic shared memory blocks as you like, but they all wind up being set to the starting address of the same single allocation, so your second idea is roughly correct, but be careful about mixing types with potentially different sizes. Something like this:

__shared__ unsigned char shared_data[];

size_t offset0 = 0;

size_t offset1 = offset0 + sizeof(float)*maxColLen*blockDim.x;

size_t offset2 = offset1 + sizeof(int)*maxColLen*blockDim.x;

float * shared_values = (float *)&shared_data[offset0];

int * shared_rowIDs = (int*)&shared_data[offset1];

int * shared_resultOrder = (int*)&shared_data[offset2];

might be safer.

kng · April 5, 2010, 7:22pm

You can declare as many dynamic shared memory blocks as you like, but they all wind up being set to the starting address of the same single allocation, so your second idea is roughly correct, but be careful about mixing types with potentially different sizes. Something like this:
might be safer.

lol Guess you forgot to paste the code :P

kng · April 5, 2010, 7:23pm

You can declare as many dynamic shared memory blocks as you like, but they all wind up being set to the starting address of the same single allocation, so your second idea is roughly correct, but be careful about mixing types with potentially different sizes. Something like this:
__shared__ unsigned char shared_data[];

size_t offset0 = 0;

size_t offset1 = offset0 + sizeof(float)*maxColLen*blockDim.x;

size_t offset2 = offset1 + sizeof(int)*maxColLen*blockDim.x];

float * shared_values = (float *)&shared_data[offset0];

int * shared_rowIDs = (int*)&shared_data[offset1];

int * shared_resultOrder = (int*)&shared_data[offset2];
might be safer.

Thanks so much you have been so helpful :D

avidday · April 5, 2010, 7:23pm

No I pasted it, then changed my mind and changed it. You caught me in mid edit.

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	55381	March 8, 2007
Using Shared Memory in CUDA C/C++ Technical Blog	36	1889	October 8, 2020
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8774	December 1, 2009
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1925	March 1, 2011
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204310	April 13, 2009
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7231	January 25, 2010
__syncthreads and shared memory CUDA Programming and Performance	21	4242	June 15, 2011
Shared Memory initialization CUDA Programming and Performance	19	44988	March 26, 2007
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16287	January 30, 2011
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010

Beginer question Thread synchronization with shared memory

Related topics