Beginer question Thread synchronization with shared memory

Hey guys a stupid question but i got stuck once again.

i require a single float on shared memory per block i declared

__shared__ float s_BlockResult;

does this create a float per thread or per block and if so how do i correctly declare a shared variable per block where the number of blocks in unknown until run time?

It declares a shared variable per block which is visible to all threads in the block. You don’t need to do anything special to get one copy of that variable per block, since that is the definition of shared. :)

do i need to __syncthreads(); after its declaration since all threads within the block will be adding results to it?

and also is float addition on shared memory atomic?

It isn’t atomic and block synchronization won’t help that sort of problem. There are shared memory atomic functions (they are described in the programming guide), but they will serialize memory access and make everything very slow. There are no atomic shared memory floating point arithmetic operations

If you are doing something like summation, then you should use one shared memory entry per thread so each thread maintains a partial sum, then add a synchronization barrier, then have a warp of threads do a parallel reduction to produce the result and use only one thread write the result back to shared memory (if this is part of an iteration within a block), or to global memory if it is the thread result.

You don’t need to __syncthreads(); after the declaration (unless it also assigns a value to the variable), as the declaration by itself does not generate any code. You do however need __syncthreads(); between accesses from different threads.

No, but there are atomic functions (see Appendix C of the programming guide) including addition that operate on shared memory.

Oh, I’m too slow.

Thanks very much avidday :D

does anyone know of an example of a synchronization barrier?

__syncthreads() is the basic block synchronization barrier. You can see the sort of shared memory block summation I was talking about in this thread, which does a form of dense matrix vector multiply close to BLAS gemv.

Thanks again avidday really appreciate your help :D i ll check out that post just 1 last question what do you mean by parallel reduction?

Ok i found what you ment by parallel reduction

// Use first warp of block to compute parallel reduction on the

	// partial sum in shared memory.

	if (threadIdx.x < 32) {

		#pragma unroll 

		for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 

	}

	if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }

	if (threadIdx.x < 8)  { buff[threadIdx.x] += buff[threadIdx.x+8]; }

	if (threadIdx.x < 4)  { buff[threadIdx.x] += buff[threadIdx.x+4]; }

	if (threadIdx.x < 2)  { buff[threadIdx.x] += buff[threadIdx.x+2]; }

	// Finalise and write out the results to global memory

	if (threadIdx.x == 0)  { 

		r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];

	}

}

but what id theadId.x is larger then 32?

Nothing. All threads in the block compute their partial results and then threads 0-31 process the whole shared memory array (32 is the warp size, so there is implicit synchronization).

I think i get it know you constantly doing hierarchical reduction with different offsets. Pretty ingenious of you :D

Very elegant and ingenious i must say :D

Final question is it possible to declare 2 shared arrays externally? if so how is it done?

Or do i declare one external array and partition it into multiple arrays using offset?

__shared__ float s_BlockResult;

float * shared_values = (float *)shared_data;

int * shared_rowIDs = (int*)&shared_data[maxColLen*blockDim.x];

int * shared_resultOrder = (int*)&shared_data[maxColLen*blockDim.x*2];

aw man, I wouldn’t introduce a beginner to warp level programming

Dont worry about it i m quick learner understood what he meant and i implemented it working fine :D just need some guidance

You can declare as many dynamic shared memory blocks as you like, but they all wind up being set to the starting address of the same single allocation, so your second idea is roughly correct, but be careful about mixing types with potentially different sizes. Something like this:

__shared__ unsigned char shared_data[];

size_t offset0 = 0;

size_t offset1 = offset0 + sizeof(float)*maxColLen*blockDim.x;

size_t offset2 = offset1 + sizeof(int)*maxColLen*blockDim.x;

float * shared_values = (float *)&shared_data[offset0];

int * shared_rowIDs = (int*)&shared_data[offset1];

int * shared_resultOrder = (int*)&shared_data[offset2];

might be safer.

lol Guess you forgot to paste the code :P

Thanks so much you have been so helpful :D

No I pasted it, then changed my mind and changed it. You caught me in mid edit.