use of shared memory

I wonder myself how to use shared memory (to have a further speedup).

The problem might be the handling of the indices but everytime i tried the performance was exactly the same.

Has someone a clue how to use shared memory here? i think this should be an easy minimal example.

I am calling a function this way:

iBlocks = ceilf( (float)pNet[y].iNrOfNeurons / pProps->maxThreadsPerBlock );

iThrds 	= ceilf( (float)pNet[y].iNrOfNeurons / (float)iBlocks );

devAdaptWeights <<< iBlocks, iThrds >>> 




		pNet[y+1].iNrOfNeurons, pNet[y].iNrOfWeights, fLearningRate);

The function is calling this kernel:

inline __global__ 

void devAdaptWeights(

		float *pNeuronsX, 	// neurons of layer x

		float *pWeightsXY,      // weights connecting X and X+1

		float *pErrorY, 	// errors of layer x+1

		int iSizeY, int iOffset, float fLearningRate) // nr of neurons in layer X+1 and (max) nr. of neurons of layer X+1(filled) or array offset


	int x = blockIdx.x * blockDim.x + threadIdx.x;

	if( x >= iOffset )


	for(int z = 0; z < iSizeY; z++) {

		pWeightsXY[x*iOffset+z] += fLearningRate*pNeuronsX[x]*pErrorY[z];


  1. if pErrorY is small enough to fit into shared memory then reading it into shared before the loop should help.

(If it is to large to all fit into shared at once then still worth reading it into shared but you will have to read a chunk at a time.

(try to make sure reads are contiguous )

  1. If more than 1 thread in a block will update the same pWeightsXY[ i ] then it might be worth reading pWeightsXY into shared before the loop

then in the updating the shared array and writing it back out after the loop.

  1. compiler may do this for you automatically, but it could be worth while doing this
float  factr = fLearningRate*pNeuronsX[x];

before the loop, and then using this variable inside the loop. Not sure, try it and see.

  1. I dont use the following type of kernel call (i.e. passing parts of a structure)
devAdaptWeights <<< iBlocks, iThrds >>> 




but I expect that they can often lead to the half-warps reads and write not being aligned correctly. Probably less of a problem on newer GPU’s

Of course all this is only worth doing if your code is running to slowly.

I hope this helps,


on gpu (gtx260) the code runs appr. 3-4 times faster in comparison with my i7 920.
not sure, wether shared memory makes sense here. but maybe i have to use another layout of my arrays.