Shared memory doubt

Suppose I have a kernel (whose header is given below) which is written below, Now How should I effectively use shared memory to gain further performance.

Right now, I am accessing the elements using global memory.

func<<625,256>>(float* outptr, float* arguments).

. and my output array size being 1,6,0000 element.

Now In my application each array element is processed by a single thread index in random fashon. ( thread 1 is processing array 1, or thread 2 is processing say array 3 and so on. As I am using global memory, which this process takes a huge amount of time.

How can I keep my elements in shared memory and gain performance.

Kindly advice.

Mathew Potter.

As long as you access global memory once to read the value and once to write it I do not think much optimization can be done. When you access an element various times it is advisable to store the value in a local variable in the kernel thus having this variable stored either in registers or shared memory.

True, but

– If the global memory access is uncoalesced, you can use shared memory to build a sort of coalescing buffer.

What do you mean by coalescing buffer. I came across this in the NVIDIA Programming guide. But could not understand much

Can I build a coalescing buffer in the following kernel.

__global__ void VectorAdditionKernel(const float* pVectorA, const float* pVectorB, float* pVectorC) 


  unsigned inti= blockIdx.x* blockDim.x+ threadIdx.x;

  pVectorC[i] = pVectorA[i] + pVectorB[i];


Please advice.

Mathew Potter

Coalescing is a very important concept in CUDA - it is essential to getting any real speed. If you’re not familiar with it, check the manual + the forums.

If you cannot coalesce your reads, read using tex1Dfetch. You can bind the device memory block you already have to a texture.