Correct Use of Shared Memory?

I don’t see any reason why using shared memory will speed up a program if the value is only retrieved from global memory once. Is there any reason to do this?

Here’s a contrived example, I have a kernel which takes a value and multiplies it by 5 and puts it into another value, I am presuming this is the fastest way of achieving this:

__global__ void kernel(float *in, float* out)

{

	unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

	out[idx] = 5.0f * in[idx];

}

You are absolutely right. In that example, global memory reads should be fully coalesced and there would be no advantage to using shared memory. However, consider a only slightly different variant of the same idea:

_global__ void kernel(float *in, float* out)

{

	unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

	out[idx] = 5.0f * in[idx+1];

}

That version can benefit enormously from the used of shared memory, particularly on compute 1.0/1.1 capable devices.