Shared Memory Access

I’m having some trouble fully understanding how to copy from global memory to shared memory.

Let’s assume I want to do some calculations with an image.

I need values of neighbour pixels so I used the following to load them into shared memory:

   unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    __shared__ float px[16][16];

    px[threadIdx.x][threadIdx.y] = image[y * width + x];

To do the calculations I need neighbour pixel values at a distance of 2 from the current pixel. Now the problem is that I need to get pixels which belong to the neighbour blocks if the current thread is on the border of the current block.

How can these pixel values be fetched fastly? I tried using conditionals like if (threadIdx.x == 0) … to fetch the values from the neighbour blocks but that seemed a bit slow to me and I gues that this is bad for coalescence.

Is there any clever way to do something like this?



Do you need to write back to the image ? If not, use a 2D texture. It has an automatic 2D cache that does exactly what you want + you can get bilinear interpolation for (almost) free if you need.


yes, I already used a 2D-Texture but I read somewhere on this forum that it can be faster to use SM in some cases.
I also read in this forum that it can be beneficial to use SM and texture fetches together.

But anyhow, I still could use SM for my second kernel.
The first kernel computes some values and stores them into a linear float array that is used by the second kernel. Now the second kernel can’t use the output from the first kernel as a 2D-Texture since it’s a linear memory which can’t be bound to a 2D-Texture. And if I understand correctly only Cuda-Arrays benefit properly from texture caching.

What I did is to copy the output from the first kernel into a Cuda-Array and bind that to a texture for use with the second kernel. I thought that using SM instead could be beneficial in this case.

I don’t see what the relation to shared mem is here. You can always use it after the texfetch if you repeatedly work on the values within one kernel, but you cannot communicate data between kernels with the shared mem.


I don’t want to communicate between two kernels using shared memory.
I just wanted to use Shared Mem for my second kernel since texture catches aren’t cached for 1D Textures bound to linear memory.

You just need to expand your shared memory array to be large enough to hold a block and its “apron” region. For a filter with radius 3 pixels, for example you need to have the block be 16 + 2*3 by 16 + 2 * 3 instead of 16 by 16. Then each block loads in the data it needs as usual. Yes, coalescing can be a problem in this case, see the discussion in the convolution white paper in the SDK.

As for your comment about non-cached 1D linear textures – that’s incorrect. I may have made that statement in the past on this forum, but I was mistaken.