I’m having some trouble fully understanding how to copy from global memory to shared memory.
Let’s assume I want to do some calculations with an image.
I need values of neighbour pixels so I used the following to load them into shared memory:
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; __shared__ float px; px[threadIdx.x][threadIdx.y] = image[y * width + x];
To do the calculations I need neighbour pixel values at a distance of 2 from the current pixel. Now the problem is that I need to get pixels which belong to the neighbour blocks if the current thread is on the border of the current block.
How can these pixel values be fetched fastly? I tried using conditionals like if (threadIdx.x == 0) … to fetch the values from the neighbour blocks but that seemed a bit slow to me and I gues that this is bad for coalescence.
Is there any clever way to do something like this?