How to improve access to global memory?

i have the kernel below:


global void kernel(unsigned char * img, int * wrkx_1d_gpu)
int hh=threadIdx.x+blockIdx.xblockDim.x;
int gg=threadIdx.y+blockIdx.y
int Nrows=512;
wrkx_1d_gpu[ggN+hh]=36(img[ggN+hh+1]-img[ggN+hh-1]) +
18*(img[(gg+1)N+hh+1]+img[(gg-1)N+hh+1]-img[(gg-1)N+hh-1]-img[(gg+1)N+hh-1]) +
N+hh-2]) +
(img[(gg+1)*N+hh+2]+img[(gg-1)*N+hh+2]-img[(gg+1)*N+hh-2]-img[(gg-1)N+hh-2]) +
(img[(gg+2)*N+hh+1]+img[(gg-2)*N+hh+1]-img[(gg+2)*N+hh-1]-img[(gg-2)N+hh-1]) +

I call the kernel as below:

The idea is that every thread calculates a pixel of the image, after reading the neighboring pixels.

The neighboring pixels are not stored continuously in global memory so i do not read continuous data from global memory and because of this i think i have low performance. What could i do (or how could i read the data from global memory) to achieve a better performance.
Any ideas?

Thanks in advance!

While your hunch about performance is probably correct, it would be much better to simply run the CUDA profiler to have it guide your optimization attempts.

The code suggests some sort of stencil is being used and seems to strongly suggest the use of shared memory as an intermediate buffer to (1) improve global memory access patterns and (2) dramatically increase per-thread data access speed and maximize data re-use. Maybe try a 16x16 pixel buffer for an initial attempt.

Note that shared memory access is to first order optimized for 32-bit accesses, and global memory accesses are inefficient for data sizes < 32bit, so consider the use of packed types like uchar4 as far as feasible.