Hello,
i have the kernel below:
////////////////////////////////////////////////////////////////
global void kernel(unsigned char * img, int * wrkx_1d_gpu)
{
int hh=threadIdx.x+blockIdx.xblockDim.x;
int gg=threadIdx.y+blockIdx.yblockDim.y;
if((hh>=2)&&(hh<510)&&(gg>=2)&&(gg<382))
{
int Nrows=512;
wrkx_1d_gpu[ggN+hh]=36(img[ggN+hh+1]-img[ggN+hh-1]) +
18*(img[(gg+1)N+hh+1]+img[(gg-1)N+hh+1]-img[(gg-1)N+hh-1]-img[(gg+1)N+hh-1]) +
12(img[(ggN+hh+2)]-img[ggN+hh-2]) +
6(img[(gg+1)*N+hh+2]+img[(gg-1)*N+hh+2]-img[(gg+1)*N+hh-2]-img[(gg-1)N+hh-2]) +
3(img[(gg+2)*N+hh+1]+img[(gg-2)*N+hh+1]-img[(gg+2)*N+hh-1]-img[(gg-2)N+hh-1]) +
1(img[(gg+2)*N+hh+2]+img[(gg-2)*N+hh+2]-img[(gg-2)*N+hh-2]-img[(gg+2)*N+hh-2]);
}
}
////////////////////////////////////////////////////////////////
I call the kernel as below:
kernel<<<grid,block>>>(img,wrkx_1d_gpu);
cudaDeviceSynchronize();
////////////////////////////////////////////////////////////////
The idea is that every thread calculates a pixel of the image, after reading the neighboring pixels.
The neighboring pixels are not stored continuously in global memory so i do not read continuous data from global memory and because of this i think i have low performance. What could i do (or how could i read the data from global memory) to achieve a better performance.
Any ideas?
Thanks in advance!