How to improve access to global memory?

Sinay · December 14, 2017, 12:06am

Hello,
i have the kernel below:

////////////////////////////////////////////////////////////////

global void kernel(unsigned char * img, int * wrkx_1d_gpu)
{
int hh=threadIdx.x+blockIdx.xblockDim.x;
int gg=threadIdx.y+blockIdx.yblockDim.y;
if((hh>=2)&&(hh<510)&&(gg>=2)&&(gg<382))
{
int Nrows=512;
wrkx_1d_gpu[ggN+hh]=36(img[ggN+hh+1]-img[ggN+hh-1]) +
18*(img[(gg+1)N+hh+1]+img[(gg-1)N+hh+1]-img[(gg-1)N+hh-1]-img[(gg+1)N+hh-1]) +
12(img[(ggN+hh+2)]-img[ggN+hh-2]) +
6(img[(gg+1)*N+hh+2]+img[(gg-1)*N+hh+2]-img[(gg+1)*N+hh-2]-img[(gg-1)N+hh-2]) +
3(img[(gg+2)*N+hh+1]+img[(gg-2)*N+hh+1]-img[(gg+2)*N+hh-1]-img[(gg-2)N+hh-1]) +
1(img[(gg+2)*N+hh+2]+img[(gg-2)*N+hh+2]-img[(gg-2)*N+hh-2]-img[(gg+2)*N+hh-2]);
}
}

////////////////////////////////////////////////////////////////
I call the kernel as below:

kernel<<<grid,block>>>(img,wrkx_1d_gpu);
cudaDeviceSynchronize();
////////////////////////////////////////////////////////////////
The idea is that every thread calculates a pixel of the image, after reading the neighboring pixels.

The neighboring pixels are not stored continuously in global memory so i do not read continuous data from global memory and because of this i think i have low performance. What could i do (or how could i read the data from global memory) to achieve a better performance.
Any ideas?

Thanks in advance!

njuffa · December 14, 2017, 12:14am

While your hunch about performance is probably correct, it would be much better to simply run the CUDA profiler to have it guide your optimization attempts.

The code suggests some sort of stencil is being used and seems to strongly suggest the use of shared memory as an intermediate buffer to (1) improve global memory access patterns and (2) dramatically increase per-thread data access speed and maximize data re-use. Maybe try a 16x16 pixel buffer for an initial attempt.

Note that shared memory access is to first order optimized for 32-bit accesses, and global memory accesses are inefficient for data sizes < 32bit, so consider the use of packed types like uchar4 as far as feasible.

Topic		Replies	Views
Cuda programming Jetson TX2 cuda	4	641	October 18, 2021
Shared memory doubt CUDA Programming and Performance	5	4596	June 11, 2008
Global memory reads optimization with texture cache CUDA Programming and Performance	2	1401	August 2, 2009
I need help to optimize the speed for my kernel function CUDA Programming and Performance	3	804	July 2, 2012
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8936	August 3, 2017
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4210	July 30, 2007
Shared memory as slow as global memory CUDA Programming and Performance	8	4395	September 5, 2016
How to implement shared memory in kernel CUDA Programming and Performance	2	950	May 4, 2011
Shared memory problem CUDA Programming and Performance	3	2259	February 8, 2008
access speed of shared memory and global memory CUDA Programming and Performance	1	1070	August 6, 2009

How to improve access to global memory?

Related topics