Hello, I’m a CUDA beginner and practiced with this exercise https://agray3.github.io/files/learnCUDApractical.pdf
In the second assignment, a kernel for image reconstruction should be optimized. The final kernel looks like this:
__global__ void inverseEdgeDetect2D(float *d_output, float *d_input, \
float *d_edge)
{
int col, row;
int idx, idx_south, idx_north, idx_west, idx_east;
int numcols = N + 2;
col = blockIdx.x*blockDim.x + threadIdx.x + 1;
row = blockIdx.y*blockDim.y + threadIdx.y + 1;
idx = row * numcols + col;
idx_south = (row - 1) * numcols + col;
idx_north = (row + 1) * numcols + col;
idx_west = row * numcols + (col - 1);
idx_east = row * numcols +(col + 1);
d_output[idx] = (d_input[idx_south] + d_input[idx_west] + d_input[idx_north]
+ d_input[idx_east] - d_edge[idx]) * 0.25;
}
For each iteration of the algorithm, each element in the image matrix accesses its four neighbors (an outer, 1 pixel wide halo is added to the image such that the edge elements have 4 neighbors). Therefore, each element in global memory d_input will be accessed up to 4 times. As far as I understand, accesses to elements at idx_south and idx_north are not coalesced, so they will not be cached. Therefore an optimization with shared memory should be achievable. This is my attempt:
__global__ void inverseEdgeDetect2D_shmem(float *d_output, float *d_input, \
float *d_edge)
{
int col, row;
int idx, idx_south, idx_north, idx_west, idx_east;
int numcols = N + 2;
col = blockIdx.x*blockDim.x + threadIdx.x + 1;
row = blockIdx.y*blockDim.y + threadIdx.y + 1;
idx = row * numcols + col;
__shared__ float shmem[34][34];
// copy inner values of 32x32 block into shmem
shmem[threadIdx.y+1][threadIdx.x+1] = d_input[idx];
// copy outer values into shmem
int tid = threadIdx.y * blockDim.x + threadIdx.x;
int i = tid % 32 + 1;
int firstBlockEntryIdx = (blockIdx.x * blockDim.x) + (blockIdx.y * blockDim.y) * numcols;
// one warp for each edge
if (tid < 32)
shmem[0][i] = d_input[firstBlockEntryIdx + i];
// and so on...
__syncthreads();
d_output[idx] = (shmem[threadIdx.y+1][threadIdx.x+2] + shmem[threadIdx.y+1][threadIdx.x]
+ shmem[threadIdx.y+0][threadIdx.x+1] + shmem[threadIdx.y+2][threadIdx.x+1]
- d_edge[idx]) * 0.25;
}
All threads in a 32x32 thread block fill shared memory shmem. Additionally, 4 warps take care of the 1 pixel wide halo in shmem, such that edge elements can be handled correctly. The kernel produces correct results. However, it runs slower than the previous kernel. For 1000 iterations, time spent on GPU is 0.231176s with the old kernel and 0.300773s with the kernel using shmem. How is this possible? Is the global memory latency hidden by any chance?