Shared memory optimization for algorithm accessing neighbors

Hello, I’m a CUDA beginner and practiced with this exercise
In the second assignment, a kernel for image reconstruction should be optimized. The final kernel looks like this:

__global__ void inverseEdgeDetect2D(float *d_output, float *d_input, \
				    float *d_edge) 
  int col, row; 
  int idx, idx_south, idx_north, idx_west, idx_east; 
  int numcols = N + 2; 

  col = blockIdx.x*blockDim.x + threadIdx.x + 1;
  row = blockIdx.y*blockDim.y + threadIdx.y + 1; 
  idx = row * numcols + col; 
  idx_south = (row - 1) * numcols + col; 
  idx_north = (row + 1) * numcols + col; 
  idx_west = row * numcols + (col - 1); 
  idx_east = row * numcols +(col + 1); 

  d_output[idx] = (d_input[idx_south] + d_input[idx_west] + d_input[idx_north] 
              + d_input[idx_east] - d_edge[idx]) * 0.25; 

For each iteration of the algorithm, each element in the image matrix accesses its four neighbors (an outer, 1 pixel wide halo is added to the image such that the edge elements have 4 neighbors). Therefore, each element in global memory d_input will be accessed up to 4 times. As far as I understand, accesses to elements at idx_south and idx_north are not coalesced, so they will not be cached. Therefore an optimization with shared memory should be achievable. This is my attempt:

__global__ void inverseEdgeDetect2D_shmem(float *d_output, float *d_input, \
				    float *d_edge) 
  int col, row; 
  int idx, idx_south, idx_north, idx_west, idx_east; 
  int numcols = N + 2; 

  col = blockIdx.x*blockDim.x + threadIdx.x + 1; 
  row = blockIdx.y*blockDim.y + threadIdx.y + 1; 

  idx = row * numcols + col; 
  __shared__ float shmem[34][34];
  // copy inner values of 32x32 block into shmem
  shmem[threadIdx.y+1][threadIdx.x+1] = d_input[idx];

  // copy outer values into shmem
  int tid = threadIdx.y * blockDim.x + threadIdx.x;
  int i = tid % 32 + 1;
  int firstBlockEntryIdx = (blockIdx.x * blockDim.x) + (blockIdx.y * blockDim.y) * numcols;
  // one warp for each edge  
  if (tid < 32)
    shmem[0][i] = d_input[firstBlockEntryIdx + i];
  // and so on...

  d_output[idx] = (shmem[threadIdx.y+1][threadIdx.x+2] + shmem[threadIdx.y+1][threadIdx.x] 
                + shmem[threadIdx.y+0][threadIdx.x+1] + shmem[threadIdx.y+2][threadIdx.x+1]
                - d_edge[idx]) * 0.25;

All threads in a 32x32 thread block fill shared memory shmem. Additionally, 4 warps take care of the 1 pixel wide halo in shmem, such that edge elements can be handled correctly. The kernel produces correct results. However, it runs slower than the previous kernel. For 1000 iterations, time spent on GPU is 0.231176s with the old kernel and 0.300773s with the kernel using shmem. How is this possible? Is the global memory latency hidden by any chance?

That’s not correct in 2 ways.

  1. Those accesses are coalesced
  2. Non-coalesced accesses still can hit in the cache, and/or generate a cache load if they don’t hit

A reuse factor of 4 is on the small side. The shared memory framework definitely complicates the kernel code, so if the additional complexity is not offset by sufficient reuse, you may not see a benefit from shared memory caching. Some of this character will also depend on your GPU type.

For performance questions I generally recommend the following:

  • a complete code
  • the compile command used
  • the operating system, CUDA version, and GPU type
  • a description (if not evident) of how timing is being measured.

Do as you wish, of course.

Thank you so much for your quick response.

Can you please elaborate. For those accesses to be coalesced, isn’t it required for the instructions of the threads in a warp to be executed simultaneously? Is this guaranteed when no branch divergence occurs?

Perhaps we disagree on some semantics. You may be referring to what I would call “perfect coalescing”.

Your claim was:

I’m going to assume you are referring to this line:

Let’s focus our attention on this:

Pick a warp. Identify what each thread is doing at that point. Identify the address pattern across the warp. That should lead you to the correct conclusion. Example:

  1. Pick warp 0 in threadblock 0. All the threads in that warp will have threadIdx.x of 0…31 and all will have threadIdx.y of 0.

  2. All threads are using their computed idx_south to read from d_input. There is no conditional activity that affects this.

  3. The address pattern is based on the idx_south calculation:

     col = 0..31 + 1 = 1..32 across the warp (blockIdx.x = 0)
     row = 1 across the warp  (blockIdx.y = 0)
     idx_south = (row -1)*numcols + col
     idx_south = 1..32 across the warp

That means all 32 threads in the warp are reading from adjacent locations in memory, and that is a pattern that I would say “will coalesce nicely”. The memory controller will reduce (or “coalesce”) the transactions required from 32 (worst case) down to some small number, like 2 or 1.

Yes, it might be 2, and this might be our semantic difference. To understand whether it is 2 or 1 requires additional work. It probably will be 2. However if we just focus on that code, it is impossible to determine whether it is 2 or 1. In fact, if we just focus on the kernel code itself, it is impossible to determine whether it is 2 or 1. To make the final determination we must know the alignment of the d_input pointer passed to the kernel.

Yes, coalescing is only relevant for the same instruction (LD, ST) issued warp-wide. I’m not sure why you are singling out branch divergence and also singling out the idx_south and idx_north accesses. I don’t see anything different about those with respect to convergence as compared to the accesses for idx_east and idx_west, so I’m probably not following your question.

1 Like

Thank you, your post has cleared up my misunderstandings.

The question about branch divergence was more of a general question and not related to this specific kernel. But now i got it, so thanks again for your time.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.