Shared memory optimization for algorithm accessing neighbors

user74194 · December 1, 2021, 2:04am

Hello, I’m a CUDA beginner and practiced with this exercise https://agray3.github.io/files/learnCUDApractical.pdf
In the second assignment, a kernel for image reconstruction should be optimized. The final kernel looks like this:

__global__ void inverseEdgeDetect2D(float *d_output, float *d_input, \
				    float *d_edge) 
{ 
  int col, row; 
  int idx, idx_south, idx_north, idx_west, idx_east; 
  int numcols = N + 2; 

  col = blockIdx.x*blockDim.x + threadIdx.x + 1;
  row = blockIdx.y*blockDim.y + threadIdx.y + 1; 
    
  idx = row * numcols + col; 
  idx_south = (row - 1) * numcols + col; 
  idx_north = (row + 1) * numcols + col; 
  idx_west = row * numcols + (col - 1); 
  idx_east = row * numcols +(col + 1); 

  d_output[idx] = (d_input[idx_south] + d_input[idx_west] + d_input[idx_north] 
              + d_input[idx_east] - d_edge[idx]) * 0.25; 
}

For each iteration of the algorithm, each element in the image matrix accesses its four neighbors (an outer, 1 pixel wide halo is added to the image such that the edge elements have 4 neighbors). Therefore, each element in global memory d_input will be accessed up to 4 times. As far as I understand, accesses to elements at idx_south and idx_north are not coalesced, so they will not be cached. Therefore an optimization with shared memory should be achievable. This is my attempt:

__global__ void inverseEdgeDetect2D_shmem(float *d_output, float *d_input, \
				    float *d_edge) 
{ 
  int col, row; 
  int idx, idx_south, idx_north, idx_west, idx_east; 
  int numcols = N + 2; 

  col = blockIdx.x*blockDim.x + threadIdx.x + 1; 
  row = blockIdx.y*blockDim.y + threadIdx.y + 1; 

  idx = row * numcols + col; 
    
  __shared__ float shmem[34][34];
  // copy inner values of 32x32 block into shmem
  shmem[threadIdx.y+1][threadIdx.x+1] = d_input[idx];

  // copy outer values into shmem
  int tid = threadIdx.y * blockDim.x + threadIdx.x;
  int i = tid % 32 + 1;
  int firstBlockEntryIdx = (blockIdx.x * blockDim.x) + (blockIdx.y * blockDim.y) * numcols;
  // one warp for each edge  
  if (tid < 32)
    shmem[0][i] = d_input[firstBlockEntryIdx + i];
  // and so on...

  __syncthreads();
  
  d_output[idx] = (shmem[threadIdx.y+1][threadIdx.x+2] + shmem[threadIdx.y+1][threadIdx.x] 
                + shmem[threadIdx.y+0][threadIdx.x+1] + shmem[threadIdx.y+2][threadIdx.x+1]
                - d_edge[idx]) * 0.25;
}

All threads in a 32x32 thread block fill shared memory shmem. Additionally, 4 warps take care of the 1 pixel wide halo in shmem, such that edge elements can be handled correctly. The kernel produces correct results. However, it runs slower than the previous kernel. For 1000 iterations, time spent on GPU is 0.231176s with the old kernel and 0.300773s with the kernel using shmem. How is this possible? Is the global memory latency hidden by any chance?

Robert_Crovella · December 1, 2021, 2:34am

That’s not correct in 2 ways.

Those accesses are coalesced
Non-coalesced accesses still can hit in the cache, and/or generate a cache load if they don’t hit

A reuse factor of 4 is on the small side. The shared memory framework definitely complicates the kernel code, so if the additional complexity is not offset by sufficient reuse, you may not see a benefit from shared memory caching. Some of this character will also depend on your GPU type.

For performance questions I generally recommend the following:

a complete code
the compile command used
the operating system, CUDA version, and GPU type
a description (if not evident) of how timing is being measured.

Do as you wish, of course.

user74194 · December 1, 2021, 12:13pm

Thank you so much for your quick response.

Can you please elaborate. For those accesses to be coalesced, isn’t it required for the instructions of the threads in a warp to be executed simultaneously? Is this guaranteed when no branch divergence occurs?

Robert_Crovella · December 1, 2021, 2:01pm

Perhaps we disagree on some semantics. You may be referring to what I would call “perfect coalescing”.

Your claim was:

I’m going to assume you are referring to this line:

Let’s focus our attention on this:

Pick a warp. Identify what each thread is doing at that point. Identify the address pattern across the warp. That should lead you to the correct conclusion. Example:

Pick warp 0 in threadblock 0. All the threads in that warp will have threadIdx.x of 0…31 and all will have threadIdx.y of 0.
All threads are using their computed idx_south to read from d_input. There is no conditional activity that affects this.

The address pattern is based on the idx_south calculation:

 col = 0..31 + 1 = 1..32 across the warp (blockIdx.x = 0)
 row = 1 across the warp  (blockIdx.y = 0)
 idx_south = (row -1)*numcols + col
 idx_south = 1..32 across the warp

That means all 32 threads in the warp are reading from adjacent locations in memory, and that is a pattern that I would say “will coalesce nicely”. The memory controller will reduce (or “coalesce”) the transactions required from 32 (worst case) down to some small number, like 2 or 1.

Yes, it might be 2, and this might be our semantic difference. To understand whether it is 2 or 1 requires additional work. It probably will be 2. However if we just focus on that code, it is impossible to determine whether it is 2 or 1. In fact, if we just focus on the kernel code itself, it is impossible to determine whether it is 2 or 1. To make the final determination we must know the alignment of the d_input pointer passed to the kernel.

Yes, coalescing is only relevant for the same instruction (LD, ST) issued warp-wide. I’m not sure why you are singling out branch divergence and also singling out the idx_south and idx_north accesses. I don’t see anything different about those with respect to convergence as compared to the accesses for idx_east and idx_west, so I’m probably not following your question.

user74194 · December 1, 2021, 3:54pm

Thank you, your post has cleared up my misunderstandings.

The question about branch divergence was more of a general question and not related to this specific kernel. But now i got it, so thanks again for your time.

system · December 15, 2021, 3:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
coalescing problem CUDA Programming and Performance	4	1064	August 8, 2011
There must be a reason, but I can not figure out:( CUDA Programming and Performance	8	8089	January 3, 2008
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18555	December 13, 2007
Using Shared Memory in CUDA C/C++ Technical Blog	36	1953	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3306	January 10, 2010
Memory access should be coalesced but is not CUDA Programming and Performance	6	1056	May 16, 2019
Coalesced Memory access related doubt CUDA Programming and Performance	13	1968	December 9, 2010
Help Avoiding Un-Coalesced Memory Access CUDA Programming and Performance	9	9174	October 4, 2010
Some Performance Consideration Questions warp divergence, coalescing and shared mem then and now CUDA Programming and Performance	1	2002	March 8, 2012
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2637	April 23, 2010

Shared memory optimization for algorithm accessing neighbors

Related topics