CUDA kernel 1D access vs 2D access

Hi Guys,

My goal is to implement frame difference ( 1920x1080 resolution frames) using CUDA on TX1. To do so, I have tried two methods of access in the difference kernel : 1D and 2D. However, the time taken by both these kernels is drastically different. The kernel implemented using 1D access takes 20 times lesser time than the kernel implemented using 2D access. I am trying to understand why that is so. Kindly find the code snippets below.

Kernel implemented using 1D access :

__global__ void diff_mats_char_atomics11(
    char *output,char *input1,char *input2,
    unsigned int width,
    unsigned int height,unsigned int pitch)
{
	
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  output[index] = abs(input1[index] -input2[index]) ;
    
  __syncthreads();

}

1D Kernel call :

int blockSize = 256;
   int numBlocks = (pitch*height ) / blockSize;
   diff_mats_char_atomics11<<<numBlocks, blockSize>>>( output,input1,input2,width,height,pitch);

Kernel implemented using 2D access :

__global__ void diff_mats_char_atomics17(
    char *output,char *input1,char *input2,
    unsigned int width,
    unsigned int height,unsigned int pitch)
{

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int nx = blockDim.x * gridDim.x;
    int ny = blockDim.y * gridDim.y;
    int kW = width / nx ;
    int kH = ceilf(height / ny);
 
    int offset = x*kW + pitch*y*kH ;

    for(int row =0; row < kH; row++ )
    {
     for (int col = 0; col < kW; col++)  
        {
	        if(offset+col < pitch*height)
	        {
		   output[offset+col] = input1[offset+col];
	        }
        }
        offset += pitch;
    }
    __syncthreads();
}

2D Kernel call :

dim3 block(32, 32);
    dim3 grid(2, 1);
 
    diff_mats_char_atomics17<<<grid, block>>>( output,input1,input2,width,height,pitch);

Kindly help me reason out why such a big difference in time taken by both these methods. It will help me decide how to implement this operation better and correct the configuration in each of these methods if necessary.

Thanks.

Hi,

It depends on the way you access memory.

The data output and data input are all texture memory.
So the different access approach will cause different cache miss rate.

Thanks.

There are a few obvious differences.

  1. The memory traffic for your 1D kernel will be entirely coalesced:
output[index] = abs(input1[index] -input2[index]) ;
          ^                  ^            ^
          all of these will coalesce nicely

the same cannot be said for your 2D kernel. Whenever you have a multiplier on the indexing component that includes threadIdx.x:

int x = blockIdx.x * blockDim.x + threadIdx.x;
    ...
    int kW = width / nx ;
    ...
    int offset = x*kW + pitch*y*kH ;
                 ^  ^
                 |  multiplier
                indexing component that includes threadIdx.x
      

    ...
		   output[offset+col] = input1[offset+col];

you have probably broken coalescing. Memory coalescing is one of the first optimization objectives of any CUDA programmer, and is covered in basic CUDA optimization presentations. For example, google “paulius GTC cuda optimization” and take the first hit.

  1. The total threads are different for each kernel. The 1D case looks larger. I assume it is launching more than 2 blocks of 256 threads each. The 2D case is launching 2 blocks of 1024 threads each, so 2048 threads total. If the 1D case is larger than 2048 threads, that may be a benefit. Both TX1 and TX2 contain 2 SMs, which means their maximum thread capacity is 4096 threads (instantaneous) and kernels that launch this number or more should be best. 2048 threads may be limiting you to 50% occupancy, which has a variety of performance impacts.

You might also want to learn about how to profile CUDA codes. The profiler can identify:

  • low occupancy
  • poor coalescing
  • in general, the performance limiter(s) for a particular code.

There are various GTC presentations on using the profilers. include GTC in your google search string.

Hi txbob,

Thanks for your response. I will try some of the tips you have given and let you know if I face any issues.

Thanks.