Hi Guys,
My goal is to implement frame difference ( 1920x1080 resolution frames) using CUDA on TX1. To do so, I have tried two methods of access in the difference kernel : 1D and 2D. However, the time taken by both these kernels is drastically different. The kernel implemented using 1D access takes 20 times lesser time than the kernel implemented using 2D access. I am trying to understand why that is so. Kindly find the code snippets below.
Kernel implemented using 1D access :
__global__ void diff_mats_char_atomics11(
char *output,char *input1,char *input2,
unsigned int width,
unsigned int height,unsigned int pitch)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
output[index] = abs(input1[index] -input2[index]) ;
__syncthreads();
}
1D Kernel call :
int blockSize = 256;
int numBlocks = (pitch*height ) / blockSize;
diff_mats_char_atomics11<<<numBlocks, blockSize>>>( output,input1,input2,width,height,pitch);
Kernel implemented using 2D access :
__global__ void diff_mats_char_atomics17(
char *output,char *input1,char *input2,
unsigned int width,
unsigned int height,unsigned int pitch)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
int kW = width / nx ;
int kH = ceilf(height / ny);
int offset = x*kW + pitch*y*kH ;
for(int row =0; row < kH; row++ )
{
for (int col = 0; col < kW; col++)
{
if(offset+col < pitch*height)
{
output[offset+col] = input1[offset+col];
}
}
offset += pitch;
}
__syncthreads();
}
2D Kernel call :
dim3 block(32, 32);
dim3 grid(2, 1);
diff_mats_char_atomics17<<<grid, block>>>( output,input1,input2,width,height,pitch);
Kindly help me reason out why such a big difference in time taken by both these methods. It will help me decide how to implement this operation better and correct the configuration in each of these methods if necessary.
Thanks.