How to decide BLOCK and GRID size for kernel code Optimization of image processing kernel

Currently, in my application I have to calculate “pixel by pixel Absolute Difference” between 2 HD images (1920x1080).

1 byte represents 1 pixel. And image data is of type “unsigned char”.

I am using 8800 GTX graphics card.

We have made 1 application that we are running on windows/vs 2005.
Can anyone suggest some optimum way to do this?

I have tried using 1 grid, 240(1980/8) blocks and 135(1080/8) threads per block. The time taken to calculate this is too high. It goes in seconds. Can anyone suggest a way to optimize this implementation?

We have also tried implementing the same function using CUBLAS APIs but CUBLAS APIs operates only float data as input and converting byte data to float data and than transferring float data onto GPU eats up a lot of time. So we decided to implement kernel for this calculation.

A block size of 135 is not optimal. The smallest unit of execution is 32 threads on the device. Your block size should be a multiple of that. Probably a related issue is that with such a block size, your memory reads are not coalesced: that can change performance by a factor of 20.

My suggestion for tuning the block size is to write the algorithm in an block size independent manner, then run benchmarks on all block sizes from 32 to the max that you can given the number of registers used, in multiples of 32, of course. Then choose the fastest block size.

I am really curious if this will ever have great performance… Assuming I understand the problem properly, for each pixel two reads, a subtraction, a predicated change in sign, and a write are performed. And this is ignoring the overhead of the transfer of the images. Please share your timing results when ready.

If your image x dimension is 1980 and the y dimension is 1080 and pixels are addressed in a x + y*DimX manner then I think you should exchange your gridsize and blocksize. It seems as if you walk through the image in x direction with your blocks which will prevent coalesced memory access which could explain your execution times.

I don’t think this is true. I believe memory access can still be coalesced even if the number of threads is not a multiple of the warpsize. At least I see significant performance gains in my implementation when trying to achieve coalesced access.

You are correct, seb. The proper padding can still lead to coalesced reads in that situation. Hence why I said “probably” :)

I don’t see the point of thinking the thing as an image. You’re just computing absolute difference for 2073600 unsigned chars. Nothing is preventing you from processing elements with different y in a single block. Just use a “standard” block size of 256 or sth would be quite sufficient.
Also, maybe reading and writing in int would help. According to the doc, unsigned char read won’t be coalesced.

Definitely packing the unsigned chars into 32-bit ints will help. I found that I had to do the same when reading an array of 16-bit shorts to get good performance.

I just did a quick test, and it looks like subtracting two 1920x1080x8bit images should take 8 ms, including the memory copies. My test was on a busy system, so your results may be faster. I couldn’t allocate page locked memory of the size required to hold the two input images and the output image. That’s only 6 MB, which surprised me a little.

Note that I have not checked it for correctness, I just wanted a rough estimate of the performance:

#include <stdio.h>

#include "cuda.h"

#include "cutil.h"

__device__ unsigned char byte(unsigned int b, int num)


  return (b >> (8 * num)) & 0xFF;


__global__ void diff(unsigned int *a, unsigned int *b, unsigned int *result)


  int index = blockIdx.x*blockDim.x + threadIdx.x;

 unsigned int a_local = a[index];

  unsigned int b_local = b[index];

 unsigned int result_local = 0;

  result_local |= byte(a_local, 3) - byte(b_local, 3);

  result_local <<= 8;

  result_local |= byte(a_local, 2) - byte(b_local, 2);

  result_local <<= 8;

  result_local |= byte(a_local, 1) - byte(b_local, 1);

  result_local <<= 8;

  result_local |= byte(a_local, 0) - byte(b_local, 0);

 result[index] = result_local;


int main()


  int elements = 1920 * 1080 / 4; // 4 bytes per int

  int malloc_size = sizeof(unsigned int) * elements;

 unsigned int *a_host = (unsigned int *) malloc(malloc_size);

  unsigned int *b_host = (unsigned int *) malloc(malloc_size);

  unsigned int *result_host = (unsigned int *) malloc(malloc_size);

  unsigned int *a_device, *b_device, *result_device;

 CUDA_SAFE_CALL(cudaMalloc((void**) &a_device, malloc_size));

  CUDA_SAFE_CALL(cudaMalloc((void**) &b_device, malloc_size));

  CUDA_SAFE_CALL(cudaMalloc((void**) &result_device, malloc_size));


 dim3 dimBlock(256);

  dim3 dimGrid(elements/dimBlock.x);

 unsigned int timer;



  int iterations = 500;

  for (int i=0; i < iterations; i++) {

    CUDA_SAFE_CALL(cudaMemcpy(a_device, a_host, malloc_size, cudaMemcpyHostToDevice));

    CUDA_SAFE_CALL(cudaMemcpy(b_device, b_host, malloc_size, cudaMemcpyHostToDevice));

    diff<<<dimGrid, dimBlock>>>(a_device, b_device, result_device);   

    CUDA_SAFE_CALL(cudaMemcpy(result_host, result_device, malloc_size,




 float total_time = cutGetTimerValue(timer);

 printf("time per image: %f milliseconds\n", total_time/iterations);

 return 0;


This code uses the cutil library included with the SDK, so you’ll have to link with it.