Filter kernel takes ~2 sec. - TIMING ISSUE CUDA Kernel execution overhead???

Please look at the code attached hereby.

( I ) Attachment details:


  1. Filter_8_Cross_8\win\project\FilterStandalone.sln



  1. - Application containing file I/O and Init function call.

  2. - Application level .cu file from which the call to the kernel is made. Also contains device level allocations.

  3. - contains kernel code.

  4. - contains system Init call. Calls function CUT_DEVICE_INIT() from inside.


  1. common_env.h
  2. Filter.h
  3. VQI_Common.h

( II ) Description:

A kernel for applying a filter (a 13X13 filter) on an image (image used currently is 1920X1080 - full HD) is written.

cutStartTimer and cutStopTimer is used for the profiling of the kernel codes in file “”. Function

cudaThreadSynchronize() is also called after both kernels. Kernels, from inside them, calls __syncthreads() when required.

The profiling gives time as 1992 mSec, i.e. nearly 2 seconds which can not be the case. As this much of computation should

take time of the order of 4-5 mSec. at max. on CUDA.

I am missing something here. Can anyone please point out the mistake?

Thanks in advance…
Tanmay Anjaria (936 KB)

Any hint would prove useful… Please pour in your suggestions in case you have any…

I’m stuck just because of the timing issue…

Sorry my initial post was hasty. I see you aren’t going over the entire image but have broken it into a block_width and block_height. Have you verified that you are doing your memory reads in an efficient manner? You don’t seem to be using any sort of shared memory. Also why do you have two kernels? Why not just combine them into one? Also the first kernel doesn’t seem to do anything but loop over the block. The same thing for your second kernel as well. Also you dont’ seem to set your grow_right or grow_left variables which could cause the first kernel to take a really long time.

Considering that both your kernels don’t really do anything useful I would say you are likely register heavy and are getting very poor occupancy. You pass a large number of parameters to both your kernels which likely are using up a lot of registers. Have you checked what the occupancy of your kernels are?

Hope that helps.

Thanks for your inputs…

We introduced the shared memory implementation and got far better results than we were getting otherwise…

It really helped… :)

Thanks once again,