Please look at the code attached hereby.
( I ) Attachment details:
FilterMain.cu - Application containing file I/O and Init function call.
Filter.cu - Application level .cu file from which the call to the kernel is made. Also contains device level allocations.
Filter_CUDA.cu - contains kernel code.
VQI_Common.cu - contains system Init call. Calls function CUT_DEVICE_INIT() from inside.
( II ) Description:
A kernel for applying a filter (a 13X13 filter) on an image (image used currently is 1920X1080 - full HD) is written.
cutStartTimer and cutStopTimer is used for the profiling of the kernel codes in file “Filter.cu”. Function
cudaThreadSynchronize() is also called after both kernels. Kernels, from inside them, calls __syncthreads() when required.
The profiling gives time as 1992 mSec, i.e. nearly 2 seconds which can not be the case. As this much of computation should
take time of the order of 4-5 mSec. at max. on CUDA.
I am missing something here. Can anyone please point out the mistake?
Thanks in advance…
Filter_8_Cross_8.zip (936 KB)