Calculate the speed of CUDA program! Is there another way to do this ???

Hi everyone!

I’ve tried the boxfilter sample and alway get the result : 62 -> 64 fps!

  • I’ve removed the line of code (which call the kernel) and also get the result : 62 -> 64 fps!

  • I’ve put the line of code (which call the kernel) into a loop like this:

for(int i = 0;i < 10;++i)


    // Code for calling the kernel here


And the result is 62 -> 64 fps T_T, oh my god !!!

How do i know the speed of my program??? It make me so confuse T_T!


If you take a look at the ConvolutionSeperable example, I know it has a cuda timer. You simply call a function right before you call a kernel, then you make another function call after the kernel is finished

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );

Call your kernel here

CUDA_SAFE_CALL( cudaThreadSynchronize() );
gpuTime = cutGetTimerValue(hTimer);
printf(“GPU convolution time : %f msec //%f Mpixels/sec\n”, gpuTime, 1e-6 * DATA_W * DATA_H / (gpuTime * 0.001));

This will give you a very presise time showing exactly how long your kernel took to execute.

CUTIL timers are not super-accurate for short kernels. You’re better off using events in a CUDA stream and taking the elapsed time between an event before and after your kernel if you’re trying to reliably measure anything very fast.

Thank you, i’ll try this ^^! But i dont know if adding “CUDA_SAFE_CALL( cudaThreadSynchronize() );” could help! Because my program calculate the same way with this (without “cudaThreadSynchronize”)!

By the way, the mapping function really slow! With “boxFilter” example, if i view as 1024x768 -> about 32 -> 40 fps (Not running the kernel)! But if i remove mapping and unmapping function -> 59->62 fps T_T!!!

I’m working with graphics processing and computer vision, so the accuracy in speed is necessary! I cannot show to the user 64 fps every time…!!!

I agree with you! My program kernel take about 16ms to complete!

If i increase the loop to 100, the result is more accurary! But this is not the way my program will run T_T!

Thank you!