If you take a look at the ConvolutionSeperable example, I know it has a cuda timer. You simply call a function right before you call a kernel, then you make another function call after the kernel is finished
CUTIL timers are not super-accurate for short kernels. You’re better off using events in a CUDA stream and taking the elapsed time between an event before and after your kernel if you’re trying to reliably measure anything very fast.
Thank you, i’ll try this ^^! But i dont know if adding “CUDA_SAFE_CALL( cudaThreadSynchronize() );” could help! Because my program calculate the same way with this (without “cudaThreadSynchronize”)!
By the way, the mapping function really slow! With “boxFilter” example, if i view as 1024x768 -> about 32 -> 40 fps (Not running the kernel)! But if i remove mapping and unmapping function -> 59->62 fps T_T!!!
I’m working with graphics processing and computer vision, so the accuracy in speed is necessary! I cannot show to the user 64 fps every time…!!!
I agree with you! My program kernel take about 16ms to complete!
If i increase the loop to 100, the result is more accurary! But this is not the way my program will run T_T!