Trying to do some bench...


I transposed a program running on CPU into one running on GPU with CUDA, and I want to know the effective speed up of this transformation. The problem is that I haven’t found any function able to fairly compute the real execution time (not taking into account OS functions or other graphic functions running on the device) for both the CPU and GPU programs.

I tried using time() for CPU, but how about GPU?
Is the “cutcreateTimer” and “cutStartTimer” doing the same thing? Because with these last functions, I have a lot of fluctuations and worse, I don’t have a significant increase of the time when I increase the number of threads (even if the number of threads is much more important than the capability of the device).

I also tried the CUDA profiler, with the CPU time computed, but the result is completely different from the “cutcreateTimer” and “cutStartTimer” one.

I’m quite lost in all this bench thing. Can anyone help me? Do you know a good tool to measure the time execution?