Profiler Kernel Speeds faster than cmd?

I have noticed that when I run my code in a command line it takes ~5.5 ms but when I run it using the CUDA profiler it only takes ~3.25 ms. Has anybody else noticed this? Is the profiler maybe passing some argument that optimizes the GPU usage?

I really have no clue on why this would be, but I thought I would ask.


well, how did you measure?

I used :



with the appropriate timer stop functions and do a printf to the screen with the number of milliseconds taken. This displays the time in the cmd window and in a little display box in the profiler.

I think there are 2 possibilities:

  • The profiler already initializes the card
  • The profiler runs your code more than once and you see the second/third where the card is already initialized

My timer does not include the initialization, just the memory transfers and kernel execution.

Could somebody else try running their own code and see if they have the same trend?