Difference in running time

I am a starter in GPU computing, recently when I was running a very simple vector-add program, I found that when I call the same cuda program twice, the first running time will be much longer than the second one. Was it because of the starting-up time for GPU hardware? Thanks!

it may be many things

it may be a race - the first run essentially ‘setting up’ the second run, and manages to succeed, due to a race condition
it may be memory
if it were the other way around, it might have been power related
it may be that the kernel(s) do not even run the 2nd time around, for a number of reasons

do the standard checks - memcheck; racecheck
and make sure to do proper error checking on/ after apis

Thanks Jimmy. I have done error check after every API, and the kernel must have run twice since the input vectors are different I have checked the output vectors,which are correct. Thank you for your answer!

that is indeed another possibility - execution time may very well be dependent on (input) data
certainly the case where the data determines the number of iterations or the termination point

Could well be the just-in-time compilation of PTX code to your target architecture, or various other driver initialization.

The former can be prevented by compiling your program including binary code for your target architecture.

One thing you did not mention in your post is if you send the data to GPU ram again for the second run.

Having said that, if you take a look at samples provided by Nvidia, they typically invoke the kernel once, then invoke it again multiple times and take average of these. The first invocation is for “warm-up”.

Thanks you guys! Your suggestions are all so helpful, all help me know more about CUDA and GPU. Since I am rather new in this field, it will still take me some time to thoroughly understand your advice. Thanks!