Is there a sort of access time to the card?

I was testing my code and I found different performances if the code is executed the first time or shortly after another execution.
My code is a function called from matlab through a mex interface, it is a bunch of computations on a vector and makes use of shared memory.
If I call the function from matlab and measure the execution time I get for the first time 0.10 seconds. If I call it straight afterwards I get a 0.07 seconds. If I write a script where I call the function two times, a line under the other, I get 0.01 seconds for the second one (or any successive one). I repeated this test several time, the results are coherent.

I first thought the problem could be that I have only one card which makes the calculations and manages the monitor, like it had to take off the hands from the monitor to run the code the first time I was running the code.

But I tried the same code on a remote machine without monitor and two cards, and I have the same situation. It is more pronounced since the card is much faster (a quadro 4000 vs a nvs290), in the first case I have the usual 0.10 seconds, in the last it goes down to 0.0002 s!!!

the first system is a windows xp 32 bit + matlab R2010 + NVS290 (driver v4.0, capability 1.1)
the second is a linux suse le 11 64bit + matlab R2008 + Quadro 4000 (driver v4.0, capability 2.0)

Can anyone give me any clue on what is this about?

It could be a number of things. I remember back when I was using java, this phenomenon would occur; it was due to the HotSpot JIT compilation of the JVM, so that the second time through the part of the code identified beforehand as a hotspot, it would already be compiled to native code, which would yield tangible speedups. Whether that’s happening here or not, I’m not sure; but in general, when timing some code, you want to do some warm up runs beforehand (or at least, a professor of mine who was in HPC seemed to think so).

I guess this didn’t really answer your question, but hopefully it was helpful…

Thats the context setup time. First runtime API call (like cudaMalloc) would cause the library to initialize itself and hence will take lot of time… Subsequent calls will go fine