Execution time questions (1st invocation & cudaDeviceMapHost)

The first executed kernel always takes a little longer (~2x longer) to run than subsequent identical calls. I do call cudaThreadSynchronize() so I do believe my timings are accurate. Is this a behaviour others have noticed? does the driver do some “warm up” configuration with the first kernel invocation?

Also, after calling cudaSetDeviceFlags(cudaDeviceMapHost)) execution times of all kernels increase almost 2x. I’m not actually using the features provided with cudaDeviceMapHost but simply noticed that setting this flag impacts kernel runtime. Is this to be expected?

I’m using CUDA 2.3 with driver 190.18 on a Fedora 11 x64 box with two GTX285 cards.

Thank you