Kernel execution times vary on multiple launches.

Hi all,

I am launching a kernel multiple times from within a loop. But strangely the GPU time is different for each kernel launch. I am checking the GPU time from cudaProf. The times vary by about 12%.

Can anyone explain why is this so?

Thanks.

If it is just the first invocation that is slower than the others then that could be explained by just-in-time compilation of the ptx to device code on first invocation.

Apart from that, I would not worry too much about 12%. These might e.g. be caused by different scheduling on different invocations. (Although I think the scheduling was deterministic at least pre-Fermi, but Nvidia makes no guarantees about it)

Scheduling is not deterministic for all intents and purposes.

can it be due to partition camping? i read about it in the transpose example in the SDK. 12% difference is kinda lot for me as I am working on a performance model and this difference makes my model go haywire.