Timing kernel in a loop

Hi everybody,
I’ve tested some kernel time in a large project. There is a kernel that is called 5 times by a loop.
Does anybody know why the first call is always longer than other (little bit, but it is…)?


My guess is because of caching :turned:

caching between different kernel calling, may be…
I must investigate!

There is some driver overhead in loading a new kernel binary onto the GPU. This happens automatically the first time you run the kernel, which is why the first execution is slightly longer.