Second kernel run is faster than first run


I discovered the following behaviour:
I have a kernel that takes a few 100mb of global memory of input data. I have a function that initialises this data (cudaMalloc, cudaMemcpy and then, after the kernel finished, cudaFree).

However, if I call that function a second time (for the same data), the kernel run and the initialization are much faster than in the first run. Why is that? In my opinion, since the data was free’d after running the kernel for the first time, there shouldn’t be any remarkable speedup because the whole cudaMalloc and cudaMemcpy has to be done again.

Note: This only happens if I call my procedure twice in a program run. If I call it only once in the program but start the program twice in a row, both runs are “slow”.

There are a couple of possible explanations, but the most significant slowdown on the first kernel invocation comes from the need to just-in-time-compile PTX code to SASS instructions if no code four your GPU architecture is present. Make sure you include binary code for your architecture on compilation.

Another possible explanation would be context initialization for the first function call. You can view this context initialization using NVVP. It gives you a clear view why the first function call is slower than the second one.