I discovered the following behaviour:
I have a kernel that takes a few 100mb of global memory of input data. I have a function that initialises this data (cudaMalloc, cudaMemcpy and then, after the kernel finished, cudaFree).
However, if I call that function a second time (for the same data), the kernel run and the initialization are much faster than in the first run. Why is that? In my opinion, since the data was free’d after running the kernel for the first time, there shouldn’t be any remarkable speedup because the whole cudaMalloc and cudaMemcpy has to be done again.
Note: This only happens if I call my procedure twice in a program run. If I call it only once in the program but start the program twice in a row, both runs are “slow”.