To my surprise, the first run was significantly slower than the rest. Further, I found that in the first run, there was significant overhead time used to initiate the memory. However, in later runs, memory initiation cost almost no time.
I am certain that I have freed all the memory resources using cudaFree(), so why am I still getting such results?
Are there any CUDA API calls prior to the loop? If not, CUDA context initialization will happen on the first loop iteration, making that take a lot longer. Try issuing a cudaFree(0) prior to the loop. This will trigger CUDA context creation early.
In general, for any loop, you would want to determine steady-state performance by ignoring the first few loop iterations, giving various performance-boosting CPU and GPU mechanisms (in particular caches and TLBs) time to warm up.
From a performance perspective, the number of dynamic memory allocation operations should be minimized, and in particular you don’t want repeated allocation and deallocation inside a loop. Allocate memory once and keep re-using it. That applies to both CPU-only and hybrid CPU/GPU code.
Thanks for the answer. In my case, I am actually writing a paper in which I am supposed to report the efficiency of my algorithm. Therefore, I have to include the allocation and deallocation in the loop to make sure each run of my algorithm is complete. My question is, to accurately reflect the efficiency performance of an algorithm, is it better to time it in the steady state, or do so before warming up?
If the answer is the latter, how do I make sure that after each loop, the CUDA context is destroyed and other performance-boosting mechanisms that you’ve mentioned are cooled down?
As I said, performance should be determined by measuring in steady-state after a warm-up phase. One benchmarking approach is to set e.g. numRuns = 10, and then report the time of the fastest of the ten runs.
It makes no sense to include the time for allocation and deallocation unless this reflects the actual usage pattern in the real-life application (and I would argue that an app that follows the pattern in your loop is likely poorly designed).
If you decide to keep the current setup, note that for components other than the kernel itself performance will be determined by the speed of the host system, in particular single-thread CPU performance and performance of the host’s system memory.