Why does kernel memory initiation time vary for repeated kernel launches

fdls2011 · March 11, 2018, 2:56am

I am trying to evaluate the efficiency of my cuda program by repeatedly running it, using some code like the following.

for(i = 0; i < numRuns; i++){
tic = clock();

//memory initiation
cudaMalloc(...);
cudaMemcpy(...);
cudaMemset(...);...
 
myKernel(...);  //kernel launch

//free the memory
cudaMemcpy(...);
cudaFree(...);

toc = clock();
runningTime = toc - tic;
}

To my surprise, the first run was significantly slower than the rest. Further, I found that in the first run, there was significant overhead time used to initiate the memory. However, in later runs, memory initiation cost almost no time.

I am certain that I have freed all the memory resources using cudaFree(), so why am I still getting such results?

Thank you!

njuffa · March 11, 2018, 5:46am

Are there any CUDA API calls prior to the loop? If not, CUDA context initialization will happen on the first loop iteration, making that take a lot longer. Try issuing a cudaFree(0) prior to the loop. This will trigger CUDA context creation early.

In general, for any loop, you would want to determine steady-state performance by ignoring the first few loop iterations, giving various performance-boosting CPU and GPU mechanisms (in particular caches and TLBs) time to warm up.

From a performance perspective, the number of dynamic memory allocation operations should be minimized, and in particular you don’t want repeated allocation and deallocation inside a loop. Allocate memory once and keep re-using it. That applies to both CPU-only and hybrid CPU/GPU code.

fdls2011 · March 11, 2018, 6:04am

Thanks for the answer. In my case, I am actually writing a paper in which I am supposed to report the efficiency of my algorithm. Therefore, I have to include the allocation and deallocation in the loop to make sure each run of my algorithm is complete. My question is, to accurately reflect the efficiency performance of an algorithm, is it better to time it in the steady state, or do so before warming up?

If the answer is the latter, how do I make sure that after each loop, the CUDA context is destroyed and other performance-boosting mechanisms that you’ve mentioned are cooled down?

Thank you!

njuffa · March 11, 2018, 6:21am

As I said, performance should be determined by measuring in steady-state after a warm-up phase. One benchmarking approach is to set e.g. numRuns = 10, and then report the time of the fastest of the ten runs.

It makes no sense to include the time for allocation and deallocation unless this reflects the actual usage pattern in the real-life application (and I would argue that an app that follows the pattern in your loop is likely poorly designed).

If you decide to keep the current setup, note that for components other than the kernel itself performance will be determined by the speed of the host system, in particular single-thread CPU performance and performance of the host’s system memory.

fdls2011 · March 11, 2018, 6:30am

Thanks a lot!

Topic		Replies	Views
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23170	July 8, 2011
free kernel code after execution CUDA Programming and Performance	8	4779	June 23, 2012
Kernel Launch Time Unexpectedly High CUDA Programming and Performance cuda , kernel , ubuntu	4	1347	July 25, 2022
Very slow kernel launch after a number of kernel has been lauched. CUDA Programming and Performance	3	5585	June 7, 2010
Cudamalloc affects the delay of cudalaunchkernel CPU launching latency CUDA Programming and Performance cuda , kernel	2	715	November 30, 2021
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1050	April 12, 2021
CudaMalloc is too expensive and GPU Memories CUDA Programming and Performance	6	2753	January 22, 2016
cudamalloc slow CUDA Programming and Performance	5	8361	November 13, 2015
cudaMalloc execution time CUDA Programming and Performance	2	44	December 16, 2024
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	253	June 11, 2024

Why does kernel memory initiation time vary for repeated kernel launches

Related topics