cudaMemcpy is slow the first time used in a loop

bocrew1987 · June 16, 2020, 5:58pm

Hello, I encountered an interesting behaviour while programming matrix/vector operations in a loop. I read several posts here and on StackOverflow, but none answered fully what I am searching for, some referred to cudaMemcpy to take lot of time due to some checking of data integrity in global memory. I use some very basic parallel algorythms, such as:

__global__ void MxV(float vec[], float mat[], float out[) {
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	float sum = 0;
	if (tid < n) {
		for (int i = 0; i < n; i++)
			sum += vec[i] * mat[i + (n * tid)];
		out[tid] = sum;
	}
}
__global__ void VxV(float vec1[), float vec2[], float out[]) {
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	if (tid < n) {
		out[tid] = vec1[tid] * vec2[tid];
	}
}
__global__ void VaddV(float a1, float vec1[], float a2, float vec2[], float out[]) {
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	if (tid < n) {
		out[tid] = a1 * vec1[tid] + a2 * vec2[tid];
	}
}

The loop on image below calls mostly matrix and vector multiplications. The whole loop that I have takes in average 245ms and when run line by line I discovered, that the first call of cudaMemcpy in every loop iteration is taking most of the loop run time (95%). The second identical call of cudaMemcpy takes nearly no time at all.

I am building a small program for iterative solving of linear equations and the speed is super important to me. If possible, I would like to avoid that the most of the time is spent on transfer between cuda memory and RAM, so I tried alternatively to build the loop entirely without any passing of data:

for (int i=0; i<300;i++)
	{
		MxV << <n / GPU_threads + 1, GPU_threads >> > (dev_p, dev_A, dev_Ap);
		VdotV << <n / GPU_threads + 1, GPU_threads >> > (dev_p, dev_Ap, dev_denom);
		divideScalars << <n / GPU_threads + 1, 1 >> > (dev_nom, dev_denom, dev_alfa);
		assignScalars << <n / GPU_threads + 1, GPU_threads >> > (dev_nom, dev_denom);

		VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (1.0, dev_x, dev_alfa, dev_p, dev_x);
		VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (1.0, dev_r, dev_alfa, dev_Ap, dev_r);
		VdotV << <n / GPU_threads + 1, GPU_threads >> > (dev_r, dev_r, dev_nom);
		divideScalars << <n / GPU_threads + 1, 1 >> > (dev_nom, dev_denom, dev_beta);

		VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (-1.0, dev_r, dev_beta, dev_p, dev_p);
	}

What I do not understand is, that the speed of the 2nd loop is only < 1% faster than in the case of the loop where I identified the bottleneck to be the cudaMemcpy. Is there a reason for that? What am I missing?

Thank you for any hints,
Vladimir

njuffa · June 16, 2020, 8:28pm

These observations are likely the result of a suboptimal performance measurement methodology. You don’t show full code for reproducibility.

A basic principle of performance measurements is never to measure on the first time through any piece of code. This is often referred to as the “cold” state, in which hardware and software structures are yet to be initialized. You would want to measure after “warmup”. This is typically achieved by iterating over the code multiple times, discarding performance data from initial iterations. Some hardware mechanisms in particular need to be exercised more than once or for a certain minimum amount of time before “steady state” is achieved. The dynamic clocking ubiquitous in modern CPUs and GPUs often falls into that category.

CUDA has fairly large software state that gets silently initialized on the first call to a CUDA API function. This can be costly, especially when a system contains multiple GPUs and or lots of system memory. The usual trick is to trigger this context creation and the associated cost with a call to cudaFree(0) outside the timed portion of the code.

You would also want to make sure to carefully separate time for kernel execution from the time for API functions. Kernels execute on the GPU asynchronously. If a kernel invocation is followed by a synchronizing API call, the time for kernel execution adds to the apparent time for the synchronizing API call. If you use your own measurement infrastructure, you may need to use judiciously placed calls to cudaDeviceSynchronize() to separate out execution times.

bocrew1987 · June 16, 2020, 9:34pm

Hi and thank you, I admit that I am new to Cuda so some apparent things may be new to me.

Although I did not write it, the loop runs like 300 times and I referred to the overall result. Also, multiple cudaMalloc functions prior the loop warm-up the GPU (takes like 2 seconds).

What I failed to see is the async run time of kernels - when debugging, the run of kernel shows “< 1ms” and I interpreted it wrongly. When I synchronise threads after the most time consuming part (matrix multiplication), I can record the time of the single kernel run:

Now I understand, that the call of API just waited for all the async threads to finish and that is why there was the waiting time I noticed in the first place.

Thank you.

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25115	March 8, 2010
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
cudaMemcpy execution time CUDA Programming and Performance	5	6823	June 17, 2010
bets way to return a float value sync or assync CUDA Programming and Performance	26	10319	May 7, 2009
cudaMemcpy takes more than 2 seconds, then driver crashed. CUDA Programming and Performance	6	4191	May 2, 2016
Why is this slow CUDA Programming and Performance	7	3731	February 7, 2012
Zeroes and Large Arrays Strange GPU behaviour CUDA Programming and Performance	2	1297	August 10, 2009
cudaMemcpy latency increases when using 1 device with 2 processes CUDA Programming and Performance cuda	7	66	March 3, 2025
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	253	June 11, 2024
CUDA trouble CUDA Programming and Performance	3	979	March 19, 2013

cudaMemcpy is slow the first time used in a loop

Related topics