cudaMemcpy is slow the first time used in a loop

Hello, I encountered an interesting behaviour while programming matrix/vector operations in a loop. I read several posts here and on StackOverflow, but none answered fully what I am searching for, some referred to cudaMemcpy to take lot of time due to some checking of data integrity in global memory. I use some very basic parallel algorythms, such as:

__global__ void MxV(float vec[], float mat[], float out[) {
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	float sum = 0;
	if (tid < n) {
		for (int i = 0; i < n; i++)
			sum += vec[i] * mat[i + (n * tid)];
		out[tid] = sum;
__global__ void VxV(float vec1[), float vec2[], float out[]) {
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	if (tid < n) {
		out[tid] = vec1[tid] * vec2[tid];
__global__ void VaddV(float a1, float vec1[], float a2, float vec2[], float out[]) {
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	if (tid < n) {
		out[tid] = a1 * vec1[tid] + a2 * vec2[tid];

The loop on image below calls mostly matrix and vector multiplications. The whole loop that I have takes in average 245ms and when run line by line I discovered, that the first call of cudaMemcpy in every loop iteration is taking most of the loop run time (95%). The second identical call of cudaMemcpy takes nearly no time at all.

I am building a small program for iterative solving of linear equations and the speed is super important to me. If possible, I would like to avoid that the most of the time is spent on transfer between cuda memory and RAM, so I tried alternatively to build the loop entirely without any passing of data:

for (int i=0; i<300;i++)
		MxV << <n / GPU_threads + 1, GPU_threads >> > (dev_p, dev_A, dev_Ap);
		VdotV << <n / GPU_threads + 1, GPU_threads >> > (dev_p, dev_Ap, dev_denom);
		divideScalars << <n / GPU_threads + 1, 1 >> > (dev_nom, dev_denom, dev_alfa);
		assignScalars << <n / GPU_threads + 1, GPU_threads >> > (dev_nom, dev_denom);

		VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (1.0, dev_x, dev_alfa, dev_p, dev_x);
		VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (1.0, dev_r, dev_alfa, dev_Ap, dev_r);
		VdotV << <n / GPU_threads + 1, GPU_threads >> > (dev_r, dev_r, dev_nom);
		divideScalars << <n / GPU_threads + 1, 1 >> > (dev_nom, dev_denom, dev_beta);

		VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (-1.0, dev_r, dev_beta, dev_p, dev_p);

What I do not understand is, that the speed of the 2nd loop is only < 1% faster than in the case of the loop where I identified the bottleneck to be the cudaMemcpy. Is there a reason for that? What am I missing?

Thank you for any hints,

These observations are likely the result of a suboptimal performance measurement methodology. You don’t show full code for reproducibility.

A basic principle of performance measurements is never to measure on the first time through any piece of code. This is often referred to as the “cold” state, in which hardware and software structures are yet to be initialized. You would want to measure after “warmup”. This is typically achieved by iterating over the code multiple times, discarding performance data from initial iterations. Some hardware mechanisms in particular need to be exercised more than once or for a certain minimum amount of time before “steady state” is achieved. The dynamic clocking ubiquitous in modern CPUs and GPUs often falls into that category.

CUDA has fairly large software state that gets silently initialized on the first call to a CUDA API function. This can be costly, especially when a system contains multiple GPUs and or lots of system memory. The usual trick is to trigger this context creation and the associated cost with a call to cudaFree(0) outside the timed portion of the code.

You would also want to make sure to carefully separate time for kernel execution from the time for API functions. Kernels execute on the GPU asynchronously. If a kernel invocation is followed by a synchronizing API call, the time for kernel execution adds to the apparent time for the synchronizing API call. If you use your own measurement infrastructure, you may need to use judiciously placed calls to cudaDeviceSynchronize() to separate out execution times.

1 Like

Hi and thank you, I admit that I am new to Cuda so some apparent things may be new to me.

Although I did not write it, the loop runs like 300 times and I referred to the overall result. Also, multiple cudaMalloc functions prior the loop warm-up the GPU (takes like 2 seconds).

What I failed to see is the async run time of kernels - when debugging, the run of kernel shows “< 1ms” and I interpreted it wrongly. When I synchronise threads after the most time consuming part (matrix multiplication), I can record the time of the single kernel run:

Now I understand, that the call of API just waited for all the async threads to finish and that is why there was the waiting time I noticed in the first place.

Thank you.