Is there any way to copy data from device to host more efficiently in this case?

hadesmajesty · December 11, 2018, 5:18am

My code sample is as below. The kernel seems run very efficiently on GPU. But if I want to copy the result back to host memory, it takes much longer time. Also no matter I want to copy 1 variable of double type or 24 variables of double type, it takes around 88 seconds. Such a long time makes computation on GPU no advantage.Is there any way to improve?

__global__ void RunSTH_OnGPU(double* X_d,otherarguments){

	int bx = blockIdx.x;
	int tx = threadIdx.x;
	int Id_t = blockDim.x*bx + tx;
	if (Id_t < nset) {
		///do a lot of things by dynamic allocation of memory.
		X_d[Id_t]=something;
}
}

GPUbegin = clock();
RunSTH_OnGPU << <nblocks, nthreads >> >(X_d,otherarguments);
GPUend = clock();
timeSec = (float(GPUend) - float(GPUbegin)) / 1000.;///It takes 0.000seconds

int nc=24;
GPUbegin = clock();
error = cudaMemcpy(X_h, X_d, sizeof(double) * nc, cudaMemcpyDeviceToHost);
GPUend = clock();
timeSec = (float(GPUend) - float(GPUbegin)) / 1000.;///It takes 88 seconds

saulocpp · December 11, 2018, 10:42am

Is the timer saying that this cudaMemcpy portion took 88 seconds?!?!

What does nvprof say about the time spent by cudaMemcpy operations? NVVP also shows specific calls along the timeline, so you have better visual feedback.

Check the considerations on this thread, it may be of interest:
[url]https://devtalk.nvidia.com/default/topic/1019140/jetson-tx1/zero-copy-memory-vs-unified-memory-cuda-processing/[/url]

Also look for Njuffa’s zero copy code around the forum (old thread, I have it at home, but I’m not at home), it provides good information.

tera · December 11, 2018, 11:15am

Ok, I’ll bite.

Your problem is not the time spent in cudaMemcpy(), but the time it takes to execute your kernel.

CUDA kernel launches are asynchronous. So you are only measuring the time it takes to launch the kernel.
Your cudaMemcpy() is slow because it has to wait for the kernel to finish first. No amount of optimisation of the copy operation will speed up your program. instead, you need to optimise the kernel you are launching to run faster.

The CUDA profiler is still the tool to turn to. Run nvvp, take a look at the timeline to see how time is spent in the kernel, not the memcpy, and then let it guide you through the necessary analysis.

hadesmajesty · December 11, 2018, 1:07pm

Yes. I need to add “cudaDeviceSynchronize()” after the kernel call, if I need to measure time spent.

AndrewGong · December 14, 2018, 3:03am

please study materials about cudaMemcpyAsync

Topic		Replies	Views
Slow memory transfers CUDA Programming and Performance	7	1986	May 23, 2011
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1947	December 14, 2018
How to copy small data from GPU to CPU many times efficiently? CUDA Programming and Performance	1	1238	December 11, 2014
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15102	April 29, 2014
cudaMemcpy too slow CUDA Programming and Performance	1	1004	May 11, 2021
Copy back to host lasts much longer than copy to device, why? CUDA Programming and Performance	3	677	December 11, 2013
CUDAmemcpy takes too long CUDA Programming and Performance	2	4375	July 16, 2009
Small random memcpy (device to device) on GPU CUDA Programming and Performance	6	8249	August 21, 2015
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	579	November 16, 2021
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2652	July 15, 2009

Is there any way to copy data from device to host more efficiently in this case?

Related topics