Copying memory from device to Host takes too much time

neztol · October 4, 2010, 12:34pm

Hello,

I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:

clock_t tmpTime = clock();

	calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size());

	cout << "calcMatches = " << clock() - tmpTime << endl;

	tmpTime = clock();

	int * h_matches = (int *)malloc(points1.size() * sizeof(int));

	cout << "malloc = " << clock() - tmpTime << endl;

	tmpTime = clock();

	cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost));

	cout << "memCpy = " << clock() - tmpTime << endl;

The output for this section of code is, in an example:

calcMatches = 10000

malloc = 0

memCpy = 30000

The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.

Is there a way for optimizing these times? Thank you in advance

neztol · October 4, 2010, 12:34pm

Hello,

I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:

clock_t tmpTime = clock();

	calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size());

	cout << "calcMatches = " << clock() - tmpTime << endl;

	tmpTime = clock();

	int * h_matches = (int *)malloc(points1.size() * sizeof(int));

	cout << "malloc = " << clock() - tmpTime << endl;

	tmpTime = clock();

	cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost));

	cout << "memCpy = " << clock() - tmpTime << endl;

The output for this section of code is, in an example:

calcMatches = 10000

malloc = 0

memCpy = 30000

The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.

Is there a way for optimizing these times? Thank you in advance

jan.heckman · October 4, 2010, 1:52pm

Hello,

I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:
clock_t tmpTime = clock();

	calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size());

	cout << "calcMatches = " << clock() - tmpTime << endl;

	tmpTime = clock();

	int * h_matches = (int *)malloc(points1.size() * sizeof(int));

	cout << "malloc = " << clock() - tmpTime << endl;

	tmpTime = clock();

	cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost));

	cout << "memCpy = " << clock() - tmpTime << endl;
The output for this section of code is, in an example:

calcMatches = 10000

malloc = 0

memCpy = 30000

The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.

Is there a way for optimizing these times? Thank you in advance

For realistic timing results, you should put a cudaThreadSynchronize() call after your kernel-call. cudaMemcpy probably has to wait until your kernel has finished all its calculations, so you may have unbalanced timing results.

jan.heckman · October 4, 2010, 1:52pm

Hello,

I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:
clock_t tmpTime = clock();

	calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size());

	cout << "calcMatches = " << clock() - tmpTime << endl;

	tmpTime = clock();

	int * h_matches = (int *)malloc(points1.size() * sizeof(int));

	cout << "malloc = " << clock() - tmpTime << endl;

	tmpTime = clock();

	cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost));

	cout << "memCpy = " << clock() - tmpTime << endl;
The output for this section of code is, in an example:

calcMatches = 10000

malloc = 0

memCpy = 30000

The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.

Is there a way for optimizing these times? Thank you in advance

For realistic timing results, you should put a cudaThreadSynchronize() call after your kernel-call. cudaMemcpy probably has to wait until your kernel has finished all its calculations, so you may have unbalanced timing results.

Jimmy_Pettersson · October 4, 2010, 2:04pm

So what is the clock speed of you CPU ? OR what is the timing in milliseconds ? (To get the time one needs to do clock()/float(CLOCKS_PER_SEC) if memory serves me)

Sometimes a “warmup” memcpy can help speed up the transfer. Or you could try using zero-copy (programming guide 3.2.5).

Jimmy_Pettersson · October 4, 2010, 2:04pm

So what is the clock speed of you CPU ? OR what is the timing in milliseconds ? (To get the time one needs to do clock()/float(CLOCKS_PER_SEC) if memory serves me)

Sometimes a “warmup” memcpy can help speed up the transfer. Or you could try using zero-copy (programming guide 3.2.5).

neztol · October 5, 2010, 9:14am

Thank you, the problem was solved by calling cudaThreadSynchronize(), I didn’t know this instruction :)

neztol · October 5, 2010, 9:14am

Thank you, the problem was solved by calling cudaThreadSynchronize(), I didn’t know this instruction :)

Topic		Replies	Views
About CUDA CUDA Programming and Performance	2	4770	December 3, 2008
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	1106	December 14, 2018
cudaMemcpy costs too much time CUDA Programming and Performance cuda	1	67	October 11, 2024
cudaMemcpy(dataDev, dataHost, mem_size, cudaMemcpyHostToDevice) execution time to long CUDA Programming and Performance	2	6458	January 21, 2010
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15479	April 29, 2014
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2734	July 15, 2009
Possibly Studpid question bout cudaMemcpy CudaMemcpy getting slow by time CUDA Programming and Performance	4	2090	February 26, 2010
copy memory slow? CUDA Programming and Performance	2	4860	February 12, 2009
cudaMemcpyDeviceToHost 200 x longer than cudaMemcpyHostToDevice ? CUDA Programming and Performance	2	1542	November 25, 2011
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2570	February 6, 2008

Copying memory from device to Host takes too much time

Related topics