Copying memory from device to Host takes too much time

Hello,

I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:

clock_t tmpTime = clock();

	calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size());

	cout << "calcMatches = " << clock() - tmpTime << endl;

	tmpTime = clock();

	int * h_matches = (int *)malloc(points1.size() * sizeof(int));

	cout << "malloc = " << clock() - tmpTime << endl;

	tmpTime = clock();

	cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost));

	cout << "memCpy = " << clock() - tmpTime << endl;

The output for this section of code is, in an example:

calcMatches = 10000

malloc = 0

memCpy = 30000

The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.

Is there a way for optimizing these times? Thank you in advance

Hello,

I have a code that I have optimized using CUDA with quite good results. The problem is that, after the optimization, the section of the code that expends more time is that in which the data is passed from the device memory to the host after calling the kernel. These are the conflicting lines of code:

clock_t tmpTime = clock();

	calcMatches <<< blocksPerGrid, threadsPerBlock >>> (d_bestCorr1, d_bestCorr2, d_matches, points1.size());

	cout << "calcMatches = " << clock() - tmpTime << endl;

	tmpTime = clock();

	int * h_matches = (int *)malloc(points1.size() * sizeof(int));

	cout << "malloc = " << clock() - tmpTime << endl;

	tmpTime = clock();

	cutilSafeCall(cudaMemcpy(h_matches, d_matches, points1.size() * sizeof(int), cudaMemcpyDeviceToHost));

	cout << "memCpy = " << clock() - tmpTime << endl;

The output for this section of code is, in an example:

calcMatches = 10000

malloc = 0

memCpy = 30000

The execution time is enough for my application, but it would be better if the times used for copying memory were smaller. The value of points.size() is 746.

Is there a way for optimizing these times? Thank you in advance

For realistic timing results, you should put a cudaThreadSynchronize() call after your kernel-call. cudaMemcpy probably has to wait until your kernel has finished all its calculations, so you may have unbalanced timing results.

For realistic timing results, you should put a cudaThreadSynchronize() call after your kernel-call. cudaMemcpy probably has to wait until your kernel has finished all its calculations, so you may have unbalanced timing results.

So what is the clock speed of you CPU ? OR what is the timing in milliseconds ? (To get the time one needs to do clock()/float(CLOCKS_PER_SEC) if memory serves me)

Sometimes a “warmup” memcpy can help speed up the transfer. Or you could try using zero-copy (programming guide 3.2.5).

So what is the clock speed of you CPU ? OR what is the timing in milliseconds ? (To get the time one needs to do clock()/float(CLOCKS_PER_SEC) if memory serves me)

Sometimes a “warmup” memcpy can help speed up the transfer. Or you could try using zero-copy (programming guide 3.2.5).

Thank you, the problem was solved by calling cudaThreadSynchronize(), I didn’t know this instruction :)

Thank you, the problem was solved by calling cudaThreadSynchronize(), I didn’t know this instruction :)