cudaMemcpy2D() and a few gray hairs It's very slow

Mr_Nuke · February 9, 2009, 6:51am

I’ve noticed that some cudaMemcpy2D() calls take a significant amount of time to complete. Also copying to the device is about five times faster than copying back to the host.

Here’s the output from a program with memcy2D() timed:

memcpyHTD1 time: 0.487 s batch: 109.375 MB
Bandwidth: 224.735 MB/s

memcpyHTD2 time: 0.373 s batch: 54.688 MB
Bandwidth: 146.572 MB/s

memcpyDTH1 time: 1.876 s batch: 109.375 MB
Bandwidth: 58.294 MB/s

memcpyDTH2 time: 1.854 s batch: 54.688 MB
Bandwidth: 29.498 MB/s

memfree time: 0.003935 seconds
GPU kernel execution time: 1.87628 seconds
Effective performance: 137.494 GFLOP/s

The device arrays are contiguous, while the host array is pitched.
It’s wierd to see that the overall memcpy time is twice as large as the kernel execution time. These results don’t make much sense to me.

Anyone else experiencing this? Is there a reason for seeing such a large time/ small bandwidth

Test platform:
GeForce 9800GT
X58-Core i7 920 (QPI @6.4GHz)
DDR3-1600 triple channel
P6T6WS Revolution
GPU connected through nForce200 bridge @ full PCIE x16 2.0; no display attached

tmurray · February 9, 2009, 7:12am

Is your host memory allocated with cudaMallocHost?

Mr_Nuke · February 9, 2009, 7:25am

No, it’s allocated within a different module. The “X58 PCIE bandwidth” thread showed that the performance gap between paged and non-paged memory is not that large on the Core i7, so I figured I shouldn’t expect exorbitant gains by using page-locked mem. Ill give it a try.

tmurray · February 9, 2009, 7:52am

I’m wondering if what we do in cudaMemcpy2D is non-optimal for Nehalem for whatever reason.

Mr_Nuke · February 9, 2009, 8:01am

I couldn’t tell that, but I did notice the PCI-E bus is choked to death whenever the device to host copy occurs.

Mr_Nuke · February 10, 2009, 2:47pm

Hmm, cudaMemcpy2D() from paged-locked to device causes subsequent CUDA calls to fail with ‘unknown error’, but the call itself generates no error. I am using the exact set of parameters as when copying from paged mem (which works fine).

Will get back with some details later today…

tmurray · February 10, 2009, 5:14pm

What driver are you using?

Mr_Nuke · February 10, 2009, 8:56pm

181.22 for Windows Vista x64

Toolkit and SDK v2.1 x64

Now, here’s an excerpt from the code where the error occurs (see the clarification at the bottom for library explanation):

[codebox]

double time;

Vector3<T> * plCharges;

size = steps*n*fieldLines.GetElemSize();

CUDA_SAFE_CALL(cudaMallocHost((void**) &plCharges, size));

QueryPerformanceCounter(&lstart);

CUDA_SAFE_CALL(cudaMemcpy(plCharges, fieldLines.GetDataPointer(), size, cudaMemcpyHostToHost));

QueryPerformanceCounter(&lend);

time = (double)(lend.QuadPart - lstart.QuadPart) / freq;

printf("\n memcpyHTH1 time: %.3f s\t batch: %.3f MB", time, (double)size/1024/1024);

printf("\n Bandwidth:\t %.3f MB/s\n", (double)size/time/1024/1024);

compSize = (fieldLines.GetElemSize()*2)/3;

size = steps*n*compSize;

CUDA_SAFE_CALL(cudaMalloc((void**) &coalVec.xyInterleaved, size));

QueryPerformanceCounter(&lstart);

/*CUDA_SAFE_CALL(cudaMemcpy2D(coalVec.xyInterleaved, compSize,

	fieldLines.GetDataPointer(), fieldLines.GetElemSize(),

	compSize, steps*n,

	cudaMemcpyHostToDevice));

	*/// Original non-paged to device copy - This does not fail

CUDA_SAFE_CALL(cudaMemcpy2D(coalVec.xyInterleaved, compSize,

	plCharges, fieldLines.GetElemSize(),

	compSize, steps*n,

	cudaMemcpyHostToDevice));

QueryPerformanceCounter(&lend);

time = (double)(lend.QuadPart - lstart.QuadPart) / freq;

printf("\n memcpyHTD1 time: %.3f s\t batch: %.3f MB", time, (double)size/1024/1024);

printf("\n Bandwidth:\t %.3f MB/s\n", (double)size/time/1024/1024);

compSize/=2;

size = steps*n*compSize;

//printf(cudaGetErrorString(cudaGetLastError())); // When commented out, this printf reveals "no error"

// This call will fail with "unknown error", but even if this is placed above memcpy2D, the next cuda*() call will fail with "unknown error"

CUDA_SAFE_CALL(cudaMalloc((void**) &coalVec.z, size));

[/codebox]

I think the names are self-explanatory, but I will provide a brief explanation of what is going on:

Vector3 is a template of form {T x, y, z;};

fieldLines is a smart array of Vector3

‘n’ is the width of the array

‘steps’ is the height of the array

GetDataPointer() returns the memory location of the actual data array

GetElemSize() returns the size in bytes of an element of the array, in this case, the size of a Vector3

T is of type float

EDIT (for Linux and UNIX users): QueryPerformanceCounter() is used to time the memory copy.

Mr_Nuke · February 13, 2009, 3:40am

tmurray, were you able to repro the memcpy error issue?

EDIT - Sometimes, the memcpy2D will fail with “unknown error,” but it may or mai not fail for the same build.

Topic		Replies	Views
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15119	April 29, 2014
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008
Slow memory transfers CUDA Programming and Performance	7	1986	May 23, 2011
cuMemcpy* and error codes CUDA_ERROR_INVALID_VALUE? CUDA Programming and Performance	11	11097	October 28, 2009
bets way to return a float value sync or assync CUDA Programming and Performance	26	10310	May 7, 2009
How much time is cudaMemcpy() use? CUDA Programming and Performance	1	4021	July 30, 2008
2MB size limit cudaMemcpy CUDA Programming and Performance	18	7286	April 18, 2011
Why is there the difference of memory copy speed between cpu>gpu and gpu>cpu CUDA Programming and Performance	3	1273	April 10, 2014
[solved] strange cuda memcopy time CUDA Programming and Performance	5	703	March 26, 2015
Question about CUDA_SAFE_CALL(cudaMemcpy(hostPx, CUDA_SAFE_CALL(cudaMemcpy(hostPx, device CUDA Programming and Performance	6	47469	January 23, 2009

cudaMemcpy2D() and a few gray hairs It's very slow

Related topics