cudaMemcpy2D() and a few gray hairs It's very slow

I’ve noticed that some cudaMemcpy2D() calls take a significant amount of time to complete. Also copying to the device is about five times faster than copying back to the host.

Here’s the output from a program with memcy2D() timed:

memcpyHTD1 time: 0.487 s batch: 109.375 MB
Bandwidth: 224.735 MB/s

memcpyHTD2 time: 0.373 s batch: 54.688 MB
Bandwidth: 146.572 MB/s

memcpyDTH1 time: 1.876 s batch: 109.375 MB
Bandwidth: 58.294 MB/s

memcpyDTH2 time: 1.854 s batch: 54.688 MB
Bandwidth: 29.498 MB/s

memfree time: 0.003935 seconds
GPU kernel execution time: 1.87628 seconds
Effective performance: 137.494 GFLOP/s

The device arrays are contiguous, while the host array is pitched.
It’s wierd to see that the overall memcpy time is twice as large as the kernel execution time. These results don’t make much sense to me.

Anyone else experiencing this? Is there a reason for seeing such a large time/ small bandwidth

Test platform:
GeForce 9800GT
X58-Core i7 920 (QPI @6.4GHz)
DDR3-1600 triple channel
P6T6WS Revolution
GPU connected through nForce200 bridge @ full PCIE x16 2.0; no display attached

Is your host memory allocated with cudaMallocHost?

No, it’s allocated within a different module. The “X58 PCIE bandwidth” thread showed that the performance gap between paged and non-paged memory is not that large on the Core i7, so I figured I shouldn’t expect exorbitant gains by using page-locked mem. Ill give it a try.

I’m wondering if what we do in cudaMemcpy2D is non-optimal for Nehalem for whatever reason.

I couldn’t tell that, but I did notice the PCI-E bus is choked to death whenever the device to host copy occurs.

Hmm, cudaMemcpy2D() from paged-locked to device causes subsequent CUDA calls to fail with ‘unknown error’, but the call itself generates no error. I am using the exact set of parameters as when copying from paged mem (which works fine).

Will get back with some details later today…

What driver are you using?

181.22 for Windows Vista x64

Toolkit and SDK v2.1 x64

Now, here’s an excerpt from the code where the error occurs (see the clarification at the bottom for library explanation):


double time;

Vector3<T> * plCharges;

size = steps*n*fieldLines.GetElemSize();

CUDA_SAFE_CALL(cudaMallocHost((void**) &plCharges, size));


CUDA_SAFE_CALL(cudaMemcpy(plCharges, fieldLines.GetDataPointer(), size, cudaMemcpyHostToHost));


time = (double)(lend.QuadPart - lstart.QuadPart) / freq;

printf("\n memcpyHTH1 time: %.3f s\t batch: %.3f MB", time, (double)size/1024/1024);

printf("\n Bandwidth:\t %.3f MB/s\n", (double)size/time/1024/1024);

compSize = (fieldLines.GetElemSize()*2)/3;

size = steps*n*compSize;

CUDA_SAFE_CALL(cudaMalloc((void**) &coalVec.xyInterleaved, size));


/*CUDA_SAFE_CALL(cudaMemcpy2D(coalVec.xyInterleaved, compSize,

	fieldLines.GetDataPointer(), fieldLines.GetElemSize(),

	compSize, steps*n,


	*/// Original non-paged to device copy - This does not fail

CUDA_SAFE_CALL(cudaMemcpy2D(coalVec.xyInterleaved, compSize,

	plCharges, fieldLines.GetElemSize(),

	compSize, steps*n,



time = (double)(lend.QuadPart - lstart.QuadPart) / freq;

printf("\n memcpyHTD1 time: %.3f s\t batch: %.3f MB", time, (double)size/1024/1024);

printf("\n Bandwidth:\t %.3f MB/s\n", (double)size/time/1024/1024);


size = steps*n*compSize;

//printf(cudaGetErrorString(cudaGetLastError())); // When commented out, this printf reveals "no error"

// This call will fail with "unknown error", but even if this is placed above memcpy2D, the next cuda*() call will fail with "unknown error"

CUDA_SAFE_CALL(cudaMalloc((void**) &coalVec.z, size));


I think the names are self-explanatory, but I will provide a brief explanation of what is going on:

Vector3 is a template of form {T x, y, z;};

fieldLines is a smart array of Vector3

‘n’ is the width of the array

‘steps’ is the height of the array

GetDataPointer() returns the memory location of the actual data array

GetElemSize() returns the size in bytes of an element of the array, in this case, the size of a Vector3

T is of type float

EDIT (for Linux and UNIX users): QueryPerformanceCounter() is used to time the memory copy.

tmurray, were you able to repro the memcpy error issue?

EDIT - Sometimes, the memcpy2D will fail with “unknown error,” but it may or mai not fail for the same build.