cudaMemcpy2D slow


I just had a large performance gain by padding arrays on the host in the same way as they are padded on the card and using cudaMemcpy instead of cudaMemcpy2D.

It took me some time to figure out that cudaMemcpy2D is very slow and that this is the performance problem I have.

A little warning in the programming guide concerning this would be nice ;-)

cudaMemcpy2D should be fast even with different pitches for host and device memory.

Could you elaborate more on your use case?


thanks for your answer,
I’ll investigate it a bit more and post results then.

Attached is the speed of cudaMemcpy2D I measure vs. the width I give cudaMalloc2D.

Good performance I only find for width == 64 and width == 128
speeds.txt (3.87 KB)

Hi, I’m getting the same problem.

a call to:

cudaMemcpy2D(&dst, 12, &src, 8, 8, 14M, cudaMemcpyDeviceToHost);

takes 1.9 s (seconds!!!) to complete. I’m copying from a contiguos device array to a 164MB striped array on the host. While it’s at it, the CPU usage jumps up to occupy an entire core.

cudaMemcpy2D(&dst, 12, &src, 4, 4, 14M, cudaMemcpyDeviceToHost);
This call takes the same 1.9 seconds, even though I’m copying half the data in the previous call.