cudaMemcpy2D slow

AndreasBuhr · June 11, 2007, 2:16pm

Hi,

I just had a large performance gain by padding arrays on the host in the same way as they are padded on the card and using cudaMemcpy instead of cudaMemcpy2D.

It took me some time to figure out that cudaMemcpy2D is very slow and that this is the performance problem I have.

A little warning in the programming guide concerning this would be nice ;-)

Cyril_Zeller · June 12, 2007, 2:14pm

cudaMemcpy2D should be fast even with different pitches for host and device memory.

Could you elaborate more on your use case?

Thanks.

AndreasBuhr · June 14, 2007, 8:19am

thanks for your answer,
I’ll investigate it a bit more and post results then.

AndreasBuhr · June 19, 2007, 4:11pm

Attached is the speed of cudaMemcpy2D I measure vs. the width I give cudaMalloc2D.

Good performance I only find for width == 64 and width == 128
speeds.txt (3.87 KB)

Mr_Nuke · January 30, 2009, 9:00am

Hi, I’m getting the same problem.

a call to:

cudaMemcpy2D(&dst, 12, &src, 8, 8, 14M, cudaMemcpyDeviceToHost);

takes 1.9 s (seconds!!!) to complete. I’m copying from a contiguos device array to a 164MB striped array on the host. While it’s at it, the CPU usage jumps up to occupy an entire core.

cudaMemcpy2D(&dst, 12, &src, 4, 4, 14M, cudaMemcpyDeviceToHost);
This call takes the same 1.9 seconds, even though I’m copying half the data in the previous call.