cudaMemcpy3D performance issues (X-Z face)

jenkinjp · August 4, 2011, 10:36pm

Hey all,

I’ve been having some performance issues with cudaMemcpy3D, particularly copying the X-Z face from a large 3-d array into contiguous host memory. Even with wide rows, the copy bandwidth is much smaller than expected. Attached is a benchmark I wrote to test this on a large array, against a 2-d copy. For the size tested (256^3 array of doubles, 512KB transfer size), there is a significant difference in the throughput, ~0.6GB/s for the 3-d copy vs. ~4.5GB/s for the 2-d copy, on an M2070. This trend roughly holds for other sizes as well. Is there anything I am doing incorrectly or is something suspect with the 3-d copy? Any help would be greatly appreciated.

Thanks,
John
test_3d_copy.cpp (2.65 KB)

tera · August 4, 2011, 11:16pm

What are the cudaStreamSynchronize() calls for?

jenkinjp · August 5, 2011, 2:37pm

My application has kernels and copy overlap, so the ops will be associated with a stream. Sorry, should have made that clear. I will repeat with synchronous calls.

jenkinjp · August 5, 2011, 3:23pm

Here’s an interesting result… It seems that the stream sync latency dominated the 3d copy, but made no difference on the 2d copy, which is quite strange. Makes me think it’s a “warming-up” issue but I added one and results are about the same (the first few copies suffered in the original, but the multiple iterations minimized the effect). The updated benchmark is attached. There is less of a regression when not using streams, but still discrepancies.

with synchronous copies (attached):
x, y, z, xz face bytes, memcpy3d time/throughput, memcpy2d time/throughput
256 256 256 524288 0.000181 2.700385 0.000105 4.634919

with stream synchronize:
x, y, z, xz face bytes, memcpy3d time/throughput, memcpy2d time/throughput
256 256 256 524288 0.000794 0.614672 0.000106 4.585089
test_3d_copy2.cpp (2.7 KB)

Joky · August 10, 2011, 11:45am

Hi,

There’s no issue here. You are not copying the same things thus it’s not surprising.

For the 3D case, the face you’re copying is not a contiguous area in the GPU memory. Thus the driver will probably have to trigger 256 separate copies, one for each x.

In the 2D copy, you selected a size which is not pitched. Thus the memory area is contiguous and can be transfered in one block. If you try to change the size so that you get a pitch, then you fall in the same scheme as the 3D case: as many copies as X dimension size.

Here are my results :

x, y, z, xz face bytes, memcpy3d time/throughput, memcpy2d time/throughput
256 256 256 524288 0.000148 3.303837 0.000086 5.677735
257 256 256 526336 0.000156 3.144393 0.000150 3.270108

jenkinjp · August 10, 2011, 10:18pm

Hello,

Unless I set a parameter incorrectly, the same underlying data is copied with both function calls, z blocks of x contiguous doubles, with the stride between blocks being the row pitch times the number of rows (the 2-d copy is not contiguous). The difference is between getting the data with the 3-d call vs. 2-d call, for which I am getting lesser performance in the 3-d case (especially when using streams). Also, perhaps it is a result of being on different machines, but I see negligible degradation for a pitched x-dimension:

x, y, z, xz face bytes, memcpy3d time/throughput, memcpy2d time/throughput
256 256 256 524288 0.000169 2.895267 0.000103 4.718463
257 256 256 526336 0.000170 2.875548 0.000105 4.657051

Joky · August 11, 2011, 9:33am

You’re right I missed the fact that you were multiplying the pitch by y.

The discrepancy we get using a 257 size is still strange, I used a Tesla C2050.