I’ve been having some performance issues with cudaMemcpy3D, particularly copying the X-Z face from a large 3-d array into contiguous host memory. Even with wide rows, the copy bandwidth is much smaller than expected. Attached is a benchmark I wrote to test this on a large array, against a 2-d copy. For the size tested (256^3 array of doubles, 512KB transfer size), there is a significant difference in the throughput, ~0.6GB/s for the 3-d copy vs. ~4.5GB/s for the 2-d copy, on an M2070. This trend roughly holds for other sizes as well. Is there anything I am doing incorrectly or is something suspect with the 3-d copy? Any help would be greatly appreciated.
test_3d_copy.cpp (2.65 KB)