cudaMempcy3D Performance

I work with 3D volumes, and need to read or write “volume slices”, where a slice has thickness>=1 and the slice is perpendicular to one of the axes. The slice must be transferred to the host. The two strategies I’ve tried:

  1. “Slice staging”: Matching slice buffers are allocated on the host and GPU, a kernel is executed which merely effects a D-D copy from logical 3D memory to the contiguous slice buffer. Next, a simple cudaMemcpy transfers the memory from D-H.

  2. cudaMemcpy3D

My simple tests are showing that approach 1 outperforms cudaMemcpy3D (substantially) when the slice is perpendicular to the fastest, or second fastest memory axis, but is slightly slower for the slowest memory axis. Can anybody share their experiences? cudaMempcy3D is 20x slower when applied on the fastest memory axis! (on a QFX5600, will test a C1060 next)

Thank you,
Dan