Very poor performance with cudaMemcpy3DAsync / cudaMemcpy3D for transfer of strided data (device to ...

So far we have only looked at device to host transfers. Now, to complete this thread, I tested the inverse copy operations (host to device transfers) that are of interest for a halo update of a 3-D device array.

Here is the experiment for the YZ-plane (1-D contiguous host array to YZ-plane of 3-D device array):

$> cat t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_h2d.cu
#include <iostream>

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_dst, *h_src;
  cudaMalloc(&d_dst, nx*ny*nz*sizeof(d_dst[0]));
  cudaHostAlloc(&h_src, ny*nz*sizeof(d_dst[0]), cudaHostAllocDefault);
  cudaMemset(d_dst, 0, nx*ny*nz*sizeof(d_dst[0]));
  memset(h_src, 1, ny*nz*sizeof(d_dst[0]));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.dstPtr = make_cudaPitchedPtr(d_dst, nx*sizeof(d_dst[0]), nx, ny);
  cpy_params.srcPtr = make_cudaPitchedPtr(h_src, sizeof(d_dst[0]), 1, ny);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(sizeof(d_dst[0]), ny, nz);
  cpy_params.kind   = cudaMemcpyHostToDevice;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }

  cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_h2d.cu 
$> nvprof ./a.out==8015== NVPROF is profiling process 8015, command: ./a.out
==8015== Profiling application: ./a.out
==8015== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  184.77ms       100  1.8477ms  1.8443ms  1.8611ms  [CUDA memcpy HtoD]
                    0.00%     960ns         1     960ns     960ns     960ns  [CUDA memset]
      API calls:   56.63%  247.11ms         1  247.11ms  247.11ms  247.11ms  cudaMalloc
                   42.42%  185.11ms         1  185.11ms  185.11ms  185.11ms  cudaDeviceSynchronize
                    0.34%  1.4770ms         1  1.4770ms  1.4770ms  1.4770ms  cudaHostAlloc
                    0.27%  1.1857ms         1  1.1857ms  1.1857ms  1.1857ms  cuDeviceTotalMem
                    0.18%  775.66us       100  7.7560us  7.3710us  19.080us  cudaMemcpy3DAsync
...

The observed performance is only about 0.57 GB/s (5125124B/(1.847s/1e3)/1e9) (see [CUDA memcpy HtoD]; NVVP notes the same throughput).

Here is the experiment for the XZ-plane (1-D contiguous host array to XZ-plane of 3-D device array):

$ cat t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_xzplane_h2d.cu
#include <iostream>

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_dst, *h_src;
  cudaMalloc(&d_dst, nx*ny*nz*sizeof(d_dst[0]));
  cudaHostAlloc(&h_src, nx*nz*sizeof(d_dst[0]), cudaHostAllocDefault);
  cudaMemset(d_dst, 0, nx*ny*nz*sizeof(d_dst[0]));
  memset(h_src, 1, nx*nz*sizeof(d_dst[0]));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.dstPtr = make_cudaPitchedPtr(d_dst, nx*sizeof(d_dst[0]), nx, ny);
  cpy_params.srcPtr = make_cudaPitchedPtr(h_src, nx*sizeof(d_dst[0]), nx, 1);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(nx*sizeof(d_dst[0]), 1, nz);
  cpy_params.kind   = cudaMemcpyHostToDevice;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }

  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_strided_xzplane_h2d.cu 
$ nvprof ./a.out
==9413== NVPROF is profiling process 9413, command: ./a.out
==9413== Profiling application: ./a.out
==9413== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  8.7544ms       100  87.543us  87.167us  98.207us  [CUDA memcpy HtoD]
                    0.01%     960ns         1     960ns     960ns     960ns  [CUDA memset]
      API calls:   95.91%  308.96ms         1  308.96ms  308.96ms  308.96ms  cudaMalloc
                    2.81%  9.0473ms         1  9.0473ms  9.0473ms  9.0473ms  cudaDeviceSynchronize
                    0.45%  1.4380ms         1  1.4380ms  1.4380ms  1.4380ms  cudaHostAlloc
                    0.36%  1.1639ms         1  1.1639ms  1.1639ms  1.1639ms  cuDeviceTotalMem
                    0.24%  782.17us       100  7.8210us  7.4170us  19.605us  cudaMemcpy3DAsync
...

We can see that we get a performance of about 12 GB/s (5125124B/(87s/1e6)/1e9) (see [CUDA memcpy HtoD]; NVVP notes the same throughput).

I omit here the experiment for the XY-plane as being a copy from a contiguous to a contiguous buffer, it will certainly give a performance as least as good as for the XZ-plane (12 GB/s, see above).

Conclusion
The results are no surprise, but completely in agreement with the device to host transfer. Again, the copy of an XZ-plane (and certainly also of a XY-plane) gives a very good performance; however the copy of the YZ-plane is very slow. A bug report should therefore be opened for the YZ-plane case in both transfer directions (device to host and host to device).

The instructions for filing a bug are linked to a sticky post at the top of this sub-forum. “How to report a bug”

Thanks @Robert_Crovella. I will file a bug report as described there.

@Robert_Crovella and @njuffa, any final comments on the additional results that I showed here?

Thanks!