Very poor performance with cudaMemcpy3DAsync / cudaMemcpy3D for transfer of strided data (device to ...

Problem
I get a very poor performance with cudaMemcpy3DAsync / cudaMemcpy3D when transfering strided data from device to host: ~300 MB/s. This is nearly 50 times slower than when transfering a contigous junk of the same size from the same data: ~12 GB/s (close to expected theoretical value of 15.75 GB/s for the available 16 PCIe 3.0 lanes).

Code
Here is the relevant part of the code that leads to the poor performance (~300 MB/s):

#define nx    512
#define ny    512
#define nz    512

cudaHostAlloc(&hbuf, (size_t)nx*ny*nz*sizeof(float), cudaHostAllocDefault);
cudaMalloc(   &dbuf, (size_t)nx*ny*nz*sizeof(float));

cudaMemcpy3DParms cpy_params = {0};
cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.srcPtr = make_cudaPitchedPtr(dbuf, nx*sizeof(float), nx, ny);
cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
cpy_params.dstPtr = make_cudaPitchedPtr(hbuf, nx*sizeof(float), nx, ny);
cpy_params.extent = make_cudaExtent(1*sizeof(float), ny, nz);
cpy_params.kind   = cudaMemcpyDeviceToHost;

cudaMemcpy3DAsync(&cpy_params);

It is enough to modify cpy_params.extent as follows to copy a contigous junk of the same size and get good performance (~12 GB/s):

cpy_params.extent = make_cudaExtent(nx*sizeof(float), ny, 1);

Comparison to copying with kernel
If I do the equivalent copies with a kernel (mapping the host buffer to make it accessible to GPU kernels), the strided memory copy is about the same speed as the contiguous memory copy (~15.75 GB/s = expected value for the available 16 PCIe 3.0 lanes).
(Later: these were not correct; an overview of the achieved performance with kernel vs cudaMemcpy3DAsync is given in a table on page 2 of this thread (Juli 8).)

Question
Am I doing something wrong in the usage of cudaMemcpy3DAsync?

Thanks!!

PS: GPU: NVIDIA® Tesla® P100
PPS: performance data were obtained with NVVP.

It’s not unusual for people to complain about low performance of strided copies over PCIE when using an API to do it, like cudaMemcpy2D or cudaMemcpy3D. I haven’t studied your case in detail, however it appears that you are transferring 4 bytes per sub-transfer. That would be especially bad. My understanding is that the DMA engine(s) have to be setup individually for each of these sub-transfers. For that reason, I would expect a direct-copy kernel to be vastly superior. It might not make much difference in terms of bus utilization, but the elimination of the DMA setup overhead might be substantial.

(Later: subsequent developments in this thread have shown that a direct kernel copy, for a case that duplicates the above defined cudaMemcpy3D operation, are not materially faster than the cudaMemcpy3D case)

As Robert Crovella says, it is expected that strided copies across PCIe see a significant slowdown compared to contiguous copies. This effect increases as the size of the contiguous chunks is reduced. You are invoking what looks like the worst case scenario, or close to it.

In the early days of CUDA, CPU and GPU memory throughput was closer to PCIe throughput, and GPU memories were small. Therefore, the performance impact from strided copies across PCIe was often tolerable and the convenience factor (no additional buffers needed to restructure the data) a significant positive.

With modern systems using fast and large memories, an approach that uses contiguous bulk copies and re-orders data once it is resident inside the GPU memory is often preferable, especially for extreme scenarios involving tiny data chunks like the case at hand.

Thank you very much for your reply @Robert_Crovella. In the following are some replies to your comments:

I haven’t studied your case in detail, however it appears that you are transferring 4 bytes per sub-transfer.

  • I am testing the transfer of a YZ-boundary plane of a 3-D array (for a halo update); so, yes, there is a large stride (nx4B = 5124B = 2KB) between each floating point number.

That would be especially bad.

  • Yet, it is what I need… (as many others as it is a very common pattern as required for a 3-D halo update).

My understanding is that the DMA engine(s) have to be setup individually for each of these sub-transfers. For that reason, I would expect a direct-copy kernel to be vastly superior. It might not make much difference in terms of bus utilization, but the elimination of the DMA setup overhead might be substantial.

  • Does that mean that I that there is no other solution than to copy the strided data first with a kernel into an intermediate contiguous buffer on the GPU and to transfer this simply with cudaMemcpyAsync (the 1D version)?
    I would like to avoid that solution as I observed that such a kernel can not truly overlap my computation kernels (as they saturate the GPU resources), while the async device to host copy can. Note that copying with cudaMemcpy3DAsync to an intermediate buffer on the GPU does not seem to be a solution as in the documentation it says “IMPORTANT NOTE: Copies with kind == cudaMemcpyDeviceToDevice are asynchronous with respect to the host, but never overlap with kernel execution.” [1].

[1] CUDA Toolkit Documentation 12.3 Update 1

Your immediate design alternatives are

(1) Use cudaMemcpy3DAsync() to copy and reformat data on the fly during PCIe transfer
(2) Use cudaMemcpyAsync() for contiguous bulk copy to intermediate buffer on GPU, then run kernel on the GPU to reformat data
(3) Use host code to reformat data as needed into a host-side buffer, then do a bulk copy to GPU with cudaMemcpyAsync()

(1) is the most convenient option from a programming perspective. (2) is likely the highest performance option on modern hardware. (3) may not actually be applicable to your scenario, it is a bit hard to tell. You could try all three variants to explore which one best suits your needs. That is a three-way experiment that shouldn’t take more than a few hours.

Alternatively, you could try to rethink your overall design to avoid this particular copy altogether, for example moving the data structure from which you are extracting data to the GPU in its entirety early on and keeping it resident there. Reducing copies between host and device is an important aspect of high-performance CUDA programming.

The cost of halo exchange can be viewed from the standpoint of latency. Halo exchange has a latency, and the GPU is a latency hiding machine.

If you are doing volumetric (3D) work, then presumably the compute load varies as the cube of the volume side dimension, whereas the cost of halo exchange (however it happens) varies as the square of the side dimension.

Increase the size of your compute volume to the point that the cost of halo exchange is a latency that can be hidden, while the compute kernel is proceeding. If the volume is large enough, we can hide an arbitrarily large halo exchange cost, even at low transfer rates, using nuffa’s design alternative 1.

I’m not aware of any magic that can be applied here. You seem to have a solid grasp of the situation.

This may be of interest:

[url]http://on-demand.gputechconf.com/gtc/2017/presentation/S7133-jiri-kraus-multi-gpu-programming.PDF[/url]

The left/right halo exchange depicted in the above presentation (e.g. slide 15) follows njuffa’s design alternative 2. No magic.

Another treatment is here:

[url]http://on-demand.gputechconf.com/gtc-eu/2017/presentation/23031-jiri-kraus-multi-gpu-programming-models.pdf[/url]

@njuffa:

Your immediate design alternatives are: …

  • The question here is not copying from host to device, but from device to host (see above in the topic description). What you noted in point (2) is the analogue for a host to device copy to what I noted in my reply above to @Robert_Crovella. Thanks nevertheless.

Alternatively, you could try to rethink your overall design to avoid this particular copy altogether.

  • The data is resident on the GPU (I don’t follow any offload approach) and we need to update every iteration the halos of the 3-D arrays with the corresponding data from the neighboring GPUs. This is the minimally required communication for the used numerical method. I do it explicitly via the CPU as CUDA-Aware MPI with RDMA is not an option currently for my particular case.

@Robert_Crovella:

  • Thanks for the links; I will go through and see if I can learn something new there.

@Robert_Crovella and @njuffa:

  • So, to summarize your answer to my question is that I cannot expect a good performance from using cudaMemcpy3DAsync to copy this strided data from device to host. I have one final questions:
    Can you explain why the performance is optimal - 15.75 GB/s - when I use instead a kernel to copy the strided data from the device to the host (see in the topic description “Comparison to copying with kernel”)? Does it in this case automatically create an intermediate contiguous buffer on the GPU before transferring the data via PCIe to the host?

quite simply, I don’t believe that number. You’ve not provided a test case. Extraordinary claims require extraordinary proof. Any claim of data transfer rates (in a particular direction) above ~13GB/s on a x16 PCIE Gen3 link to a NVIDIA GPU are most likely faulty. The way the NVIDIA GPU uses PCIE precludes such a possibility.

Just do a bit of research.

https://devtalk.nvidia.com/default/topic/1026924/cuda-programming-and-performance/-about-how-to-set-gpu-max-payload-size-please-help-me/

https://devtalk.nvidia.com/default/topic/766754/20-of-the-bandwidth-is-missing/

No, nothing of the sort happens.

(but see below)

PCIe uses packetized transport which incurs overhead, so theoretical transfer rates are not achievable in practice. For large bulk transfers (say, > 16 MB), throughput of 12.4 to 12.9 GB/sec represents the maximum unidirectional transfer rate that is achieved by GPUs when using a PCIe gen 3 x16 link. PCIe provides a full duplex link, so up to about 25 GB/sec can be moved when simultaneous upload and download from/to the GPU is used.

Higher host/device throughput requires a GPU/CPU combination that supports NVlink, which currently applies to PowerPC platforms only. PCIe4 (due sometime in 2020) is expected to roughly double throughput compared to PCIe3.

I wrote a quick device->host bulk copy kernel (non-strided) and ran it on Tesla V100 PCIE (x16, gen3 link), and was able to observe close to 13GB/s:

$ cat t1442.cu
#include <iostream>

template <typename T>
__global__ void copy(const T * __restrict__ src, T * __restrict__ dst, size_t n){

  for (size_t i  = threadIdx.x+blockDim.x*blockIdx.x; i < n; i+=gridDim.x*blockDim.x)
    dst[i] = src[i];
}

int main(){

  const size_t n = 8192000;
  int *d_src, *h_dst;
  cudaMalloc(&d_src, n*sizeof(d_src[0]));
  cudaHostAlloc(&h_dst, n*sizeof(d_src[0]), cudaHostAllocDefault);
  cudaMemset(d_src, 0, n*sizeof(d_src[0]));
  memset(h_dst, 1, n*sizeof(d_src[0]));
  copy<<<320,512>>>(d_src, h_dst, n);
  copy<<<320,512>>>(d_src, h_dst, n);
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1442 t1442.cu
$ nvprof ./t1442
==4042== NVPROF is profiling process 4042, command: ./t1442
==4042== Profiling application: ./t1442
==4042== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.97%  5.1531ms         2  2.5765ms  2.5765ms  2.5766ms  void copy<int>(int const *, int*, unsigned long)
                    0.03%  1.6320us         1  1.6320us  1.6320us  1.6320us  [CUDA memset]
      API calls:   91.27%  310.55ms         1  310.55ms  310.55ms  310.55ms  cudaMalloc
                    3.93%  13.386ms         1  13.386ms  13.386ms  13.386ms  cudaHostAlloc
                    1.60%  5.4420ms         4  1.3605ms  680.00us  3.3753ms  cuDeviceTotalMem
                    1.52%  5.1622ms         1  5.1622ms  5.1622ms  5.1622ms  cudaDeviceSynchronize
...

8192000x4 = 32768000 bytes in 2.5765ms = 12.7GB/s

The GB I am using here is 1000000000 bytes, which I believe is consistent with the usage in bandwdithTest

I am sorry, I didn’t remember the numbers right: I thought the 15.75 GB/s was the practical limit, not the theoretical. So, obviously I made some mistake in my kernel copy test code. I will search it and report back.

I found the error and then performance was only about 12 GB/s. So, I took now your nice simple example and added a parameter STRIDE, which I then defined via a macro (given as argument to nvcc). So here are the results that I get with STRIDE=1 (equivivalent to your original code) and STRIDE=512 (the same stride as in the cudaMemcpy3DAsync problem from the topic description):

$> cat t1442_float512x512_strided.cu
#include <iostream>

template <typename T>
__global__ void copy(const T * __restrict__ src, T * __restrict__ dst, size_t n){

  for (size_t i  = threadIdx.x+blockDim.x*blockIdx.x; i < n; i+=gridDim.x*blockDim.x)
    dst[i] = src[i*STRIDE];
}


int main(){

  const size_t n = 512*512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, STRIDE*n*sizeof(d_src[0]));
  cudaHostAlloc(&h_dst, n*sizeof(d_src[0]), cudaHostAllocDefault);
  cudaMemset(d_src, 0, STRIDE*n*sizeof(d_src[0]));
  memset(h_dst, 1, n*sizeof(d_src[0]));
  for (int i=0; i<100; i++){
    copy<<<512,512>>>(d_src, h_dst, n);
  }
  cudaDeviceSynchronize();
}

$> nvcc -DSTRIDE=1 -arch=sm_60 t1442_float512x512_strided.cu 
$> nvprof ./a.out==32250== NVPROF is profiling process 32250, command: ./a.out
==32250== Profiling application: ./a.out
==32250== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  8.0016ms       100  80.015us  79.936us  80.639us  void copy<float>(float const *, float*, unsigned long)
                    0.01%     992ns         1     992ns     992ns     992ns  [CUDA memset]
...

$> nvcc -DSTRIDE=512 -arch=sm_60 t1442_float512x512_strided.cu 
$> nvprof ./a.out==32307== NVPROF is profiling process 32307, command: ./a.out
==32307== Profiling application: ./a.out
==32307== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  9.5102ms       100  95.101us  94.079us  101.38us  void copy<float>(float const *, float*, unsigned long)
                    0.01%     992ns         1     992ns     992ns     992ns  [CUDA memset]
...

This gives the following throughputs:

STRIDE=1: 12.85 GB/s (5125124/(81.6/1e6)/1e9)
STRIDE=512: 11.03 GB/s (5125124/(95.1/1e6)/1e9)

So if I didn’t do any error here, then I would ask you again: can you explain why the performance is close to optimal when I use a kernel instead of cudaMemcpy3DAsync to copy the strided data from the device to the host?

what you have now shown is not an equivalent comparison, in my view. In your cudaMemcpy3D case, both source and destination pointers are strided similarly:

cpy_params.srcPtr = make_cudaPitchedPtr(dbuf, nx*sizeof(float), nx, ny);
...
cpy_params.dstPtr = make_cudaPitchedPtr(hbuf, nx*sizeof(float), nx, ny);

In the code you have shown, only the source pointer is striding:

dst[i] = src[i*STRIDE];
               ^^^^^^

This matters a lot. The constrained bandwidth pipe here is PCIE, not main memory. By comparison, main memory has much more available bandwidth, and may not show a serious constraint even for inefficient (strided access) relative to the throughput that is available on PCIE.

On the other hand, if we stride both accesses equivalently, which I believe is a better comparison to your cudaMemcpy3D case originally shown, then we witness a corresponding drop-off in throughput for the strided kernel case:

$ cat t1443.cu
#include <iostream>

template <typename T>
__global__ void copy(const T * __restrict__ src, T * __restrict__ dst, size_t n){

  for (size_t i  = threadIdx.x+blockDim.x*blockIdx.x; i < n; i+=gridDim.x*blockDim.x)
#ifndef USE_DEST_STRIDE
    dst[i] = src[i*STRIDE];
#else
    dst[i*STRIDE] = src[i*STRIDE];
#endif
}

int main(){

  const size_t n = 512*512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, STRIDE*n*sizeof(d_src[0]));
  cudaHostAlloc(&h_dst, STRIDE*n*sizeof(d_src[0]), cudaHostAllocDefault);
  cudaMemset(d_src, 0, STRIDE*n*sizeof(d_src[0]));
  memset(h_dst, 1, STRIDE*n*sizeof(d_src[0]));
  for (int i=0; i<100; i++){
    copy<<<512,512>>>(d_src, h_dst, n);
  }
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -DSTRIDE=512 -o t1443 t1443.cu
$ nvprof ./t1443
==23295== NVPROF is profiling process 23295, command: ./t1443
==23295== Profiling application: ./t1443
==23295== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.98%  8.9560ms       100  89.560us  89.250us  97.986us  void copy<float>(float const *, float*, unsigned long)
                    0.02%  1.6320us         1  1.6320us  1.6320us  1.6320us  [CUDA memset]
      API calls:   57.45%  313.18ms         1  313.18ms  313.18ms  313.18ms  cudaMalloc
                   38.78%  211.41ms         1  211.41ms  211.41ms  211.41ms  cudaHostAlloc
                    1.54%  8.4056ms         1  8.4056ms  8.4056ms  8.4056ms  cudaDeviceSynchronize
...
$ nvcc -arch=sm_70 -DSTRIDE=512 -o t1443 t1443.cu -DUSE_DEST_STRIDE
$ nvprof ./t1443
==23338== NVPROF is profiling process 23338, command: ./t1443
==23338== Profiling application: ./t1443
==23338== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  356.95ms       100  3.5695ms  3.5028ms  3.6556ms  void copy<float>(float const *, float*, unsigned long)
                    0.00%  1.6320us         1  1.6320us  1.6320us  1.6320us  [CUDA memset]
      API calls:   40.05%  356.42ms         1  356.42ms  356.42ms  356.42ms  cudaDeviceSynchronize
                   34.67%  308.54ms         1  308.54ms  308.54ms  308.54ms  cudaMalloc
                   23.92%  212.89ms         1  212.89ms  212.89ms  212.89ms  cudaHostAlloc
...
$

It may also be interesting to note that as a result of the destination stride, the unstrided bandwidth (~12GB/s) is reduced to something that is comparable to the bandwidth you reported in the strided cudaMemcpy3D case (~300MB/s), a ~40x reduction.

I should revisit one of the statements I made previously:

I see that we may not be comparing apples to apples. I had previously assumed that your kernel copy (the code of which you had not shown) was intended to duplicate the behavior of the cudaMemcpy3D call (which you had shown). Perhaps that is not the case.

It seems evident that for this kind of strided-load-from-GPU-memory-with-unstrided-sys-memory-write-over-PCIE, there is a “coalescing buffer” activity that is taking place, prior to the write. To me this seems obvious that this should happen, but it may not be that evident what is going on.

The strided read from src operation is uncoalesced. By inspection of the unstrided store to dst, that certainly follows the pattern for a coalesced write (e.g. it would certainly be a coalesced write if it were going to GPU global memory). There is effectively a “coalescing buffer” activity of sorts under the hood to make this happen. Items which are scattered in the cache are pulled together by the memory controller, before being written out (to sys memory, in this case). That sort of behavior is a standard process for the GPU memory controller, it is what coalescing refers to, and there isn’t anything extra being done to facilitate that, AFAIK. It’s the same coalescing-on-write you would get if you were writing instead to GPU DRAM (“global” logical space).

But as I’ve stated previously, I don’t consider this to be an apples-to-apples comparison, so I would not attempt to provide reasons to explain why this is, other than what has been covered already.

So when I said “nothing of the sort happens” I was referring to the strided read and strided write case. That is uncoalesced all the way through, at least based on the above data.

If the option to have an unstrided buffer on the host is possible for what you are doing, you might want to investigate the performance of the cudaMemcpy3D where the destination (host memory) pointer is not a pitched pointer. Obviously that data organization in host memory could not match the data organization in device memory. I don’t know that it would be any better, but it may be worth checking.

Thanks @Robert_Crovella! You are of absolutely right that the cudaMemcpy3D performance of 300 MB/s that I had reported needs to be compared to a kernel copy where both the source and the destination memory access is strided. It is nice to see that then we get performance results that are in agreement (for both cases about 300 MB/s).

Now, I came to realize that I didn’t call cudaMemcpy3DAsync with the right parameters for my needs. My aim is to use cudaMemcpy3DAsync to read strided data from the device and write it to a contigous buffer on the host. So, I am actually interested in the case that we have not yet looked at. In other words, I am looking for how to best do the case for which we have a question mark in the following performance table (approximative values; stride=2KB; data=512x512x4bytes):

#			cudaMemcpy3DAsync	kernel-copy
contigous-to-contigous:  ~13 GB/s		 ~13 GB/s
strided-to-contigous:     ?			 ~11 GB/s
strided-to-strided:	~300 MB/s		~300 MB/s

I have therefore done the following:

  1. I first modified your example in order to reproduce with cudaMemcpy3DAsync the performance of ~13 GB/s for copying contigous data from the device to the host:
$> cat t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_contiguous.cu
#include <iostream>

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, nx*ny*nz*sizeof(d_src[0]));
  cudaHostAlloc(&h_dst, ny*nz*sizeof(d_src[0]), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(d_src[0]));
  memset(h_dst, 1, ny*nz*sizeof(d_src[0]));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(d_src[0]), nx, ny);
  cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, ny*sizeof(d_src[0]), ny, nz);
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(nx*sizeof(d_src[0]), ny, 1);
  cpy_params.kind   = cudaMemcpyDeviceToHost;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }
  
  cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_contiguous_to_contiguous.cu
$> nvprof ./a.out
==23132== NVPROF is profiling process 23132, command: ./a.out
==23132== Profiling application: ./a.out
==23132== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  8.0308ms       100  80.307us  80.159us  88.895us  [CUDA memcpy DtoH]
                    0.01%     992ns         1     992ns     992ns     992ns  [CUDA memset]
      API calls:   97.31%  453.86ms         1  453.86ms  453.86ms  453.86ms  cudaMalloc
                    1.89%  8.7984ms         1  8.7984ms  8.7984ms  8.7984ms  cudaDeviceSynchronize
                    0.32%  1.4823ms         1  1.4823ms  1.4823ms  1.4823ms  cudaHostAlloc
                    0.25%  1.1824ms         1  1.1824ms  1.1824ms  1.1824ms  cuDeviceTotalMem
                    0.14%  636.49us        94  6.7710us     120ns  252.92us  cuDeviceGetAttribute
                    0.07%  334.62us       100  3.3460us  3.0270us  14.167us  cudaMemcpy3DAsync
...

We can see that we get a performance of about 13 GB/s (5125124B/(80s/1e6)/1e9) (see [CUDA memcpy DtoH])

  1. Then, I tried to modify the copy parameters to copy instead from a strided to a contiguous buffer (i.e. to copy a YZ-plane of a 3-D device array to a 1-D host array):
$> cat t1442_float512x512_cudaMemcpy3DAsync_strided_to_contiguous.cu
#include <iostream>

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, nx*ny*nz*sizeof(d_src[0]));
  cudaHostAlloc(&h_dst, ny*nz*sizeof(d_src[0]), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(d_src[0]));
  memset(h_dst, 1, ny*nz*sizeof(d_src[0]));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(d_src[0]), nx, ny);
  cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, sizeof(d_src[0]), 1, ny);      # MODIFIED!
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(sizeof(d_src[0]), ny, nz);                # MODIFIED!
  cpy_params.kind   = cudaMemcpyDeviceToHost;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }
  
  cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_strided_to_contiguous.cu
$> nvprof ./a.out
==23234== NVPROF is profiling process 23234, command: ./a.out
==23234== Profiling application: ./a.out
==23234== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  115.31ms       100  1.1531ms  1.1522ms  1.1545ms  [CUDA memcpy DtoH]
                    0.00%     960ns         1     960ns     960ns     960ns  [CUDA memset]
      API calls:   68.99%  266.83ms         1  266.83ms  266.83ms  266.83ms  cudaMalloc
                   29.91%  115.69ms         1  115.69ms  115.69ms  115.69ms  cudaDeviceSynchronize
                    0.37%  1.4407ms         1  1.4407ms  1.4407ms  1.4407ms  cudaHostAlloc
                    0.32%  1.2477ms         1  1.2477ms  1.2477ms  1.2477ms  cuDeviceTotalMem
                    0.21%  799.60us       100  7.9950us  7.3030us  20.194us  cudaMemcpy3DAsync
...

The observed performance is only about 0.9 GB/s (5125124B/(1.153s/1e3)/1e9) (see [CUDA memcpy DtoH]). I am however not sure if I set the parameters right as I don’t have any experience with cudaMemcpy3DAsync. If it was right, then it would be over 10 times slower than the corresponding kernel copy (~11 GB/s, see in the table above).

Did I set the parameters right?

That sounds exactly like what I need. Do you mean with that what I tried to do above or something else?

It seems to me that you have done it correctly and yes that is what I had intended. It seems that it is about 3x faster than the pitched->pitched copy cudaMemcpy3D call (i.e. where you started this thread at 300MB/s) but not as fast as the pitched->unpitched copy kernel

You believe that I set the parameters correctly and NVVP notes also 0.9 GB/s throughput (the throughput is given directly, not only the kernel execution time). So, we can probably trust this result. This brings up two questions:

  1. why is the cudaMemcpy3DAsync over 10 times slower than the corresponding kernel copy (~11 GB/s, see in the table above)?

  2. what can we do to speed it up?

Access patterns and how they interact with the memory hierarchy are important for achieved memory throughput. DMA engines are not as flexible as custom copy kernels in this regard, and they also incur setup overhead. The benefit they provide is that they don’t utilize execution resources and transfer data asynchronously.

Unless cudaMemcpy3DAsync() programs the GPU’s DMA engines in some sub-optimal fashion (something we cannot know as we can’t see the source code) there is nothing that can be done by a CUDA programmer to improve its performance. At most you could file a bug report about low cudaMemcpy3DAsync() performance, and this may result in improved performance in some future version of CUDA, or with future GPUs, depending on whether the performance limitations are inherent in the DMA engine design or their configuration by cudaMemcpy3DAsync().

Thanks @njuffa!

That’s exactly why I want to replace my copy kernels for the halo updates with cudaMemcpyAsinc3D.
It is a pity that it doesn’t work well for the YZ-plane. By the way, I tried with the XZ-plane (i.e. nz strided blocks of nx contiguous numbers), what works perfectly well as one could expect:

$> cat t1442_float512x512_cudaMemcpy3DAsync_strided_to_contiguous_xzplane.cu
#include <iostream>

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, nx*ny*nz*sizeof(d_src[0]));
  cudaHostAlloc(&h_dst, nx*nz*sizeof(d_src[0]), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(d_src[0]));
  memset(h_dst, 1, nx*nz*sizeof(d_src[0]));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(d_src[0]), nx, ny);
  cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, nx*sizeof(d_src[0]), nx, 1);
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(nx*sizeof(d_src[0]), 1, nz);
  cpy_params.kind   = cudaMemcpyDeviceToHost;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }
  
  cudaDeviceSynchronize();
}
$> nvcc -arch=sm_60 t1442_float512x512_cudaMemcpy3DAsync_strided_to_contiguous_xzplane.cu 
$> nvprof ./a.out==6131== NVPROF is profiling process 6131, command: ./a.out
==6131== Profiling application: ./a.out
==6131== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  8.0859ms       100  80.859us  80.703us  89.375us  [CUDA memcpy DtoH]
                    0.01%     992ns         1     992ns     992ns     992ns  [CUDA memset]
      API calls:   95.82%  288.64ms         1  288.64ms  288.64ms  288.64ms  cudaMalloc
                    2.79%  8.4075ms         1  8.4075ms  8.4075ms  8.4075ms  cudaDeviceSynchronize
                    0.49%  1.4693ms         1  1.4693ms  1.4693ms  1.4693ms  cudaHostAlloc
                    0.40%  1.2071ms         1  1.2071ms  1.2071ms  1.2071ms  cuDeviceTotalMem
                    0.26%  784.22us       100  7.8420us  7.2380us  18.554us  cudaMemcpy3DAsync
...

We can see that we get a performance of about 13 GB/s (5125124B/(81s/1e6)/1e9) (see [CUDA memcpy DtoH]).

I would draw therefore the following conclusion: cudaMemcpy3DAsync works perfectly well as long as there are some big enough contiguous parts as when copying the XY plane or XZ plane of a large 3-D array (here 512^3), but not in the particular case of a large stride (here 2KB) between each single floating point number as it is the case when copying a YZ plane. In the latter case we unfortunately observed performance 10 times below a custom copy kernel.
I would therefore indeed like to open a bug report to NVIDIA for this as I consider this very a fundamental issue, as it is a building block for MPI halo updates of 3-D arrays. I hope very much that this can be improved. Can you point me to where I can open this bug report?