The change of speed when copying data between host and device

Environment: CUDA 11.0, Titan RTX, PCIe 3.0 x16

Here is the code i used for copying

checkCuda( cudaMallocHost((void**)&h_a, size)); // host pinned
checkCuda( cudaMallocHost((void**)&h_b, size)); // host pinned
checkCuda( cudaMalloc((void**)&d, size);           // device


checkCuda( cudaEventRecord(startEvent, 0) );
checkCuda( cudaMemcpyAsync(d, h_a, size, cudaMemcpyHostToDevice) );
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
//print time

checkCuda( cudaEventRecord(startEvent, 0) );
checkCuda( cudaMemcpyAsync(h_b, d,size, cudaMemcpyDeviceToHost) );
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
//print time

Here is the result i got


Pinned transfers
Transfer size : 16KB

  //repeat 3 times
  Host to Device bandwidth (GB/s): 0.072402
  Host to Device bandwidth (GB/s): 1.199708
  Host to Device bandwidth (GB/s): 2.923840

  //repeat 3 times
  Device to Host bandwidth (GB/s): 1.453218
  Device to Host bandwidth (GB/s): 3.487723
  Device to Host bandwidth (GB/s): 3.814697



Pinned transfers
Transfer size : 12GB

  //repeat 3 times
  Host to Device bandwidth (GB/s): 10.908256
  Host to Device bandwidth (GB/s): 11.548207
  Host to Device bandwidth (GB/s): 11.547699
  
  //repeat 3 times
  Device to Host bandwidth (GB/s): 9.973671
  Device to Host bandwidth (GB/s): 10.838805
  Device to Host bandwidth (GB/s): 10.790790

I am confusing that :

  1. Why the first time of transmission was always slower?
  2. In the 16KB case, the speed calculated in this way is very small. Is it more because of the launch overhead of cudaMemcpyAsync(), or more because of maybe the launch overhead or other mechanism of PCIe? or some other reasons?

And I also wonder if the speed 11.5 GB/s I reached was normal in Titan RTX, as the ideal speed of PCIe 3.0 x16 is 15.8GB/s. What is holding it? Can I know a little about what does cudaMemcpyAsync() do before starting transmission? I mean what kinds of “overhead” it may contain? (I am trying to figure out why the speed is small when transferring smaller data between host and device so I want to make sure I checked every possible influencer)

Any help will be appreciated!!!

  1. Many operations on the GPU or in computer science in general run faster the 2nd time. I doubt there is a readily available answer for this.
  2. There are various overheads, at the PCIE transfer level and at other levels of the software stack. A CUDA call typically has an overhead in the 5-50us range. For large transfers, this is meaningless. For small transfers, this matters.
  3. Yes, 11.5GB/s is normal for PCIE 3.0 x16. The PCIE transfers have a payload ratio that prevents achieving full bandwidth, and other factors as well.

Sorry, I won’t be able to give details about exactly the steps that cudaMemcpyAsync goes through. All of your observations appear normal to me.

PCIe uses packetized transport. Each packet has a packet header. Theoretical maximum throughput would require packets of infinite length, whereas currently GPUs support a maximum packet length of 256 bytes. There is also an overhead for the entire transfer.

I previously determined that the per-transfer overhead for host->device copy with a PCIe gen 3 x16 link is 1.125 microseconds. Therefore (to quote myself from a previous post):

A 16 KB transfer requires 1.25 microseconds to transmit, a 64KB transfer requires 5 microseconds to transmit, and a 256 KB transfer requires 20 microseconds to transmit. So in one second = 1 million microseconds, we achieve 1e6/(1.25+1.125) = 421000 transfers of 16 KB each for a total of 6.9e9 bytes, or 1e6/(5+1.125) = 163200 transfers of 64 KB for a total of 10.7e9 bytes, or 1e6/(20+1.125) = 47300 transfers of 256 KB each for a total of 12.4e9 bytes. As transfer size grows even further, the effective transfer rate will approach 13.1 GB/sec asymptotically.

In practical terms, CUDA applications should strive to transfer data between host and device in as large chunks as possible, to minimize the impact of per-transfer overhead.

Here is an example measurement from my Windows 10 system with Xeon W-2133 and a Quadro RTX 4000, using the CUDA sample app bandwidthTest from CUDA 11.1:

C:\Users\Norbert\My Programs>bandwidthTest --memory=pinned --htod --mode=range --start=262144 --end=5242880 --increment=262144
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro RTX 4000
 Range Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   262144                       11.131
   524288                       11.710
   786432                       11.911
   1048576                      11.973
   1310720                      12.034
   1572864                      12.106
   1835008                      12.125
   2097152                      12.150
   2359296                      12.171
   2621440                      12.187
   2883584                      12.197
   3145728                      12.213
   3407872                      12.220
   3670016                      12.230
   3932160                      12.236
   4194304                      12.232
   4456448                      12.238
   4718592                      12.181
   4980736                      12.249
   5242880                      12.254

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Note that this still somewhat lower than what we would expect based on my theoretical calculations above.

This paper is dated but still useful for comprehension:

The 128 byte number (max payload) quoted in there was correct for NVIDIA GPUs also up until recently.

This paper may also be of interest:

@Robert_Crovella @njuffa Thank you so much! The information you provided helped me understand a loooooot!!! Thanks again! Best Wishes!!!