The change of speed when copying data between host and device

1429802329 · April 22, 2021, 8:45am

Environment: CUDA 11.0, Titan RTX, PCIe 3.0 x16

Here is the code i used for copying

checkCuda( cudaMallocHost((void**)&h_a, size)); // host pinned
checkCuda( cudaMallocHost((void**)&h_b, size)); // host pinned
checkCuda( cudaMalloc((void**)&d, size);           // device


checkCuda( cudaEventRecord(startEvent, 0) );
checkCuda( cudaMemcpyAsync(d, h_a, size, cudaMemcpyHostToDevice) );
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
//print time

checkCuda( cudaEventRecord(startEvent, 0) );
checkCuda( cudaMemcpyAsync(h_b, d,size, cudaMemcpyDeviceToHost) );
checkCuda( cudaEventRecord(stopEvent, 0) );
checkCuda( cudaEventSynchronize(stopEvent) );
checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
//print time

Here is the result i got


Pinned transfers
Transfer size : 16KB

  //repeat 3 times
  Host to Device bandwidth (GB/s): 0.072402
  Host to Device bandwidth (GB/s): 1.199708
  Host to Device bandwidth (GB/s): 2.923840

  //repeat 3 times
  Device to Host bandwidth (GB/s): 1.453218
  Device to Host bandwidth (GB/s): 3.487723
  Device to Host bandwidth (GB/s): 3.814697



Pinned transfers
Transfer size : 12GB

  //repeat 3 times
  Host to Device bandwidth (GB/s): 10.908256
  Host to Device bandwidth (GB/s): 11.548207
  Host to Device bandwidth (GB/s): 11.547699
  
  //repeat 3 times
  Device to Host bandwidth (GB/s): 9.973671
  Device to Host bandwidth (GB/s): 10.838805
  Device to Host bandwidth (GB/s): 10.790790

I am confusing that :

Why the first time of transmission was always slower?
In the 16KB case, the speed calculated in this way is very small. Is it more because of the launch overhead of cudaMemcpyAsync(), or more because of maybe the launch overhead or other mechanism of PCIe? or some other reasons?

And I also wonder if the speed 11.5 GB/s I reached was normal in Titan RTX, as the ideal speed of PCIe 3.0 x16 is 15.8GB/s. What is holding it? Can I know a little about what does cudaMemcpyAsync() do before starting transmission? I mean what kinds of “overhead” it may contain? (I am trying to figure out why the speed is small when transferring smaller data between host and device so I want to make sure I checked every possible influencer)

Any help will be appreciated!!!

Robert_Crovella · April 22, 2021, 3:16pm

Many operations on the GPU or in computer science in general run faster the 2nd time. I doubt there is a readily available answer for this.
There are various overheads, at the PCIE transfer level and at other levels of the software stack. A CUDA call typically has an overhead in the 5-50us range. For large transfers, this is meaningless. For small transfers, this matters.
Yes, 11.5GB/s is normal for PCIE 3.0 x16. The PCIE transfers have a payload ratio that prevents achieving full bandwidth, and other factors as well.

Sorry, I won’t be able to give details about exactly the steps that cudaMemcpyAsync goes through. All of your observations appear normal to me.

njuffa · April 22, 2021, 5:51pm

PCIe uses packetized transport. Each packet has a packet header. Theoretical maximum throughput would require packets of infinite length, whereas currently GPUs support a maximum packet length of 256 bytes. There is also an overhead for the entire transfer.

I previously determined that the per-transfer overhead for host->device copy with a PCIe gen 3 x16 link is 1.125 microseconds. Therefore (to quote myself from a previous post):

A 16 KB transfer requires 1.25 microseconds to transmit, a 64KB transfer requires 5 microseconds to transmit, and a 256 KB transfer requires 20 microseconds to transmit. So in one second = 1 million microseconds, we achieve 1e6/(1.25+1.125) = 421000 transfers of 16 KB each for a total of 6.9e9 bytes, or 1e6/(5+1.125) = 163200 transfers of 64 KB for a total of 10.7e9 bytes, or 1e6/(20+1.125) = 47300 transfers of 256 KB each for a total of 12.4e9 bytes. As transfer size grows even further, the effective transfer rate will approach 13.1 GB/sec asymptotically.

In practical terms, CUDA applications should strive to transfer data between host and device in as large chunks as possible, to minimize the impact of per-transfer overhead.

Here is an example measurement from my Windows 10 system with Xeon W-2133 and a Quadro RTX 4000, using the CUDA sample app bandwidthTest from CUDA 11.1:

C:\Users\Norbert\My Programs>bandwidthTest --memory=pinned --htod --mode=range --start=262144 --end=5242880 --increment=262144
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro RTX 4000
 Range Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   262144                       11.131
   524288                       11.710
   786432                       11.911
   1048576                      11.973
   1310720                      12.034
   1572864                      12.106
   1835008                      12.125
   2097152                      12.150
   2359296                      12.171
   2621440                      12.187
   2883584                      12.197
   3145728                      12.213
   3407872                      12.220
   3670016                      12.230
   3932160                      12.236
   4194304                      12.232
   4456448                      12.238
   4718592                      12.181
   4980736                      12.249
   5242880                      12.254

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Note that this still somewhat lower than what we would expect based on my theoretical calculations above.

Robert_Crovella · April 22, 2021, 6:38pm

This paper is dated but still useful for comprehension:

The 128 byte number (max payload) quoted in there was correct for NVIDIA GPUs also up until recently.

This paper may also be of interest:

1429802329 · April 23, 2021, 8:47am

@Robert_Crovella @njuffa Thank you so much! The information you provided helped me understand a loooooot!!! Thanks again! Best Wishes!!!

Topic		Replies	Views
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2732	July 27, 2017
Improving data transfer performance from host to device CUDA Programming and Performance	2	2062	January 28, 2015
Why is the transfer throughput low when transferring small size data from Host to Device (or Device to Host)? CUDA Programming and Performance	8	2147	October 12, 2021
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15530	December 11, 2009
PCI Express Latency and how to decrease it CUDA Programming and Performance	7	19381	January 31, 2011
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68056	April 18, 2008
Data transfer time CUDA Programming and Performance	3	1610	March 12, 2018
Why cuMemcpyHtoD and cuMemcpyHtoDAsync have almost the same performance tested at different CPU frequencies (1.5GHz and 2.8GHz) (copy 5M data) CUDA Programming and Performance	13	539	December 2, 2023
Bandwidht Usage CUDA Programming and Performance	16	8895	October 30, 2008
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	9915	June 30, 2009

The change of speed when copying data between host and device

Related topics