Pinned transfers
Transfer size : 16KB
//repeat 3 times
Host to Device bandwidth (GB/s): 0.072402
Host to Device bandwidth (GB/s): 1.199708
Host to Device bandwidth (GB/s): 2.923840
//repeat 3 times
Device to Host bandwidth (GB/s): 1.453218
Device to Host bandwidth (GB/s): 3.487723
Device to Host bandwidth (GB/s): 3.814697
Pinned transfers
Transfer size : 12GB
//repeat 3 times
Host to Device bandwidth (GB/s): 10.908256
Host to Device bandwidth (GB/s): 11.548207
Host to Device bandwidth (GB/s): 11.547699
//repeat 3 times
Device to Host bandwidth (GB/s): 9.973671
Device to Host bandwidth (GB/s): 10.838805
Device to Host bandwidth (GB/s): 10.790790
I am confusing that :
Why the first time of transmission was always slower?
In the 16KB case, the speed calculated in this way is very small. Is it more because of the launch overhead of cudaMemcpyAsync(), or more because of maybe the launch overhead or other mechanism of PCIe? or some other reasons?
And I also wonder if the speed 11.5 GB/s I reached was normal in Titan RTX, as the ideal speed of PCIe 3.0 x16 is 15.8GB/s. What is holding it? Can I know a little about what does cudaMemcpyAsync() do before starting transmission? I mean what kinds of “overhead” it may contain? (I am trying to figure out why the speed is small when transferring smaller data between host and device so I want to make sure I checked every possible influencer)
Many operations on the GPU or in computer science in general run faster the 2nd time. I doubt there is a readily available answer for this.
There are various overheads, at the PCIE transfer level and at other levels of the software stack. A CUDA call typically has an overhead in the 5-50us range. For large transfers, this is meaningless. For small transfers, this matters.
Yes, 11.5GB/s is normal for PCIE 3.0 x16. The PCIE transfers have a payload ratio that prevents achieving full bandwidth, and other factors as well.
Sorry, I won’t be able to give details about exactly the steps that cudaMemcpyAsync goes through. All of your observations appear normal to me.
PCIe uses packetized transport. Each packet has a packet header. Theoretical maximum throughput would require packets of infinite length, whereas currently GPUs support a maximum packet length of 256 bytes. There is also an overhead for the entire transfer.
I previously determined that the per-transfer overhead for host->device copy with a PCIe gen 3 x16 link is 1.125 microseconds. Therefore (to quote myself from a previous post):
A 16 KB transfer requires 1.25 microseconds to transmit, a 64KB transfer requires 5 microseconds to transmit, and a 256 KB transfer requires 20 microseconds to transmit. So in one second = 1 million microseconds, we achieve 1e6/(1.25+1.125) = 421000 transfers of 16 KB each for a total of 6.9e9 bytes, or 1e6/(5+1.125) = 163200 transfers of 64 KB for a total of 10.7e9 bytes, or 1e6/(20+1.125) = 47300 transfers of 256 KB each for a total of 12.4e9 bytes. As transfer size grows even further, the effective transfer rate will approach 13.1 GB/sec asymptotically.
In practical terms, CUDA applications should strive to transfer data between host and device in as large chunks as possible, to minimize the impact of per-transfer overhead.
Here is an example measurement from my Windows 10 system with Xeon W-2133 and a Quadro RTX 4000, using the CUDA sample app bandwidthTest from CUDA 11.1:
C:\Users\Norbert\My Programs>bandwidthTest --memory=pinned --htod --mode=range --start=262144 --end=5242880 --increment=262144
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Quadro RTX 4000
Range Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
262144 11.131
524288 11.710
786432 11.911
1048576 11.973
1310720 12.034
1572864 12.106
1835008 12.125
2097152 12.150
2359296 12.171
2621440 12.187
2883584 12.197
3145728 12.213
3407872 12.220
3670016 12.230
3932160 12.236
4194304 12.232
4456448 12.238
4718592 12.181
4980736 12.249
5242880 12.254
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Note that this still somewhat lower than what we would expect based on my theoretical calculations above.