I experimented with two types of Pinned/Pageable, increasing the cudamemcpy transfer size little by little. Below is the graph of Throughput(GB/s) vs. transfer size(KB).
If so, why is the throughput achievable when sending small data (both in the case of Pageable/Pinned) is low?
n my experience, transfers to/from pageable memory reach reliably over 6GB/s when the size is around 1MB. Below that, it has a lower throughput.
→ In a packet in PCIe, I know that if the payload is small it has about (2b/130b) = 1.5% overhead. However, I am not sure if this is a major factor in the low throughput. If someone knows the correct answer, please let me know. Or, please let me know how I can prove that this is the main factor.
The throughput is about 1/2 times higher when using pageable memory than when using pinned memory. Why is this happening?
My guess is, the transfer process to/from pageable memory is
(1) Pinned memory allocation
(2) copy from pageable memory to pinned memory
(3) copy from pinned memory to device memory
I think it is because it goes through the process of. On the other hand, if the pinned memory is allocated from the beginning, you only need to perform the process (3).
In my environment, the host memory bandwidth is about 17GB/s, which is similar to the PCIe bandwidth (15.8GB/s). Therefore, it seems to show a performance difference of about 2 times.
If my guess is correct, is the time spent in (2) related to the host memory bandwidth?
Above the physical transport layer (your 2/130 math applies to that) PCIe is a packet based transport. The packet size used by GPUs is 128 bytes as I recall, and the packet header comprises 20 bytes. This leads to an efficiency of about 86% relative to the physical layer. Because each transfer incurs fixed-sized overhead, this rate (equivalent to about 12 GB/sec for a PCIe gen3 x16 link), is only achieved for long transfers, on the order of > 8 MB or so. Smaller transfers lead to lower transfer rates, as you noted.
Transfers to and from the GPU happen by DMA. DMA transfers require contiguous physical addresses. So transfers involving pageable memory require two hops: DMA transfers between a pinned memory buffer provided by the driver. I think on the order of a couple of MB. Transfer between the pinned buffer and pageable user process memory. The performance of this second step is obviously highly dependent on the performance of the host’s system memory. For hosts with very high throughput system memory (four or six DDR4-2666 channels), the overhead is small enough that the difference between pageable and pinned host memory can often be neglected. Because of the driver’s limited buffer size, transfers between GPU and CPU in excess of the buffer size have to be broken up into multiple chunks.
A host system memory bandwidth of 17 GB/sec is abysmally low, what kind of system is this and how was the throughput measured? I can recommend the STREAM benchmark. If the host’s system memory bandwidth is about equal to the PCIe bandwidth, a two-hop transfer to pageable memory would cut the transfer speed in half compared to the use of a pinned memory buffer, as you observed.
Because each transfer incurs fixed-sized overhead, this rate (equivalent to about 12 GB/sec for a PCIe gen3 x16 link), is only achieved for long transfers, on the order of > 8 MB or so.
It’s weird.
I have found that from >500KB to 12GB/s throughput has been stably reached. Is it simply the difference in GPU performance from the> 8MB you mentioned? For reference, I am using a GTX 1050 TI. (112 GB/s bandwidth)
For your question,
My host memory is DDR4 RAM with 1066MHz Clock Speed. Here’s how I calculated the bandwidth:
Clock speed * 2 (DDR) * 64 (bitwidth)
= 1066MHz * 2 * 64-bit = 136.448 Gbps = 17.056 GB/s
If you run a test with a pinned buffer to isolate the PCIe transfers, and double the transfer size, you should see a gradual increase in the effective transfer speed. You would want to run multiple transfers at each size and record the fastest one to eliminate measurement artifacts.
You would want to measure your system memory bandwidth instead of computing it. Your system may use more than one channel of DDR4. Any recent system likely has at least two.
I seem to understand why not all of the theoretical maximum bandwidth is available. But it’s still hard to understand if sending a small size has a lower throughput than sending a large size.
Let’s take an example of one situation.
I used lspci to query the environment of the VGA controller.
Max_Payload_Size = 256 Byte
Max_Read_Request = 512 Byte
If I want to send data less than 256 bytes, I guess it will be sent with less efficiency due to the fixed-overhead. (Because sending 200 Byte is better than sending 100 Byte with 20-Byte overhead.)
However, in typical CUDA programming situations, the transfer size is not as small as a few bytes. For example, if you want to transmit 16 KB, it will be transmitted 63 times with 256 Byte, which is the maximum MPS. Since all 63 transmissions are full MPS sized transmissions, I think the efficiency will be the maximum, but I wonder why this is wrong.
This is the result of running CUDA Sample-BandwidthTest. In my past experience, this test would average the execution times of 100 runs of cudaMemcpy of the same transfer size.
For pinned memory:
6.9 GB/s at 16KB,
9.0 GB/s at 32KB,
10.7 GB/s at 64KB,
11.8 GB/s at 128 KB,
12.4 GB/s at 256KB,
After that, it saturates to about 12 to 13 GB/s.
What I want to know is the difference between 6.9GB/s at 16KB and 12.4GB/s at 256KB.
Incidentally, thanks for your kind previous answer.
That looks like a very good result for a PCIe gen3 x16 link, and the throughput difference based on transfer size looks normal to me. I don’t understand what is unclear about this situation based on the previous explanation.
There is per packet header overhead limiting the maximum achievable transfer rate versus theoretical throughput based purely on the physical limits. When you add to that a fixed overhead that is incurred for each transfer, the effective transfer rate will be lower when using smaller transfers.
Based on the two data items, I compute the per-transfer overhead to 1.125 microseconds by solving a system of two equations in two variables. A 16 KB packet requires 1.25 microseconds to transmit, a 64KB packet requires 5 microseconds to transmit, and a 256 KB packet requires 20 microseconds to transmit.
So in one second = 1 million microseconds, we achieve 1e6/(1.25+1.125) = 421000 transfers of 16 KB each for a total of 6.9e9 bytes, or 1e6/(5+1.125) = 163200 transfers of 64 KB for a total of 10.7e9 bytes, or 1e6/(20+1.125) = 47300 transfers of 256 KB each for a total of 12.4e9 bytes. As transfer size grows even further, the effective transfer rate will approach 13.1 GB/sec asymptotically.
Based on the two data items, I compute the per-transfer overhead to 1.125 microseconds by solving a system of two equations in two variables. A 16 KB packet requires 1.25 microseconds to transmit, a 64KB packet requires 5 microseconds to transmit, and a 256 KB packet requires 20 microseconds to transmit.
Dear njuffa,
First of all, I would like to say thank you for the previous answers that helped me a lot.
In this answer, you mentioned “per-transfer overhead to 1.125 microseconds”, but how did you derive that?
This value looks like just a “magic number”, but it is very similar to the actual experimental results.
Similar to your approach, the authors of this paper derived transfer latency from Eq.(1) in Section 3.2, but they didn’t mention how they could get “15us”.
It is very similar to actual experimental results because it is based on actual experimental results!
I assumed that transfer time is the sum of a fixed overhead plus a variable portion growing linearly with the number of bytes transferred. I took throughput measurements at transfer sizes of 16KB, 64KB, and 256 KB. Using two of these results, I set up a system of two simultaneous equations with two unknowns. I solved that system, and found that the fixed overhead per transfer is almost exactly 1.125 microseconds. I used the third measurement to validate my result and found an almost perfect match.
I don’t have the raw data anymore, but you should be able to reproduce my experiments and subsequent computation within the space of about ten minutes.