How to understand CPU memory transfer data to GPU memory speed problem

Transfer size Host to Device bandwidth(Pageable)(GB/S)
4B 0.000603
8B 0.001247
16B 0.002364
32B 0.00491
64B 0.009794
128B 0.019905
256B 0.038854
512B 0.076136
1KB 0.151731
2KB 0.291505
4KB 0.593279
8KB 1.04918
16KB 1.692562
32KB 2.343518
64KB 2.952924
128KB 4.924556
256KB 5.556347
512KB 6.191286
1MB 6.409264
2MB 8.388285
4MB 9.868952
8MB 10.933235
16MB 11.437335
32MB 11.390979
64MB 8.666728
128MB 8.33996
256MB 8.350593
512MB 8.331942
1GB 8.322069
Why the maximum transfer rate is between 8MB to 32MB?

The drop-off in performance at smaller sizes is due to inefficiency associated with the packetization of data on PCIE, as well as the amortization of transfer cost overhead over the entire transfer size. In short, small buffer transfers are on average less efficient than larger buffer transfers.

The drop-off in performance at larger sizes may be due to the fact that you are doing a (large) pageable transfer. If paging is occurring during this, it will slow down the average transfer rate. Paging is “more likely to be occurring” at larger buffer sizes. If this is happening on windows, windows WDDM may also be playing a role, as it runs a virtual memory management scheme for GPU memory.

For the most predictable, consistent transfer performance, use pinned buffers. On windows, the situation may also be improved somewhat if TCC instead of WDDM driver model is (possible to be) used.

Even without demand paging taking place, there are some secondary effects that can negatively affect the performance of large transfers between GPU and pageable system memory, where “large” usually means over 4MB or so:

(1) The TLB (a CPU structure that caches virtual to physical address translations) can only cover a limited amount of system memory using standard 4K pages. Typical capacities are 512 or 1024 pages. Any memory transaction touching more pages than can fit in the TLB will incur cost for updating TLB entries (“TLB miss”). While processors typically also provide a small number of large pages (i.e. page sizes > 4K) in the TLB, they are unlikely to come into play here.

(2) Large transfers from/to pageable system memory are broken up into multiple smaller transfers. The way these transfers work is that the CUDA driver allocates a fixed-sized, pinned DMA buffer at startup. I do not know how large it is, but circumstantial evidence suggests that it could be in the single-digit MB range. Transfers involving pageable system memory consist of a DMA-transfer between GPU and driver-allocated DMA buffer, and a system memory to system memory transfer between the DMA buffer and user application memory.

There may be additional effects based on CPU and memory affinity settings that apply to NUMA situations (mostly multi-socket platforms), where there is a chunk of system memory associated with each CPU memory controller. For maximum performance you would want to take care to ensure that a given GPU always communicates with the “near” CPU and the “near” system memory. On Linux, you might control this with numactl, for example.

Overall, the efficiency of large transfers between GPU and pageable system memory relies heavily on the efficiency of system memory to system memory transfers, so systems using a larger number of DDR4 channels, and using higher speed grades of DDR4, will typically show higher performance. I would expect the differences to become more pronounced with the advent of PCIe gen 4, to appear in first systems in 2018.

Thanks for your reply.I don’t understand why transfers between GPU and pagebale system memory relies heavily on the efficiency of system memory to system memory?