PCIe bandwidth is asymmetrical between Host to Device and Device to Host

esp · January 2, 2025, 6:30pm

Hi,

Is the bandwidth from CPU to GPU usually greater than the bandwidth from GPU to CPU?

$$ BW (CPU->GPU) >> BW (CPU<-GPU)$$

I tested the bandwidth using nvbandwidth and bandwidthTest in the Cuda-sample, and found that there’s an asymmetry between the bandwidth of Host to Device and that of Device to Host.

I’m using PCIe Gen 5, and GPU H100.

    Product Name                          : NVIDIA H100 PCIe
            PCIe Generation
            SRAM PCIE                     : 0

- bandwidthTest

When I copy data from the host to the device, it was 4.20 GB/s, and when I sent it from the device to the host, it was only 1.33 GB/s when I copy the about 64 MB sized data.

- nvbandwidth

nvbandwidth Version: v0.6
Built from Git version: v0.6

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.127.05

Device 0: NVIDIA H100 PCIe (00000000:01:00)

Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0      4.18

SUM host_to_device_memcpy_ce 4.18

Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0
 0      1.23

As you can see, the bandwidth in device to host memory copy is only 1.23 GB/s while host to device reaches 4.18.

With the two evaluations, I can tell the data copy operation from GPU to CPU has about 4 times less bandwidth than the data copy operation from CPU to GPU. Is this conclusion right? I’m not sure why there’s a difference in bandwidth.

Topic		Replies	Views
what is 'bandwidth' from where to where CUDA Programming and Performance	2	1269	January 4, 2009
Bandwidth of PCIe 2.0-based GPUs CUDA Programming and Performance	1	1986	March 19, 2008
lopsided bandwidthTest: D->H is 3X slower than H->D CUDA Programming and Performance	0	2167	June 3, 2009
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10115	June 30, 2009
Device Memory Bandwidth CUDA Programming and Performance	17	8625	January 17, 2018
cudaMemcpyDeviceToHost time procces CUDA Programming and Performance	6	3099	August 1, 2008
Asymmetric PCIe bandwidth in bidirectional transfers: H2D drops 56% while D2H maintains performance CUDA Programming and Performance pcie , cuda , a100	1	81	December 2, 2025
bandwith performance on PCI-E v1 slow? CUDA Programming and Performance	3	913	May 15, 2008
Bandwidht Usage CUDA Programming and Performance	16	9053	October 30, 2008
PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4 GPU-Accelerated Libraries cuda , kernel , ubuntu	0	67	September 17, 2025

PCIe bandwidth is asymmetrical between Host to Device and Device to Host

- bandwidthTest

- nvbandwidth

Related topics