PCIe bandwidth is asymmetrical between Host to Device and Device to Host

Hi,

Is the bandwidth from CPU to GPU usually greater than the bandwidth from GPU to CPU?

$$ BW (CPU->GPU) >> BW (CPU<-GPU)$$

I tested the bandwidth using nvbandwidth and bandwidthTest in the Cuda-sample, and found that there’s an asymmetry between the bandwidth of Host to Device and that of Device to Host.

I’m using PCIe Gen 5, and GPU H100.

    Product Name                          : NVIDIA H100 PCIe
            PCIe Generation
            SRAM PCIE                     : 0

- bandwidthTest

When I copy data from the host to the device, it was 4.20 GB/s, and when I sent it from the device to the host, it was only 1.33 GB/s when I copy the about 64 MB sized data.

- nvbandwidth

nvbandwidth Version: v0.6
Built from Git version: v0.6

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.127.05

Device 0: NVIDIA H100 PCIe (00000000:01:00)

Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
           0
 0      4.18

SUM host_to_device_memcpy_ce 4.18

Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0
 0      1.23

As you can see, the bandwidth in device to host memory copy is only 1.23 GB/s while host to device reaches 4.18.

With the two evaluations, I can tell the data copy operation from GPU to CPU has about 4 times less bandwidth than the data copy operation from CPU to GPU. Is this conclusion right? I’m not sure why there’s a difference in bandwidth.