Hi,
Is the bandwidth from CPU to GPU usually greater than the bandwidth from GPU to CPU?
$$ BW (CPU->GPU) >> BW (CPU<-GPU)$$
I tested the bandwidth using nvbandwidth and bandwidthTest in the Cuda-sample, and found that there’s an asymmetry between the bandwidth of Host to Device and that of Device to Host.
I’m using PCIe Gen 5, and GPU H100.
Product Name : NVIDIA H100 PCIe
PCIe Generation
SRAM PCIE : 0
- bandwidthTest
When I copy data from the host to the device, it was 4.20 GB/s, and when I sent it from the device to the host, it was only 1.33 GB/s when I copy the about 64 MB sized data.
- nvbandwidth
nvbandwidth Version: v0.6
Built from Git version: v0.6
CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.127.05
Device 0: NVIDIA H100 PCIe (00000000:01:00)
Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0
0 4.18
SUM host_to_device_memcpy_ce 4.18
Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0
0 1.23
As you can see, the bandwidth in device to host memory copy is only 1.23 GB/s while host to device reaches 4.18.
With the two evaluations, I can tell the data copy operation from GPU to CPU has about 4 times less bandwidth than the data copy operation from CPU to GPU. Is this conclusion right? I’m not sure why there’s a difference in bandwidth.