Our server has 8 RTX 3090 GPUs, they are unable to peer access each other, which results in very slow p2p bandwidth (~3GB/s).
Some details of the server, please let me know if any information is needed:
result of “nvidia-smi topo -m”
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X PIX PIX PIX PXB PXB PXB PXB 0-23,48-71 0
GPU1 PIX X PIX PIX PXB PXB PXB PXB 0-23,48-71 0
GPU2 PIX PIX X PIX PXB PXB PXB PXB 0-23,48-71 0
GPU3 PIX PIX PIX X PXB PXB PXB PXB 0-23,48-71 0
GPU4 PXB PXB PXB PXB X PIX PIX PIX 0-23,48-71 0
GPU5 PXB PXB PXB PXB PIX X PIX PIX 0-23,48-71 0
GPU6 PXB PXB PXB PXB PIX PIX X PIX 0-23,48-71 0
GPU7 PXB PXB PXB PXB PIX PIX PIX X 0-23,48-71 0
Result of “nvidia-smi topo -p2p r”
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X CNS CNS CNS CNS CNS CNS CNS
GPU1 CNS X CNS CNS CNS CNS CNS CNS
GPU2 CNS CNS X CNS CNS CNS CNS CNS
GPU3 CNS CNS CNS X CNS CNS CNS CNS
GPU4 CNS CNS CNS CNS X CNS CNS CNS
GPU5 CNS CNS CNS CNS CNS X CNS CNS
GPU6 CNS CNS CNS CNS CNS CNS X CNS
GPU7 CNS CNS CNS CNS CNS CNS CNS X
CUDA version 11.3, nvidia driver version 460.91.03
Server model: ASUS ESC8000 G4
It seems that it’s “chipset not supported”, but I thought these GPUs are PIX or PXB connected and have the same architecture, it should be able to peer access?
VT-d is disabled, but p2p bandwidth still very low, and training with 8 GPU is almost as slow as training with 1 GPU, due to the communication overhead.
February 5, 2023, 6:31pm
Seems to be somewhat of a challenging process. See this thread:
I am using the 4-slot RTX NVLINK bridge along with two RTX 3090 cards. In both Windows and Linux, it seems that it’s not quite working (with CUDA 11.8).
On Ubuntu 20.04, driver 520.61.05, nvidia-smi nvlink seems to indicate that the NVLink connections are present but down. The p2pBandwidthLatencyTest example indicates that peer-to-peer access is working … but the actual P2P bandwidth is so slow (<0.01 GB/s) that the example hangs.
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Aff…
Thanks for the pointer. The server is not using NVLink, does RTX 3090 have to use nvlinks to have p2p access?
February 6, 2023, 1:15am
It’s certainly by far the best way if supported.
I was primarily replying to a previous post, since removed, that suggested using it and so I offered the thread above.
In light of apparent NVlink difficulties, (and even if you can get it working, it appears to be limited to 2 cards only), you’re stuck with PCIe transfers.
The ASUS ESC8000 G4 only supports PCIe Gen 3 x16 for the GPU’s, the 3090 has a Gen4 interface and so there’s a limitation to start. I can’t offer detailed advice, as I have no direct experience with large mutiGPU setups - perhaps njuffa will respond.
Depending on the nature of your workload, this thread might be worth checking as well:
I’ve been trying to diagnose some difficult performance-related trouble in a dual-CPU, 10x RTX A4000 system. There seem to be multiple issues that cause lower than expected performance (see my earlier topic:
Multi-GPU contention inside CUDA). After a lot of debugging I’ve identified one of the underlying problems, which is related to memory copies (both H->D, D->H). In general, memory throughput is much lower than expected as the load increases. The workload is a TensorRT model that is driven fr…