Our server has 8 RTX 3090 GPUs, they are unable to peer access each other, which results in very slow p2p bandwidth (~3GB/s).
Some details of the server, please let me know if any information is needed:
result of “nvidia-smi topo -m”
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X PIX PIX PIX PXB PXB PXB PXB 0-23,48-71 0
GPU1 PIX X PIX PIX PXB PXB PXB PXB 0-23,48-71 0
GPU2 PIX PIX X PIX PXB PXB PXB PXB 0-23,48-71 0
GPU3 PIX PIX PIX X PXB PXB PXB PXB 0-23,48-71 0
GPU4 PXB PXB PXB PXB X PIX PIX PIX 0-23,48-71 0
GPU5 PXB PXB PXB PXB PIX X PIX PIX 0-23,48-71 0
GPU6 PXB PXB PXB PXB PIX PIX X PIX 0-23,48-71 0
GPU7 PXB PXB PXB PXB PIX PIX PIX X 0-23,48-71 0
Result of “nvidia-smi topo -p2p r”
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X CNS CNS CNS CNS CNS CNS CNS
GPU1 CNS X CNS CNS CNS CNS CNS CNS
GPU2 CNS CNS X CNS CNS CNS CNS CNS
GPU3 CNS CNS CNS X CNS CNS CNS CNS
GPU4 CNS CNS CNS CNS X CNS CNS CNS
GPU5 CNS CNS CNS CNS CNS X CNS CNS
GPU6 CNS CNS CNS CNS CNS CNS X CNS
GPU7 CNS CNS CNS CNS CNS CNS CNS X
CUDA version 11.3, nvidia driver version 460.91.03
Server model: ASUS ESC8000 G4
It seems that it’s “chipset not supported”, but I thought these GPUs are PIX or PXB connected and have the same architecture, it should be able to peer access?
VT-d is disabled, but p2p bandwidth still very low, and training with 8 GPU is almost as slow as training with 1 GPU, due to the communication overhead.