How to enable P2P access?

Our server has 8 RTX 3090 GPUs, they are unable to peer access each other, which results in very slow p2p bandwidth (~3GB/s).

Some details of the server, please let me know if any information is needed:

result of “nvidia-smi topo -m”
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X PIX PIX PIX PXB PXB PXB PXB 0-23,48-71 0
GPU1 PIX X PIX PIX PXB PXB PXB PXB 0-23,48-71 0
GPU2 PIX PIX X PIX PXB PXB PXB PXB 0-23,48-71 0
GPU3 PIX PIX PIX X PXB PXB PXB PXB 0-23,48-71 0
GPU4 PXB PXB PXB PXB X PIX PIX PIX 0-23,48-71 0
GPU5 PXB PXB PXB PXB PIX X PIX PIX 0-23,48-71 0
GPU6 PXB PXB PXB PXB PIX PIX X PIX 0-23,48-71 0
GPU7 PXB PXB PXB PXB PIX PIX PIX X 0-23,48-71 0

Result of “nvidia-smi topo -p2p r”
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X CNS CNS CNS CNS CNS CNS CNS
GPU1 CNS X CNS CNS CNS CNS CNS CNS
GPU2 CNS CNS X CNS CNS CNS CNS CNS
GPU3 CNS CNS CNS X CNS CNS CNS CNS
GPU4 CNS CNS CNS CNS X CNS CNS CNS
GPU5 CNS CNS CNS CNS CNS X CNS CNS
GPU6 CNS CNS CNS CNS CNS CNS X CNS
GPU7 CNS CNS CNS CNS CNS CNS CNS X

CUDA version 11.3, nvidia driver version 460.91.03
Server model: ASUS ESC8000 G4

It seems that it’s “chipset not supported”, but I thought these GPUs are PIX or PXB connected and have the same architecture, it should be able to peer access?

VT-d is disabled, but p2p bandwidth still very low, and training with 8 GPU is almost as slow as training with 1 GPU, due to the communication overhead.

Seems to be somewhat of a challenging process. See this thread:

Thanks for the pointer. The server is not using NVLink, does RTX 3090 have to use nvlinks to have p2p access?

It’s certainly by far the best way if supported.

I was primarily replying to a previous post, since removed, that suggested using it and so I offered the thread above.

In light of apparent NVlink difficulties, (and even if you can get it working, it appears to be limited to 2 cards only), you’re stuck with PCIe transfers.

The ASUS ESC8000 G4 only supports PCIe Gen 3 x16 for the GPU’s, the 3090 has a Gen4 interface and so there’s a limitation to start. I can’t offer detailed advice, as I have no direct experience with large mutiGPU setups - perhaps njuffa will respond.

Depending on the nature of your workload, this thread might be worth checking as well: