Multi-GPU not work with Nvidia A100 PCI-E GPUs

OS: ubuntu 20.04.
Barebone: Tyan Transport HX TN83-B8251
CPUs: AMD EPYC 7302 x 2
GPUs: Nvidia A100 PCIe x 4
RAM: 256GB
SSD: NVMe Samsung Enterprise Level
Nvidia driver version: 460.32.03
CUDA version: 11.2.1

symtoms:

  1. run p2pBandwidthLatencyTest got following results

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1154.84 11.38 11.42 11.54
1 11.45 1168.66 11.47 11.40
2 11.54 11.55 1158.27 11.64
3 11.54 11.61 11.73 1293.46
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 1290.26 2.60 2.32 2.32
1 2.60 1294.53 2.60 2.60
2 2.09 2.60 1290.26 2.60
3 1.99 2.29 2.60 1291.32
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1171.73 15.45 15.75 15.82
1 15.81 1280.21 15.80 15.84
2 15.83 15.95 1307.53 16.02
3 15.71 15.98 15.96 1311.37
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1306.44 4.64 4.64 4.64
1 4.64 1308.08 5.20 5.20
2 5.20 5.20 1307.53 5.20
3 5.20 5.20 5.20 1308.08

it saids Peer-to-peer access is available, but P2P=Enabled part got terrible results.

  1. pytorch hangs when call

torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
In this case run env is latest NGC docker image (pytorch 20.12)

Now this A100 machine is completely unusable with multi-GPU settings on pytorch.
I also found it works perfectly with single A100.
Another note, multi-GPUs functionality works fine with older built NGC tensorflow (1.15) images.

I suggest discussing the issue with your server vendor, Tyan. This is a problem with their system BIOS. We won’t be able to sort it out here. You may also want to see if there are any newer system BIOS available for your motherboard.

We experience the same issue - the hardware is HP Proliant DL385 Gen10 Plus v2 with 3 GRID A100 PCIe 40GB.

I’m not sure what a GRID A100 PCIe 40GB GPU is. If you mean that you are running A100 in a virtualized setting (e.g. with a vCS profile), then I’m not surprised that P2P is not working. See here, P2P is supported over NVLink only. The A100 PCIe in HPE DL38x platforms cannot be configured with the NVLink bridge, so you do not have NVLink.

In any event, my recommendation is the same, contact your system vendor (HPE).

I reported what was listed in lspci - we have 3, and they are not configured with virtualization on.

In a 3 GPU configuration on DL38x, two of the GPUs will be attached to one CPU socket, the 3rd GPU will be attached to the other. P2P is not supported from GPUs on one socket to GPUs on another socket. This may explain some of your observations.

In any event, my recommendation is the same, contact your system vendor (HPE).