Multi-GPU not work with Nvidia A100 PCI-E GPUs

OS: ubuntu 20.04.
Barebone: Tyan Transport HX TN83-B8251
CPUs: AMD EPYC 7302 x 2
GPUs: Nvidia A100 PCIe x 4
RAM: 256GB
SSD: NVMe Samsung Enterprise Level
Nvidia driver version: 460.32.03
CUDA version: 11.2.1


  1. run p2pBandwidthLatencyTest got following results

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1154.84 11.38 11.42 11.54
1 11.45 1168.66 11.47 11.40
2 11.54 11.55 1158.27 11.64
3 11.54 11.61 11.73 1293.46
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 1290.26 2.60 2.32 2.32
1 2.60 1294.53 2.60 2.60
2 2.09 2.60 1290.26 2.60
3 1.99 2.29 2.60 1291.32
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1171.73 15.45 15.75 15.82
1 15.81 1280.21 15.80 15.84
2 15.83 15.95 1307.53 16.02
3 15.71 15.98 15.96 1311.37
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1306.44 4.64 4.64 4.64
1 4.64 1308.08 5.20 5.20
2 5.20 5.20 1307.53 5.20
3 5.20 5.20 5.20 1308.08

it saids Peer-to-peer access is available, but P2P=Enabled part got terrible results.

  1. pytorch hangs when call

torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
In this case run env is latest NGC docker image (pytorch 20.12)

Now this A100 machine is completely unusable with multi-GPU settings on pytorch.
I also found it works perfectly with single A100.
Another note, multi-GPUs functionality works fine with older built NGC tensorflow (1.15) images.

I suggest discussing the issue with your server vendor, Tyan. This is a problem with their system BIOS. We won’t be able to sort it out here. You may also want to see if there are any newer system BIOS available for your motherboard.

We experience the same issue - the hardware is HP Proliant DL385 Gen10 Plus v2 with 3 GRID A100 PCIe 40GB.

I’m not sure what a GRID A100 PCIe 40GB GPU is. If you mean that you are running A100 in a virtualized setting (e.g. with a vCS profile), then I’m not surprised that P2P is not working. See here, P2P is supported over NVLink only. The A100 PCIe in HPE DL38x platforms cannot be configured with the NVLink bridge, so you do not have NVLink.

In any event, my recommendation is the same, contact your system vendor (HPE).

I reported what was listed in lspci - we have 3, and they are not configured with virtualization on.

In a 3 GPU configuration on DL38x, two of the GPUs will be attached to one CPU socket, the 3rd GPU will be attached to the other. P2P is not supported from GPUs on one socket to GPUs on another socket. This may explain some of your observations.

In any event, my recommendation is the same, contact your system vendor (HPE).

We have very similar issue.
Using up to date NGC image pytorch:21.10-py3, task on one A100 gpu works.
But it is impossible to use multi gpus: 2,3,… 8.
So A100 x8 platform is completely unusable on pytorch case.
It simply hangs in lines dealing with multi gpu distribution as already mentioned.
We tried with pytorch:20.02-py3, 20.03-py3: only in this case the pytorch does not hang, although results are terrible and unusable - divergence and not convergence as in case of DGX-1 V100 on the same task.
Different multi gpu tasks using tensorflow 2.4 (so quite old) work perfectly. How this is possible ?
On machine we have: NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 Cuda Version:11.0
SRAS4124GSTNROTO94 Supermicro assembed server based on AS-3124GS-TNR 2xRome
Supermicro A+ Server 4124GS-TNR
GPU-NVTA100-40 Supermicro NVIDIA A100 40GB CoWos HBM2 PCIe 4.0 Passive Cooling - 8
GPU-NVTNVLINK-A100 Supermicro/NVIDIA NVLINK Bridge Ampere 2-Way 2 Slot x16 12

Your A100 GPUs are PCIE GPUs (which is different than the configuration for example in a DGX-1 V100, where all GPUs are interconnected by NVLINK). In the PCIE case, your GPUs have at most pairwise connections for NVLINK, using the bridge(s). This will certainly have implications for multi-GPU DL training activity, although it should be mostly about performance.

Beyond that, I don’t have enough information to discover what problems you are having, and this particular forum is not really about how to use NGC DL containers. There are forums for NGC as well as forums for Deep Learning Training and Inference.

1 Like