Multi-GPU not work with Nvidia A100 PCI-E GPUs

skywalkeria4ke · February 24, 2021, 4:03pm

OS: ubuntu 20.04.
Barebone: Tyan Transport HX TN83-B8251
CPUs: AMD EPYC 7302 x 2
GPUs: Nvidia A100 PCIe x 4
RAM: 256GB
SSD: NVMe Samsung Enterprise Level
Nvidia driver version: 460.32.03
CUDA version: 11.2.1

symtoms:

run p2pBandwidthLatencyTest got following results

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1154.84 11.38 11.42 11.54
1 11.45 1168.66 11.47 11.40
2 11.54 11.55 1158.27 11.64
3 11.54 11.61 11.73 1293.46
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 1290.26 2.60 2.32 2.32
1 2.60 1294.53 2.60 2.60
2 2.09 2.60 1290.26 2.60
3 1.99 2.29 2.60 1291.32
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1171.73 15.45 15.75 15.82
1 15.81 1280.21 15.80 15.84
2 15.83 15.95 1307.53 16.02
3 15.71 15.98 15.96 1311.37
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1306.44 4.64 4.64 4.64
1 4.64 1308.08 5.20 5.20
2 5.20 5.20 1307.53 5.20
3 5.20 5.20 5.20 1308.08

it saids Peer-to-peer access is available, but P2P=Enabled part got terrible results.

pytorch hangs when call

torch.distributed.init_process_group(backend=‘nccl’, init_method=‘env://’)
In this case run env is latest NGC docker image (pytorch 20.12)

Now this A100 machine is completely unusable with multi-GPU settings on pytorch.
I also found it works perfectly with single A100.
Another note, multi-GPUs functionality works fine with older built NGC tensorflow (1.15) images.

Robert_Crovella · March 8, 2021, 11:31pm

I suggest discussing the issue with your server vendor, Tyan. This is a problem with their system BIOS. We won’t be able to sort it out here. You may also want to see if there are any newer system BIOS available for your motherboard.

jennyw · September 23, 2021, 1:17pm

We experience the same issue - the hardware is HP Proliant DL385 Gen10 Plus v2 with 3 GRID A100 PCIe 40GB.

Robert_Crovella · September 23, 2021, 2:10pm

I’m not sure what a GRID A100 PCIe 40GB GPU is. If you mean that you are running A100 in a virtualized setting (e.g. with a vCS profile), then I’m not surprised that P2P is not working. See here, P2P is supported over NVLink only. The A100 PCIe in HPE DL38x platforms cannot be configured with the NVLink bridge, so you do not have NVLink.

In any event, my recommendation is the same, contact your system vendor (HPE).

jennyw · September 23, 2021, 2:13pm

I reported what was listed in lspci - we have 3, and they are not configured with virtualization on.

Robert_Crovella · September 23, 2021, 2:18pm

In a 3 GPU configuration on DL38x, two of the GPUs will be attached to one CPU socket, the 3rd GPU will be attached to the other. P2P is not supported from GPUs on one socket to GPUs on another socket. This may explain some of your observations.

In any event, my recommendation is the same, contact your system vendor (HPE).

matej.rojc · November 23, 2021, 8:03am

We have very similar issue.
Using up to date NGC image pytorch:21.10-py3, task on one A100 gpu works.
But it is impossible to use multi gpus: 2,3,… 8.
So A100 x8 platform is completely unusable on pytorch case.
It simply hangs in lines dealing with multi gpu distribution as already mentioned.
We tried with pytorch:20.02-py3, 20.03-py3: only in this case the pytorch does not hang, although results are terrible and unusable - divergence and not convergence as in case of DGX-1 V100 on the same task.
Different multi gpu tasks using tensorflow 2.4 (so quite old) work perfectly. How this is possible ?
On machine we have: NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 Cuda Version:11.0
SRAS4124GSTNROTO94 Supermicro assembed server based on AS-3124GS-TNR 2xRome
Supermicro A+ Server 4124GS-TNR
GPU-NVTA100-40 Supermicro NVIDIA A100 40GB CoWos HBM2 PCIe 4.0 Passive Cooling - 8
GPU-NVTNVLINK-A100 Supermicro/NVIDIA NVLINK Bridge Ampere 2-Way 2 Slot x16 12
NVIDIA EDU

Robert_Crovella · November 23, 2021, 3:19pm

Your A100 GPUs are PCIE GPUs (which is different than the configuration for example in a DGX-1 V100, where all GPUs are interconnected by NVLINK). In the PCIE case, your GPUs have at most pairwise connections for NVLINK, using the bridge(s). This will certainly have implications for multi-GPU DL training activity, although it should be mostly about performance.

Beyond that, I don’t have enough information to discover what problems you are having, and this particular forum is not really about how to use NGC DL containers. There are forums for NGC as well as forums for Deep Learning Training and Inference.

Topic		Replies	Views
Issue with P2P connection using two RTX A4500 CUDA Programming and Performance cuda , ubuntu	7	2368	March 31, 2023
RTX 3090 + NVLink + CUDA P2P - not working on Linux or Windows, in different ways? CUDA Programming and Performance	9	7051	May 24, 2023
Multi-GPU model inference failing with 4 A6000s CUDA Programming and Performance cuda	1	528	March 6, 2024
Computer requirements for A100 Miscellaneous Products (archived) a100	3	7448	April 20, 2022
No P2P - Dual NVlinked RTX 2080 TI setup on HP workstation - NVLink not working, no SLI option in control panel CUDA Setup and Installation	6	2223	December 26, 2019
Failed to simpleP2P CUDA Setup and Installation	3	74	August 26, 2024
P2P Transfers Across Single PCIe Switch Fail CUDA Programming and Performance	5	1288	April 15, 2024
How to enable P2P access? CUDA Setup and Installation cuda	3	4139	February 6, 2023
NVLINK support for connecting 4 GPUs GPU - Hardware	9	7133	May 29, 2023
P2P not working for P600s? CUDA Programming and Performance	7	1786	April 5, 2018

Multi-GPU not work with Nvidia A100 PCI-E GPUs

Related topics