simpleP2P failing for A5000 Ubuntu server 20.04

Description

A clear and concise description of the bug or issue.
I am using 4 x A5000 GPUs, and am facing the following problems:

  1. simpleP2P is failing with errors.
  2. DistributedDataParallel wrapper is stuck when invoked.
  3. Training is possible on individual GPUs, and on all 4 when i use gloo backend instead of nccl.

Environment

Ubuntu 20.04

GPU Type: Nvidia A5000
Nvidia Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.2.4
Operating System + Version: Ubuntu 20.04
PyTorch Version (if applicable): 1.9.1+cu111
Baremetal or Container (if container which image + tag): Baremetal

[./simpleP2P] - Starting…

Checking for multiple GPUs…

CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access…

Peer access from NVIDIA RTX A5000 (GPU0) → NVIDIA RTX A5000 (GPU1) : Yes

Peer access from NVIDIA RTX A5000 (GPU0) → NVIDIA RTX A5000 (GPU2) : Yes

Peer access from NVIDIA RTX A5000 (GPU0) → NVIDIA RTX A5000 (GPU3) : Yes

Peer access from NVIDIA RTX A5000 (GPU1) → NVIDIA RTX A5000 (GPU0) : Yes

Peer access from NVIDIA RTX A5000 (GPU1) → NVIDIA RTX A5000 (GPU2) : Yes

Peer access from NVIDIA RTX A5000 (GPU1) → NVIDIA RTX A5000 (GPU3) : Yes

Peer access from NVIDIA RTX A5000 (GPU2) → NVIDIA RTX A5000 (GPU0) : Yes

Peer access from NVIDIA RTX A5000 (GPU2) → NVIDIA RTX A5000 (GPU1) : Yes

Peer access from NVIDIA RTX A5000 (GPU2) → NVIDIA RTX A5000 (GPU3) : Yes

Peer access from NVIDIA RTX A5000 (GPU3) → NVIDIA RTX A5000 (GPU0) : Yes

Peer access from NVIDIA RTX A5000 (GPU3) → NVIDIA RTX A5000 (GPU1) : Yes

Peer access from NVIDIA RTX A5000 (GPU3) → NVIDIA RTX A5000 (GPU2) : Yes

Enabling peer access between GPU0 and GPU1…

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…

Creating event handles…

cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.19GB/s

Preparing host buffer and memcpy to GPU0…

Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…

Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…

Copy data back to host from GPU0 and verify results…

Verification error @ element 0: val = nan, ref = 0.000000

Verification error @ element 1: val = nan, ref = 4.000000

Verification error @ element 2: val = nan, ref = 8.000000

Verification error @ element 3: val = nan, ref = 12.000000

Verification error @ element 4: val = nan, ref = 16.000000

Verification error @ element 5: val = nan, ref = 20.000000

Verification error @ element 6: val = nan, ref = 24.000000

Verification error @ element 7: val = nan, ref = 28.000000

Verification error @ element 8: val = nan, ref = 32.000000

Verification error @ element 9: val = nan, ref = 36.000000

Verification error @ element 10: val = nan, ref = 40.000000

Verification error @ element 11: val = nan, ref = 44.000000

Disabling peer access…

Shutting down…

Test failed!

Hi,

This forum talks more about updates and issues related to TensorRT. We recommend you to please post your concern on related platform to get better help.

Thank you.