Distributed Data Parallel Training fails, NCCL WARN Error : ring 0 does not contain rank 1

Description

I am trying to run a DDP training with 4 nodes, each with 1 GPU, I am using PyTorch Lightning framework with strategy = “ddp”, the backend is nccl. I have one NVIDIA RTX 3090 in each of the node.
NCCL version 2.14.3+cuda11.7

Environment

GPU Type: 3090 RTX
Nvidia Driver Version: 515.86.01
CUDA Version: 11.7
CUDNN Version:
Operating System + Version: Ubuntu OSD, 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

in each of the nodes i gave the following commands,
export MASTER_ADDR=(I use eno2 from my ifconfig)
export MASTER_PORT= xxxx (free port)
export WORLD_SIZE=4
export NODE_RANK=correspoding node rank in each node. (0 to 3)
then,
NCCL_DEBUG=INFO python3 LIGHTININGSCRIPT.py

This is the output from main node
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

distributed_backend=nccl
All distributed processes registered. Starting with 4 processes

KOR-C-008J2:546882:546882 [0] NCCL INFO Bootstrap : Using eno2:10.165.178.196<0>
KOR-C-008J2:546882:546882 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
KOR-C-008J2:546882:546882 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
KOR-C-008J2:546882:547125 [0] NCCL INFO Failed to open libibverbs.so[.1]
KOR-C-008J2:546882:547125 [0] NCCL INFO NET/Socket : Using [0]eno2:10.xxx.xxx.xxx<0> [1]br-fb29d128b7b0:192.168.49.1<0>
KOR-C-008J2:546882:547125 [0] NCCL INFO Using network Socket
KOR-C-008J2:546882:547125 [0] NCCL INFO Setting affinity for GPU 0 to 0fffffff
KOR-C-008J2:546882:547125 [0] NCCL INFO Channel 00/02 : 0 0 0 0

KOR-C-008J2:546882:547125 [0] graph/rings.cc:51 NCCL WARN Error : ring 0 does not contain rank 1
KOR-C-008J2:546882:547125 [0] NCCL INFO graph/connect.cc:317 → 3
KOR-C-008J2:546882:547125 [0] NCCL INFO init.cc:759 → 3
KOR-C-008J2:546882:547125 [0] NCCL INFO init.cc:1089 → 3
KOR-C-008J2:546882:547125 [0] NCCL INFO group.cc:64 → 3 [Async thread]
KOR-C-008J2:546882:546882 [0] NCCL INFO group.cc:421 → 3
KOR-C-008J2:546882:546882 [0] NCCL INFO group.cc:106 → 3
KOR-C-008J2:546882:546882 [0] NCCL INFO comm 0x55ae17d27580 rank 0 nranks 4 cudaDev 0 busId 21000 - Abort COMPLETE
/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py:512: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn(“Error handling mechanism for deadlock detection is uninitialized. Skipping check.”)
Traceback (most recent call last):
File “main.py”, line 124, in
trainer.fit(model, data)
File “/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 735, in fit
self._call_and_handle_interrupt(
File “/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 770, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 1132, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py”, line 1428, in _call_setup_hook
self.training_type_plugin.barrier(“pre_setup”)
File “/home/vjj2kor/miniconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py”, line 405, in barrier
torch.distributed.barrier(device_ids=self.determine_ddp_device_ids())
File “/home/vjj2kor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 3145, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Error : ring 0 does not contain rank 1

In another node,
KOR-C-008J0:2448267:2448393 [0] bootstrap.cc:75 NCCL WARN Message truncated : received 988 bytes instead of 984
KOR-C-008J0:2448267:2448393 [0] NCCL INFO bootstrap.cc:413 → 3
KOR-C-008J0:2448267:2448393 [0] NCCL INFO init.cc:672 → 3
KOR-C-008J0:2448267:2448393 [0] NCCL INFO init.cc:904 → 3
KOR-C-008J0:2448267:2448393 [0] NCCL INFO group.cc:72 → 3 [Async thread]

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, internal error, NCCL version 2.10.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

Hi @s.sivaramakrishnan ,

This forum talks about issues related to TensorRT.

I believe you can get better assistance on the respective forum.

Thanks

1 Like