Encountered NCCL communicator error while using multi GPU training

Hi, I encountered the following error while training using 2 GPU:
“RuntimeError:NCCL communicator was aborted on rank 1.”
When I switched to 1 GPU training, the error disappeared. Can anyone help me with this? I have trained other cases with both 2 GPUs and 1 GPU successfully, but encountered this problem when I added a new input parameter into the network.

The error file is:
Traceback (most recent call last): File "/home/users/*/*/*/Case.py", line 471, in run slv.solve() File "/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py", line 173, in solve self._train_loop(sigterm_handler) File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 664, in _train_loop dist.reduce(loss, 0, op=dist.ReduceOp.AVG) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 145, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2074, in reduce work = default_pg.reduce([tensor], opts) RuntimeError: NCCL communicator was aborted on rank 1.