PyTorch Lightning on 2 DGX Sparks crashes connected by QFSP

Setup:

I have two DGX Sparks connected by a full-speed QSFP cable. I am trying to run a test based on the ControlNet project with updated libraries.

Tests:

Completed the NCCL for Two Sparks playbook so the connection is working at full bandwidth.

Tried the PyTorch Fine Tuning playbook but encountered a crash for what seems to be unrelated reasons. The ControlNet tutorial does not run in a container, and has fewer layers of sugar-coating (just PyTorch-Lightning) so it seemed to me that it would be an easier place to begin debugging.

Ran the training process successfully on one node.

On multi-node launch, node 0 waits for node 1 to join, the the training process appears to start on both nodes.

Tried reducing the precision and increasing the batch size. However, the same crash occurred.

The Problem:

The process crashes at what appears to be the end of the epoch with the following final (verbose) logging:

Node 0 (hostname = zerozero):

zerozero:20258:20535 [0] NCCL INFO AllReduce: 4104 Bytes -> Algo RING proto LL channel{Lo..Hi}={0..0}
[rank0]:[E227 23:41:58.367404475 ProcessGroupNCCL.cpp:688] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2407, OpType=ALLREDUCE, NumelIn=14749440, NumelOut=14749440, Timeout(ms)=1800000) ran for 1800067 milliseconds before timing out.
[rank0]:[E227 23:41:58.379476073 ProcessGroupNCCL.cpp:2277] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 2407 PG status: last enqueued work: 2498, last completed work: 2406
[rank0]:[E227 23:41:58.380165062 ProcessGroupNCCL.cpp:735] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E227 23:41:58.380194023 ProcessGroupNCCL.cpp:2610] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping.
[rank0]:[E227 23:41:59.353027560 ProcessGroupNCCL.cpp:1890] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 2498, last completed NCCL work: 2406.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank0]:[E227 23:41:59.353231985 ProcessGroupNCCL.cpp:1606] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1, only active collectives: 0
[rank0]:[E227 23:42:58.380395627 ProcessGroupNCCL.cpp:749] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E227 23:42:58.380420092 ProcessGroupNCCL.cpp:763] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E227 23:42:58.390475801 ProcessGroupNCCL.cpp:2093] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2407, OpType=ALLREDUCE, NumelIn=14749440, NumelOut=14749440, Timeout(ms)=1800000) ran for 1800067 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:691 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xc8 (0xe00945975978 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x208 (0xe0094686d9e8 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xe7c (0xe0094687521c in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xe8 (0xe00946876628 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xda294 (0xe009faa2a294 in /home/ghare/miniconda3/envs/control/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8595c (0xe009fc88595c in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xebb4c (0xe009fc8ebb4c in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2407, OpType=ALLREDUCE, NumelIn=14749440, NumelOut=14749440, Timeout(ms)=1800000) ran for 1800067 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:691 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xc8 (0xe00945975978 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x208 (0xe0094686d9e8 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xe7c (0xe0094687521c in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xe8 (0xe00946876628 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xda294 (0xe009faa2a294 in /home/ghare/miniconda3/envs/control/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8595c (0xe009fc88595c in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xebb4c (0xe009fc8ebb4c in /lib/aarch64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xc8 (0xe00945975978 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::Watchdog::run() + 0x550 (0xe00946876a90 in /home/ghare/miniconda3/envs/control/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xda294 (0xe009faa2a294 in /home/ghare/miniconda3/envs/control/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8595c (0xe009fc88595c in /lib/aarch64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0xebb4c (0xe009fc8ebb4c in /lib/aarch64-linux-gnu/libc.so.6)

Aborted (core dumped)

Node 1 (hostname = spark-b185):

spark-b185:13212:13344 [0] NCCL INFO AllReduce: opCount 966 sendbuff 0xf5d0e76c8c00 recvbuff 0xf5d0e76c8c00 count 14749440 datatype 7 op 0 root 0 comm 0xf881770 [nranks=2] stream 0x10dd7dd0
[rank1]:[E227 23:41:58.035946582 ProcessGroupNCCL.cpp:1825] [PG ID 0 PG GUID 0(default_pg) Rank 1] Observed flight recorder dump signal from another rank via TCPStore.
[rank1]:[E227 23:41:58.037313229 ProcessGroupNCCL.cpp:1890] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from  rank 0 and we will try our best to dump the debug info. Last enqueued NCCL work: 2406, last completed NCCL work: 2406.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank1]:[E227 23:41:58.048410765 ProcessGroupNCCL.cpp:1606] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1, only active collectives: 0
spark-b185:13212:13311 [0] NCCL INFO RAS current socket connection with 169.254.61.128<57777> closed by peer on receive; terminating it
spark-b185:13212:13311 [0] NCCL INFO RAS trying to reconnect with 169.254.61.128<57777> (experiencingDelays 0, startRetryTime 1.00s)
spark-b185:13212:13311 [0] NCCL INFO RAS connect timeout warning (5s) on socket connection with 169.254.61.128<57777>
spark-b185:13212:13311 [0] NCCL INFO RAS link 1: no more fallbacks to add (total 1)
spark-b185:13212:13311 [0] NCCL INFO RAS link -1: no more fallbacks to add (total 1)
spark-b185:13212:13311 [0] NCCL INFO RAS init timeout error (20s) on socket connection with 169.254.61.128<57777> (experiencingDelays 1, startRetryTime 21.00s, socket status 1)
spark-b185:13212:13311 [0] NCCL INFO RAS trying to reconnect with 169.254.61.128<57777> (experiencingDelays 1, startRetryTime 21.00s)
spark-b185:13212:13311 [0] NCCL INFO RAS init timeout error (20s) on socket connection with 169.254.61.128<57777> (experiencingDelays 1, startRetryTime 41.00s, socket status 1)
spark-b185:13212:13311 [0] NCCL INFO RAS trying to reconnect with 169.254.61.128<57777> (experiencingDelays 1, startRetryTime 41.00s)
spark-b185:13212:13311 [0] NCCL INFO RAS connect retry timeout (60s) on socket connection with 169.254.61.128<57777>
spark-b185:13212:13311 [0] NCCL INFO RAS handling deadPeer (addr 169.254.61.128<57777>)
spark-b185:13212:13311 [0] NCCL INFO RAS link 1: no more fallbacks to add (total 1)
spark-b185:13212:13311 [0] NCCL INFO RAS link -1: no more fallbacks to add (total 1)
spark-b185:13212:13311 [0] NCCL INFO RAS link 1: dropping primary connection with 169.254.61.128<57777>
spark-b185:13212:13311 [0] NCCL INFO RAS link -1: dropping primary connection with 169.254.61.128<57777>
spark-b185:13212:13311 [0] NCCL INFO RAS terminating a connection with 169.254.61.128<57777>
spark-b185:13212:13311 [0] NCCL INFO Mem Realloc old size 0, new size 112 pointer 0x10dd63b0
spark-b185:13212:13311 [0] NCCL INFO RAS declaring peer 169.254.61.128<57777> as DEAD; rasDeadPeersHash 0xd62332af59f8abc2
[rank1]:[F227 23:49:58.073326024 ProcessGroupNCCL.cpp:1631] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to collective timeout or exception.
Aborted (core dumped)

From the logs, it appears to me that Node 0 has become unresponsive, and so Node 1 gives up after waiting for 1 min. But, concurrently Node 0 seems to decide that Node 1 is unresponsive.

If anyone has suggestions for fixes or tests, I’ll be happy to try them!

You said at the end of the epoch which I’m assuming gets heavier on a closing computation. Have you monitored NVTOP for temps? My head node will crash during an agent compaction (huge prefill calc to generate a conversation summary). Currently putting a fan in front of them to avoid the crash. Should note that this has only started after recent firmware updates.