Setup
I have 2 dgx sparks (Gigabyte model) connected with this QSFP56 cable. I am running distributed training (DDP) with sparks in a ray cluster. Repo.
Problem
I am encountering a NCCL timeout error when running distributed training across the cluster:
(RayTrainWorker pid=327294, ip=192.168.200.13) [rank0]:[E409 04:26:47.344717386 ProcessGroupNCCL.cpp:689] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1155874, OpType=ALLREDUCE, NumelIn=9445376, NumelOut=9445376, Timeout(ms)=1800000) ran for 1800002 milliseconds before timing out.
(RayTrainWorker pid=327294, ip=192.168.200.13) ount 11a31e sendbuff 0xfd35fb41a000 recvbuff 0xfd35fb41a000 count 7351296 datatype 7 op 0 root 0 comm 0xfd3c95b04b00 [nranks=2] stream 0xfd3c96ba5e10
(RayTrainWorker pid=327294, ip=192.168.200.13) aitopatom-0512:327294:329134 [0] NCCL INFO AllReduce: opCount 11a32f sendbuff 0xfd36285c6000 recvbuff 0xfd36285c6000 count 52516864 datatype 7 op 0 root 0 comm 0xfd3c95b04b00 [nranks=2] stream 0xfd3c96ba5e10 [repeated 36x across cluster]
(RayTrainWorker pid=327294, ip=192.168.200.13) [rank0]:[E409 04:26:47.352963809 ProcessGroupNCCL.cpp:2495] Rank 0
(RayTrainWorker pid=327294, ip=192.168.200.13) - [0] Timeout at collective: ALLREDUCE, #1155888
(RayTrainWorker pid=327294, ip=192.168.200.13) - To our best knowledge, the lagging/dead/mismatched ranks that caused the desync are:
(RayTrainWorker pid=327294, ip=192.168.200.13) [1] finished collective #1155873, but didn't join collective #1155874 (count from 1)
(RayTrainWorker pid=327294, ip=192.168.200.13) - Snapshot of ranks' latest states:
(RayTrainWorker pid=327294, ip=192.168.200.13) #1155873 finished ranks:
(RayTrainWorker pid=327294, ip=192.168.200.13) [1] finished ALLREDUCE
(RayTrainWorker pid=327294, ip=192.168.200.13) #1155888 started ranks:
(RayTrainWorker pid=327294, ip=192.168.200.13) [0] started ALLREDUCE
...
This occurs mid epoch after training has been running successfully for 36 - 72 hours and already completed a few epochs. It is not tied to rank or physical host; either rank can fail to join a collective operation regardless of which host is assigned rank 0/1.
Logs:
NCCL startup/NET/IB logs - nccl_timeout_logs_head.txt (35.6 KB)
Full error logs - nccl_timeout_logs_trunk.txt (48.0 KB)
Notes:
The nccl playbook says to ignore enP2p<...> interfaces. Does that mean I need to set NCCL_SOCKET_IFNAME directly to the enp1<...> interface instead of allowing nccl to discover and use both interfaces?
I did not follow the playbook to install nccl; it was already installed. Do I need to reinstall it with the instructions provided?
Thanks in advance!
Edit: I have already kicked off another run with "NCCL_SOCKET_IFNAME": "enp1s0f1np1" and "NCCL_IB_HCA": "rocep1s0f1" but will be at least a few days until I know if it still crashes.
Edit 2: I am setting these env vars:
"TORCH_FR_BUFFER_SIZE": "1048576",
"TORCH_NCCL_TRACE_BUFFER_SIZE": "1048576",
"TORCH_NCCL_DUMP_ON_TIMEOUT": "1",
"TORCH_NCCL_DESYNC_DEBUG": "1",
"NCCL_DEBUG": "INFO",
"NCCL_DEBUG_SUBSYS": "INIT,NET,COLL",
but still seeing
Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
Not sure why the buffer size envs are not enabling FlightRecorder, or if stack trace was not found for another reason.