Hi,
I’m currently working with an NVIDIA A100 GPU and using the CUDA 11.3 container for distributed training. However, I’m encountering an issue related to NCCL during execution. Below is a summary of the error message:
NCCL operation ncclGroupEnd() failed: unhandled cuda error
NCCL WARN Cuda failure 'invalid device function'
NCCL INFO NET/Plugin : Plugin load returned 0 : dlopen hook: 'libnccl-net.so': cannot open shared object file: No such file or directory.
Setup:
- GPU: NVIDIA A100
- CUDA Version: 11.3 (using NVIDIA container, which originally included NCCL 2.9.9 by default)
- NCCL Version: 2.14.3 (we removed the default 2.9.9 version and built 2.14.3 from source)
- Driver Version: 12040 (CUDA 11.3 compatible)
I am using Ray for distributed communication across two nodes. While the communicators are created successfully, I encounter errors when using broadcast. I am wondering if this could be caused by an incompatibility between NCCL 2.14.3 and CUDA 11.3? I also tried using NCCL 2.19 built from source and encountered similar issues when initializing communication.
Below is the debug output:
In this setup, one GPU on one node is communicating with one GPU on another node, and they are using
(MeshHostWorker pid=2098475) a100-ib-9:2098475:2098475 [0] NCCL INFO Bootstrap : Using eth0:192.168.0.5<0>
(MeshHostWorker pid=2098475) a100-ib-9:2098475:2098475 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : dlopen hook: 'libnccl-net.so': cannot open shared object file: No such file or directory.
(MeshHostWorker pid=2098475) a100-ib-9:2098475:2098475 [0] NCCL INFO cudaDriverVersion 12040
(MeshHostWorker pid=2098475) NCCL version 2.14.3+cuda11.3
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] NCCL INFO cudaDriverVersion 12040
(MeshHostWorker pid=94174, ip=192.168.0.9)
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] enqueue.cc:100 NCCL WARN Cuda failure 'invalid device function'
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] NCCL INFO Bootstrap : Using eth0:192.168.0.9<0>
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : dlopen hook: 'libnccl-net.so': cannot open shared object file: No such file or directory.
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.0.9<0> [1]br-89a2c375e0e1:172.18.0.1<0>
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Using network Socket
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 00/0 : 0[140] -> 1[a0] [receive] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 01/0 : 0[140] -> 1[a0] [receive] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 00/0 : 1[a0] -> 0[140] [send] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 01/0 : 1[a0] -> 0[140] [send] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Connected all rings
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Connected all trees
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO comm 0x4290fa0 rank 1 nranks 2 cudaDev 0 busId a0 - Init COMPLETE
(MeshHostWorker pid=94174, ip=192.168.0.9) NCCL version 2.14.3+cuda11.3
compile_pipeshard_executable::driver executable: 0.51 s
- Compile (driver): 33.56 s
- Compile (worker): 8.44 s
Iteration 0 ...
2025-01-20 13:20:36,585 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.run_executable() (pid=94174, ip=192.168.0.9, actor_id=92ec63d22394d7f60c97868905000000, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f42fb19d3d0>)
File "/workspace/alpa/alpa/device_mesh.py", line 283, in run_executable
self.executables[uuid].execute_on_worker(*args, **kwargs)
File "/workspace/alpa/alpa/pipeline_parallel/pipeshard_executable.py", line 561, in execute_on_worker
self.worker.run_resharding_broadcast_task(instruction.task_uuid,
File "/workspace/alpa/alpa/device_mesh.py", line 550, in run_resharding_broadcast_task
worker_nccl_util.broadcast(self, ary_uuid, broadcast_spec.comm_key,
File "/workspace/alpa/alpa/collective/worker_nccl_util.py", line 35, in broadcast
return _switch_impl(cupy_impl.broadcast, xla_impl.broadcast, worker, uuid,
File "/workspace/alpa/alpa/collective/worker_nccl_util.py", line 13, in _switch_impl
return xla_fn(*args)
File "/workspace/alpa/alpa/collective/worker_nccl_util_xla.py", line 160, in broadcast
col.broadcast_partialgpu(buffers, n_elements, comm_key, world_size,
File "/workspace/alpa/alpa/collective/collective.py", line 458, in broadcast_partialgpu
g.broadcast_partialgpu(tensor_list, opts)
File "/workspace/alpa/alpa/collective/collective_group/xla_nccl_collective_group.py", line 252, in broadcast_partialgpu
self.xla_comm_group.nccl_broadcast_partial_gpus(
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/alpa_nccl_group_base.cc:267: NCCL operation ncclGroupEnd() failed: unhandled cuda error