NCCL Error: “invalid device function” - Is it due to NCCL version incompatibility with CUDA 11.3?

Hi,

I’m currently working with an NVIDIA A100 GPU and using the CUDA 11.3 container for distributed training. However, I’m encountering an issue related to NCCL during execution. Below is a summary of the error message:

NCCL operation ncclGroupEnd() failed: unhandled cuda error
NCCL WARN Cuda failure 'invalid device function'
NCCL INFO NET/Plugin : Plugin load returned 0 : dlopen hook: 'libnccl-net.so': cannot open shared object file: No such file or directory.

Setup:

  • GPU: NVIDIA A100
  • CUDA Version: 11.3 (using NVIDIA container, which originally included NCCL 2.9.9 by default)
  • NCCL Version: 2.14.3 (we removed the default 2.9.9 version and built 2.14.3 from source)
  • Driver Version: 12040 (CUDA 11.3 compatible)

I am using Ray for distributed communication across two nodes. While the communicators are created successfully, I encounter errors when using broadcast. I am wondering if this could be caused by an incompatibility between NCCL 2.14.3 and CUDA 11.3? I also tried using NCCL 2.19 built from source and encountered similar issues when initializing communication.

Below is the debug output:

In this setup, one GPU on one node is communicating with one GPU on another node, and they are using

(MeshHostWorker pid=2098475) a100-ib-9:2098475:2098475 [0] NCCL INFO Bootstrap : Using eth0:192.168.0.5<0>
(MeshHostWorker pid=2098475) a100-ib-9:2098475:2098475 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : dlopen hook: 'libnccl-net.so': cannot open shared object file: No such file or directory.
(MeshHostWorker pid=2098475) a100-ib-9:2098475:2098475 [0] NCCL INFO cudaDriverVersion 12040
(MeshHostWorker pid=2098475) NCCL version 2.14.3+cuda11.3
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] NCCL INFO cudaDriverVersion 12040
(MeshHostWorker pid=94174, ip=192.168.0.9) 
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] enqueue.cc:100 NCCL WARN Cuda failure 'invalid device function'
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] NCCL INFO Bootstrap : Using eth0:192.168.0.9<0>
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94174 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : dlopen hook: 'libnccl-net.so': cannot open shared object file: No such file or directory.
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.0.9<0> [1]br-89a2c375e0e1:172.18.0.1<0>
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Using network Socket
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 00/0 : 0[140] -> 1[a0] [receive] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 01/0 : 0[140] -> 1[a0] [receive] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 00/0 : 1[a0] -> 0[140] [send] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Channel 01/0 : 1[a0] -> 0[140] [send] via NET/Socket/0
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Connected all rings
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO Connected all trees
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
(MeshHostWorker pid=94174, ip=192.168.0.9) i-w7emi71b:94174:94340 [0] NCCL INFO comm 0x4290fa0 rank 1 nranks 2 cudaDev 0 busId a0 - Init COMPLETE
(MeshHostWorker pid=94174, ip=192.168.0.9) NCCL version 2.14.3+cuda11.3
compile_pipeshard_executable::driver executable: 0.51 s
 - Compile (driver): 33.56 s
 - Compile (worker): 8.44 s
Iteration 0 ...
2025-01-20 13:20:36,585 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.run_executable() (pid=94174, ip=192.168.0.9, actor_id=92ec63d22394d7f60c97868905000000, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f42fb19d3d0>)
  File "/workspace/alpa/alpa/device_mesh.py", line 283, in run_executable
    self.executables[uuid].execute_on_worker(*args, **kwargs)
  File "/workspace/alpa/alpa/pipeline_parallel/pipeshard_executable.py", line 561, in execute_on_worker
    self.worker.run_resharding_broadcast_task(instruction.task_uuid,
  File "/workspace/alpa/alpa/device_mesh.py", line 550, in run_resharding_broadcast_task
    worker_nccl_util.broadcast(self, ary_uuid, broadcast_spec.comm_key,
  File "/workspace/alpa/alpa/collective/worker_nccl_util.py", line 35, in broadcast
    return _switch_impl(cupy_impl.broadcast, xla_impl.broadcast, worker, uuid,
  File "/workspace/alpa/alpa/collective/worker_nccl_util.py", line 13, in _switch_impl
    return xla_fn(*args)
  File "/workspace/alpa/alpa/collective/worker_nccl_util_xla.py", line 160, in broadcast
    col.broadcast_partialgpu(buffers, n_elements, comm_key, world_size,
  File "/workspace/alpa/alpa/collective/collective.py", line 458, in broadcast_partialgpu
    g.broadcast_partialgpu(tensor_list, opts)
  File "/workspace/alpa/alpa/collective/collective_group/xla_nccl_collective_group.py", line 252, in broadcast_partialgpu
    self.xla_comm_group.nccl_broadcast_partial_gpus(
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/alpa_nccl_group_base.cc:267: NCCL operation ncclGroupEnd() failed: unhandled cuda error