CUDA NCCL Error "operation not supported" Multi-GPUs

Terry-W · June 25, 2025, 9:38pm

Hi Forums,

Setup:

GPU: two M4000 GPU
CUDA Version: cuda_12.4.r12.4/compiler.34097967_0
NCCL Version: libnccl-dev 2.27.3-1+cuda12.4

GPUs are independently via PCIe on my motherboard, no NVLINK between them.

I tried to train a PyTorch model using both GPU, using nn.DataParallel()
However, I ran into the error 'unhandled cuda error (run with NCCL_DEBUG=INFO for details)

Running nccl-tests./build/all_reduce_perf with NCCL_DEBUG=INFO, I got this error

Authorization required, but no authorization protocol specified
# nThread 1 nGpus 1 minBytes 8 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 204118 on hom device  0 [0000:15:00] Quadro M4000
home:204118:204118 [0] NCCL INFO Bootstrap: Using eno1:10.39.120.16<0>
home:204118:204118 [0] NCCL INFO cudaDriverVersion 12040
home:204118:204118 [0] NCCL INFO NCCL version 2.27.3+cuda12.4
home:204118:204136 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. 
home:204118:204136 [0] NCCL INFO NET/IB : No device found.
home:204118:204136 [0] NCCL INFO NET/IB : Using [RO]; OOB eno1:10.39.120.16<0>
home:204118:204136 [0] NCCL INFO NET/Socket : Using [0]eno1:10.39.120.16<0>
home:204118:204136 [0] NCCL INFO Initialized NET plugin Socket
home:204118:204136 [0] NCCL INFO Assigned NET plugin Socket to comm
home:204118:204136 [0] NCCL INFO Using network Socket

home:204118:204136 [0] init.cc:426 NCCL WARN Cuda failure 'operation not supported'

I see NET/IB : No device found. Does it mean NCCL can’t find my 2 GPUs? smi can find both GPU no problem.
Thanks!

rs277 · June 26, 2025, 4:18am

I suspect the message is referring to an Infiniband network interface, which you presumably don’t have fitted, hence the “INFO” status.

A wild guess: Looking at common.mk in the nccl-tests, the minimum hardware version supported is Pascal, (sm_60), and this is perhaps causing the “operation not supported”.

You are on Maxwell, sm_52, so try adding an entry for -gencode=arch=compute_52,code=sm_52

Topic		Replies	Views
Problems migrating to multi-gpu setting Deep Learning (Training & Inference) pytorch , python , cloud	1	1498	March 5, 2024
Nccl version missmatch causes multi-gpu training freeze CUDA Setup and Installation cuda , ubuntu , pytorch , python	0	1003	February 11, 2022
NCCL failure : "unhandled system error" for 2 GPUs CUDA on Windows Subsystem for Linux	1	4330	January 21, 2021
NCCL error on multi machine. transport/p2p.cu :515 WARN failed to open CUDA IPC handle : 30 unknown error Deep Learning (Training & Inference)	0	885	May 31, 2018
NCCL error when training data in GCP GPU-Accelerated Libraries cuda , tensorflow , ubuntu , python	2	1505	August 23, 2024
Code runs in RTX 3060 but not in 4xTesla T4 Azure cluster Microsoft Azure Image pytorch , python , cudnn	0	483	March 5, 2024
NCCL error GPU-Accelerated Libraries	4	385	February 19, 2025
NCCL failure common.cu:908 'unhandled cuda error'. Deep Learning (Training & Inference)	1	1442	April 26, 2018
NCCL can't use IB network GPU-Accelerated Libraries ubuntu , cudnn , nccl	2	1941	October 11, 2023
NCCL declaring Nvidia GPU missing using Pytorch distributed GPU-Accelerated Libraries boot , cuda , ubuntu , nvbugs	1	3761	February 7, 2023

CUDA NCCL Error "operation not supported" Multi-GPUs

Related topics