Hi Forums,
Setup:
- GPU: two M4000 GPU
- CUDA Version: cuda_12.4.r12.4/compiler.34097967_0
- NCCL Version: libnccl-dev 2.27.3-1+cuda12.4
GPUs are independently via PCIe on my motherboard, no NVLINK between them.
I tried to train a PyTorch model using both GPU, using nn.DataParallel()
However, I ran into the error 'unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Running nccl-tests./build/all_reduce_perf with NCCL_DEBUG=INFO, I got this error
Authorization required, but no authorization protocol specified
# nThread 1 nGpus 1 minBytes 8 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 204118 on hom device 0 [0000:15:00] Quadro M4000
home:204118:204118 [0] NCCL INFO Bootstrap: Using eno1:10.39.120.16<0>
home:204118:204118 [0] NCCL INFO cudaDriverVersion 12040
home:204118:204118 [0] NCCL INFO NCCL version 2.27.3+cuda12.4
home:204118:204136 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
home:204118:204136 [0] NCCL INFO NET/IB : No device found.
home:204118:204136 [0] NCCL INFO NET/IB : Using [RO]; OOB eno1:10.39.120.16<0>
home:204118:204136 [0] NCCL INFO NET/Socket : Using [0]eno1:10.39.120.16<0>
home:204118:204136 [0] NCCL INFO Initialized NET plugin Socket
home:204118:204136 [0] NCCL INFO Assigned NET plugin Socket to comm
home:204118:204136 [0] NCCL INFO Using network Socket
home:204118:204136 [0] init.cc:426 NCCL WARN Cuda failure 'operation not supported'
I see NET/IB : No device found. Does it mean NCCL can’t find my 2 GPUs? smi can find both GPU no problem.
Thanks!