Environment:
Windows 10 (OS Build 20161.1000)
GPU: 2 Geforce GTX 1080: (The test works when I only use one GPU, CUDA_VISIBLE_DEVICES=0)
WSL2
First, I came across the exception in pytorch sample code
import torch
t = torch.randn(5,5 )
torch._C._broadcast(t, (0, 1))
tensors = [torch.randn(5).long().cuda(), torch.randn(5).cuda()]
torch._C._broadcast_coalesced(tensors, (0,1), 10485760)
Then: I downloaded the nccl-test code from GitHub - NVIDIA/nccl-tests: NCCL Tests
and run command
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
the log is
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 53 on pengwa-z8 device 0 [0x15] GeForce GTX 1080
# Rank 1 Pid 53 on pengwa-z8 device 1 [0x2d] GeForce GTX 1080
pengwa-z8:53:53 [0] NCCL INFO Bootstrap : Using [0]eth0:172.29.141.51<0>
pengwa-z8:53:53 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pengwa-z8:53:53 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
pengwa-z8:53:53 [0] NCCL INFO NET/Socket : Using [0]eth0:172.29.141.51<0>
pengwa-z8:53:53 [0] NCCL INFO Using network Socket
NCCL version 2.7.6+cuda11.0
pengwa-z8:53:61 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:15/../../0000:15:00.0pengwa-z8:53:61 [1] NCCL INFO graph/xml.cc:469 -> 2
pengwa-z8:53:61 [1] NCCL INFO graph/xml.cc:660 -> 2
pengwa-z8:53:61 [1] NCCL INFO graph/topo.cc:523 -> 2
pengwa-z8:53:60 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:15/../../0000:15:00.0pengwa-z8:53:60 [0] NCCL INFO graph/xml.cc:469 -> 2
pengwa-z8:53:60 [0] NCCL INFO graph/xml.cc:660 -> 2
pengwa-z8:53:60 [0] NCCL INFO graph/topo.cc:523 -> 2
pengwa-z8:53:61 [1] NCCL INFO init.cc:586 -> 2
pengwa-z8:53:60 [0] NCCL INFO init.cc:586 -> 2
pengwa-z8:53:60 [0] NCCL INFO init.cc:845 -> 2
(torch16) zhanyi@pengwa-z8:~/git/nccl-tests$
pengwa-z8:53:60 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
pengwa-z8:53:61 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
pengwa-z8:53:53 [1] NCCL INFO init.cc:911 -> 2
pengwa-z8: **Test NCCL failure** common.cu:777 'unhandled system error'