Nccl version missmatch causes multi-gpu training freeze

hi I’m using cuda 11.3 in my docker container. if I run multi-gpus it freezes so I thought it would be solved if I change pytorch.cuda.nccl.version…

I really like to know where the nccl 2.10.3 is located and how can I remove it.

also is there any way to find nccl 2.10.3 in my env? because apt search nccl didn’t show any 2.10.3 version that shows in torch.cuda.nccl.version. I wonder if I remove 2.10.3, then torch would set the default version as 2.9.9.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
>>> torch.cuda.nccl.version()
(2, 10, 3)
libhttpasyncclient-java/focal 4.1.4-1 all
  HTTP/1.1 compliant asynchronous HTTP agent implementation

libnccl-dev/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
  NVIDIA Collective Communication Library (NCCL) Development Files

libnccl2/unknown 2.11.4-1+cuda11.6 amd64 [upgradable from: 2.9.9-1+cuda11.3]
  NVIDIA Collective Communication Library (NCCL) Runtime

libpuppetlabs-http-client-clojure/focal 0.9.0-1 all
  Clojure wrapper around libhttpasyncclient-java

libvncclient1/focal-updates,focal-security 0.9.12+dfsg-9ubuntu0.3 amd64
  API to write one's own VNC server - client library

python-ncclient-doc/focal 0.6.0-2.1 all
  Documentation for python-ncclient (Python library for NETCONF clients)

python3-ncclient/focal 0.6.0-2.1 all
  Python library for NETCONF clients (Python 3)