NCCL declaring Nvidia GPU missing using Pytorch distributed

Hi Nvidia Community,
I’m trying to configure my ZorinOS 16.1 which is based on Ubuntu 20.04 to run Microsoft FLUTE (a framework for simulating Federated Learning) which relies on PyTorch.distributed to simulate communications between clients , namely it requires a fully configured NCCL in it’s backend and pushing to Cuda.
For extra Context , I’m using a RTX nvidia geforce rtx 3060 and I have prepared a cuda toolkit of version for this11.7 (I used the network installer for this because the default command installs 12 with dependancy errors that has been reported of the forums before)
in this post, I’d like to ask for help [moderators,hope I’m posting in the right place] going from driver installation and verifying my cuda installation to finally debugging the error received by NCCL .

Just to sum up the situation here are the peculiar outputs that led me to believe I have a driver installation issue.
using the command provided by MS Flute

python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru **-backend NCCL**

which returns an output containing :

[W CUDAFunctions.cpp:109] Warning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (function operator())

and

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6438) of binary: /home/crns/anaconda3/envs/FLUTE/bin/python

which traces the pytorch distributed call to

File “/home/crns/anaconda3/envs/FLUTE/lib/python3.8/site-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/home/crns/anaconda3/envs/FLUTE/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/crns/anaconda3/envs/FLUTE/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

digging deeper the command nvidia-smi returns

Failed to initialize NVML: Driver/library version mismatch

with commands lsmod | grep nvidia returning

nvidia_drm 61440 0
nvidia_modeset 1142784 1 nvidia_drm
nvidia 4497408 1 nvidia_modeset
drm_kms_helper 258048 1 nvidia_drm
drm 557056 3 drm_kms_helper,nvidia,nvidia_drm

and potentially conflicting commands sudo sh NVIDIA-Linux-x86_64-525.85.05.run --dkms then cuda 11.7 mentions it needs a 470 driver so I had to remove and purge the installation and do sudo apt install cuda-drivers-470 nvidia-container-runtime -y and forcing secure boot off and had me Enroll a MOK password on reboot to reach the errors decribed above.

just to double check my GPU as declared above is RTX 3060 and I tried installing the driver from their sites which defaults to 525 and somehow a non-working NCCL / driver detection here are some commands that show my device info lspci | grep VGA 01:00.0 VGA

compatible controller: NVIDIA Corporation Device 2504 (rev a1)
and sudo lshw -C video
*-display
description: VGA compatible controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:a0000000-a0ffffff memory:4000000000-400fffffff memory:4010000000-4011ffffff ioport:5000(size=128) memory:c0000-dffff

I tried to trim down the errors and boil things down instead of a wall of text from my terminal but if that’s needed I’m happy to help you help me . Thanks in advance Nvidia community

1 Like

The best place to get help for pytorch issues is the pytorch forums.

Pytorch distributed (and NCCL) would typically be used in a machine that has multiple GPUs. I haven’t sorted out your case carefully, but my guess would be that you are trying to use multiple workers in pytorch distributed, and those workers are expecting multiple GPUs, and setting the CUDA_VISIBLE_DEVICES variable so each worker uses a separate GPU. One or more of those workers are setting the variable to indicate some GPU other than GPU 0, which don’t exist in your machine. This doesn’t look like a defect of any sort, but rather incorrect usage of pytorch.

I suggest asking for help on the pytorch forums. NVIDIA has experts on pytorch there.

If you think you have a broken driver installation, my suggestion would be to verify your CUDA install using the method provided in the CUDA linux install guide. If you have trouble with that, the forum to use is the CUDA setup and installation forum. Otherwise I suggest you post in the pytorch forums. I’m unlikely to be able to help further here.