Hi Nvidia Community,
I’m trying to configure my ZorinOS 16.1 which is based on Ubuntu 20.04 to run Microsoft FLUTE (a framework for simulating Federated Learning) which relies on PyTorch.distributed to simulate communications between clients , namely it requires a fully configured NCCL in it’s backend and pushing to Cuda.
For extra Context , I’m using a RTX nvidia geforce rtx 3060 and I have prepared a cuda toolkit of version for this11.7 (I used the network installer for this because the default command installs 12 with dependancy errors that has been reported of the forums before)
in this post, I’d like to ask for help [moderators,hope I’m posting in the right place] going from driver installation and verifying my cuda installation to finally debugging the error received by NCCL .
Just to sum up the situation here are the peculiar outputs that led me to believe I have a driver installation issue.
using the command provided by MS Flute
python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru **-backend NCCL**
which returns an output containing :
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (function operator())
and
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6438) of binary: /home/crns/anaconda3/envs/FLUTE/bin/python
which traces the pytorch distributed call to
File “/home/crns/anaconda3/envs/FLUTE/lib/python3.8/site-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/home/crns/anaconda3/envs/FLUTE/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/crns/anaconda3/envs/FLUTE/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
digging deeper the command nvidia-smi returns
Failed to initialize NVML: Driver/library version mismatch
with commands lsmod | grep nvidia
returning
nvidia_drm 61440 0
nvidia_modeset 1142784 1 nvidia_drm
nvidia 4497408 1 nvidia_modeset
drm_kms_helper 258048 1 nvidia_drm
drm 557056 3 drm_kms_helper,nvidia,nvidia_drm
and potentially conflicting commands sudo sh NVIDIA-Linux-x86_64-525.85.05.run --dkms
then cuda 11.7 mentions it needs a 470 driver so I had to remove and purge the installation and do sudo apt install cuda-drivers-470 nvidia-container-runtime -y
and forcing secure boot off and had me Enroll a MOK password on reboot to reach the errors decribed above.
just to double check my GPU as declared above is RTX 3060 and I tried installing the driver from their sites which defaults to 525 and somehow a non-working NCCL / driver detection here are some commands that show my device info lspci | grep VGA
01:00.0 VGA
compatible controller: NVIDIA Corporation Device 2504 (rev a1)
and sudo lshw -C video
*-display
description: VGA compatible controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: iomemory:400-3ff iomemory:400-3ff irq:16 memory:a0000000-a0ffffff memory:4000000000-400fffffff memory:4010000000-4011ffffff ioport:5000(size=128) memory:c0000-dffff
I tried to trim down the errors and boil things down instead of a wall of text from my terminal but if that’s needed I’m happy to help you help me . Thanks in advance Nvidia community