Hello,
Product: DGX Station
OS: Ubuntu 16.04
NVIDIA-SMI 384.183
Driver Version: 384.183
CUDA Version: 9.0
I have been trying to get pytorch working with DGX station. After different attempts on pip,conda and nvidia docker. The issue seems to be boiling down to my DGX station is still on the same nvidia driver when it was purchased. I have been trying to update the drivers to 455 from the existing 384 so that I can get the latest CUDA which will then in turn help me run the latest pytorch version. I tried to update the driver by installing the latest compatible cuda for pytorch and this is how it went
I hit ctrl+alt+f1
sudo service lightdm stop
sudo stop nvidia-digits-server
sudo service docker stop
sudo service nvidia-docker stop
sudo rmmod nvidia-uvm
sudo sh cuda_11.1.0_455.23.05_linux.run
But the installation fails
cuda_installer.log:
[INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9]
[INFO]: Cleaning up window
[INFO]: Complete
[INFO]: Checking compiler version…
[INFO]: gcc location: /usr/bin/gcc
[INFO]: gcc version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting
nvidia_installer.log:
nvidia-installer log file ‘/var/log/nvidia-installer.log’
creation time: Tue Mar 23 18:12:58 2021
installer version: 455.23.05
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
–ui=none
–no-questions
–accept-license
–disable-nouveau
–no-cc-version-check
–install-libglvnd
Using built-in stream user interface
→ Detected 40 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Does anyone know how I can update the drivers without errors for DGX station( V100 GPU version)?