DGX Station V-100 Driver update fail

anand.kadumberi · March 23, 2021, 10:44pm

Hello,
Product: DGX Station
OS: Ubuntu 16.04
NVIDIA-SMI 384.183
Driver Version: 384.183
CUDA Version: 9.0
I have been trying to get pytorch working with DGX station. After different attempts on pip,conda and nvidia docker. The issue seems to be boiling down to my DGX station is still on the same nvidia driver when it was purchased. I have been trying to update the drivers to 455 from the existing 384 so that I can get the latest CUDA which will then in turn help me run the latest pytorch version. I tried to update the driver by installing the latest compatible cuda for pytorch and this is how it went
I hit ctrl+alt+f1

sudo service lightdm stop
sudo stop nvidia-digits-server
sudo service docker stop
sudo service nvidia-docker stop
sudo rmmod nvidia-uvm
sudo sh cuda_11.1.0_455.23.05_linux.run

But the installation fails
cuda_installer.log:
[INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9]
[INFO]: Cleaning up window
[INFO]: Complete
[INFO]: Checking compiler version…
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting

nvidia_installer.log:
nvidia-installer log file ‘/var/log/nvidia-installer.log’
creation time: Tue Mar 23 18:12:58 2021
installer version: 455.23.05

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
./nvidia-installer
–ui=none
–no-questions
–accept-license
–disable-nouveau
–no-cc-version-check
–install-libglvnd

Using built-in stream user interface
→ Detected 40 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Does anyone know how I can update the drivers without errors for DGX station( V100 GPU version)?

ScottEllis · April 16, 2021, 8:39pm

Wow, that version you’re running does look pretty old!

In general, for DGX systems running DGX OS the recommendation is to upgrade drivers from the already configured repositories (e.g., for the OS version you’re running, the DGX repository), and not run the driver installer (th same holds true for CUDA). This will ensure the system stays on tested, supported versions to avoid issues like you’re seeing now.

General instructions for upgrading can be found in the DGX OS release notes at DGX OS 4 Desktop Software Release Notes :: DGX Systems Documentation . From where you are right now, I’d recommend doing at least a apt update && apt full-upgrade to get up the latest DGX OS version within the major version you’re running.

You may also consider updating to DGX OS 5 (Ubuntu 20.04 based), which brings in all sorts of improvements! See DGX OS 5 User Guide :: DGX Systems Documentation for that.

Note also that as a DGX customer, you can create a support ticket (EnterpriseSupport@nvidia.com or click “Create Ticket” on the Enterprise Support Portal) to get more immediate help.

ScottE

Topic		Replies	Views
Unable to install latest CUDA libraries on new DGX DGX User Forum cuda	1	742	October 3, 2022
Issue when upgrading cuda driver to R470 - DGX2 DGX User Forum cuda	17	6727	July 5, 2023
DGX \| unmet dependencies error DGX User Forum	1	1091	October 1, 2019
Upgrading Nvidia DGX packages did not update CUDA version DGX User Forum cuda	4	960	March 3, 2023
Ubuntu 16.04 problem with cuda 9.1 + 390.30 driver! CUDA Setup and Installation	8	14380	February 22, 2018
Driver update fails on Ubuntu after Cuda-6.5 install from .deb CUDA Setup and Installation	6	6330	March 14, 2015
Instalation error cuda 12.2 in ubuntu 20.04 CUDA Setup and Installation	4	4419	July 19, 2024
installation fails with kernels >= 5.1.x CUDA Setup and Installation	7	6462	July 5, 2019
Upgrading DGX OS on a DGX Station DGX User Forum	2	1215	May 20, 2022
Failed Cuda Driver and Runtime version may be mismatched Cuda installation fails on Ubuntu 10.4 x86_ CUDA Programming and Performance	13	5090	November 17, 2010

DGX Station V-100 Driver update fail

Related topics