DGX Station V-100 Driver update fail

Hello,
Product: DGX Station
OS: Ubuntu 16.04
NVIDIA-SMI 384.183
Driver Version: 384.183
CUDA Version: 9.0
I have been trying to get pytorch working with DGX station. After different attempts on pip,conda and nvidia docker. The issue seems to be boiling down to my DGX station is still on the same nvidia driver when it was purchased. I have been trying to update the drivers to 455 from the existing 384 so that I can get the latest CUDA which will then in turn help me run the latest pytorch version. I tried to update the driver by installing the latest compatible cuda for pytorch and this is how it went
I hit ctrl+alt+f1

sudo service lightdm stop
sudo stop nvidia-digits-server
sudo service docker stop
sudo service nvidia-docker stop
sudo rmmod nvidia-uvm
sudo sh cuda_11.1.0_455.23.05_linux.run

But the installation fails
cuda_installer.log:
[INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9]
[INFO]: Cleaning up window
[INFO]: Complete
[INFO]: Checking compiler version…
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting

nvidia_installer.log:
nvidia-installer log file ‘/var/log/nvidia-installer.log’
creation time: Tue Mar 23 18:12:58 2021
installer version: 455.23.05

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
./nvidia-installer
–ui=none
–no-questions
–accept-license
–disable-nouveau
–no-cc-version-check
–install-libglvnd

Using built-in stream user interface
→ Detected 40 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Does anyone know how I can update the drivers without errors for DGX station( V100 GPU version)?

Wow, that version you’re running does look pretty old!

In general, for DGX systems running DGX OS the recommendation is to upgrade drivers from the already configured repositories (e.g., for the OS version you’re running, the DGX repository), and not run the driver installer (th same holds true for CUDA). This will ensure the system stays on tested, supported versions to avoid issues like you’re seeing now.

General instructions for upgrading can be found in the DGX OS release notes at DGX OS Desktop Software Release Notes :: DGX Systems Documentation . From where you are right now, I’d recommend doing at least a apt update && apt full-upgrade to get up the latest DGX OS version within the major version you’re running.

You may also consider updating to DGX OS 5 (Ubuntu 20.04 based), which brings in all sorts of improvements! See DGX OS 5.0 User Guide :: DGX Systems Documentation for that.

Note also that as a DGX customer, you can create a support ticket (EnterpriseSupport@nvidia.com or click “Create Ticket” on the Enterprise Support Portal) to get more immediate help.

ScottE