Failed to initialize NVML: Unknown Error when running nvidia-smi on Docker container

Hi, I’m new in the forum. I’ve started using Docker a few months ago and I’m working on my graduation thesis.
I know that are many topics like this in the forum. I’ve already read all of these but they didn’t solve my problem.
I’m currently using Ubuntu 20.04 in dual boot with Windows 10. I’ve already disable the Secure Boot. Windows runs on disk C, the primary, and Ubuntu runs on disk D, so they are separate.
I believe my computer has all the features to run cuda. I’ve checked out from Installation Guide Linux :: CUDA Toolkit Documentation
There is something weird. I can install the driver and switch to my Nvidia GPU (sudo prime-select nvidia) directly from Ubuntu terminal, reboot and launch the command “nvidia-smi”. It works without problems.
But when I build my docker image using the same commands inside the Dockerfile I receive this error:
Failed to initialize NVML: Unknown Error

SPECS OF MY SYSTEM
$ uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION=“Ubuntu 20.04.1 LTS”
NAME=“Ubuntu”
VERSION=“20.04.1 LTS (Focal Fossa)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 20.04.1 LTS”
VERSION_ID=“20.04”
HOME_URL=“https://www.ubuntu.com/
SUPPORT_URL=“https://help.ubuntu.com/
BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/
PRIVACY_POLICY_URL=“Data privacy | Ubuntu
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Kernel version: $uname -r
Linux 5.4.0-48-generic x86_64

Docker version: $docker -v
Docker version 19.03.13

GPUs on computer(integrated and dedicated)
$sudo lshw -c video
*-display
description: VGA compatible controller
product: GP107M [GeForce GTX 1050 Ti Mobile]
vendor: NVIDIA Corporation
physical id: 0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nouveau latency=0

*-display
description: VGA compatible controller
product: UHD Graphics 630 (Mobile)
vendor: Intel Corporation
physical id: 2
version: 00
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0

Nvidia GPU: GeForce GTX 1050 Ti
$lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev ff)

Recommended Nvidia Driver:
nvidia-driver-450

$dpkg --get-selections | egrep “nvidia|bbswitch”
libnvidia-cfg1-450:amd64 install
libnvidia-common-450 install
libnvidia-compute-450:amd64 install
libnvidia-decode-450:amd64 install
libnvidia-encode-450:amd64 install
libnvidia-extra-450:amd64 install
libnvidia-fbc1-450:amd64 install
libnvidia-gl-450:amd64 install
libnvidia-ifr1-450:amd64 install
nvidia-compute-utils-450 install
nvidia-dkms-450 install
nvidia-driver-450 install
nvidia-kernel-common-450 install
nvidia-kernel-source-450 install
nvidia-prime install
nvidia-settings install
nvidia-utils-450 install
xserver-xorg-video-nvidia-450 install

GCC version: $gcc -v
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

GLIBC version: $ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.1) 2.31

Here there is my Dockerfile
#OpenCV with CUDA Acceleration Test | by Mikkel Wilson | Medium
FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update
#I’ve added my user to the docker group so “sudo” would be unecessary but for safety I’ve used it anyway
RUN apt-get install -y sudo unzip nano git wget coreutils

RUN sudo apt-get update

#Verify the System has the Correct Kernel Headers and Development Packages Installed
#The kernel headers and development packages for the currently running kernel can be installed with:
RUN sudo apt-get install -y linux-headers-$(uname -r)

#---------------------------- AVOID TZDATA & KEYBOARD CONFIG ------------------------------------------
#Avoiding user interaction with tzdata
#for apt to be noninteractive

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y keyboard-configuration
ENV TZ=Europe/Minsk
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
#------------------------------------------------------------------------------------------------

#------------------------------ SET-UP Driver GPU ------------------------------------------

#lspci needs pciutils - to check Linux system hardware information GPU
RUN sudo apt-get install -y pciutils
#use: lspci -k | grep -A 2 -i “VGA” oppure lspci | grep VGA oppure lspci -vnnn | perl -lne ‘print if /^\d+:.+([\S+:\S+])/’ | grep VGA

#How To Switch Between Intel and Nvidia Graphics Card on Ubuntu
#How To Switch Between Intel and Nvidia Graphics Card on Ubuntu

RUN sudo apt-get update

#Install nvidia-smi and Nvidia driver for my GPU: Nvidia GeForce GTX 1050 Ti
RUN sudo apt-get install ubuntu-drivers-common
#All compatible drivers for my GPU
RUN sudo ubuntu-drivers devices
RUN sudo ubuntu-drivers autoinstall
#Or I could only install the 450 because it’s the one recommended.
#RUN sudo apt-get install -y nvidia-driver-450

#Now I switch to my Nvidia GPU
RUN sudo prime-select nvidia
#If Nvidia GPU was selected I see the result with the follow command
RUN prime-select query
#-------------------------------------------------------------------------------------------------------

#--------------------------------------- CUDA Installation ------------------------------------------------------

#RUN apt-get install nvidia-container-toolkit #NOT WORK!

#------------- CUDA Toolkit 11.1 - Installer for Linux Ubuntu 20.04 x86_64
#CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer
RUN sudo apt-get update
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
RUN sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.deb
#dpkg → is the default package manager on Ubuntu. You can use it to install, configure, update or remove packages.
RUN sudo dpkg -i cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.deb
RUN sudo apt-key add /var/cuda-repo-ubuntu2004-11-1-local/7fa2af80.pub
RUN sudo apt-get update

#---------------------------- AVOID TZDATA & KEYBOARD CONFIG ------------------------------------------
#Avoiding user interaction with tzdata
#for apt to be noninteractive

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y keyboard-configuration
ENV TZ=Europe/Minsk
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
#------------------------------------------------------------------------------------------------

RUN sudo apt-get -y install cuda
RUN sudo apt-get update

#----------- End Dockerfile --------------------------------------

I’ve already tried the following commands but NONE have solved my problem…

  1. Remove all Nvidia package and reinstall
    RUN sudo apt-get autoremove -y --purge $(dpkg --get-selections| grep nvidia | awk ‘{print $1}’)
    RUN sudo ubuntu-drivers autoinstall

2)Disable Nouverau driver
RUN mkdir -p /etc/modprobe.d/ && touch /etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN echo “blacklist nouveau”>/etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN echo “options nouveau modeset=0”>>/etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN apt-get update

3)Nvidia fallback service & Noverau & bbswitch
#NVIDIA GPU, Optimus Prime and Ubuntu 18.04 Woes | by Amitosh Swain Mahapatra | Medium
RUN sudo systemctl disable nvidia-fallback.service
RUN sudo apt-get --reinstall install -y grub-pc
RUN sudo apt-get update

#Blacklist nouveau driver using GRUB config. In /etc/default/grub look for a line GRUB_CMDLINE_LINUX .
#Add nouveau.blacklist=1 into that parameter. #If the line is not present add this line GRUB_CMDLINE_LINUX=“nouveau.blacklist=1”
WORKDIR /etc/default/
RUN sed -i grub -e ‘11s!GRUB_CMDLINE_LINUX=“”!GRUB_CMDLINE_LINUX=“nouveau.blacklist=1”!’
RUN sudo update-grub
WORKDIR /
#bbswitch (only for laptop users interested for power savings, if your system supports it.)
RUN sudo apt-get install -y bbswitch-dkms
#Configure the system to load it by appending bbswitch in /etc/modules
#To disable the card on boot run
RUN sudo echo “options bbswitch load_state=0” | sudo tee /etc/modprobe.d/bbswitch.conf
RUN sudo apt-get update
RUN sudo prime-select intel
RUN sudo prime-select nvidia
RUN sudo apt-get update

can someone help me? thank you very much

The way NVIDIA recommends you set up a machine for docker usage is:

  1. Install (latest) NVIDIA driver in base machine
  2. Install nvidia container toolkit in base machine
  3. install docker-ce in base machine

None of those things should be or need to be installed in the container. You should not install the driver in the container.

Therefore this is bad bad bad in your dockerfile:

sudo apt-get -y install cuda

To verify your machine is set up properly for docker usage, try running a nvidia cuda container from dockerhub first, before trying your own container. To learn how to build a dockerfile properly using this method, try studying the dockerfiles for nvidia cuda containers on dockerhub.

1 Like

Hi Robert,
thank you so much for yours advices. I believed that the nvidia drivers had to be installed inside the container so when I would distribute it to other hosts, the whole configuration process would be automatic for all supported Nvidia GPU. And it would been possible using the command “sudo ubuntu-drivers autoinstall”. Obviously the target host must have docker installed.
Tanks for correcting my mistakes.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
I initially followed the official guide but when I run the container I received this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\n\""": unknown.

So I followed another guide: Ubuntu 20.04 LTS : NVIDIA Container Toolkit : Server World
The solution for me was disabling nouveau driver and reboot the system.

Now I’ve managed to create a docker container with Cuda and OpenCV.
Have a nice day!