Failed to initialize NVML: Unknown Error when running nvidia-smi on Docker container

matteo.springolo98 · October 14, 2020, 8:41pm

Hi, I’m new in the forum. I’ve started using Docker a few months ago and I’m working on my graduation thesis.
I know that are many topics like this in the forum. I’ve already read all of these but they didn’t solve my problem.
I’m currently using Ubuntu 20.04 in dual boot with Windows 10. I’ve already disable the Secure Boot. Windows runs on disk C, the primary, and Ubuntu runs on disk D, so they are separate.
I believe my computer has all the features to run cuda. I’ve checked out from Installation Guide Linux :: CUDA Toolkit Documentation
There is something weird. I can install the driver and switch to my Nvidia GPU (sudo prime-select nvidia) directly from Ubuntu terminal, reboot and launch the command “nvidia-smi”. It works without problems.
But when I build my docker image using the same commands inside the Dockerfile I receive this error:
Failed to initialize NVML: Unknown Error

SPECS OF MY SYSTEM
$ uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION=“Ubuntu 20.04.1 LTS”
NAME=“Ubuntu”
VERSION=“20.04.1 LTS (Focal Fossa)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 20.04.1 LTS”
VERSION_ID=“20.04”
HOME_URL=“https://www.ubuntu.com/”
SUPPORT_URL=“https://help.ubuntu.com/”
BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/”
PRIVACY_POLICY_URL=“Data privacy | Ubuntu”
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Kernel version: $uname -r
Linux 5.4.0-48-generic x86_64

Docker version: $docker -v
Docker version 19.03.13

GPUs on computer(integrated and dedicated)
$sudo lshw -c video
*-display
description: VGA compatible controller
product: GP107M [GeForce GTX 1050 Ti Mobile]
vendor: NVIDIA Corporation
physical id: 0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nouveau latency=0

*-display
description: VGA compatible controller
product: UHD Graphics 630 (Mobile)
vendor: Intel Corporation
physical id: 2
version: 00
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0

Nvidia GPU: GeForce GTX 1050 Ti
$lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev ff)

Recommended Nvidia Driver:
nvidia-driver-450

$dpkg --get-selections | egrep “nvidia|bbswitch”
libnvidia-cfg1-450:amd64 install
libnvidia-common-450 install
libnvidia-compute-450:amd64 install
libnvidia-decode-450:amd64 install
libnvidia-encode-450:amd64 install
libnvidia-extra-450:amd64 install
libnvidia-fbc1-450:amd64 install
libnvidia-gl-450:amd64 install
libnvidia-ifr1-450:amd64 install
nvidia-compute-utils-450 install
nvidia-dkms-450 install
nvidia-driver-450 install
nvidia-kernel-common-450 install
nvidia-kernel-source-450 install
nvidia-prime install
nvidia-settings install
nvidia-utils-450 install
xserver-xorg-video-nvidia-450 install

GCC version: $gcc -v
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

GLIBC version: $ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.1) 2.31

Here there is my Dockerfile
#OpenCV with CUDA Acceleration Test | by Mikkel Wilson | Medium
FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update
#I’ve added my user to the docker group so “sudo” would be unecessary but for safety I’ve used it anyway
RUN apt-get install -y sudo unzip nano git wget coreutils

RUN sudo apt-get update

#Verify the System has the Correct Kernel Headers and Development Packages Installed
#The kernel headers and development packages for the currently running kernel can be installed with:
RUN sudo apt-get install -y linux-headers-$(uname -r)

#---------------------------- AVOID TZDATA & KEYBOARD CONFIG ------------------------------------------
#Avoiding user interaction with tzdata
#for apt to be noninteractive

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y keyboard-configuration
ENV TZ=Europe/Minsk
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
#------------------------------------------------------------------------------------------------

#------------------------------ SET-UP Driver GPU ------------------------------------------

#lspci needs pciutils - to check Linux system hardware information GPU
RUN sudo apt-get install -y pciutils
#use: lspci -k | grep -A 2 -i “VGA” oppure lspci | grep VGA oppure lspci -vnnn | perl -lne ‘print if /^\d+:.+([\S+:\S+])/’ | grep VGA

#How To Switch Between Intel and Nvidia Graphics Card on Ubuntu
#How To Switch Between Intel and Nvidia Graphics Card on Ubuntu

RUN sudo apt-get update

#Install nvidia-smi and Nvidia driver for my GPU: Nvidia GeForce GTX 1050 Ti
RUN sudo apt-get install ubuntu-drivers-common
#All compatible drivers for my GPU
RUN sudo ubuntu-drivers devices
RUN sudo ubuntu-drivers autoinstall
#Or I could only install the 450 because it’s the one recommended.
#RUN sudo apt-get install -y nvidia-driver-450

#Now I switch to my Nvidia GPU
RUN sudo prime-select nvidia
#If Nvidia GPU was selected I see the result with the follow command
RUN prime-select query
#-------------------------------------------------------------------------------------------------------

#--------------------------------------- CUDA Installation ------------------------------------------------------

#RUN apt-get install nvidia-container-toolkit #NOT WORK!

#------------- CUDA Toolkit 11.1 - Installer for Linux Ubuntu 20.04 x86_64
#CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer
RUN sudo apt-get update
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
RUN sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.deb
#dpkg → is the default package manager on Ubuntu. You can use it to install, configure, update or remove packages.
RUN sudo dpkg -i cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.deb
RUN sudo apt-key add /var/cuda-repo-ubuntu2004-11-1-local/7fa2af80.pub
RUN sudo apt-get update

#---------------------------- AVOID TZDATA & KEYBOARD CONFIG ------------------------------------------
#Avoiding user interaction with tzdata
#for apt to be noninteractive

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y keyboard-configuration
ENV TZ=Europe/Minsk
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
#------------------------------------------------------------------------------------------------

RUN sudo apt-get -y install cuda
RUN sudo apt-get update

#----------- End Dockerfile --------------------------------------

I’ve already tried the following commands but NONE have solved my problem…

Remove all Nvidia package and reinstall
RUN sudo apt-get autoremove -y --purge $(dpkg --get-selections| grep nvidia | awk ‘{print $1}’)
RUN sudo ubuntu-drivers autoinstall

2)Disable Nouverau driver
RUN mkdir -p /etc/modprobe.d/ && touch /etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN echo “blacklist nouveau”>/etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN echo “options nouveau modeset=0”>>/etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
RUN apt-get update

3)Nvidia fallback service & Noverau & bbswitch
#NVIDIA GPU, Optimus Prime and Ubuntu 18.04 Woes | by Amitosh Swain Mahapatra | Medium
RUN sudo systemctl disable nvidia-fallback.service
RUN sudo apt-get --reinstall install -y grub-pc
RUN sudo apt-get update

#Blacklist nouveau driver using GRUB config. In /etc/default/grub look for a line GRUB_CMDLINE_LINUX .
#Add nouveau.blacklist=1 into that parameter. #If the line is not present add this line GRUB_CMDLINE_LINUX=“nouveau.blacklist=1”
WORKDIR /etc/default/
RUN sed -i grub -e ‘11s!GRUB_CMDLINE_LINUX=“”!GRUB_CMDLINE_LINUX=“nouveau.blacklist=1”!’
RUN sudo update-grub
WORKDIR /
#bbswitch (only for laptop users interested for power savings, if your system supports it.)
RUN sudo apt-get install -y bbswitch-dkms
#Configure the system to load it by appending bbswitch in /etc/modules
#To disable the card on boot run
RUN sudo echo “options bbswitch load_state=0” | sudo tee /etc/modprobe.d/bbswitch.conf
RUN sudo apt-get update
RUN sudo prime-select intel
RUN sudo prime-select nvidia
RUN sudo apt-get update

can someone help me? thank you very much

Robert_Crovella · October 15, 2020, 10:13pm

The way NVIDIA recommends you set up a machine for docker usage is:

Install (latest) NVIDIA driver in base machine
Install nvidia container toolkit in base machine
install docker-ce in base machine

None of those things should be or need to be installed in the container. You should not install the driver in the container.

Therefore this is bad bad bad in your dockerfile:

sudo apt-get -y install cuda

To verify your machine is set up properly for docker usage, try running a nvidia cuda container from dockerhub first, before trying your own container. To learn how to build a dockerfile properly using this method, try studying the dockerfiles for nvidia cuda containers on dockerhub.

matteo.springolo98 · October 18, 2020, 1:04pm

Hi Robert,
thank you so much for yours advices. I believed that the nvidia drivers had to be installed inside the container so when I would distribute it to other hosts, the whole configuration process would be automatic for all supported Nvidia GPU. And it would been possible using the command “sudo ubuntu-drivers autoinstall”. Obviously the target host must have docker installed.
Tanks for correcting my mistakes.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
I initially followed the official guide but when I run the container I received this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\n\""": unknown.

So I followed another guide: Ubuntu 20.04 LTS : NVIDIA Container Toolkit : Server World
The solution for me was disabling nouveau driver and reboot the system.

Now I’ve managed to create a docker container with Cuda and OpenCV.
Have a nice day!

Topic		Replies	Views
Stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown CUDA on Windows Subsystem for Linux	35	37319	August 21, 2023
nvidia-smi -----> Failed to initialize NVML: Unknown Error (in docker) CUDA Setup and Installation	4	20002	August 12, 2019
nvidia-docker inside Kubernetes - Failed to initialize NVML: Unknown Error CUDA Setup and Installation	3	4035	January 9, 2022
470.14 - WSL with W10 Build 21343 - NVIDIA-SMI error CUDA on Windows Subsystem for Linux	43	18889	November 21, 2021
Nvidia driver-container does not work after restart Docker and NVIDIA Docker	7	6014	March 24, 2022
command "docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi" fails with Error CUDA Setup and Installation	1	9950	January 16, 2019
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA on Windows Subsystem for Linux	33	22867	May 1, 2021
[INFO]: Finished with code: 256 , [ERROR]: Install of driver component failed CUDA Setup and Installation	24	175664	September 29, 2024
Nvidia Cuda Compiler not showing up in Linux 22.04 Linux cuda , linux , nvcc	24	18407	May 30, 2022
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371332	March 19, 2021

Failed to initialize NVML: Unknown Error when running nvidia-smi on Docker container

Related topics