AWS AMI - couldn't communicate with the NVIDIA driver

System information (version)
  • Instance name: ec2-*
  • Type: Amazon Linux 2023 AMI
  • Operating System / Platform: Ubuntu
  • Nvidia-driver: nvidia-driver-470/nvidia-driver-470-server
  • Cuda: 11.4
Detailed description
  • After installing nvidia-driver, nvidia-docker2 to use GPU for docker, but got the following error: OCI runtime create failed.
Steps to reproduce

After logging into AMI, check for no nvidia-docker as follows:

nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
  • Update and upgrade ubuntu
  • Install nvidia driver 470 server
sudo apt install nvidia-driver-470-server

DKMS: install completed.
Setting up xserver-xorg-video-nvidia-470-server (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-common-470-server (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-decode-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up nvidia-compute-utils-470-server (470.182.03-0ubuntu0.20.04.1) ...
Warning: The home dir /nonexistent you specified can't be accessed: No such file or directory
Adding system user `nvidia-persistenced' (UID 119) ...
Adding new group `nvidia-persistenced' (GID 123) ...
Adding new user `nvidia-persistenced' (UID 119) with group `nvidia-persistenced' ...
Not creating home directory `/nonexistent'.
Setting up libnvidia-encode-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-gl-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-ifr1-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up nvidia-driver-470-server (470.182.03-0ubuntu0.20.04.1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for initramfs-tools (0.136ubuntu6.7) ...
update-initramfs: Generating /boot/initrd.img-5.11.0-1028-aws
  • Install nvidia-docker2
sudo apt-get install -y nvidia-docker2

Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-docker2 is already the newest version (2.13.0-1).
  • Run the docker container:
nvidia-docker run -it -d --restart=always --name tao_container -v `pwd`:/workspace --net=host nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3

d60b57c75e359a8e4f485939fb61a4c6473203639e8b4a057f2b2b0dcba688e4
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

  • Check nvidia hardware information:
nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0531 03:17:04.143207 2346 nvc.c:376] initializing library context (version=1.13.1, build=6f4aea0fca16aaff01bab2567adb34ec30847a0e)
I0531 03:17:04.143270 2346 nvc.c:350] using root /
I0531 03:17:04.143294 2346 nvc.c:351] using ldcache /etc/ld.so.cache
I0531 03:17:04.143307 2346 nvc.c:352] using unprivileged user 1000:1000
I0531 03:17:04.143338 2346 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0531 03:17:04.143543 2346 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0531 03:17:04.148556 2346 nvc.c:258] failed to detect NVIDIA devices
W0531 03:17:04.148871 2347 nvc.c:273] failed to set inheritable capabilities
W0531 03:17:04.148936 2347 nvc.c:274] skipping kernel modules load due to failure
I0531 03:17:04.149402 2348 rpc.c:71] starting driver rpc service
I0531 03:17:04.447027 2346 rpc.c:135] driver rpc service terminated with signal 15
nvidia-container-cli: initialization error: nvml error: driver not loaded
I0531 03:17:04.447106 2346 nvc.c:434] shutting down library context

Hi @Morganh Can you help me to check it?

Can anyone help me?

Hi Robert,

I am not sure @Morganh knows about AWS setup issues, and neither do I. So I took the liberty of movig your post to the AWS category. Hopefully they will have some suggestions.

Thanks!

1 Like

@Robert_Hoang
You can refer to Docker instantiation failed with error (TAO Toolkit - Yolo_v4_tiny) to check if it works.

1 Like

Thanks, Markus, Hope anyone can help.

Hi @Morganh, Nice to hear from you. I tried and it doesn’t work, this error is most likely caused by:

W0531 03:17:04.148556 2346 nvc.c:258] failed to detect NVIDIA devices

The strange thing is, there should be nvidia gpu and nvidia-drive in the AMI already.

Please run nvidia-smi again to check. Also, please try to reboot.

Rebooted before and after installing the driver, but it didn’t work, nvidia-smi said to install nvidia drive, however nvidia drive installed using apt-get.

Please retry as below.
Uninstall:
sudo apt purge nvidia-driver-*
sudo apt autoremove
sudo apt autoclean

Install: sudo apt install nvidia-driver-525

1 Like