System information (version)
- Instance name: ec2-*
- Type: Amazon Linux 2023 AMI
- Operating System / Platform: Ubuntu
- Nvidia-driver: nvidia-driver-470/nvidia-driver-470-server
- Cuda: 11.4
Detailed description
- After installing nvidia-driver, nvidia-docker2 to use GPU for docker, but got the following error: OCI runtime create failed.
Steps to reproduce
After logging into AMI, check for no nvidia-docker as follows:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
- Update and upgrade ubuntu
- Install nvidia driver 470 server
sudo apt install nvidia-driver-470-server
DKMS: install completed.
Setting up xserver-xorg-video-nvidia-470-server (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-common-470-server (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-decode-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up nvidia-compute-utils-470-server (470.182.03-0ubuntu0.20.04.1) ...
Warning: The home dir /nonexistent you specified can't be accessed: No such file or directory
Adding system user `nvidia-persistenced' (UID 119) ...
Adding new group `nvidia-persistenced' (GID 123) ...
Adding new user `nvidia-persistenced' (UID 119) with group `nvidia-persistenced' ...
Not creating home directory `/nonexistent'.
Setting up libnvidia-encode-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-gl-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up libnvidia-ifr1-470-server:amd64 (470.182.03-0ubuntu0.20.04.1) ...
Setting up nvidia-driver-470-server (470.182.03-0ubuntu0.20.04.1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for initramfs-tools (0.136ubuntu6.7) ...
update-initramfs: Generating /boot/initrd.img-5.11.0-1028-aws
- Install nvidia-docker2
sudo apt-get install -y nvidia-docker2
Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-docker2 is already the newest version (2.13.0-1).
- Run the docker container:
nvidia-docker run -it -d --restart=always --name tao_container -v `pwd`:/workspace --net=host nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
d60b57c75e359a8e4f485939fb61a4c6473203639e8b4a057f2b2b0dcba688e4
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
- Check nvidia hardware information:
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0531 03:17:04.143207 2346 nvc.c:376] initializing library context (version=1.13.1, build=6f4aea0fca16aaff01bab2567adb34ec30847a0e)
I0531 03:17:04.143270 2346 nvc.c:350] using root /
I0531 03:17:04.143294 2346 nvc.c:351] using ldcache /etc/ld.so.cache
I0531 03:17:04.143307 2346 nvc.c:352] using unprivileged user 1000:1000
I0531 03:17:04.143338 2346 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0531 03:17:04.143543 2346 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0531 03:17:04.148556 2346 nvc.c:258] failed to detect NVIDIA devices
W0531 03:17:04.148871 2347 nvc.c:273] failed to set inheritable capabilities
W0531 03:17:04.148936 2347 nvc.c:274] skipping kernel modules load due to failure
I0531 03:17:04.149402 2348 rpc.c:71] starting driver rpc service
I0531 03:17:04.447027 2346 rpc.c:135] driver rpc service terminated with signal 15
nvidia-container-cli: initialization error: nvml error: driver not loaded
I0531 03:17:04.447106 2346 nvc.c:434] shutting down library context