NVIDIA GPU Optimized AMI is missing drivers and can't run the PyTorch NGC Dockerfile

jaan1 · December 18, 2023, 6:04pm

This AMI (AWS Marketplace: NVIDIA GPU-Optimized AMI - ami ID “ami-041855406987a648b”) should be preconfigured to run NVIDIA GPU Cloud (NGC) containers such as the PyTorch one.

However it fails on launch on AWS (on a p3.2xlarge instance).

After sshing in, I see this error message:

Installing drivers ...
modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-1011-aws

And sure enough, running containers such as PyTorch (PyTorch | NVIDIA NGC) does not work:

~$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.11-py3
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

Any idea how to debug this? Shouldn’t it work out of the box?

hackforsac · January 12, 2024, 7:28pm

I ran into the same issue, the 1st time i got same error. I logged out, and reconnected over SSH and this time it installed the driver.

Topic		Replies	Views
AWS AMI - couldn't communicate with the NVIDIA driver Amazon Web Services (AWS) tao	9	3704	June 6, 2023
NGC pytorch docker container. The NVIDIA Driver was not detected Docker and NVIDIA Docker	0	994	February 23, 2023
Docker Tensorflow-gpu can't find device, as well as nvidia-smi "No device found" Linux	2	4199	October 12, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. cuDNN	1	3207	November 30, 2019
GPU Driver Issue using AWS Isaac Sim aws , installation , driver	8	2208	April 5, 2024
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use 'nvidia-docker run' to start this container; Docker and NVIDIA Docker pytorch	0	3976	March 24, 2022
How do I find the AMI that provides the standard foundation for accessing the NGC container repository? Amazon Web Services (AWS)	1	1454	February 29, 2024
PyTorch NGC Container for WSL 2 Deep Learning (Training & Inference)	0	863	August 4, 2020
Error running Pytorch:19.12-py3 docker on Jetson Nano Docker and NVIDIA Docker	2	826	October 12, 2021
How to get CUDA container to utilize graphics card drivers CUDA Programming and Performance	3	596	May 9, 2019

NVIDIA GPU Optimized AMI is missing drivers and can't run the PyTorch NGC Dockerfile

Related topics