NVIDIA GPU Optimized AMI is missing drivers and can't run the PyTorch NGC Dockerfile

This AMI (AWS Marketplace: NVIDIA GPU-Optimized AMI - ami ID “ami-041855406987a648b”) should be preconfigured to run NVIDIA GPU Cloud (NGC) containers such as the PyTorch one.

However it fails on launch on AWS (on a p3.2xlarge instance).

After sshing in, I see this error message:

Installing drivers ...
modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-1011-aws

And sure enough, running containers such as PyTorch (PyTorch | NVIDIA NGC) does not work:

~$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.11-py3
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

Any idea how to debug this? Shouldn’t it work out of the box?

2 Likes

I ran into the same issue, the 1st time i got same error. I logged out, and reconnected over SSH and this time it installed the driver.

1 Like