This AMI (AWS Marketplace: NVIDIA GPU-Optimized AMI - ami ID “ami-041855406987a648b”) should be preconfigured to run NVIDIA GPU Cloud (NGC) containers such as the PyTorch one.
However it fails on launch on AWS (on a p3.2xlarge instance).
After sshing in, I see this error message:
Installing drivers ...
modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-1011-aws
And sure enough, running containers such as PyTorch (PyTorch | NVIDIA NGC) does not work:
~$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.11-py3
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
Any idea how to debug this? Shouldn’t it work out of the box?