Docker pull nvcr.io/nvidia/driver:550.54.15-amzn2 not found

Hi,
I installed the latest GPU operator in a kubernetes (EKS) cluster, however one of the pods fails with the message:

Normal   Pulling    36m (x4 over 37m)      kubelet            Pulling image "nvcr.io/nvidia/driver:550.54.15-amzn2"
  Warning  Failed     36m (x4 over 37m)      kubelet            Failed to pull image "nvcr.io/nvidia/driver:550.54.15-amzn2": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:550.54.15-amzn2": failed to resolve reference "nvcr.io/nvidia/driver:550.54.15-amzn2": nvcr.io/nvidia/driver:550.54.15-amzn2: not found

However other image in the same pod downloads successfully:

Normal   Pulling    37m                    kubelet            Pulling image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.8"
  Normal   Pulled     37m                    kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.8" in 11.662s (11.663s including waiting)
  

I tested locally in my laptop and indeed I managed to pull the nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.8 but it failed when trying to pull the other image:

% docker pull nvcr.io/nvidia/driver:550.54.15-amzn2
Error response from daemon: manifest for nvcr.io/nvidia/driver:550.54.15-amzn2 not found: manifest unknown: manifest unknown

This happens when installing the gpu-operator from the helm chart. I used terraform to install:

resource "helm_release" "nvidia_gpu_operator" {
  name       = "gpu-operator"
  namespace  = "kube-system"
  repository = "https://helm.ngc.nvidia.com/nvidia"
  chart      = "gpu-operator"
  version    = "v24.3.0"
}

Cluster is AWS EKS 1.29. Node is g4dn.xlarge.

Should I try installing the previous operator instead?

Ok so I assume the latest EKS 1.29 AMI for g4dn has the nvidia drivers installed. So I changed the chart configuration to driver.enabled=false and this error is gone… but now I got a lot of containers that say:

  Warning  FailedCreatePodSandBox  2m20s (x371 over 82m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

I know this is a warning, but they still are not initialised like

nvidia-container-toolkit-daemonset-x75nc                      0/1     Init:0/1   0          86m
nvidia-dcgm-exporter-5np72                                    0/1     Init:0/1   0          86m
nvidia-device-plugin-daemonset-h4xk9                          0/1     Init:0/1   0          86m
nvidia-operator-validator-6ps8g                               0/1     Init:0/4   0          86m

Not sure what to do now.

No I don’t think that’s it… the main problem here is I want the cluster to recognise the g4dn instance as having 1 nvidia card. Right now the node displays the CPU and RAM but not the nvidia.com/gpu resource so pods can’t be scheduled there.
I thought installing the GPU operator would solve it but it won’t start because it won’t find the drivers so I think I’m a bit stuck…

After reading the docs I don’t think the nvidia operator is compatible with Amazon Linux 2. So closing this.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.