Docker pull not found

I installed the latest GPU operator in a kubernetes (EKS) cluster, however one of the pods fails with the message:

Normal   Pulling    36m (x4 over 37m)      kubelet            Pulling image ""
  Warning  Failed     36m (x4 over 37m)      kubelet            Failed to pull image "": rpc error: code = NotFound desc = failed to pull and unpack image "": failed to resolve reference "": not found

However other image in the same pod downloads successfully:

Normal   Pulling    37m                    kubelet            Pulling image ""
  Normal   Pulled     37m                    kubelet            Successfully pulled image "" in 11.662s (11.663s including waiting)

I tested locally in my laptop and indeed I managed to pull the but it failed when trying to pull the other image:

% docker pull
Error response from daemon: manifest for not found: manifest unknown: manifest unknown

This happens when installing the gpu-operator from the helm chart. I used terraform to install:

resource "helm_release" "nvidia_gpu_operator" {
  name       = "gpu-operator"
  namespace  = "kube-system"
  repository = ""
  chart      = "gpu-operator"
  version    = "v24.3.0"

Cluster is AWS EKS 1.29. Node is g4dn.xlarge.

Should I try installing the previous operator instead?

Ok so I assume the latest EKS 1.29 AMI for g4dn has the nvidia drivers installed. So I changed the chart configuration to driver.enabled=false and this error is gone… but now I got a lot of containers that say:

  Warning  FailedCreatePodSandBox  2m20s (x371 over 82m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

I know this is a warning, but they still are not initialised like

nvidia-container-toolkit-daemonset-x75nc                      0/1     Init:0/1   0          86m
nvidia-dcgm-exporter-5np72                                    0/1     Init:0/1   0          86m
nvidia-device-plugin-daemonset-h4xk9                          0/1     Init:0/1   0          86m
nvidia-operator-validator-6ps8g                               0/1     Init:0/4   0          86m

Not sure what to do now.

No I don’t think that’s it… the main problem here is I want the cluster to recognise the g4dn instance as having 1 nvidia card. Right now the node displays the CPU and RAM but not the resource so pods can’t be scheduled there.
I thought installing the GPU operator would solve it but it won’t start because it won’t find the drivers so I think I’m a bit stuck…

After reading the docs I don’t think the nvidia operator is compatible with Amazon Linux 2. So closing this.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.