Announcing containerd Support for the NVIDIA GPU Operator

Originally published at: Announcing containerd Support for the NVIDIA GPU Operator | NVIDIA Technical Blog

For many years, docker was the only container runtime supported by Kubernetes. Over time, support for other runtimes has not only become possible but often preferred, as standardization around a common container runtime interface (CRI) has solidified in the broader container ecosystem. Runtimes such as containerd and cri-o have grown in popularity as docker has…

The Cloud Native team at NVIDIA is excited to announce the new containerd support in the GPU Operator. If you have any questions or comments, please let us know!

Where would I go to see the resulting containerd config toml from successfully running this?

Hi rcvanvo,

By default the containerd config is located at /etc/containerd/config.toml, and that is the one that gets updated if you don’t specify a different one.

Hey,
Will it work for Nvidia Jetson as well?

I would also love to know if it will work with Jetson Nano’s? Setting up a GPU enabled k3’s cluster with Jetson Nano’s is turning out to be quite the headache!

Hi @alon2 and @josiase

Here is a link to the page listing the supported platforms for the GPU Operator:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html

At present, Jetson is not supported unfortunately as mentioned in the comment:

The GPU Operator only supports platforms using discrete GPUs - Jetson or other embedded products with integrated GPUs are not supported.

1 Like

Thank you for the response @kklues. With this in mind, I don’t see a way to access GPU’s from Jetson Nano worker nodes when running a K3 cluster :(

We try to install the GPU Operator in our containerd-based Cluster.

The DaemonSet for installing the nvidia driver exists with the following error message:

Creating directory NVIDIA-Linux-x86_64-460.32.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03..........................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 4.15.0-136-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use

What is our error here? Do you need more information? Hope you can help us. :)

We are running Kubernetes v.1.19.7, GPU-Operator Helm Chart v.1.6.0 and containerd in v.1.4.3.

Please use https://github.com/NVIDIA/gpu-operator/issues to file any issues related to the GPU Operator.

Thanks @kklues. I will try my luck there.

@kklues bummer… any way to use contained on Jetson instead of docker ?

Is there a way - even without the GPU operator - to make nvidia-runtime work with containerd instead of docker with k3s on a Jetson (ARM64)?

@klein.shaked
Please see my response here for some guidance:
https://github.com/containerd/containerd/issues/4834#issuecomment-786854732

Hello,
is this the correct place to ask a technical question?

We have previously used the “Nvidia-device-plugin” which adds GPUs as a ressource to Kubernetes. Our Kubernetes-Jobs are scheduled based on percentage of GPU required (just like it is available for CPUs and memory).

With the Kubernetes Upgrade beyond 1.20 and docker being removed, I found that the preferred installation now uses the “GPU Operator” according to (1). Which seems to work very well. However, I have not been able to get GPUs show up as schedule-able ressources yet (2). As such, our Jobs currently cannot execute.

Is there a part I am missing? Can someone please point me to the instructions I need?

Thank you very much.

(1) https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html
(2) Excerpt from kubectl get nodes:
Allocated resources:
Resource Requests Limits


cpu 100m (0%) 100m (0%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0 ← missing from GPU-operator setup