Nvidia-container-runtime 1.13 (experimental branch) on k3s

Hi!
Currently, I am using three Jetson Nanos to setup a K3s + containerd cluster with GPU support. By default, they come with nvidia-container-runtime 1.7 which it has a bug, thus I upgraded to the experimental branch to receive the nvidia-container-runtime 1.13 which worked but that lead to the containers not pulling cuda and cudart libraries not being pulled from the host OS as reported on the L4T Base webpage (given that Jeptack 5.x uses 1.9). Therefore, my question is: is there any way to configure the runtime to pull the libraries from the OS or do I need to build the images with the cuda libraries?

I also tried bind mounting the paths for my container but I find that TF reserves between 10 MB to 50 MB of memory, thus my application fails with OOM.

Any guidance will be much appreaciated!

Hi,

We need to check this with our internal team.
Will update more info with you later.

Thanks.

Firstly, it is not required to use the experimental branch. The stable repositories for installing the NVIDIA Container Toolkit packages can be configured by following the steps outlined in our documentation. For Tegra support in the device plugin, at least NVIDIA Container Toolkit v1.11.0 is required as this automatically includes the mounts required to detect Jetpack-based systems.

With regards to not including the CUDA libraries from the host. This was a design decision to enable portability of containers. Ideally, containers would package the runtime dependencies such as the CUDA Toolkit and Runtime library (the same holds for CUDNN and CUBLAS, for example).

There is an (undocumented) option to revert back to the previous (<1.10.0) behaviour, but it should only be used as a last resort.

Note that as implemented, the NVIDIA Container Runtime considers the files l4t.csv, drivers.csv, and devices.csv in /etc/nvidia-container-runtime/host-files-for-container.d by default. Would including the relevant libraries in drivers.csv could be a reasonable workaround for you at the moment?

Hi @elezar!

Sorry for my late response but thank you for your comments! It worked as before!

May I ask why when using bind mounts, TF would only reserve a few MB but through the runtime, it would reserve what it needs (sometimes ~200MB to 1GB)?

Again many thanks!

Best,
Isaac

Hi,

This depends on the available memory amount at the runtime.
Thanks.