I have been trying to compile libnvidia-container from source (with some success) on multiple platforms, including musl-based.
When I actually try to run nvidia-container-cli, it runs, but if I try to do an action, like nvidia-container-cli list, I get an error finding the libnvidia-ml.so.1:
$ ./nvidia-container-cli list
nvidia-container-cli: initialization error: load library failed: Error loading shared library libnvidia-ml.so.1: no such file or directory
Looking at the library on a normal Jetpack, I see that the library is at the following path, installed by the following deb pkg:
Also, I find it interesting. In order to run containers with access to GPUs, I need the CUDA libraries both outside the container to start it, and inside the container to run CUDA apps?
@avi24 on JetPack 4, yes you need CUDA/cuDNN/TensorRT installed on your device, and they will be mounted into container by --runtime nvidia. You don’t need CUDA/cuDNN/TensorRT packages installed inside the JetPack 4 containers. The containers should be derived on l4t-base.
On JetPack 5 (which you are presumably using since your CUDA version is 11.4), the CUDA Toolkit/ect are installed inside the container and not mounted by --runtime nvidia. However there are still some drivers that get mounted that you can find in /etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv
On JetPack 5 (which you are presumably using since your CUDA version is 11.4)
Correct.
the CUDA Toolkit/ect are installed inside the container and not mounted by --runtime nvidia
So the host and the container have their own copies. I always can mount them (e.g. with docker run -v or similar for other container runtimes), but Nvidia Container Toolkit no longer takes ownership of mounting them in. Is that correct?
But doesn’t the host still need them? Without them, we miss the .so libraries needed for libnvidia-container, etc.?
That is correct - if you check l4t.csv, there are no files from /usr/local/cuda listed in there. There are however lower-level GPU drivers that get mounted from /usr/lib/aarch64-linux-gnu/tegra/
Your issue with nvidia-container-cli and libnvidia-ml.so aside, you should only need CUDA Toolkit on the device if you need to use it outside container (like for compiling code with NVCC/ect). FWIW, I haven’t used nvidia-container-cli and just stick with docker run --runtime nvidia
That was my issue. nvidia-container-cli doesn’t have it linked in, but actually searches for libnvidia-ml.so.1. I found it in the source defined here and loaded here. dlopen; no idea why.