I have been trying to compile libnvidia-container from source (with some success) on multiple platforms, including musl-based.
When I actually try to run
nvidia-container-cli, it runs, but if I try to do an action, like
nvidia-container-cli list, I get an error finding the libnvidia-ml.so.1:
$ ./nvidia-container-cli list
nvidia-container-cli: initialization error: load library failed: Error loading shared library libnvidia-ml.so.1: no such file or directory
Looking at the library on a normal Jetpack, I see that the library is at the following path, installed by the following deb pkg:
$ find /usr -name 'libnvidia-ml*'
$ dpkg -S /usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libnvidia-ml.so
- Is the source for that available anywhere? I want to get it installed on OSes that do not use rpm/yum/zypper, including musl-based (no glibc).
- Why is the search coded into the app itself, rather than using the normal linker and
LD_LIBRARY_PATH or similar to find it?
It looks like the library is included in the CUDA package.
Here is the OTA download link for your reference: Index
Those are Debian packages, not binary files. It always is possible to extract them, but that gets messy.
They also won’t work on non-glibc-based systems.
Is the source available?
Also, I find it interesting. In order to run containers with access to GPUs, I need the CUDA libraries both outside the container to start it, and inside the container to run CUDA apps?
@avi24 on JetPack 4, yes you need CUDA/cuDNN/TensorRT installed on your device, and they will be mounted into container by
--runtime nvidia. You don’t need CUDA/cuDNN/TensorRT packages installed inside the JetPack 4 containers. The containers should be derived on l4t-base.
On JetPack 5 (which you are presumably using since your CUDA version is 11.4), the CUDA Toolkit/ect are installed inside the container and not mounted by
--runtime nvidia. However there are still some drivers that get mounted that you can find in
Hi @dusty_nv ; thanks for hopping in to answer.
On JetPack 5 (which you are presumably using since your CUDA version is 11.4)
the CUDA Toolkit/ect are installed inside the container and not mounted by
So the host and the container have their own copies. I always can mount them (e.g. with
docker run -v or similar for other container runtimes), but Nvidia Container Toolkit no longer takes ownership of mounting them in. Is that correct?
But doesn’t the host still need them? Without them, we miss the
.so libraries needed for
That is correct - if you check
l4t.csv, there are no files from
/usr/local/cuda listed in there. There are however lower-level GPU drivers that get mounted from
Your issue with nvidia-container-cli and libnvidia-ml.so aside, you should only need CUDA Toolkit on the device if you need to use it outside container (like for compiling code with NVCC/ect). FWIW, I haven’t used nvidia-container-cli and just stick with
docker run --runtime nvidia
That was my issue.
nvidia-container-cli doesn’t have it linked in, but actually searches for
libnvidia-ml.so.1. I found it in the source defined here and loaded here.
dlopen; no idea why.