Libnvidia-ml location and source?

I have been trying to compile libnvidia-container from source (with some success) on multiple platforms, including musl-based.

When I actually try to run nvidia-container-cli, it runs, but if I try to do an action, like nvidia-container-cli list, I get an error finding the libnvidia-ml.so.1:

$ ./nvidia-container-cli list
nvidia-container-cli: initialization error: load library failed: Error loading shared library libnvidia-ml.so.1: no such file or directory

Looking at the library on a normal Jetpack, I see that the library is at the following path, installed by the following deb pkg:

$ find /usr -name 'libnvidia-ml*'
/usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libnvidia-ml.so
$ dpkg -S /usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libnvidia-ml.so
cuda-nvml-dev-11-4: /usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libnvidia-ml.so
  1. Is the source for that available anywhere? I want to get it installed on OSes that do not use rpm/yum/zypper, including musl-based (no glibc).
  2. Why is the search coded into the app itself, rather than using the normal linker and LD_LIBRARY_PATH or similar to find it?

Thanks.

Hi,

It looks like the library is included in the CUDA package.
Here is the OTA download link for your reference: Index

Thanks.

Hi,

Those are Debian packages, not binary files. It always is possible to extract them, but that gets messy.

They also won’t work on non-glibc-based systems.

Is the source available?

Also, I find it interesting. In order to run containers with access to GPUs, I need the CUDA libraries both outside the container to start it, and inside the container to run CUDA apps?

@avi24 on JetPack 4, yes you need CUDA/cuDNN/TensorRT installed on your device, and they will be mounted into container by --runtime nvidia. You don’t need CUDA/cuDNN/TensorRT packages installed inside the JetPack 4 containers. The containers should be derived on l4t-base.

On JetPack 5 (which you are presumably using since your CUDA version is 11.4), the CUDA Toolkit/ect are installed inside the container and not mounted by --runtime nvidia. However there are still some drivers that get mounted that you can find in /etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv

Hi @dusty_nv ; thanks for hopping in to answer.

On JetPack 5 (which you are presumably using since your CUDA version is 11.4)

Correct.

the CUDA Toolkit/ect are installed inside the container and not mounted by --runtime nvidia

So the host and the container have their own copies. I always can mount them (e.g. with docker run -v or similar for other container runtimes), but Nvidia Container Toolkit no longer takes ownership of mounting them in. Is that correct?

But doesn’t the host still need them? Without them, we miss the .so libraries needed for libnvidia-container, etc.?

That is correct - if you check l4t.csv, there are no files from /usr/local/cuda listed in there. There are however lower-level GPU drivers that get mounted from /usr/lib/aarch64-linux-gnu/tegra/

Your issue with nvidia-container-cli and libnvidia-ml.so aside, you should only need CUDA Toolkit on the device if you need to use it outside container (like for compiling code with NVCC/ect). FWIW, I haven’t used nvidia-container-cli and just stick with docker run --runtime nvidia

That was my issue. nvidia-container-cli doesn’t have it linked in, but actually searches for libnvidia-ml.so.1. I found it in the source defined here and loaded here. dlopen; no idea why.

I just reread what you wrote 2 weeks ago @dusty_nv:

FWIW, I haven’t used nvidia-container-cli and just stick with docker run --runtime nvidia

I need to refresh my memory, but doesn’t docker run --runtime nvidia actually execute the thin runc wrapper, which calls libnvidia-container, which then has the dependency on those libraries anyways? See the architecture doc here.

Or is it possible that libnvidia-container does not actually depend upon it, and just that nvidia-container-cli does, so maybe I can bypass it? I will be working via containerd, and not docker, but the idea is the same.

I guess we can try and see what it depends upon.

Aha! You are right! It does work! I am not sure why nvidia-container-cli does that runtime dependency, but I am able to bypass it, and libnvidia-container (along with /usr/bin/nvidia-container-runtime (the runc wrapper) do not.

Still much to figure out with CDI, but getting there.

@dusty_nv wrote:

the CUDA Toolkit/ect are installed inside the container and not mounted by --runtime nvidia. However there are still some drivers that get mounted that you can find in /etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv

I see a lot of devices and firmware directories, those all make sense.

What about the libraries, almost all under /usr/lib/aarch64-linux-gnu/? Why would those be mounted in? Shouldn’t those be part of the container filesystem?

Those are typically lower-level drivers that are tied to the JetPack-L4T version, so they aren’t installed into the container themselves since the aim is to have JetPack 5 containers be more portable across JetPack versions.

Does that mean that I’m getting those because I’m running an older version of jetpack? Or of the container (cannot be that, since CDI yaml is created before any container is run)? Or the executable nvidia-ctk?

Sorry Avi, I don’t personally have experience using other docker runtimes/ect and typically stick to the default --runtime nvidia to maintain compatibility. For more in-depth knowledge about the nvidia container runtime, you might want to file an issue against the libnvidia-container github. Happy to help otherwise though.

Yeah, maybe I will. Thanks @dusty_nv

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.