Running CUDA in LXD container: nvidia-smi doesn't show running processes

Recently I have successfully got CUDA running in an LXD container. Everything seems to work fine but one thing that is bugging me is that nvidia-smi (running this inside the LXD container) doesn’t show the processes running at the moment (it is just empty) even when it is reporting GPU utilization correctly.

I would like to know where nvidia-smi gets the current running processes from, the driver or from the device nodes? By the way nvidia-smi does show the current running processes when I run it on the host machine. Can someone give me some pointers on how get this fixed?

Thanks in advance!

I think I’ve just figured out the reason for it but I don’t think there is an easy solution for this.

Running strace nvidia-smi, I can actually see that it is getting the PIDs for the processes running at the moment and reading the following file: /proc/PID/cmdline.

The problem is the processes in the container have different PIDs compared to the same processes on the host. nvidia-smi is getting the PIDs of the processes from the host. Then it tries to read /proc/PID/cmdline, which doesn’t exist in the container, therefore nvidia-smi doesn’t report the processes.

So the problem is that we have no idea which PIDs on the host corresponds to which PIDs in the container. Even if we know the mapping, there must a program that monitors the processes in real-time to add/remove soft-links while risking clashes with existing PIDs in the container.

Anyone got any ideas?

I may be misunderstanding the situation, but from your description this sounds to me like a flaw in the containerization technology used: Isn’t the whole point of containerization a sort of para-virtualization that provides isolation and abstraction, while making apps running in the container believe they are running on the bare operating system? You may want to discuss this issue with the containerization software vendor.

nvidia has preconfigured docker containers which may be of interest:

https://github.com/NVIDIA/nvidia-docker/wiki/Using-nvidia-docker

I don’t know the details of LXD, but with Docker you will have a similar behavior by default.
It’s because Docker uses a PID namespace, there are multiple namespaces available on Linux:
http://lwn.net/Articles/531114/
http://man7.org/linux/man-pages/man7/pid_namespaces.7.html
https://blog.yadutaf.fr/2014/01/05/introduction-to-linux-namespaces-part-3-pid/

In docker, the fix is simple, use “–pid=host”:
$ nvidia-docker run -ti --pid=host nvidia/cuda nvidia-smi