Unable to run a nvidia docker on AGX Xavier

I have been trying to do something very simple:

docker run --runtime=nvidia --rm nvidia/cuda

However I got the error

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/8cb963c23bee566216d2d890e60f62ae497be2857ef31e519ebd31e43e91a865/log.json: no such file or directory): exec: “nvidia-container-runtime”: executable file not found in $PATH: unknown.

So I tried to do sudo apt install nvidia-container-runtime

but I got

E: Unable to locate package nvidia-container-runtime

So I followed the advice of this page
and I did

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update

with this I could do sudo apt install nvidia-container-runtime

then I tried to run the docker container of the start of this question
docker run --runtime=nvidia --rm nvidia/cuda

and now I got a complete different error

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused “process_linux.go:430: container init caused "process_linux.go:413: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""”: unknown.

I don’t know how to proceeed from here to be able to run the container. Any help will be greatly appreciated

^^ should be for x86_64 architecture;
for Xavier AGX you may like to use l4t containers from ngc.nvidia.com:

nvcr.io/nvidia/l4t-base:r32.4.2

Sorry to jump in here, I just started with the AGX Xavier, and having issue with the docker.

thuy@worker03-xavieragx:~$ sudo docker run --runtime nvidia --network host -it -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-base:r32.3.1
Unable to find image ‘nvcr.io/nvidia/l4t-base:r32.3.1’ locally
r32.3.1: Pulling from nvidia/l4t-base
8aaa03d29a6e: Pull complete
e73d3a974854: Pull complete
2c14cdba18f5: Pull complete
23dd63c7659b: Pull complete
3bd414bd9504: Pull complete
cafd526eb263: Pull complete
483b0873e636: Pull complete
2568c5428ff2: Pull complete
6bcd9356d42f: Pull complete
c7f6d0180a4e: Pull complete
beddc9b83fb0: Pull complete
656f2307c79e: Pull complete
fe2e73a571b7: Pull complete
f5decba41c07: Pull complete
f0b6e413c48c: Pull complete
Digest: sha256:e8987d52ddb9496948e02656fc62d46561abce25bfe83203f4bc24c67e094578
Status: Downloaded newer image for nvcr.io/nvidia/l4t-base:r32.3.1
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""”: unknown.
ERRO[0072] error waiting for container: context canceled

thuy@worker03-xavieragx:~$ nvidia-container-cli list
nvidia-container-cli: initialization error: driver error: failed to process request

Not sure why I have issue with the driver error as everything has been installed through the JetPack 4.4 which should include all necessary drivers for nvidia?

Can you give me some pointers here or let me know if I should open a new thread for this?

Thanks,
Thuy

1 Like

does the issue persist if you run the docker without the runtime argument?

No, there is no issue with docker itself. Switching to the second Xavier, it is working fine (so I can remove the current drivers and reinstall them later). I’m then having issue with this xavier when joining an existing kubernetes cluster, the nvidia-device-plugin-daemonset does not work. I just want to expose the GPU to the cluster.

I used this command for the plugin in the master-node:
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml

thuy@thuy-xavier-02:~$ kubectl version
Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.6”, GitCommit:“dff82dc0de47299ab66c83c626e08b245ab19037”, GitTreeState:“clean”, BuildDate:“2020-07-15T16:58:53Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/arm64”}

thuy@thuy-xavier-02:~$ docker version
Client:
Version: 19.03.6
API version: 1.40
Go version: go1.12.17
Git commit: 369ce74a3c
Built: Fri Feb 28 23:47:53 2020
OS/Arch: linux/arm64
Experimental: false

thuy@thuy-xavier-02:~$ cat /etc/docker/daemon.json
{
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”:
}
}
}

thuy@thuy-xavier-02:~$ nvidia-docker version
NVIDIA Docker: 2.0.3
Client:
Version: 19.03.6
API version: 1.40
Go version: go1.12.17
Git commit: 369ce74a3c
Built: Fri Feb 28 23:47:53 2020
OS/Arch: linux/arm64
Experimental: false

Sorry for switching the topic,

I figured out the issue is that the nvidia-device-plugin in kubernetes requires nvidia-smi to work with, but I don’t have nvidia-smi, but only have tegrastats.

I don’t know if I should try to install nvidia-smi in xavier agx board, so try to figure out how to enable the nvidia-device-plugin to work with tegrastats.

If you have any experience with this, it’d be great to know.

Thanks,
Thuy