Nvidia docker runtime does not seem to work with docker compose

francisco.torrinha · September 25, 2024, 12:29pm

Hi everyone,

I have a custom docker container based on NVIDIA’s cuda container, I also have a docker compose that runs this same container.

This system has been working for a while, however, for some reason it has stopped working. I can no longer run nvidia-smi inside the container and for some reason vulkan has stopped working with the following error message:

Error: ExtensionRestrictionNotMet(ExtensionRestrictionError { extension: "khr_display", restriction: NotSupported })

(though this might not be related to the driver issue).

Here is a snippet of the relevant docker-compose file:

runtime: nvidia
    ...
    environment:
      - NVIDIA_DISABLE_REQUIRE=1
      - TZ=Europe/Lisbon
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all

Like I have reference earlier, this used to work. I do have the nvidia-docker-toolkit installed.
It looks like there is some kind of issue related to GPU communication with the container.

Something that might also be interesting is that nvidia-smi has the correct output if I use docker run instead of docker compose up.

If you have any idea what could be causing this issue, I would greatly appreciate your feedback

Thanks in advance,
Francisco

francisco.torrinha · September 25, 2024, 12:39pm

I have figured out the issue, for some reason, updating docker broke the container ability to properly talk to the GPU, here is the broken docker version:

docker-buildx-plugin/jammy 0.17.1-1~ubuntu.22.04~jammy amd64
docker-ce-cli/jammy 5:27.3.1-1~ubuntu.22.04~jammy amd64
docker-ce-rootless-extras/jammy 5:27.3.1-1~ubuntu.22.04~jammy amd64
docker-ce/jammy 5:27.3.1-1~ubuntu.22.04~jammy amd64
docker-compose-plugin/jammy 2.29.7-1~ubuntu.22.04~jammy amd64

These are the package versions I am using now and they seem to work:

docker-buildx-plugin/0.16.2-1~ubuntu.22.04~jammy
docker-ce-cli/5:27.1.2-1~ubuntu.22.04~jammy
docker-ce-rootless-extras/5:27.1.2-1~ubuntu.22.04~jammy
docker-ce/upgradable 5:27.1.2-1~ubuntu.22.04~jammy
docker-compose-plugin/2.29.1-1~ubuntu.22.04~jammy

cclaunch · October 4, 2024, 7:42pm

We have the same issue, and @francisco.torrinha’s fixed work for us. Here’s the relevant chunk of our compose:

  runtime: nvidia
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            capabilities: [gpu]

And the nvidia related portion of an .env we pass:

NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all

noah.rochefort · October 10, 2024, 11:50pm

Hi Francisco,

I just ran into the same exact issue after flashing an Orin-NX and installing CUDA and other parts of the Jetson SDK. How did you safely downgrade those relevant docker packages without breaking the rest of the Nvidia and CUDA dependencies?

I also found that if I ran docker run --runtime nvidia ... then things would work as expected, so clearly something broke in this new version of docker compose.

francisco.torrinha · October 11, 2024, 9:03am

Hi Noah,
The easiest and safest way we found to do this is to uninstall the broken docker version and installed the working versions using apt as such:

apt install docker-buildx-plugin=0.16.2-1~ubuntu.22.04~jammy docker-ce=5:27.1.2-1~ubuntu.22.04~jammy docker-compose-plugin=2.29.1-1~ubuntu.22.04~jammy

Keep in mind this works for ubuntu 22.04 if, you are using any other ubuntu version the version might be slightly different, but it shouldn’t be too hard to figure out

cclaunch · October 11, 2024, 12:09pm

This was our process (mind you just on a x86_64 Ubuntu 22.04 machine, not a Jetson):

Create this file /etc/apt/preferences.d/docker-pinned :

Package: docker-buildx-plugin
Pin: version 0.16.2-1~ubuntu.22.04~jammy
Pin-Priority: 1000

Package: docker-ce-cli docker-ce-rootless-extras docker-ce
Pin: version 5:27.1.2-1~ubuntu.22.04~jammy
Pin-Priority: 1000

Package: docker-compose-plugin
Pin: version 2.29.1-1~ubuntu.22.04~jammy
Pin-Priority: 1000

Then apt update/upgrade as normal should perform the downgrade and keep those packages pinned. Again no idea if this will work on a Jetson.

elezar · October 29, 2024, 9:45am

@francisco.torrinha could you provide the versions of the *nvidia-container* components that you have installed on your system.

Note that for an envvar-only approach to work, the NVIDIA Container Runtime must be set as the default runtime in docker. This reads this envvar and ensures that the correct modifications are made to the container being started. It could be that the interaction with the runtime was changed in one of the compose updates.

Note that the resources.reservations.devices that @cclaunch calls out would trigger the injection using the nvidia-container-runtime-hook directly. This works for most use cases, but may have problems with vulkan applications specifically since we have implemented recent enhancements there in the nvidia-container-runtime instead. It is also not currently applicable for using iGPUs on Tegra-based systems.

francisco.torrinha · October 29, 2024, 9:54am

I currently do not have access to the machine I am having issues with but I am fairly certain I was using version 1.16.1-1 of the nvidia-container-runtime.

I am specifying the default runtime as the nvidia one in my daemon.json file.
This as always worked for us in the past until we updated docker compose so my best guess would be that, yes, some change in docker compose probably causes this weird behaviour

cclaunch · October 29, 2024, 12:01pm

The *nvidia-container* packages we have (which we did not pin) are: 1.16.2-1 (again on amd64 / ubuntu ~~22.04~~ 24.04)

edit: correction this was 24.04, but I think it’s the same on our 22.04 machines as well, can check that for sure later

edit2: confirmed it’s the same package version on our 20.04 and 22.04 machines as well

acm025 · October 30, 2024, 12:34am

I just wanted to chime in and say that I’m seeing the same issue on a Debian 11 (bullseye) system. The good and bad docker package versions listed by @francisco.torrinha were the same ones on my system and downgrading solves the issue for now. Also running with nvidia-container-toolkit 1.16.2.

So is this an issue with the toolkit or with docker or both?

acm025 · October 30, 2024, 2:32am

Found this issue in the docker-compose github:

github.com/docker/compose

Nvidia Runtime Does Not Work via Docker Compose

opened 12:10AM - 11 Oct 24 UTC

closed 08:41AM - 14 Oct 24 UTC

nomah98

kind/bug area/devices

### Description Going from docker-compose-plugin/2.29.1 to docker-compose-plugi…n/jammy 2.29.7, the `runtime` field of the docker compose file does not enable the specified nvida runtime in my dockerfile. However, running the same image with the argument `--runtime nvidia` actually enables the Nvidia runtime in the container. I have other Nvidia devices running docker-compose-plugin/2.29.1 that do not have this issue. ### Steps To Reproduce On a Jetson Orin-NX with docker-compose-plugin/jammy 2.29.7, use docker compose to start a container via docker-compose that has fields such as ``` image: MY_IMAGE container_name: MY_CONTAINER runtime: nvidia network_mode: host cap_add: [SYS_TIME] deploy: resources: reservations: devices: - driver: nvidia capabilities: - utility # nvidia-smi - compute # CUDA - video # NVDEC/NVENC/NVCUVID ``` then try to import something that uses an nvidia shared object ``` from .tensorrt import * ``` and see an error like ``` ImportError: /usr/lib/aarch64-linux-gnu/nvidia/libnvdla_compiler.so: file too short ``` Run the same image with ``` docker run --runtime nvidia -it MY_IMAGE bash ``` then try to import something that uses an nvidia shared object ``` from .tensorrt import * ``` No error. ### Compose Version docker-compose-plugin/jammy 2.29.7 ### Docker Environment Client: Docker Engine - Community Version: 27.3.1 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.17.1 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.29.7 Path: /usr/libexec/docker/cli-plugins/docker-compose Server: Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 1 Server Version: 27.3.1 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 nvidia runc Default Runtime: runc Init Binary: docker-init containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c runc version: v1.1.14-0-g2c9f560 init version: de40ad0 Security Options: seccomp Profile: builtin cgroupns Kernel Version: 5.15.136-tegra Operating System: Ubuntu 22.04.4 LTS OSType: linux Architecture: aarch64 CPUs: 8 Total Memory: 15.29GiB Name: rudi-nx ID: 3c5b7ecd-713f-4d6e-ac9a-f6cfe3c2112f Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 192.168.11.200:5000 127.0.0.0/8 Live Restore Enabled: false WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled ### Anything else? Francisco encountered the same issue [here](https://forums.developer.nvidia.com/t/nvidia-docker-runtime-does-not-seem-to-work-with-docker-compose/307879) Like Francisco, I was able to make this work by downgrading docker-ce and docker-compose-plugin

Seems like a fix may be coming in 2.30.x release.

edit: also adding the “count” field with some valid value in the docker-compose.yml also works around the issue with the current 2.29.7 release.

  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            capabilities: [gpu]
            count: 1

cclaunch · October 30, 2024, 12:29pm

Confirmed the count: workaround works for us.

Topic		Replies	Views
Docker Nvidia Fails after Software Update Jetson Xavier NX docker	10	694	December 18, 2023
Cannot run docker with --runtime nvidia Jetson Xavier NX docker , containers	8	7491	December 22, 2021
Unable to run nvidia docker Jetson Xavier NX docker	4	3603	December 8, 2021
`nvidia-container-cli` driver error when trying to run Nvidia docker on Jetson Nano Jetson Nano cuda , containers	6	7354	October 18, 2021
Docker not working Jetson Nano docker	10	674	December 20, 2023
Nvidia driver-container does not work after restart Docker and NVIDIA Docker	7	6184	March 24, 2022
Jetson Xavier AGX and Docker Checkpoint issue Jetson Xavier NX kernel , docker	4	1005	January 11, 2023
Cannot run docker containers, incorrect nvidia-container-toolkit dependencies? Jetson AGX Orin docker , containers , jetson	4	161	February 12, 2025
Docker on the TX2 Jetson TX2	37	31886	October 18, 2021
Test nvidia-smi by nvidia docker Jetson TX2	2	2162	October 18, 2021

Nvidia docker runtime does not seem to work with docker compose

Related topics