Nvidia docker runtime does not seem to work with docker compose

Hi everyone,

I have a custom docker container based on NVIDIA’s cuda container, I also have a docker compose that runs this same container.

This system has been working for a while, however, for some reason it has stopped working. I can no longer run nvidia-smi inside the container and for some reason vulkan has stopped working with the following error message:

Error: ExtensionRestrictionNotMet(ExtensionRestrictionError { extension: "khr_display", restriction: NotSupported })

(though this might not be related to the driver issue).

Here is a snippet of the relevant docker-compose file:

runtime: nvidia
    ...
    environment:
      - NVIDIA_DISABLE_REQUIRE=1
      - TZ=Europe/Lisbon
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all

Like I have reference earlier, this used to work. I do have the nvidia-docker-toolkit installed.
It looks like there is some kind of issue related to GPU communication with the container.

Something that might also be interesting is that nvidia-smi has the correct output if I use docker run instead of docker compose up.

If you have any idea what could be causing this issue, I would greatly appreciate your feedback

Thanks in advance,
Francisco

1 Like

I have figured out the issue, for some reason, updating docker broke the container ability to properly talk to the GPU, here is the broken docker version:

  • docker-buildx-plugin/jammy 0.17.1-1~ubuntu.22.04~jammy amd64
  • docker-ce-cli/jammy 5:27.3.1-1~ubuntu.22.04~jammy amd64
  • docker-ce-rootless-extras/jammy 5:27.3.1-1~ubuntu.22.04~jammy amd64
  • docker-ce/jammy 5:27.3.1-1~ubuntu.22.04~jammy amd64
  • docker-compose-plugin/jammy 2.29.7-1~ubuntu.22.04~jammy amd64

These are the package versions I am using now and they seem to work:

  • docker-buildx-plugin/0.16.2-1~ubuntu.22.04~jammy
  • docker-ce-cli/5:27.1.2-1~ubuntu.22.04~jammy
  • docker-ce-rootless-extras/5:27.1.2-1~ubuntu.22.04~jammy
  • docker-ce/upgradable 5:27.1.2-1~ubuntu.22.04~jammy
  • docker-compose-plugin/2.29.1-1~ubuntu.22.04~jammy
1 Like

We have the same issue, and @francisco.torrinha’s fixed work for us. Here’s the relevant chunk of our compose:

  runtime: nvidia
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            capabilities: [gpu]

And the nvidia related portion of an .env we pass:

NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all

Hi Francisco,

I just ran into the same exact issue after flashing an Orin-NX and installing CUDA and other parts of the Jetson SDK. How did you safely downgrade those relevant docker packages without breaking the rest of the Nvidia and CUDA dependencies?

I also found that if I ran docker run --runtime nvidia ... then things would work as expected, so clearly something broke in this new version of docker compose.

Hi Noah,
The easiest and safest way we found to do this is to uninstall the broken docker version and installed the working versions using apt as such:

apt install docker-buildx-plugin=0.16.2-1~ubuntu.22.04~jammy docker-ce=5:27.1.2-1~ubuntu.22.04~jammy docker-compose-plugin=2.29.1-1~ubuntu.22.04~jammy

Keep in mind this works for ubuntu 22.04 if, you are using any other ubuntu version the version might be slightly different, but it shouldn’t be too hard to figure out

1 Like

This was our process (mind you just on a x86_64 Ubuntu 22.04 machine, not a Jetson):

Create this file /etc/apt/preferences.d/docker-pinned :

Package: docker-buildx-plugin
Pin: version 0.16.2-1~ubuntu.22.04~jammy
Pin-Priority: 1000

Package: docker-ce-cli docker-ce-rootless-extras docker-ce
Pin: version 5:27.1.2-1~ubuntu.22.04~jammy
Pin-Priority: 1000

Package: docker-compose-plugin
Pin: version 2.29.1-1~ubuntu.22.04~jammy
Pin-Priority: 1000

Then apt update/upgrade as normal should perform the downgrade and keep those packages pinned. Again no idea if this will work on a Jetson.

1 Like

@francisco.torrinha could you provide the versions of the *nvidia-container* components that you have installed on your system.

Note that for an envvar-only approach to work, the NVIDIA Container Runtime must be set as the default runtime in docker. This reads this envvar and ensures that the correct modifications are made to the container being started. It could be that the interaction with the runtime was changed in one of the compose updates.

Note that the resources.reservations.devices that @cclaunch calls out would trigger the injection using the nvidia-container-runtime-hook directly. This works for most use cases, but may have problems with vulkan applications specifically since we have implemented recent enhancements there in the nvidia-container-runtime instead. It is also not currently applicable for using iGPUs on Tegra-based systems.

I currently do not have access to the machine I am having issues with but I am fairly certain I was using version 1.16.1-1 of the nvidia-container-runtime.

I am specifying the default runtime as the nvidia one in my daemon.json file.
This as always worked for us in the past until we updated docker compose so my best guess would be that, yes, some change in docker compose probably causes this weird behaviour

The *nvidia-container* packages we have (which we did not pin) are: 1.16.2-1 (again on amd64 / ubuntu 22.04 24.04)

edit: correction this was 24.04, but I think it’s the same on our 22.04 machines as well, can check that for sure later

edit2: confirmed it’s the same package version on our 20.04 and 22.04 machines as well

I just wanted to chime in and say that I’m seeing the same issue on a Debian 11 (bullseye) system. The good and bad docker package versions listed by @francisco.torrinha were the same ones on my system and downgrading solves the issue for now. Also running with nvidia-container-toolkit 1.16.2.

So is this an issue with the toolkit or with docker or both?

Found this issue in the docker-compose github:

Seems like a fix may be coming in 2.30.x release.

edit: also adding the “count” field with some valid value in the docker-compose.yml also works around the issue with the current 2.29.7 release.

  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            capabilities: [gpu]
            count: 1
2 Likes

Confirmed the count: workaround works for us.

1 Like