I have a custom docker container based on NVIDIA’s cuda container, I also have a docker compose that runs this same container.
This system has been working for a while, however, for some reason it has stopped working. I can no longer run nvidia-smi inside the container and for some reason vulkan has stopped working with the following error message:
Like I have reference earlier, this used to work. I do have the nvidia-docker-toolkit installed.
It looks like there is some kind of issue related to GPU communication with the container.
Something that might also be interesting is that nvidia-smi has the correct output if I use docker run instead of docker compose up.
If you have any idea what could be causing this issue, I would greatly appreciate your feedback
I have figured out the issue, for some reason, updating docker broke the container ability to properly talk to the GPU, here is the broken docker version:
I just ran into the same exact issue after flashing an Orin-NX and installing CUDA and other parts of the Jetson SDK. How did you safely downgrade those relevant docker packages without breaking the rest of the Nvidia and CUDA dependencies?
I also found that if I ran docker run --runtime nvidia ... then things would work as expected, so clearly something broke in this new version of docker compose.
Hi Noah,
The easiest and safest way we found to do this is to uninstall the broken docker version and installed the working versions using apt as such:
Keep in mind this works for ubuntu 22.04 if, you are using any other ubuntu version the version might be slightly different, but it shouldn’t be too hard to figure out
@francisco.torrinha could you provide the versions of the *nvidia-container* components that you have installed on your system.
Note that for an envvar-only approach to work, the NVIDIA Container Runtime must be set as the default runtime in docker. This reads this envvar and ensures that the correct modifications are made to the container being started. It could be that the interaction with the runtime was changed in one of the compose updates.
Note that the resources.reservations.devices that @cclaunch calls out would trigger the injection using the nvidia-container-runtime-hook directly. This works for most use cases, but may have problems with vulkan applications specifically since we have implemented recent enhancements there in the nvidia-container-runtime instead. It is also not currently applicable for using iGPUs on Tegra-based systems.
I currently do not have access to the machine I am having issues with but I am fairly certain I was using version 1.16.1-1 of the nvidia-container-runtime.
I am specifying the default runtime as the nvidia one in my daemon.json file.
This as always worked for us in the past until we updated docker compose so my best guess would be that, yes, some change in docker compose probably causes this weird behaviour
I just wanted to chime in and say that I’m seeing the same issue on a Debian 11 (bullseye) system. The good and bad docker package versions listed by @francisco.torrinha were the same ones on my system and downgrading solves the issue for now. Also running with nvidia-container-toolkit 1.16.2.
So is this an issue with the toolkit or with docker or both?