Hi I’m trying to get docker-in-docker (dind) to work with cuda on Jetson Xavier NX developer kit.
Setup
output of on jetson host jetson_release -v
- NVIDIA Jetson Xavier NX (Developer Kit Version)
* Jetpack 4.5 [L4T 32.5.0]
* NV Power Mode: MODE_15W_6CORE - Type: 2
* jetson_stats.service: active
- Board info:
* Type: Xavier NX (Developer Kit Version)
* SOC Family: tegra194 - ID:25
* Module: P3668 - Board: P3509-000
* Code Name: jakku
* CUDA GPU architecture (ARCH_BIN): 7.2
* Serial Number: 1423220027077
- Libraries:
* CUDA: 10.2.89
* cuDNN: 8.0.0.180
* TensorRT: 7.1.3.0
* Visionworks: 1.6.0.501
* OpenCV: 4.1.1 compiled CUDA: NO
* VPI: ii libnvvpi1 1.0.12 arm64 NVIDIA Vision Programming Interface library
* Vulkan: 1.2.70
- jetson-stats:
* Version 3.1.0
* Works on Python 3.6.9
I use custom dind image inspired by GitHub - Henderake/dind-nvidia-docker: nvidia-docker in DinD (Docker inside Docker) :
Dockerfile:
FROM nvcr.io/nvidia/l4t-base:r32.5.0 AS base
RUN apt-get update -q && \
apt-get install -yq \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
echo \
"deb [arch=arm64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null && \
apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io
# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
apt-get update -q && \
apt-get install -yq \
btrfs-tools \
e2fsprogs \
iptables \
xfsprogs \
xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
pigz \
wget
# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
&& addgroup --system dockremap \
&& adduser --system -ingroup dockremap dockremap \
&& echo 'dockremap:165536:65536' >> /etc/subuid \
&& echo 'dockremap:165536:65536' >> /etc/subgid
# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT ed89041433a031cafc0a0f19cfe573c31688d377
RUN set -eux; \
wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
chmod +x /usr/local/bin/dind
##### Install nvidia docker for jetson #####
RUN apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" | tee /etc/apt/sources.list.d/nvidia-docker.list && \
echo "deb https://repo.download.nvidia.com/jetson/t194 r32.5 main" >> /etc/apt/sources.list.d/nvidia-docker.list && \
apt-get update -qq && \
apt-get install -yq nvidia-docker2 nvidia-container-csv-cuda nvidia-container-csv-cudnn && \
sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json
COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh
VOLUME /var/lib/docker
EXPOSE 2375
ENTRYPOINT ["dockerd-entrypoint.sh"]
CMD []
dockerd-entrypoint.sh
:
#!/bin/sh
set -e
# no arguments passed
# or first arg is `-f` or `--some-option`
if [ "$#" -eq 0 ] || [ "${1#-}" != "$1" ]; then
# add our default arguments
set -- dockerd \
--host=unix:///var/run/docker.sock \
--host=tcp://0.0.0.0:2375 \
"$@"
fi
if [ "$1" = 'dockerd' ]; then
# if we're running Docker, let's pipe through dind
set -- "$(which dind)" "$@"
# explicitly remove Docker's default PID file to ensure that it can start properly if it was stopped uncleanly (and thus didn't clean up the PID file)
find /run /var/run -iname 'docker*.pid' -delete
fi
exec "$@"
To test if cuda works properly I use deviceQuery
utility from CUDA samples, which was built directly on the jetson host device and gets mounted to docker containers thanks to --runtime nvidia
docker option.
Problem
The deviceQuery
works correctly when run directly on the jetson host, it works correctly when run from the dind container, but when run from container running inside the dind docker engine it fails with error
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
I start the dind container (lets call it container A) using this command docker run --privileged -d --runtime nvidia --name dind dind
and then start a bash session within the container A using command docker exec -it dind bash
. Within this bash session on container A the deviceQuery
also works properly.
Following this setup I run another docker container (lets call it container B) from within the bash session on container A using command docker run --rm --privileged --runtime nvidia nvcr.io/nvidia/l4t-base:r32.5.0 /usr/local/cuda-10.2/samples/1_Utilities/deviceQuery/deviceQuery
and this command failed with the aforementioned
CUDA driver version is insufficient for CUDA runtime version
error.
Is there any way I can get more information about the version mismatch e.g. displaying what versions are actually available and required?
Am I missing any deb packages in the dind image that are required to run this kind of workflow? The currently installed package list nvidia-docker2 nvidia-container-csv-cuda nvidia-container-csv-cudnn
was built empirically based on what was failing.
All recommendation I can find about the error is to “Update the cuda drivers” but AFAIK in this case this should be handled by matching the l4t versions used (jetson host, container A and container B all have l4t version 32.5.0)
Any help with this issue would be much appreciated.