Jetson Xavier NX docker-in-docker issue

Hi I’m trying to get docker-in-docker (dind) to work with cuda on Jetson Xavier NX developer kit.

Setup

output of on jetson host jetson_release -v

- NVIDIA Jetson Xavier NX (Developer Kit Version)
   * Jetpack 4.5 [L4T 32.5.0]
   * NV Power Mode: MODE_15W_6CORE - Type: 2
   * jetson_stats.service: active
 - Board info:
   * Type: Xavier NX (Developer Kit Version)
   * SOC Family: tegra194 - ID:25
   * Module: P3668 - Board: P3509-000
   * Code Name: jakku
   * CUDA GPU architecture (ARCH_BIN): 7.2
   * Serial Number: 1423220027077
 - Libraries:
   * CUDA: 10.2.89
   * cuDNN: 8.0.0.180
   * TensorRT: 7.1.3.0
   * Visionworks: 1.6.0.501
   * OpenCV: 4.1.1 compiled CUDA: NO
   * VPI: ii libnvvpi1 1.0.12 arm64 NVIDIA Vision Programming Interface library
   * Vulkan: 1.2.70
 - jetson-stats:
   * Version 3.1.0
   * Works on Python 3.6.9

I use custom dind image inspired by GitHub - Henderake/dind-nvidia-docker: nvidia-docker in DinD (Docker inside Docker) :
Dockerfile:

FROM nvcr.io/nvidia/l4t-base:r32.5.0 AS base

RUN apt-get update -q && \
    apt-get install -yq \
        apt-transport-https \
        ca-certificates \
        curl \
        gnupg \
        lsb-release && \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
    echo \
      "deb [arch=arm64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null && \
    apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io

# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
    apt-get update -q && \
	apt-get install -yq \
		btrfs-tools \
		e2fsprogs \
		iptables \
		xfsprogs \
		xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
		pigz \
		wget


# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
	&& addgroup --system dockremap \
	&& adduser --system -ingroup dockremap dockremap \
	&& echo 'dockremap:165536:65536' >> /etc/subuid \
	&& echo 'dockremap:165536:65536' >> /etc/subgid

# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT ed89041433a031cafc0a0f19cfe573c31688d377

RUN set -eux; \
	wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
	chmod +x /usr/local/bin/dind

##### Install nvidia docker for jetson #####
RUN apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
  echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" | tee /etc/apt/sources.list.d/nvidia-docker.list && \
  echo "deb https://repo.download.nvidia.com/jetson/t194 r32.5 main" >> /etc/apt/sources.list.d/nvidia-docker.list && \
  apt-get update -qq && \
  apt-get install -yq nvidia-docker2 nvidia-container-csv-cuda nvidia-container-csv-cudnn  && \
  sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json

COPY dockerd-entrypoint.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh

VOLUME /var/lib/docker
EXPOSE 2375

ENTRYPOINT ["dockerd-entrypoint.sh"]
CMD []

dockerd-entrypoint.sh :

#!/bin/sh
set -e

# no arguments passed
# or first arg is `-f` or `--some-option`
if [ "$#" -eq 0 ] || [ "${1#-}" != "$1" ]; then
	# add our default arguments
	set -- dockerd \
		--host=unix:///var/run/docker.sock \
		--host=tcp://0.0.0.0:2375 \
		"$@"
fi

if [ "$1" = 'dockerd' ]; then
	# if we're running Docker, let's pipe through dind
	set -- "$(which dind)" "$@"

	# explicitly remove Docker's default PID file to ensure that it can start properly if it was stopped uncleanly (and thus didn't clean up the PID file)
	find /run /var/run -iname 'docker*.pid' -delete
fi

exec "$@"

To test if cuda works properly I use deviceQuery utility from CUDA samples, which was built directly on the jetson host device and gets mounted to docker containers thanks to --runtime nvidia docker option.

Problem

The deviceQuery works correctly when run directly on the jetson host, it works correctly when run from the dind container, but when run from container running inside the dind docker engine it fails with error

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version

I start the dind container (lets call it container A) using this command docker run --privileged -d --runtime nvidia --name dind dind and then start a bash session within the container A using command docker exec -it dind bash. Within this bash session on container A the deviceQuery also works properly.

Following this setup I run another docker container (lets call it container B) from within the bash session on container A using command docker run --rm --privileged --runtime nvidia nvcr.io/nvidia/l4t-base:r32.5.0 /usr/local/cuda-10.2/samples/1_Utilities/deviceQuery/deviceQuery and this command failed with the aforementioned
CUDA driver version is insufficient for CUDA runtime version error.

Is there any way I can get more information about the version mismatch e.g. displaying what versions are actually available and required?

Am I missing any deb packages in the dind image that are required to run this kind of workflow? The currently installed package list nvidia-docker2 nvidia-container-csv-cuda nvidia-container-csv-cudnn was built empirically based on what was failing.

All recommendation I can find about the error is to “Update the cuda drivers” but AFAIK in this case this should be handled by matching the l4t versions used (jetson host, container A and container B all have l4t version 32.5.0)

Any help with this issue would be much appreciated.

Hi,

There are some OTA update commands in your Dockerfile.

RUN apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
  echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" | tee /etc/apt/sources.list.d/nvidia-docker.list && \
  echo "deb https://repo.download.nvidia.com/jetson/t194 r32.5 main" >> /etc/apt/sources.list.d/nvidia-docker.list && \
  apt-get update -qq && \
  apt-get install -yq nvidia-docker2 nvidia-container-csv-cuda nvidia-container-csv-cudnn  && \
  sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json

Would you mind checking if your container is upgraded to r32.5.1 after the command?

Thanks.

Hi, thank you for checking out my question.

Do you have nay hints on how to check if it is upgraded to r32.5.1 ? Only way I know of is cat /etc/nv_tegra_release and that file does not exist in the container.

Also the output from the apt-get install commands contains only version of the installed packages, which are versioned separately from l4t e.g.:

Get:34 https://repo.download.nvidia.com/jetson/common r32.5/main arm64 cuda-cudart-10-2 arm64 10.2.89-1 [94.5 kB]
Get:35 https://repo.download.nvidia.com/jetson/common r32.5/main arm64 cuda-driver-dev-10-2 arm64 10.2.89-1 [8296 B]

Hi,

CUDA driver version is insufficient for CUDA runtime version

Usually, this error indicates the GPU driver and CUDA library is incompatible.
For Jetson, the GPU driver is integrated into the OS so it depends on the release version.

It seems that you want to reuse the docker running on the NX.
Have you tried to install a docker in the dind container rather than reusing it?

Thanks.

Actually this problem only occurs when I am NOT reusing the docker engine from NX - there is a docker engine in the dind container which is separate from the docker engine on the NX. The installation is performed in this part of the dockerfile:

FROM nvcr.io/nvidia/l4t-base:r32.5.0 AS base

RUN apt-get update -q && \
    apt-get install -yq \
        apt-transport-https \
        ca-certificates \
        curl \
        gnupg \
        lsb-release && \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
    echo \
      "deb [arch=arm64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null && \
    apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io

# https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
RUN set -eux; \
    apt-get update -q && \
	apt-get install -yq \
		btrfs-tools \
		e2fsprogs \
		iptables \
		xfsprogs \
		xz-utils \
# pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
		pigz \
		wget


# set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
RUN set -x \
	&& addgroup --system dockremap \
	&& adduser --system -ingroup dockremap dockremap \
	&& echo 'dockremap:165536:65536' >> /etc/subuid \
	&& echo 'dockremap:165536:65536' >> /etc/subgid

# https://github.com/docker/docker/tree/master/hack/dind
ENV DIND_COMMIT ed89041433a031cafc0a0f19cfe573c31688d377

RUN set -eux; \
	wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
	chmod +x /usr/local/bin/dind

When I reuse the docker engine from the NX via docker socket binding then everything works correctly as expected.