Invoke nvidia runtime failed with docker compose on AGX Orin

Hi Experts,

I’ve built a Docker image with the CUDA driver on an AGX Orin.

When I run the container using the command docker run -it --rm --runtime nvidia <image name>, everything works fine.

However, when I try to deploy my CI pipeline on the GitLab server using the docker-compose.yml file below, the test cases do not run correctly.

I believe the NVIDIA runtime is not being invoked properly.

version: '3'

services:
  gitlab-runner:
    image: gitlab/gitlab-runner:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    container_name: gitlab-runner
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /etc/gitlab-runner/config.toml:/etc/gitlab-runner/config.toml

This is my gitlab-runner config:

[[runners]]
  name = "ddjetson AGX Orin"
  url = "xxxxx"
  id = 404
  token = "xxxxx"
  token_obtained_at = 2024-08-16T03:20:53Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "jetsondev"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
    network_mtu = 0
    pull_policy = "if-not-present"

Hi,

Do you use iGPU driver to build the CUDA image?
Could you share the Dockerfile with us?

Thanks.

Hi @AastaLLL,

Yes, I built the image on AGX Orin with the command docker build -t jetsondev .

Dockerfile

FROM nvcr.io/nvidia/l4t-cuda:12.2.12-runtime

# Install nvidia-l4t-core
RUN \
    echo "deb https://repo.download.nvidia.com/jetson/common r36.3 main" >> /etc/apt/sources.list && \
    echo "deb https://repo.download.nvidia.com/jetson/t234 r36.3 main" >> /etc/apt/sources.list && \
    apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
    mkdir -p /opt/nvidia/l4t-packages/ && \
    touch /opt/nvidia/l4t-packages/.nv-l4t-disable-boot-fw-update-in-preinstall

RUN apt-get update \
    && echo "Y" | apt-get install -y --no-install-recommends nvidia-l4t-core

ENV UDEV=1

# Install CUDA driver 12.5
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/arm64/cuda-keyring_1.1-1_all.deb \
  && dpkg -i cuda-keyring_1.1-1_all.deb \
  && apt-get update \
  && apt-get -y install cuda-toolkit-12-5

# Install CUDA Compat 12.5
RUN apt-get update \
  && apt-get -y install cuda-compat-12-5

# Install necessary dependencies including gcc
RUN apt-get update \
    && apt-get install -y wget gdb build-essential git cmake libzmq3-dev pkg-config curl vim python3 python3-pip docker-compose ninja-build \
    && rm -rf /var/lib/apt/lists/*

# Install jtop 
RUN pip3 install jetson-stats

WORKDIR /

# Install GCC 12 and G++ 12
RUN apt-get update \
    && apt-get install -y software-properties-common \
    && add-apt-repository ppa:ubuntu-toolchain-r/test \
    && apt-get update \
    && apt-get install -y gcc-12 g++-12 \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 100 \
    && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 100

##### Install necessary packages
COPY ./requirements.txt /
RUN pip3 install -r requirements.txt && rm -rf requirements.txt

# Install Google Test
RUN git clone https://github.com/google/googletest.git \
  && cd googletest \
  && mkdir build \
  && cd build \
  && cmake .. \
  && make -j12 \
  && make -j12 install \
  && cd ../.. \
  && rm -rf googletest

# Add lines to ~/.bashrc
RUN echo 'export PATH=/usr/local/cuda-12.5/bin:$PATH' >> ~/.bashrc \
  && echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.5/compat:$LD_LIBRARY_PATH' >> ~/.bashrc

I also updated my gitlab-runner config file but only parts of my test cases pass.

[[runners]]
  name = "ddjetson AGX Orin"
  url = "xxxxx"
  id = 404
  token = "xxxxx"
  token_obtained_at = 2024-08-16T03:20:53Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "jetsondev"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = "1g"
    network_mtu = 0
    pull_policy = "if-not-present"
    runtime = "nvidia"

Hi,

Is there any error message or failure log can share with us?

It looks like you manually upgraded the CUDA version from 12.2 to 12.5.
Have you tried if docker compose works using the default 12.2 CUDA?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.