Normal user cannot use cuda device in L4T-36.2 docker

Hi Nvidia,

I am using l4t-36.3 docker and I can see torch.cuda.is_available() is True when I am the root user in the docker. However, after I switch to a new user, torch.cuda.is_available() is False
Here is the full error message:

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 801: operation not supported (Triggered internally at /tmp/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

I am using jetson AGX orin 64G developer verison, jetpack6.0, docker 27.3.1, docker-compose 1.29.2

The exact same docker file worked in jetpack5.1.1
here is the docker file:

FROM nvcr.io/nvidia/l4t-ml:r36.2.0-py3

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update --no-install-recommends \ 
    && apt-get install -y apt-utils 

RUN apt-get install -y \
  build-essential \
  cmake \
  cppcheck \
  gdb \
  git \
  lsb-release \
  software-properties-common \
  sudo \
  vim \
  wget \
  tmux \
  curl \
  less \
  net-tools \
  byobu \
  libgl-dev \
  iputils-ping \
  nano \
  unzip \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/*


# Add a user with the same user_id as the user outside the container
# Requires a docker build argument `user_id`
ARG user_id=$user_id
ENV USERNAME developer
RUN useradd -U --uid ${user_id} -ms /bin/bash $USERNAME \
 && echo "$USERNAME:$USERNAME" | chpasswd \
 && adduser $USERNAME sudo \
 && echo "$USERNAME ALL=NOPASSWD: ALL" >> /etc/sudoers.d/$USERNAME

# Commands below run as the developer user
USER $USERNAME

# When running a container start in the developer's home folder
WORKDIR /home/$USERNAME

# Set the timezone
RUN export DEBIAN_FRONTEND=noninteractive \
 && sudo apt-get update \
 && sudo -E apt-get install -y \
   tzdata \
 && sudo ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime \
 && sudo dpkg-reconfigure --frontend noninteractive tzdata \
 && sudo apt-get clean 



RUN mkdir ~/.mmpug

RUN touch ~/.Xauthority

RUN sudo usermod -a -G dialout developer \
 && sudo usermod -a -G tty developer \
 && sudo usermod -a -G video developer \
 && sudo usermod -a -G root developer \
 && sudo groupadd -f -r gpio \
 && sudo usermod -a -G gpio developer

# for ros2
RUN sudo apt update && sudo apt install locales \
 && sudo locale-gen en_US en_US.UTF-8 \
 && sudo update-locale LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8 \
 && export LANG=en_US.UTF-8

RUN sudo apt install software-properties-common \
 && sudo add-apt-repository universe \
 && sudo apt update && sudo apt install curl -y \
 && sudo curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key -o /usr/share/keyrings/ros-archive-keyring.gpg \
 && echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/ros-archive-keyring.gpg] http://packages.ros.org/ros2/ubuntu $(. /etc/os-release && echo $UBUNTU_CODENAME) main" | sudo tee /etc/apt/sources.list.d/ros2.list > /dev/null

after I entered the normal user, cuda is not available anymore

xhost:  unable to open display ""
root@ubuntu:/# python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> 
KeyboardInterrupt
>>> 
root@ubuntu:/# USER developer
bash: USER: command not found
root@ubuntu:/# su developer  
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

developer@ubuntu:/$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 801: operation not supported (Triggered internally at /tmp/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> 

I have tried usermod -aG sudo,video,i2c "$USER", it didn’t work
Please help, thanks

Hi,

How do you launch the container?
Could you try the settings below to see if they help?

Thanks

I use docker start -ai for the container, can you show me what setting you are mentioning?

docker_execute_command="
  docker exec
    --privileged
    -e DISPLAY=${DISPLAY}
    -e LINES=`tput lines`
    -it ${container}

I use docker-compose making the container

  base:
    # extend gpu or non-gpu
    build:
      args:
        - ARCH_T=$JAVIS_ARCH_T
        - JAVIS_ROS_DISTRO=$JAVIS_ROS_DISTRO
        - DOCKER_IMAGE_VERSION=$DOCKER_IMAGE_VERSION
        - user_id=$JAVIS_USERID
        - group_id=$JAVIS_GROUPID
    extends:
      service: ${JAVIS_HOST_TYPE}
    privileged: true
    security_opt:
      - seccomp:unconfined
    ipc: host
    volumes:
      # javis workspace
      - ${JAVIS_PATH}:/home/developer/javis_ws/
      # gui configurations
      - /tmp/.X11-unix:/tmp/.X11-unix
      - /etc/localtime:/etc/localtime:ro
      - /dev/input:/dev/input
      - /dev/:/dev/
      - /etc/hosts:/etc/hosts
      - ~/.javis/auto/deploy.conf:/home/developer/.javis/auto/deploy.conf
      - ${JAVIS_LOGGING_DIR}:/logging
      - /var/log/syslog:/syslog
      - /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra
      #- $XAUTHORITY:/home/developer/.Xauthority:rw
    environment:
      # Set environment params for GUI container passthrough
      - DISPLAY
      - QT_X11_NO_MITSHM=1
      # - XAUTHORITY=/tmp/.docker.xauth
      # - QT_QPA_PLATFORM='offscreen'
      - JAVIS_ROS_DISTRO=${JAVIS_ROS_DISTRO}
      # deployer export for exec call
      - DEPLOYER_TOP_PATH=/home/developer/javis_ws/operations//javis_deploy/deployer/
      - DEPLOYER_BIN=/home/developer/javis_ws/operations//javis_deploy/deployer/bin/
      - DEPLOYER_BOOKS_PATH=/home/developer/javis_ws/operations//javis_deploy/books/
      - JAVIS_PATH=/home/developer/javis_ws/
      - JAVIS_SRC_PATH=/home/developer/javis_ws/src/
      # Set the hostnames of different systems
      - ROS_MASTER_IP=$JAVIS_HOSTNAME
      - ROS_HOSTNAME=$JAVIS_HOSTNAME
      - JAVIS_USERID=$JAVIS_USERID
      - JAVIS_GROUPID=$JAVIS_GROUPID
      - JAVIS_SYSTEM_ID=$JAVIS_SYSTEM_ID
      - JAVIS_SYSTEM_TYPE=$JAVIS_SYSTEM_TYPE
      - JAVIS_SYSTEM_COMPONENT=$JAVIS_SYSTEM_COMPONENT
      - JAVIS_SETUP_SUPPRESS_CHECKS=true
    # entrypoint:
      # - /docker-entrypoint/ws-shell.bash
    tty: true
    runtime: nvidia
    # use host network
    network_mode: "host"
  javis_test:
    image: javis/${JAVIS_ARCH_T}.test:${DOCKER_IMAGE_VERSION}
    build:
      dockerfile: ${JAVIS_DOCKER_PATH}/javis/services/test.dockerfile
      context: ${JAVIS_DOCKER_PATH}/javis/
    extends:
      service: base
    container_name: javis_test
    privileged: true
    #ulimits:
    #  nice: 40
    environment:
      - ROS_SOURCED_WORKSPACE=/home/developer/javis_ws/install/javis_test/setup.bash
    volumes:
      - /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra
      - /usr/src/jetson_multimedia_api:/usr/src/jetson_multimedia_api
      - /usr/src/jetson_multimedia_api/argus:/usr/src/jetson_multimedia_api/argus
      - /etc/nv_tegra_release:/etc/nv_tegra_release
      - /usr/sbin/nvargus-daemon:/usr/sbin/nvargus-daemon
      - /tmp/argus_socket:/tmp/argus_socket
      - /tmp:/tmp
      - /var/nvidia/nvcam/settings/:/var/nvidia/nvcam/settings/
      - /etc/systemd/system:/etc/systemd/system
      - /etc/udev/rules.d/:/etc/udev/rules.d/
    runtime: nvidia
    devices:
      - /dev/i2c-8:/dev/i2c-8
      - /dev/video0:/dev/video0
      - /dev/video1:/dev/video1
      - /dev/video2:/dev/video2
      - /dev/video3:/dev/video3
      - /dev/video4:/dev/video4
      - /dev/video5:/dev/video5
      - /dev/video6:/dev/video6
    ipc: "host"

Hi,

Sorry, I mean below command:

Thanks.

I have tried adding docker-default-runtime option, after sudo systemctl restart docker, here s the output:

jiahe@ubuntu:~$ sudo docker info | grep 'Default Runtime'
 Default Runtime: nvidia
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

and I rebuilt the image and container, the cuda in torch is still not available without root

Hi,

Would you mind checking if a simple docker run command can work?
(instead of docker-compose)?

Thanks.

docker run works

stilll, I hope to do everything on docker-compose, the whole project is built on docker-compose

Hi,

You can find below the steps to set up docker rootless mode.

Could you apply the similar to the docker-compose tool to see if it can also run with non-root account?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.