Rootless Podman Container - CUDA Operation Not Supported - Error Code 801

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
X Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.9.3.10904
X other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Hi, I am trying to run a sample program deviceQuery from the /opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery directory inside a rootless podman container.

I believe this program is using CUDA libraries to interact with the GPU.

When I ran this program directly on the host, it worked.

However, when I ran this inside a rootless podman container, i got the error “cudaGetDeviceCount not supported”

root@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 801
-> operation not supported
Result = FAIL
root@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery# exit
exit
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.1 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28954 MBytes (30360899584 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.1, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery$

I followed Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.16.0 documentation and Support for Container Device Interface — NVIDIA Container Toolkit 1.16.0 documentation to set up Podman.

I am pretty sure this is not a driver and CUDA versoin compatibility issue.

Can someone shed some light on this? i have been stuck on this a many days

pnvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery$ podman info
host:
  arch: arm64
  buildahVersion: 1.33.2
  cgroupControllers:
  - memory
  - pids
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/local/libexec/podman/conmon
    version: 'conmon version 2.1.12, commit: 3bc422cd8aaec542d85d1a80f2d38e6e69046b5b'
  cpuUtilization:
    idlePercent: 99.86
    systemPercent: 0.08
    userPercent: 0.07
  cpus: 12
  databaseBackend: sqlite
  distribution:
    codename: focal
    distribution: ubuntu
    version: "20.04"
  eventLogger: file
  freeLocks: 2047
  hostname: tegra-ubuntu
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.15.122-rt-tegra
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 25436360704
  memTotal: 30360899584
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns: {}
    package: containernetworking-plugins_0.8.5-1_arm64
    path: /usr/lib/cni
  ociRuntime:
    name: crun
    package: crun_0.12.1+dfsg-1_arm64
    path: /usr/bin/crun
    version: |-
      crun version 0.12.1
      commit: df5f2b2369b3d9f36d175e1183b26e5cee55dd0a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: ""
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_0.4.3-1_arm64
    version: |-
      slirp4netns version 0.4.3
      commit: 2244b9b6461afeccad1678fac3d6e478c28b4ad6
  swapFree: 0
  swapTotal: 0
  uptime: 7h 56m 59.00s (Approximately 0.29 days)
  variant: v8
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/nvidia/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/nvidia/.local/share/containers/storage
  graphRootAllocated: 27292614656
  graphRootUsed: 10870382592
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 5
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/nvidia/.local/share/containers/storage/volumes
version:
  APIVersion: 4.8.3
  Built: 1723775305
  BuiltTime: Fri Aug 16 02:28:25 2024
  GitCommit: 85dc30df56566a654700722a4dd190e1b9680ee7
  GoVersion: go1.22.6
  Os: linux
  OsArch: linux/arm64
  Version: 4.8.3
NVRM version: NVIDIA UNIX Open Kernel Module for aarch64  541.2.0  Release Build  (root@xinronch-ubuntu)  Wed Aug 14 18:24:07 UTC 2024
GCC version:  gcc version 9.3.0 (Buildroot 2020.08)
{
   "cuda" : {
      "name" : "CUDA SDK",
      "version" : "11.4.30"
   },
   "cuda_cudart" : {
      "name" : "CUDA Runtime (cudart)",
      "version" : "11.4.532"
   },
   "cuda_cuobjdump" : {
      "name" : "cuobjdump",
      "version" : "11.4.532"
   },
   "cuda_cupti" : {
      "name" : "CUPTI",
      "version" : "11.4.532"
   },
   "cuda_cuxxfilt" : {
      "name" : "CUDA cu++ filt",
      "version" : "11.4.532"
   },
   "cuda_gdb" : {
      "name" : "CUDA GDB",
      "version" : "11.4.532"
   },
   "cuda_nvcc" : {
      "name" : "CUDA NVCC",
      "version" : "11.4.532"
   },
   "cuda_nvdisasm" : {
      "name" : "CUDA nvdisasm",
      "version" : "11.4.532"
   },
   "cuda_nvprune" : {
      "name" : "CUDA nvprune",
      "version" : "11.4.532"
   },
   "cuda_nvrtc" : {
      "name" : "CUDA NVRTC",
      "version" : "11.4.532"
   },
   "cuda_nvtx" : {
      "name" : "CUDA NVTX",
      "version" : "11.4.532"
   },
   "cuda_samples" : {
      "name" : "CUDA Samples",
      "version" : "11.4.532"
   },
   "cuda_sanitizer_api" : {
      "name" : "CUDA Compute Sanitizer API",
      "version" : "11.4.532"
   },
   "cuda_thrust" : {
      "name" : "CUDA C++ Core Compute Libraries",
      "version" : "11.4.532"
   },
   "libcublas" : {
      "name" : "CUDA cuBLAS",
      "version" : "11.6.6.316"
   },
   "libcudla" : {
      "name" : "CUDA cuDLA",
      "version" : "11.4.532"
   },
   "libcufft" : {
      "name" : "CUDA cuFFT",
      "version" : "10.6.0.436"
   },
   "libcurand" : {
      "name" : "CUDA cuRAND",
      "version" : "10.2.5.531"
   },
   "libcusolver" : {
      "name" : "CUDA cuSOLVER",
      "version" : "11.2.0.531"
   },
   "libcusparse" : {
      "name" : "CUDA cuSPARSE",
      "version" : "11.6.0.531"
   },
   "libnpp" : {
      "name" : "CUDA NPP",
      "version" : "11.4.0.521"
   },
   "nsight_compute" : {
      "name" : "Nsight Compute",
      "version" : "2021.2.10.1"
   }
}

permissions of device files that container sees

root@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery# ls -ld /dev/n*
ls: /usr/lib/aarch64-linux-gnu/libselinux.so.1: no version information available (required by ls)
crw-rw-rw- 1 root root   1,   3 Aug 17 00:44 /dev/null
drwxr-xr-x 3 root root       60 Aug 17 00:44 /dev/nvgpu
crw-rw-rw- 1 root root 472,   0 Aug 17 00:44 /dev/nvhost-as-gpu
crw-rw-rw- 1 root root 472,   2 Aug 17 00:44 /dev/nvhost-ctrl-gpu
crw-rw-rw- 1 root root 486,   0 Aug 17 00:44 /dev/nvhost-ctrl-isp
crw-rw-rw- 1 root root 498,   0 Aug 17 00:44 /dev/nvhost-ctrl-isp-thi
crw-rw-rw- 1 root root 490,   0 Aug 17 00:44 /dev/nvhost-ctrl-nvcsi
crw-rw-rw- 1 root root 495,   0 Aug 17 00:44 /dev/nvhost-ctrl-nvdla0
crw-rw-rw- 1 root root 493,   0 Aug 17 00:44 /dev/nvhost-ctrl-nvdla1
crw-rw-rw- 1 root root 492,   0 Aug 17 00:44 /dev/nvhost-ctrl-pva0
crw-rw-rw- 1 root root 489,   0 Aug 17 00:44 /dev/nvhost-ctrl-vi0
crw-rw-rw- 1 root root 501,   0 Aug 17 00:44 /dev/nvhost-ctrl-vi0-thi
crw-rw-rw- 1 root root 488,   0 Aug 17 00:44 /dev/nvhost-ctrl-vi1
crw-rw-rw- 1 root root 500,   0 Aug 17 00:44 /dev/nvhost-ctrl-vi1-thi
crw-rw-rw- 1 root root 472,   3 Aug 17 00:44 /dev/nvhost-ctxsw-gpu
crw-rw-rw- 1 root root 472,   4 Aug 17 00:44 /dev/nvhost-dbg-gpu
crw-rw-rw- 1 root root 472,   1 Aug 17 00:44 /dev/nvhost-gpu
crw-rw-rw- 1 root root 472,   9 Aug 17 00:44 /dev/nvhost-nvsched-gpu
crw-rw-rw- 1 root root 472,  10 Aug 17 00:44 /dev/nvhost-nvsched_ctrl_fifo-gpu
crw-rw-rw- 1 root root 502,   0 Aug 17 00:44 /dev/nvhost-power-gpu
crw-rw-rw- 1 root root 472,   6 Aug 17 00:44 /dev/nvhost-prof-ctx-gpu
crw-rw-rw- 1 root root 472,   7 Aug 17 00:44 /dev/nvhost-prof-dev-gpu
crw-rw-rw- 1 root root 472,   5 Aug 17 00:44 /dev/nvhost-prof-gpu
crw-rw-rw- 1 root root 472,   8 Aug 17 00:44 /dev/nvhost-sched-gpu
crw-rw-rw- 1 root root 472,  11 Aug 17 00:44 /dev/nvhost-tsg-gpu
crwxrwxrwx 1 root root 195, 254 Aug 17 00:44 /dev/nvidia-modeset
crwxrwxrwx 1 root root 195,   0 Aug 17 00:44 /dev/nvidia0
crwxrwxrwx 1 root root 195, 255 Aug 17 00:44 /dev/nvidiactl
crw-rw-rw- 1 root root  10,  79 Aug 17 00:44 /dev/nvmap

Tried running using sudo and --privileged, the issue persists

config.toml

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

devices.csv

dev, /dev/nvhost-tsg-gpu
dev, /dev/nvhost-nvsched_ctrl_fifo-gpu
dev, /dev/nvhost-nvsched-gpu
dev, /dev/nvhost-sched-gpu
dev, /dev/nvhost-prof-dev-gpu
dev, /dev/nvhost-prof-ctx-gpu
dev, /dev/nvhost-prof-gpu
dev, /dev/nvhost-dbg-gpu
dev, /dev/nvhost-ctxsw-gpu
dev, /dev/nvhost-ctrl-gpu
dev, /dev/nvhost-gpu
dev, /dev/nvhost-as-gpu
dev, /dev/nvhost-ctrl-isp
dev, /dev/nvhost-ctrl-vi1
dev, /dev/nvhost-ctrl-nvcsi
dev, /dev/nvhost-ctrl-vi0
dev, /dev/nvhost-ctrl-pva0
dev, /dev/nvhost-ctrl-nvdla1
dev, /dev/nvhost-ctrl-nvdla0
dev, /dev/nvhost-ctrl-isp-thi
dev, /dev/nvhost-ctrl-vi1-thi
dev, /dev/nvhost-ctrl-vi0-thi
dev, /dev/nvhost-power-gpu
dev, /dev/nvgpu/igpu0/tsg
dev, /dev/nvgpu/igpu0/nvsched_ctrl_fifo
dev, /dev/nvgpu/igpu0/nvsched
dev, /dev/nvgpu/igpu0/sched
dev, /dev/nvgpu/igpu0/prof-dev
dev, /dev/nvgpu/igpu0/prof-ctx
dev, /dev/nvgpu/igpu0/prof
dev, /dev/nvgpu/igpu0/dbg
dev, /dev/nvgpu/igpu0/ctxsw
dev, /dev/nvgpu/igpu0/ctrl
dev, /dev/nvgpu/igpu0/channel
dev, /dev/nvgpu/igpu0/as
dev, /dev/nvgpu/igpu0/power
dev, /dev/nvmap
dev, /dev/nvidia0
dev, /dev/nvidiactl
dev, /dev/nvidia-modeset
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ sudo podman run --rm -v $(pwd):$(pwd) -w $(pwd) --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset --device nvidia.com/gpu=0 --runtime=runc --gpus all -e NVIDIA_VISIBLE_DEVICES=1  -e CUDA_VISIBLE_DEVICES=1  --security-opt label=disable  --security-opt apparmor=unconfined --read-only=false   --security-opt seccomp=unconfined -it --network=host ubuntu:20.04 ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 801
-> operation not supported
Result = FAIL
```

i wrote a simple CPP  program to call cudaGetDeviceCount, and it also failed with ERROR_OPERATIO_NOT_SUPPORTED when running insode a podman container
```
#include <cuda_runtime.h>
#include <iostream>

int main() {
    int device_count = 0;
    cudaError_t error = cudaGetDeviceCount(&device_count);

    if (error != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }

    std::cout << "Number of CUDA devices: " << device_count << std::endl;

    for (int i = 0; i < device_count; ++i) {
        cudaDeviceProp device_prop;
        cudaGetDeviceProperties(&device_prop, i);

        std::cout << "Device " << i << ": " << device_prop.name << std::endl;
    }

    return 0;
}
```

Dear @xinronch,
Could you please check using native docker client instead of podman?

Hi @SivaRamaKrishnaNV

I am running into the same issue with Docker. Followed this guide Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.16.0 documentation

I tried to run the deviceQuery (from the CUDA samples) executable inside a rootful docker container, and the same “error code 801 operation not supported” error was observed

nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ ls
cudaNvSci  deviceQuery  deviceQueryDrv  matrixMul  simpleStreams  simple_cuda_test  simple_cuda_test.cpp  simple_cuda_test.cu  test  test.cpp
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ sudo docker run --rm -v $(pwd):$(pwd) -w $(pwd) --runtime=nvidia --gpus=all ubuntu:20.04 ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 801
-> operation not supported
Result = FAIL

All the Nvidia-related device files are accessible inside the docker container with proper permissions

nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ sudo docker run --rm -v $(pwd):$(pwd) -w $(pwd) --runtime=nvidia --gpus=all ubuntu:20.04 ls -ld  /dev/nv*
drwxr-xr-x 3 root root       60 Aug 19 19:14 /dev/nvgpu
crw-rw-rw- 1 root root 472,   0 Aug 19 19:14 /dev/nvhost-as-gpu
crw-rw-rw- 1 root root 472,   2 Aug 19 19:14 /dev/nvhost-ctrl-gpu
crw-rw-rw- 1 root root 487,   0 Aug 19 19:14 /dev/nvhost-ctrl-isp
crw-rw-rw- 1 root root 496,   0 Aug 19 19:14 /dev/nvhost-ctrl-isp-thi
crw-rw-rw- 1 root root 490,   0 Aug 19 19:14 /dev/nvhost-ctrl-nvcsi
crw-rw-rw- 1 root root 495,   0 Aug 19 19:14 /dev/nvhost-ctrl-nvdla0
crw-rw-rw- 1 root root 494,   0 Aug 19 19:14 /dev/nvhost-ctrl-nvdla1
crw-rw-rw- 1 root root 492,   0 Aug 19 19:14 /dev/nvhost-ctrl-pva0
crw-rw-rw- 1 root root 488,   0 Aug 19 19:14 /dev/nvhost-ctrl-vi0
crw-rw-rw- 1 root root 498,   0 Aug 19 19:14 /dev/nvhost-ctrl-vi0-thi
crw-rw-rw- 1 root root 486,   0 Aug 19 19:14 /dev/nvhost-ctrl-vi1
crw-rw-rw- 1 root root 497,   0 Aug 19 19:14 /dev/nvhost-ctrl-vi1-thi
crw-rw-rw- 1 root root 472,   3 Aug 19 19:14 /dev/nvhost-ctxsw-gpu
crw-rw-rw- 1 root root 472,   4 Aug 19 19:14 /dev/nvhost-dbg-gpu
crw-rw-rw- 1 root root 472,   1 Aug 19 19:14 /dev/nvhost-gpu
crw-rw-rw- 1 root root 472,   9 Aug 19 19:14 /dev/nvhost-nvsched-gpu
crw-rw-rw- 1 root root 472,  10 Aug 19 19:14 /dev/nvhost-nvsched_ctrl_fifo-gpu
crw-rw-rw- 1 root root 502,   0 Aug 19 19:14 /dev/nvhost-power-gpu
crw-rw-rw- 1 root root 472,   6 Aug 19 19:14 /dev/nvhost-prof-ctx-gpu
crw-rw-rw- 1 root root 472,   7 Aug 19 19:14 /dev/nvhost-prof-dev-gpu
crw-rw-rw- 1 root root 472,   5 Aug 19 19:14 /dev/nvhost-prof-gpu
crw-rw-rw- 1 root root 472,   8 Aug 19 19:14 /dev/nvhost-sched-gpu
crw-rw-rw- 1 root root 472,  11 Aug 19 19:14 /dev/nvhost-tsg-gpu
crw-rw-rw- 1 root root 195, 254 Aug 19 19:14 /dev/nvidia-modeset
crw-rw-rw- 1 root root 195,   0 Aug 19 19:14 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Aug 19 19:14 /dev/nvidiactl
crw-rw-rw- 1 root root  10,  79 Aug 19 19:14 /dev/nvmap
crw-rw-rw- 1 root root 509,   0 Aug 19 19:14 /dev/nvpps0
crw-rw-rw- 1 root root 504,   0 Aug 19 19:14 /dev/nvsciipc

docker, containerD, runc version

nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ docker --version
Docker version 27.1.2, build d01f264
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ runc --version
runc version 1.1.13
commit: v1.1.13-0-g58aa920
spec: 1.0.2-dev
go: go1.21.13
libseccomp: 2.5.1
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ containerd --version
containerd containerd.io 1.7.20 8fc6bcff51318944179630522a095cc9dbf9f353

/etc/nvidia-container-runtime/config.toml

nvidia@tegra-ubuntu:~$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

devices.csv

lib, /usr/lib/aarch64-linux-gnu/xtables/libxt_nfacct.sonvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ cat /etc/nvidia-container-runtime/h*/devices.csv
dev, /dev/nvhost-tsg-gpu
dev, /dev/nvhost-nvsched_ctrl_fifo-gpu
dev, /dev/nvhost-nvsched-gpu
dev, /dev/nvhost-sched-gpu
dev, /dev/nvhost-prof-dev-gpu
dev, /dev/nvhost-prof-ctx-gpu
dev, /dev/nvhost-prof-gpu
dev, /dev/nvhost-dbg-gpu
dev, /dev/nvhost-ctxsw-gpu
dev, /dev/nvhost-ctrl-gpu
dev, /dev/nvhost-gpu
dev, /dev/nvhost-as-gpu
dev, /dev/nvhost-ctrl-isp
dev, /dev/nvhost-ctrl-vi1
dev, /dev/nvhost-ctrl-nvcsi
dev, /dev/nvhost-ctrl-vi0
dev, /dev/nvhost-ctrl-pva0
dev, /dev/nvhost-ctrl-nvdla1
dev, /dev/nvhost-ctrl-nvdla0
dev, /dev/nvhost-ctrl-isp-thi
dev, /dev/nvhost-ctrl-vi1-thi
dev, /dev/nvhost-ctrl-vi0-thi
dev, /dev/nvhost-power-gpu
dev, /dev/nvgpu/igpu0/tsg
dev, /dev/nvgpu/igpu0/nvsched_ctrl_fifo
dev, /dev/nvgpu/igpu0/nvsched
dev, /dev/nvgpu/igpu0/sched
dev, /dev/nvgpu/igpu0/prof-dev
dev, /dev/nvgpu/igpu0/prof-ctx
dev, /dev/nvgpu/igpu0/prof
dev, /dev/nvgpu/igpu0/dbg
dev, /dev/nvgpu/igpu0/ctxsw
dev, /dev/nvgpu/igpu0/ctrl
dev, /dev/nvgpu/igpu0/channel
dev, /dev/nvgpu/igpu0/as
dev, /dev/nvgpu/igpu0/power
dev, /dev/nvmap
dev, /dev/nvidia0
dev, /dev/nvidiactl
dev, /dev/nvidia-modeset
dev, /dev/nvsciipc
dev, /dev/nvpps0

Confirmed that running the deviceQuery sample CUDA executable inside a --privileged Docker container worked, but this should be avoided as this is not aligning with security best practices.

nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ sudo docker run --rm -v $(pwd):$(pwd) -w $(pwd) --runtime=nvidia --gpus=all --privileged ubuntu:20.04 ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.1 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28954 MBytes (30360899584 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.1, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

confirmed that deviceQuery sample CUDA executable can be ran within a --privileged podman container as well

nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ sudo podman run --runtime=runc --privileged -it --rm -v $(pwd):$(pwd) -w $(pwd)  --device=nvidia.com/gpu=all  --network none ubuntu ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.1 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28954 MBytes (30360899584 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.1, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Docker

  • –privileged mode
    • able to run the deviceQuery sample CUDA executable inside the container, but --privileged mode should be avoided for security best pratices
  • rootful
    • running deviceQuery executable failed with ERROR CODE 801 OPERATION NOT SUPPORTED
  • rootless
    • running deviceQuery executable failed with ERROR CODE 801 OPERATION NOT SUPPORTED

Podman

  • –privileged mode
    • able to run the deviceQuery sample CUDA executable inside the container, but --privileged mode should be avoided for security best pratices
  • rootful
    • running deviceQuery executable failed with ERROR CODE 801 OPERATION NOT SUPPORTED
  • rootless
    • running deviceQuery executable failed with ERROR CODE 801 OPERATION NOT SUPPORTED

@SivaRamaKrishnaNV
can you please help us take a look at on why the sample CUDA executable cannot be ran inside both rootful and rootless docker/podman containers? If this is not possible, it means the only way that an NVIDIA GPU is accessible from a container is by running the container in --privileged mode (i.e. docker run --privileged …)

NVIDIA Driver and CUDA Versions

nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for aarch64  541.2.0  Release Build  (root@xinronch-ubuntu)  Wed Aug 14 18:24:07 UTC 2024
GCC version:  gcc version 9.3.0 (Buildroot 2020.08)
nvidia@tegra-ubuntu:/opt/nvidia/drive-linux/NVIDIA_CUDA-11.4_Samples/bin/aarch64/linux/release$ cat /usr/local/cuda/version.json
{
   "cuda" : {
      "name" : "CUDA SDK",
      "version" : "11.4.30"
   },
   "cuda_cudart" : {
      "name" : "CUDA Runtime (cudart)",
      "version" : "11.4.532"
   },
   "cuda_cuobjdump" : {
      "name" : "cuobjdump",
      "version" : "11.4.532"
   },
   "cuda_cupti" : {
      "name" : "CUPTI",
      "version" : "11.4.532"
   },
   "cuda_cuxxfilt" : {
      "name" : "CUDA cu++ filt",
      "version" : "11.4.532"
   },
   "cuda_gdb" : {
      "name" : "CUDA GDB",
      "version" : "11.4.532"
   },
   "cuda_nvcc" : {
      "name" : "CUDA NVCC",
      "version" : "11.4.532"
   },
   "cuda_nvdisasm" : {
      "name" : "CUDA nvdisasm",
      "version" : "11.4.532"
   },
   "cuda_nvprune" : {
      "name" : "CUDA nvprune",
      "version" : "11.4.532"
   },
   "cuda_nvrtc" : {
      "name" : "CUDA NVRTC",
      "version" : "11.4.532"
   },
   "cuda_nvtx" : {
      "name" : "CUDA NVTX",
      "version" : "11.4.532"
   },
   "cuda_samples" : {
      "name" : "CUDA Samples",
      "version" : "11.4.532"
   },
   "cuda_sanitizer_api" : {
      "name" : "CUDA Compute Sanitizer API",
      "version" : "11.4.532"
   },
   "cuda_thrust" : {
      "name" : "CUDA C++ Core Compute Libraries",
      "version" : "11.4.532"
   },
   "libcublas" : {
      "name" : "CUDA cuBLAS",
      "version" : "11.6.6.316"
   },
   "libcudla" : {
      "name" : "CUDA cuDLA",
      "version" : "11.4.532"
   },
   "libcufft" : {
      "name" : "CUDA cuFFT",
      "version" : "10.6.0.436"
   },
   "libcurand" : {
      "name" : "CUDA cuRAND",
      "version" : "10.2.5.531"
   },
   "libcusolver" : {
      "name" : "CUDA cuSOLVER",
      "version" : "11.2.0.531"
   },
   "libcusparse" : {
      "name" : "CUDA cuSPARSE",
      "version" : "11.6.0.531"
   },
   "libnpp" : {
      "name" : "CUDA NPP",
      "version" : "11.4.0.521"
   },
   "nsight_compute" : {
      "name" : "Nsight Compute",
      "version" : "2021.2.10.1"
   }
}

Yes. The Docker service provided on the target is considered experimental by DRIVE OS. It is not recommended for production. Using --privileged flag makes it simple for a sample execution and testing
If the sample require access to certain host paths and resources.

I wrote a simple CPP program that calls the cudaGetDeviceCount API of CUDART library provided by the drive OS, and ran it in a SUDO docker container, and the same issue persisted. Does this mean running CUDA workload inside Docker container on Drive OS is experimental, and generally not recommended for production?

#include <cuda_runtime.h>
#include <iostream>

int main() {
    int device_count = 0;
    cudaError_t error = cudaGetDeviceCount(&device_count);

    if (error != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }

    std::cout << "Number of CUDA devices: " << device_count << std::endl;

    for (int i = 0; i < device_count; ++i) {
        cudaDeviceProp device_prop;
        cudaGetDeviceProperties(&device_prop, i);

        std::cout << "Device " << i << ": " << device_prop.name << std::endl;
    }

    return 0;
}
 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 801
-> operation not supported
Result = FAIL

Dear @xinronch,
I could run the application like below. Please file a new topic incase of any issue.

nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/test$ ls
Makefile  test  test.cpp  test.o
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/test$ cat test.cpp
#include <cuda_runtime.h>
#include <iostream>

int main() {
    int device_count = 0;
    cudaError_t error = cudaGetDeviceCount(&device_count);

    if (error != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }

    std::cout << "Number of CUDA devices: " << device_count << std::endl;

    for (int i = 0; i < device_count; ++i) {
        cudaDeviceProp device_prop;
        cudaGetDeviceProperties(&device_prop, i);

        std::cout << "Device " << i << ": " << device_prop.name << std::endl;
    }

    return 0;
}
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/test$ sudo docker run --rm --privileged --network host --runtime nvidia --gpus all -v $(pwd):$(pwd) -w $(pwd) ubuntu:20.04 ./test
Number of CUDA devices: 1
Device 0: Orin