[BUG] target-docker-container running cuda-samples require unintended extra permission

lizhensheng · April 25, 2023, 5:48am

Required Info:

Software Version
DRIVE OS 6.0.6
Target OS
Linux
SDK Manager Version
1.9.2.10884
Host Machine Version
native Ubuntu Linux 20.04 Host installed with DRIVE OS DOCKER Containers

Describe the bug

following with this topic https://developer.nvidia.com/blog/running-docker-containers-directly-on-nvidia-drive-agx-orin/#entry-content-comments.

in target-host, running cuda-samples doesn’t need sudo, while in target-docker-container, sudo is needed.

To Reproduce

# in host: compile the cuda-sample
mkdir cuda-sample && cd ./cuda-sample
cp -r /usr/local/cuda/samples/ ./
cd samples/1_Utilities/deviceQuery
make clean && make
# start and into the container
./docker/run/orin_start.sh
./docker/run/orin_into.sh

the keypoint of orin_start.sh is

+ docker run --runtime nvidia --gpus all -it -d --privileged --name gw_orin_20.04_nvidia -e DOCKER_USER=nvidia -e USER=nvidia -e DOCKER_USER_ID=1000 -e DOCKER_GRP=nvidia -e DOCKER_GRP_ID=1000 -e DOCKER_IMG=arm64v8/ros:foxy -e USE_GPU=1 -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=compute,graphics,video,utility,display -e DISPLAY -v /home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target:/target -v /usr/local/driveworks-5.10:/usr/local/driveworks-5.10 -v /usr/local/cuda-11.4:/usr/local/cuda-11.4 -v /dev:/dev -v /home/nvidia/zhensheng/cuda-sample:/home/nvidia/zhensheng/cuda-sample -v /home/nvidia/.cache:/home/nvidia/.cache -v /dev/bus/usb:/dev/bus/usb -v /media:/media -v /tmp/.X11-unix:/tmp/.X11-unix:rw -v /etc/localtime:/etc/localtime:ro -v /usr/src:/usr/src -v /lib/mgaules:/lib/mgaules --net host --ipc host --cap-add SYS_ADMIN --cap-add SYS_PTRACE -w /target --add-host in_orin_docker:127.0.0.1 --add-host tegra-ubuntu:127.0.0.1 --hostname in_orin_docker --shm-size 2G -v /dev/null:/dev/raw1394 arm64v8/ros:foxy /bin/bash

Expected behavior

# in host
./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Actual behavior

# in target-docker-container without sudo
./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

NvRmMemInitNvmap failed with Permission denied
351: NvMap init failed


****NvRmMemMgrInit failed**** error type: 196626


cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

in target-docker-container, with sudo running cuda-sample give the expected result

# in target-docker-container with sudo
sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Additional context

How to avoid the extra sudo operation?

Thanks.

SivaRamaKrishnaNV · April 25, 2023, 12:32pm

Dear @lizhensheng,
Could you share the used docker image for repro locally? Is it the same ubuntu-20.4 used in blog?

lizhensheng · April 25, 2023, 12:42pm

you can check the whole docker run above,

For more info, the non-root user is added with this

https://github.com/ZhenshengLee/nv_driveworks_demo/blob/42be4c663e5c8345f1cbdc988f43610b579a48e8/docker/scripts/target_adduser.sh#L5-L10

the /bin/bash is executed with this

https://github.com/ZhenshengLee/nv_driveworks_demo/blob/42be4c663e5c8345f1cbdc988f43610b579a48e8/docker/run/orin_into.sh#L52-L56

I already checked other topics in the forum, and the following steps are done but can’t help to resolve the issue

usermod -aG sudo,video,i2c
docker run --privileged
docker exec -u $USER

FYI, thanks.

lizhensheng · April 25, 2023, 12:45pm

arm64v8/ros:foxy is in Image Layer Details - arm64v8/ros:foxy | Docker Hub

which is based on ubuntu:20.04

lizhensheng · May 4, 2023, 1:51am

Friendly ping @SivaRamaKrishnaNV @VickNV for updates.

SivaRamaKrishnaNV · May 9, 2023, 3:52pm

Dear @lizhensheng,
I am yet to get an update from engineering team. May I know if this blocks your development?

lizhensheng · May 10, 2023, 1:36am

@SivaRamaKrishnaNV
Yes, it blocks the development in target-docker-container.

I didn’t find any useful solution to solve this permission issue.

as for this topic [BUG] dwcgf error of NvSciIpcOpenEndpoint with shm header not cleared (nvidia.com), I can’t reply because it’s closed.

What I know is that running multiple cgf app instance cause the shm-header-not-cleared error.

Thanks.

SivaRamaKrishnaNV · May 10, 2023, 2:13am

Dear @lizhensheng,
Can you share ls -la /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery to confirm if other user don’t have execute permission.
regarding “shm-header-not-cleared” issue, I notice killing LoaderLite process will avoid this issue after launching CGF app.

I quickly tested your command and notice no issue.

nvidia@tegra-ubuntu:~$ docker run --runtime nvidia --gpus all -it -d --privileged --name gw_orin_20.04_nvidia -e DOCKER_USER=nvidia -e USER=nvidia -e DOCKER_USER_ID=1000 -e DOCKER_GRP=nvidia -e DOCKER_GRP_ID=1000 -e DOCKER_IMG=arm64v8/ros:foxy -e USE_GPU=1 -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=compute,graphics,video,utility,display -e DISPLAY -v /home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target:/target -v /usr/local/driveworks-5.10:/usr/local/driveworks-5.10 -v /usr/local/cuda-11.4:/usr/local/cuda-11.4 -v /dev:/dev -v /home/nvidia/zhensheng/cuda-sample:/home/nvidia/zhensheng/cuda-sample -v /home/nvidia/.cache:/home/nvidia/.cache -v /dev/bus/usb:/dev/bus/usb -v /media:/media -v /tmp/.X11-unix:/tmp/.X11-unix:rw -v /etc/localtime:/etc/localtime:ro -v /usr/src:/usr/src -v /lib/mgaules:/lib/mgaules --net host --ipc host --cap-add SYS_ADMIN --cap-add SYS_PTRACE -w /target --add-host in_orin_docker:127.0.0.1 --add-host tegra-ubuntu:127.0.0.1 --hostname in_orin_docker --shm-size 2G -v /dev/null:/dev/raw1394 arm64v8/ros:foxy /bin/bash
1b9daa04c4462d7bbcca9a5623fc50d5cdaf48017dc595f7b0c77ed9c86600d6
nvidia@tegra-ubuntu:~$ docker ps -a
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS                      PORTS     NAMES
1b9daa04c446   arm64v8/ros:foxy       "/ros_entrypoint.sh …"   39 seconds ago   Up 38 seconds                         gw_orin_20.04_nvidia
c954285372e1   arm64v8/ros:foxy       "/ros_entrypoint.sh …"   55 minutes ago   Exited (0) 55 minutes ago             friendly_benz
661b1990f6bc   arm64v8/ubuntu:focal   "/bin/bash"              5 days ago       Exited (0) 5 days ago                 bold_chatterjee
54c7fe337a5e   arm64v8/ros:foxy       "/ros_entrypoint.sh …"   6 days ago       Exited (255) 11 hours ago             my_gw_orin_20.04_nvidia
nvidia@tegra-ubuntu:~$ docker attach 1b9daa04c446
root@in_orin_docker:/target# cd /usr/local/cuda-11.4/bin/
root@in_orin_docker:/usr/local/cuda-11.4/bin# ls
bin2c              crt       cuda-gdb        cuda-install-samples-11.4.sh  cuobjdump  nvcc          nvdisasm  nvprune
compute-sanitizer  cu++filt  cuda-gdbserver  cudafe++                      fatbinary  nvcc.profile  nvlink    ptxas
root@in_orin_docker:/usr/local/cuda-11.4/bin# cd ../samples/bin/aarch64/linux/release/
root@in_orin_docker:/usr/local/cuda-11.4/samples/bin/aarch64/linux/release# ls
deviceQuery  matrixMul
root@in_orin_docker:/usr/local/cuda-11.4/samples/bin/aarch64/linux/release# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
root@in_orin_docker:/usr/local/cuda-11.4/samples/bin/aarch64/linux/release#

lizhensheng · May 10, 2023, 5:36am

@SivaRamaKrishnaNV

you are running deviceQuery with root， which is not as expected.

You could reproduce the behavior with this

in host : make and run deviceQuery

nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target$ cd /usr/local/cuda-11.4/samples/1_Utilities/deviceQuery/
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ sudo make
[sudo] password for nvidia: 
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.4/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$  ls -la /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery 
-rwxr-xr-x. 1 root root 819928 Feb 13 17:11 /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

in container : run deviceQuery with user=nvidia

docker exec \
    -u nvidia \
    -e HISTFILE=/target/.dev_bash_hist \
    -it gw_orin_20.04_nvidia \
    /bin/bash

nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target$ ./docker/run/orin_into.sh 
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

selecting project in: /gw_demo
nvidia@in_orin_docker:/target$ cd /usr/local/cuda-11.4/samples/1_Utilities/deviceQuery/
nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ls
Makefile  NsightEclipse.xml  deviceQuery.cpp  readme.txt
nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ sudo make 
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
Assembler messages:
Fatal error: can't create deviceQuery.o: Read-only file system
make: *** [Makefile:326: deviceQuery.o] Error 255

nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ls -la /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery 
-rwxr-xr-x. 1 root root 819928 Feb 13 17:11 /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery

nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

NvRmMemInitNvmap failed with Permission denied
351: NvMap init failed


****NvRmMemMgrInit failed**** error type: 196626


cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

lizhensheng · May 10, 2023, 5:41am

this could be a reference l4t-ros2-docker/Dockerfile at main · atinfinity/l4t-ros2-docker · GitHub

github.com

atinfinity/l4t-ros2-docker/blob/21f06947fe916cf587564edb50b55f09e0530f2b/humble/Dockerfile#L1-L21


      
          FROM docker.io/arm64v8/ubuntu:22.04
          
          
ENV NVIDIA_VISIBLE_DEVICES all
          ENV NVIDIA_DRIVER_CAPABILITIES all
          
          
ARG UID=1000
          ARG GID=1000
          
          
# add new sudo user
          ENV USERNAME jetson
          ENV HOME /home/$USERNAME
          RUN useradd -m $USERNAME && \
                  echo "$USERNAME:$USERNAME" | chpasswd && \
                  usermod --shell /bin/bash $USERNAME && \
                  usermod -aG sudo $USERNAME && \
                  mkdir /etc/sudoers.d && \
                  echo "$USERNAME ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/$USERNAME && \
                  chmod 0440 /etc/sudoers.d/$USERNAME && \
                  usermod  --uid $UID $USERNAME && \
                  groupmod --gid $GID $USERNAME
          RUN gpasswd -a $USERNAME video

lizhensheng · May 10, 2023, 5:59am

I copy the cuda-sample from /usr/local/cuda/sample to /home/nvidia/zhensheng and test it.

nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ make clean .
rm -f deviceQuery deviceQuery.o
rm -rf ../../bin/aarch64/linux/release/deviceQuery
make: Nothing to be done for '.'.
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ make .
make: Nothing to be done for '.'.
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ ls
Makefile  NsightEclipse.xml  deviceQuery.cpp  readme.txt
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ make all
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.4/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ ls -la
total 876
drwxr-xr-x. 3 nvidia nvidia   4096 Feb 13 17:34 .
drwxr-xr-x. 7 nvidia nvidia   4096 Feb  4 05:19 ..
drwxr-xr-x. 2 nvidia nvidia   4096 Feb  4 05:19 .vscode
-rw-r--r--. 1 nvidia nvidia  12414 Feb  4 05:19 Makefile
-rw-r--r--. 1 nvidia nvidia   1789 Feb  4 05:19 NsightEclipse.xml
-rwxr-xr-x. 1 nvidia nvidia 819928 Feb 13 17:34 deviceQuery
-rw-r--r--. 1 nvidia nvidia  12721 Feb  4 05:19 deviceQuery.cpp
-rw-r--r--. 1 nvidia nvidia  19352 Feb 13 17:33 deviceQuery.o
-rw-r--r--. 1 nvidia nvidia    168 Feb  4 05:19 readme.txt
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

NvRmMemInitNvmap failed with Permission denied
351: NvMap init failed


****NvRmMemMgrInit failed**** error type: 196626


cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$

lizhensheng · May 16, 2023, 2:30am

usermod -aG video "$DOCKER_USER" solves the issue.

system · May 30, 2023, 6:04pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rootless Podman Container - CUDA Operation Not Supported - Error Code 801 DRIVE AGX Orin General driveos-cuda	11	541	October 10, 2024
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1320	February 9, 2024
Install Problem CUDA Programming and Performance	32	12706	December 17, 2009
trying to get a tesla k10 online. cuda_5.5.22_linux_64.run fails Linux	18	5801	February 16, 2014
deviceQuery passes and then fails CUDA Setup and Installation	4	2149	July 6, 2016
gpu computing sdk 4.0 runtime failures build the sdk succesfully, but the run of any exe failed CUDA Programming and Performance	3	2793	August 8, 2011
Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit Jetson AGX Orin gpu , kubernetes	15	159	April 20, 2025
CUDA runtime on Jetson Orin AGX Jetson AGX Orin cuda	46	6235	September 1, 2023
Docker on top of AGX orin hardware with 6.0.6 DRIVE AGX Orin General docker	21	1164	June 28, 2023
No CUDA-capable device is detected on tao detectnet_v2 dataset convert TAO Toolkit pycuda , omniverse_extension	13	6171	January 4, 2022

[BUG] target-docker-container running cuda-samples require unintended extra permission

Related topics