[BUG] target-docker-container running cuda-samples require unintended extra permission

Required Info:

  • Software Version
    DRIVE OS 6.0.6
  • Target OS
    Linux
  • SDK Manager Version
    1.9.2.10884
  • Host Machine Version
    native Ubuntu Linux 20.04 Host installed with DRIVE OS DOCKER Containers

Describe the bug

following with this topic https://developer.nvidia.com/blog/running-docker-containers-directly-on-nvidia-drive-agx-orin/#entry-content-comments.

in target-host, running cuda-samples doesn’t need sudo, while in target-docker-container, sudo is needed.

To Reproduce

# in host: compile the cuda-sample
mkdir cuda-sample && cd ./cuda-sample
cp -r /usr/local/cuda/samples/ ./
cd samples/1_Utilities/deviceQuery
make clean && make
# start and into the container
./docker/run/orin_start.sh
./docker/run/orin_into.sh

the keypoint of orin_start.sh is

+ docker run --runtime nvidia --gpus all -it -d --privileged --name gw_orin_20.04_nvidia -e DOCKER_USER=nvidia -e USER=nvidia -e DOCKER_USER_ID=1000 -e DOCKER_GRP=nvidia -e DOCKER_GRP_ID=1000 -e DOCKER_IMG=arm64v8/ros:foxy -e USE_GPU=1 -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=compute,graphics,video,utility,display -e DISPLAY -v /home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target:/target -v /usr/local/driveworks-5.10:/usr/local/driveworks-5.10 -v /usr/local/cuda-11.4:/usr/local/cuda-11.4 -v /dev:/dev -v /home/nvidia/zhensheng/cuda-sample:/home/nvidia/zhensheng/cuda-sample -v /home/nvidia/.cache:/home/nvidia/.cache -v /dev/bus/usb:/dev/bus/usb -v /media:/media -v /tmp/.X11-unix:/tmp/.X11-unix:rw -v /etc/localtime:/etc/localtime:ro -v /usr/src:/usr/src -v /lib/mgaules:/lib/mgaules --net host --ipc host --cap-add SYS_ADMIN --cap-add SYS_PTRACE -w /target --add-host in_orin_docker:127.0.0.1 --add-host tegra-ubuntu:127.0.0.1 --hostname in_orin_docker --shm-size 2G -v /dev/null:/dev/raw1394 arm64v8/ros:foxy /bin/bash

Expected behavior

# in host
./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Actual behavior

# in target-docker-container without sudo
./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

NvRmMemInitNvmap failed with Permission denied
351: NvMap init failed


****NvRmMemMgrInit failed**** error type: 196626


cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

in target-docker-container, with sudo running cuda-sample give the expected result

# in target-docker-container with sudo
sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Additional context

  1. How to avoid the extra sudo operation?

Thanks.

Dear @lizhensheng,
Could you share the used docker image for repro locally? Is it the same ubuntu-20.4 used in blog?

you can check the whole docker run above,

For more info, the non-root user is added with this

https://github.com/ZhenshengLee/nv_driveworks_demo/blob/42be4c663e5c8345f1cbdc988f43610b579a48e8/docker/scripts/target_adduser.sh#L5-L10

the /bin/bash is executed with this

https://github.com/ZhenshengLee/nv_driveworks_demo/blob/42be4c663e5c8345f1cbdc988f43610b579a48e8/docker/run/orin_into.sh#L52-L56

I already checked other topics in the forum, and the following steps are done but can’t help to resolve the issue

  1. usermod -aG sudo,video,i2c
  2. docker run --privileged
  3. docker exec -u $USER

FYI, thanks.

arm64v8/ros:foxy is in Image Layer Details - arm64v8/ros:foxy | Docker Hub

which is based on ubuntu:20.04

1 Like

Friendly ping @SivaRamaKrishnaNV @VickNV for updates.

Dear @lizhensheng,
I am yet to get an update from engineering team. May I know if this blocks your development?

@SivaRamaKrishnaNV
Yes, it blocks the development in target-docker-container.

I didn’t find any useful solution to solve this permission issue.

as for this topic [BUG] dwcgf error of NvSciIpcOpenEndpoint with shm header not cleared (nvidia.com), I can’t reply because it’s closed.

What I know is that running multiple cgf app instance cause the shm-header-not-cleared error.

Thanks.

Dear @lizhensheng,
Can you share ls -la /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery to confirm if other user don’t have execute permission.
regarding “shm-header-not-cleared” issue, I notice killing LoaderLite process will avoid this issue after launching CGF app.

I quickly tested your command and notice no issue.

nvidia@tegra-ubuntu:~$ docker run --runtime nvidia --gpus all -it -d --privileged --name gw_orin_20.04_nvidia -e DOCKER_USER=nvidia -e USER=nvidia -e DOCKER_USER_ID=1000 -e DOCKER_GRP=nvidia -e DOCKER_GRP_ID=1000 -e DOCKER_IMG=arm64v8/ros:foxy -e USE_GPU=1 -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=compute,graphics,video,utility,display -e DISPLAY -v /home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target:/target -v /usr/local/driveworks-5.10:/usr/local/driveworks-5.10 -v /usr/local/cuda-11.4:/usr/local/cuda-11.4 -v /dev:/dev -v /home/nvidia/zhensheng/cuda-sample:/home/nvidia/zhensheng/cuda-sample -v /home/nvidia/.cache:/home/nvidia/.cache -v /dev/bus/usb:/dev/bus/usb -v /media:/media -v /tmp/.X11-unix:/tmp/.X11-unix:rw -v /etc/localtime:/etc/localtime:ro -v /usr/src:/usr/src -v /lib/mgaules:/lib/mgaules --net host --ipc host --cap-add SYS_ADMIN --cap-add SYS_PTRACE -w /target --add-host in_orin_docker:127.0.0.1 --add-host tegra-ubuntu:127.0.0.1 --hostname in_orin_docker --shm-size 2G -v /dev/null:/dev/raw1394 arm64v8/ros:foxy /bin/bash
1b9daa04c4462d7bbcca9a5623fc50d5cdaf48017dc595f7b0c77ed9c86600d6
nvidia@tegra-ubuntu:~$ docker ps -a
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS                      PORTS     NAMES
1b9daa04c446   arm64v8/ros:foxy       "/ros_entrypoint.sh …"   39 seconds ago   Up 38 seconds                         gw_orin_20.04_nvidia
c954285372e1   arm64v8/ros:foxy       "/ros_entrypoint.sh …"   55 minutes ago   Exited (0) 55 minutes ago             friendly_benz
661b1990f6bc   arm64v8/ubuntu:focal   "/bin/bash"              5 days ago       Exited (0) 5 days ago                 bold_chatterjee
54c7fe337a5e   arm64v8/ros:foxy       "/ros_entrypoint.sh …"   6 days ago       Exited (255) 11 hours ago             my_gw_orin_20.04_nvidia
nvidia@tegra-ubuntu:~$ docker attach 1b9daa04c446
root@in_orin_docker:/target# cd /usr/local/cuda-11.4/bin/
root@in_orin_docker:/usr/local/cuda-11.4/bin# ls
bin2c              crt       cuda-gdb        cuda-install-samples-11.4.sh  cuobjdump  nvcc          nvdisasm  nvprune
compute-sanitizer  cu++filt  cuda-gdbserver  cudafe++                      fatbinary  nvcc.profile  nvlink    ptxas
root@in_orin_docker:/usr/local/cuda-11.4/bin# cd ../samples/bin/aarch64/linux/release/
root@in_orin_docker:/usr/local/cuda-11.4/samples/bin/aarch64/linux/release# ls
deviceQuery  matrixMul
root@in_orin_docker:/usr/local/cuda-11.4/samples/bin/aarch64/linux/release# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
root@in_orin_docker:/usr/local/cuda-11.4/samples/bin/aarch64/linux/release#
1 Like

@SivaRamaKrishnaNV

you are running deviceQuery with root, which is not as expected.

You could reproduce the behavior with this

in host : make and run deviceQuery

nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target$ cd /usr/local/cuda-11.4/samples/1_Utilities/deviceQuery/
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ sudo make
[sudo] password for nvidia: 
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.4/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$  ls -la /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery 
-rwxr-xr-x. 1 root root 819928 Feb 13 17:11 /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery
nvidia@tegra-ubuntu:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

in container : run deviceQuery with user=nvidia

docker exec \
    -u nvidia \
    -e HISTFILE=/target/.dev_bash_hist \
    -it gw_orin_20.04_nvidia \
    /bin/bash

nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target$ ./docker/run/orin_into.sh 
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

selecting project in: /gw_demo
nvidia@in_orin_docker:/target$ cd /usr/local/cuda-11.4/samples/1_Utilities/deviceQuery/
nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ls
Makefile  NsightEclipse.xml  deviceQuery.cpp  readme.txt
nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ sudo make 
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
Assembler messages:
Fatal error: can't create deviceQuery.o: Read-only file system
make: *** [Makefile:326: deviceQuery.o] Error 255

nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ls -la /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery 
-rwxr-xr-x. 1 root root 819928 Feb 13 17:11 /usr/local/cuda-11.4/samples/bin/aarch64/linux/release/deviceQuery

nvidia@in_orin_docker:/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

NvRmMemInitNvmap failed with Permission denied
351: NvMap init failed


****NvRmMemMgrInit failed**** error type: 196626


cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL

this could be a reference l4t-ros2-docker/Dockerfile at main · atinfinity/l4t-ros2-docker · GitHub

I copy the cuda-sample from /usr/local/cuda/sample to /home/nvidia/zhensheng and test it.

nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ make clean .
rm -f deviceQuery deviceQuery.o
rm -rf ../../bin/aarch64/linux/release/deviceQuery
make: Nothing to be done for '.'.
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ make .
make: Nothing to be done for '.'.
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ ls
Makefile  NsightEclipse.xml  deviceQuery.cpp  readme.txt
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ make all
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.4/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ ls -la
total 876
drwxr-xr-x. 3 nvidia nvidia   4096 Feb 13 17:34 .
drwxr-xr-x. 7 nvidia nvidia   4096 Feb  4 05:19 ..
drwxr-xr-x. 2 nvidia nvidia   4096 Feb  4 05:19 .vscode
-rw-r--r--. 1 nvidia nvidia  12414 Feb  4 05:19 Makefile
-rw-r--r--. 1 nvidia nvidia   1789 Feb  4 05:19 NsightEclipse.xml
-rwxr-xr-x. 1 nvidia nvidia 819928 Feb 13 17:34 deviceQuery
-rw-r--r--. 1 nvidia nvidia  12721 Feb  4 05:19 deviceQuery.cpp
-rw-r--r--. 1 nvidia nvidia  19352 Feb 13 17:33 deviceQuery.o
-rw-r--r--. 1 nvidia nvidia    168 Feb  4 05:19 readme.txt
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

NvRmMemInitNvmap failed with Permission denied
351: NvMap init failed


****NvRmMemMgrInit failed**** error type: 196626


cudaGetDeviceCount returned 999
-> unknown error
Result = FAIL
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 28458 MBytes (29840424960 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1275 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS
nvidia@in_orin_docker:~/zhensheng/cuda-sample/samples/1_Utilities/deviceQuery$ 

usermod -aG video "$DOCKER_USER" solves the issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.