Environment is pretty much the same, but minimal code snippet won't run

Overview: I have a minimal example of a code snippet that can run directly on the OS, but not in docker. When I run the below code snippet I get CUDNN_STATUS_NOT_INITIALIZED inside the docker container, but not on the OS.

What I tried already (I have already spent 3-4 days trying to fix this):

  1. JP 6.1 and JP 6.0
  2. Compiling torch 2.3.1 and 2.5.0 on the Jetson myself (using different images from /nvidia/cuda on Dockerhub)
  3. I tried on an Orin NX as well
  4. Changing different numpy, numba versions

Here’s “dpkg -l | grep cuda” from inside the broken environment:

ii  cuda-cccl-12-2                  12.2.140-1                              arm64        CUDA CCCL
ii  cuda-command-line-tools-12-2    12.2.2-1                                arm64        CUDA command-line tools
ii  cuda-compiler-12-2              12.2.2-1                                arm64        CUDA compiler
ii  cuda-crt-12-2                   12.2.140-1                              arm64        CUDA crt
ii  cuda-cudart-12-2                12.2.140-1                              arm64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-12-2            12.2.140-1                              arm64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-12-2             12.2.140-1                              arm64        CUDA cuobjdump
ii  cuda-cupti-12-2                 12.2.142-1                              arm64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-12-2             12.2.142-1                              arm64        CUDA profiling tools interface.
ii  cuda-cuxxfilt-12-2              12.2.140-1                              arm64        CUDA cuxxfilt
ii  cuda-driver-dev-12-2            12.2.140-1                              arm64        CUDA Driver native dev stub library
ii  cuda-gdb-12-2                   12.2.140-1                              arm64        CUDA-GDB
ii  cuda-keyring                    1.0-1                                   all          GPG keyring for the CUDA repository
ii  cuda-libraries-12-2             12.2.2-1                                arm64        CUDA Libraries 12.2 meta-package
ii  cuda-libraries-dev-12-2         12.2.2-1                                arm64        CUDA Libraries 12.2 development meta-package
ii  cuda-minimal-build-12-2         12.2.2-1                                arm64        Minimal CUDA 12.2 toolkit build packages.
ii  cuda-nsight-compute-12-2        12.2.2-1                                arm64        NVIDIA Nsight Compute
ii  cuda-nvcc-12-2                  12.2.140-1                              arm64        CUDA nvcc
ii  cuda-nvdisasm-12-2              12.2.140-1                              arm64        CUDA disassembler
ii  cuda-nvml-dev-12-2              12.2.140-1                              arm64        NVML native dev links, headers
ii  cuda-nvprune-12-2               12.2.140-1                              arm64        CUDA nvprune
ii  cuda-nvrtc-12-2                 12.2.140-1                              arm64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-12-2             12.2.140-1                              arm64        NVRTC native dev links, headers
ii  cuda-nvtx-12-2                  12.2.140-1                              arm64        NVIDIA Tools Extension
ii  cuda-nvvm-12-2                  12.2.140-1                              arm64        CUDA nvvm
ii  cuda-profiler-api-12-2          12.2.140-1                              arm64        CUDA Profiler API
ii  cuda-sanitizer-12-2             12.2.140-1                              arm64        CUDA Sanitizer
ii  cuda-toolkit-12-2-config-common 12.2.140-1                              all          Common config package for CUDA Toolkit 12.2.
ii  cuda-toolkit-12-config-common   12.2.140-1                              all          Common config package for CUDA Toolkit 12.
ii  cuda-toolkit-config-common      12.2.140-1                              all          Common config package for CUDA Toolkit.
hi  libcudnn8                       8.9.6.50-1+cuda12.2                     arm64        cuDNN runtime libraries
ii  libcudnn8-dev                   8.9.6.50-1+cuda12.2                     arm64        cuDNN development libraries and headers
hi  libnccl-dev                     2.19.3-1+cuda12.2                       arm64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                        2.19.3-1+cuda12.2                       arm64        NVIDIA Collective Communication Library (NCCL) Runtime

Hardware: AGX Orin 64GB
JP: 6.0 rev 2
OS: Ubuntu 22.04

Minimal example code snippet:

import numpy as np
import torch as th
import torch.fft
import torch.nn.functional as F
from scipy.ndimage._filters import _gaussian_kernel1d

def cuda_downsample(th_img, factor=2):
    gaussian_kernel = _gaussian_kernel1d(sigma=factor * 0.5, order=0, radius=int(4*factor * 0.5 + 0.5))[::-1].copy()
    th_gaussian_kernel = th.as_tensor(gaussian_kernel, dtype=th.float32, device="cuda")
    temp = F.conv2d(th_img, th_gaussian_kernel[None,None, :, None]) # convolve y
    th_filteredImage = F.conv2d(temp, th_gaussian_kernel[None, None, None, :]) # convolve x
    h2, w2 = np.floor(np.array(th_filteredImage.shape[2:]) / float(factor)).astype(int)
    return th_filteredImage[:, :, :h2 * factor:factor, :w2 * factor:factor]

def main():
    img = np.zeros(3024*4032, dtype=np.uint32)
    img = np.reshape(img, (1, 1, 3024, 4032))

    torch_img_grey = th.as_tensor(img, dtype=th.float32, device="cuda")
    torch_img_grey = cuda_downsample(torch_img_grey)

if __name__ == "__main__":
    main()

Additional notes:

  1. devicequery would not compile. I didn’t investigate this further.
  2. Python packages for torch, torchvision, numpy, numba, scipy are the same.
  3. Jetson is freshly flashed and I didn’t change anything on it aside from install some basic packages via apt.
  4. These devices are air gapped and dont have access to the internet

Env 1 (code runs as expected):
Running directly on the OS

Env 2 (code does not run):
Dockerfile:
The base image is from the following command:

docker pull --platform=linux/arm64 nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04

The torch and torchvision wheels are from this post and are the matching versions for JP 6.0 cuda 12.2

FROM custom/reg/cuda:12.2.2-cudnn8-devel-ubuntu22.04

RUN apt update && \
    apt install -y build-essential libopenblas-base libopenmpi-dev libomp-dev python3 vim python3-pip

RUN pip3 install "numba==0.60.0" "numpy==1.23.4" "scipy==1.10.0"
RUN pip3 install -i custom/reg/simple torch torchvision

COPY . /test
WORKDIR /test

Env1 collect_env

/usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.136-tegra-aarch64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Orin (nvgpu)
Nvidia driver version: N/A
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.8.9.4
/usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so.8.9.4
/usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so.8.9.4
/usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8.9.4
/usr/lib/aarch64-linux-gnu/libcudnn_cnn_train.so.8.9.4
/usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so.8.9.4
/usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-7
Off-line CPU(s) list:               8-11
Vendor ID:                          ARM
Model name:                         Cortex-A78AE
Model:                              1
Thread(s) per core:                 1
Core(s) per cluster:                4
Socket(s):                          -
Cluster(s):                         2
Stepping:                           r0p1
CPU max MHz:                        2201.6001
CPU min MHz:                        115.2000
BogoMIPS:                           62.50
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc flagm paca pacg
L1d cache:                          512 KiB (8 instances)
L1i cache:                          512 KiB (8 instances)
L2 cache:                           2 MiB (8 instances)
L3 cache:                           4 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; CSV2, but not BHB
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] onnx-graphsurgeon==0.3.12
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0a0+6043bc2
[conda] Could not collect

Env 2 collect_env:

/usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.136-tegra-aarch64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Orin (nvgpu)
Nvidia driver version: N/A
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/aarch64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-7
Off-line CPU(s) list:               8-11
Vendor ID:                          ARM
Model name:                         Cortex-A78AE
Model:                              1
Thread(s) per core:                 1
Core(s) per cluster:                4
Socket(s):                          -
Cluster(s):                         2
Stepping:                           r0p1
CPU max MHz:                        2201.6001
CPU min MHz:                        115.2000
BogoMIPS:                           62.50
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp uscat ilrcpc flagm paca pacg
L1d cache:                          512 KiB (8 instances)
L1i cache:                          512 KiB (8 instances)
L2 cache:                           2 MiB (8 instances)
L3 cache:                           4 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; CSV2, but not BHB
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0a0+6043bc2
[conda] Could not collect

SDK Manager about:

HI,

Could you try to use the l4t-based CUDA image?

Thanks.

I tried this too actually but was getting other issues. I think it was cause something was compiled against cudnn8.x and the installation instructions for cudnn just lead to installing cudnn9.x now.

Using the whls from here, I think they’re linked to cudnn8.

Starting from the 12.2.2-devel-arm64-ubuntu22.04 image and using the cudnn installation guide I just get this error.

root@5c57fc15fdb4:/test# dpkg -l | grep cudnn
ii  cudnn9-cuda-12                  9.5.1.17-1                              arm64        NVIDIA cuDNN for CUDA 12
ii  cudnn9-cuda-12-6                9.5.1.17-1                              arm64        NVIDIA cuDNN for CUDA 12.6
ii  libcudnn9-cuda-12               9.5.1.17-1                              arm64        cuDNN runtime libraries for CUDA 12.6
ii  libcudnn9-dev-cuda-12           9.5.1.17-1                              arm64        cuDNN development headers and symlinks for CUDA 12.6
ii  libcudnn9-static-cuda-12        9.5.1.17-1                              arm64        cuDNN static libraries for CUDA 12.6
root@5c57fc15fdb4:/test# python3 main.py 
Traceback (most recent call last):
  File "/test/main.py", line 2, in <module>
    import torch as th
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory
root@5c57fc15fdb4:/test# 

update: Inside of the 12.2.2-devel-arm64-ubuntu22.04 I installed the three cudnn packages libcudnn8, libcudnn8-dev, libcudnn-samples with a command like “apt install libcudnn8=8.9.4.25-1+cuda12.2” as found on Nvidia’s installing cuDNN on Linux page

I then compiled and ran the mnist example from the cuDNN installation verification section on the page and met the following error. It’s the same as I’ve been getting all week.

root@c2e76405b2ad:~/cudnn_samples_v8/mnistCUDNN# make clean && make
rm -rf *o
rm -rf mnistCUDNN
CUDA_VERSION is 12020
Linking agains cublasLt = true
CUDA VERSION: 12020
TARGET ARCH: aarch64
HOST_ARCH: aarch64
TARGET OS: linux
SMS: 50 53 60 61 62 70 72 75 80 86 87 90
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -ccbin g++ -m64 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o fp16_dev.o -c fp16_dev.cu
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include   -o fp16_emu.o -c fp16_emu.cpp
g++ -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include   -o mnistCUDNN.o -c mnistCUDNN.cpp
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o mnistCUDNN fp16_dev.o fp16_emu.o mnistCUDNN.o -I/usr/local/cuda/include -I/usr/local/cuda/include -IFreeImage/include -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 -lcublasLt -LFreeImage/lib/linux/aarch64 -LFreeImage/lib/linux -lcudart -lcublas -lcudnn -lfreeimage -lstdc++ -lm
root@c2e76405b2ad:~/cudnn_samples_v8/mnistCUDNN# ./mnistCUDNN
Executing: mnistCUDNN
cudnnGetVersion() : 8904 , CUDNN_VERSION from cudnn.h : 8904 (8.9.4)
Host compiler version : GCC 11.4.0

There are 1 CUDA capable devices on your machine :
device 0 : sms  8  Capabilities 8.7, SmClock 1300.0 Mhz, MemSize (Mb) 62841, MemClock 612.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
ERROR: cudnn failure (CUDNN_STATUS_NOT_INITIALIZED) in mnistCUDNN.cpp:414
Aborting...
root@c2e76405b2ad:~/cudnn_samples_v8/mnistCUDNN# 

Hi,

You will need to container that supports iGPU.
Usually this kind of images are tagged with l4t keyword.

For example, if your device is setup with JetPack 6.0 with CUDA 12.2.
Please try the nvcr.io/nvidia/l4t-cuda:12.2.12-devel.

Thanks.