CUDA runtime on Jetson Orin AGX

Hello,

How can I enable Cuda runtime on ORIN AGX? I have updated to Cuda 12.2 and nvcc --version returns:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:22:54_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

If I run

import torch

if torch.cuda.is_available():
… print(“CUDA is available on your system.”)
… print(“CUDA version:”, torch.version.cuda)
… else:
… print(“CUDA is not available on your system.”)

CUDA is not available on your system.

btw, this is torch 2.0.1. I had CUDA 12 previously working on Xavier NX.

Hi,

Please try the below command to see if it helps:

export PATH=/usr/local/cuda-12/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda-12.0/compat

The compat lib is required when running CUDA 12 on the Jetson whose GPU driver is older.

If Torch is still not working, please check if the package has been built with CUDA support.
Thanks.

Hi,

just to confirm. This is a brand new Jetson ORIN AGX we just purchased. The GPU driver is older??
After modifying the PATH and LD_LIBRARY_PATH like you suggested, CUDA is still not available. Testing further it appears the torch 2.0.1 we have is not compatible with CUDA based on torch.version.cuda which returned false. Could you direct me to to the right torch that supports CUDA on ORIN AGX for Ubuntu 20.04 aarch64 (and how to install) ?

Thank you!

@hg1 before PyTorch, first try making sure that you can run CUDA deviceQuery sample to confirm the GPU is working:

cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 30589 MBytes (32074559488 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1300 MHz (1.30 GHz)
  Memory Clock rate:                             1300 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Then you can find the PyTorch wheels that were built for JetPack (with CUDA enabled) here:

I found deviceQuery for cuda 11.4, will this work for cuda 12 or 12.2 ? I’m hoping I won’t have to roll back to cuda 11.4. According to nvcc --version the Cuda device driver I have is:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:22:54_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0


CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Orin”
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 62797 MBytes (65847091200 bytes)
(008) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 612 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 1
Result = PASS

Just trying installing torch-2.0.0+nv23.5 (which appears to be the one to pick here:
*** [Installing PyTorch for Jetson Platform - NVIDIA Docs]**
**all goes well until I try to install torchvision. torchvision just decides to uninstall torch-2.0.0+nv23.5 **
and installs instead torch-2.0.1 without Cuda support

How do I go around this?

If you are installing torchvision like pip3 install torchvision, it will install the torchvision/PyTorch that was built without GPU support. Instead, install torchvision from source like found under the Installation section of this post: PyTorch for Jetson

I don’t know if the PyTorch wheels will work with CUDA 12 or not, because they were built with CUDA 11.4. You might need to rebuild them for CUDA 12 if you have an issue.

when I try to install torchvision from source

(git clone GitHub - pytorch/vision: Datasets, Transforms and Models specific to Computer Vision; python setup.py install)

I get the error

The detected CUDA version (12.2) mismatches the version that was used to compile
PyTorch (11.4). Please make sure to use the same CUDA versions.

but

import torch
print(torch.version)
returns:
2.0.0+nv23.05

Which I believe is the latest one that supports CUDA?

ok… it seems to be building now without the initial error after get rid of pointing to CUDA 12.2 in PATH and LD_LIBRARY_PATH (and point only to 12.0)

@dusty_nv
But I continue getting errors while building torchvision from source using:

python setup.py build develop
and
python setup.py install

It appears that it doesn’t find libpng-dev and libjpeg-dev, even I verified that they are installed…

any idea?

@hg1 I’m not sure why it wouldn’t find those if they are in fact installed (besides the fact that you might want to use python3 instead of python), but here is the procedure I use to build torchvision in container:

You can follow along with similar to steps as are what are in the RUN statements.

I had it all working in a container before on Xavier NX. When I tried to transfer the image (about 16GB) it didn’t have enough room on the Orin AGX for the container. (Although I have 64GB)

I am not that familiar with docker containers, and I would love to just build without it on the Orin AGX.

So far no luck building torchvision from scratch (source).

Would you have the steps to build the torchvision without the container?

I was able to build torchvision from source yet getting the following and seeing two versions of Cuda??? does that seems ok?

if torch.cuda.is_available():
print(‘CUDA is available on your system.’)
print(‘CUDA version:”, torch.version.cuda’)
else:
print(‘CUDA is not available on your system.’)

if torch.version.cuda is not None:
print(f"PyTorch was built with CUDA support. CUDA version: {torch.version.cuda}")
else:
print(“PyTorch was NOT built with CUDA support.”)

returns:

CUDA is available on your system.
CUDA version:”, torch.version.cuda
PyTorch was built with CUDA support. CUDA version: 11.4

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:22:54_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

and in pyton shell:

torch.version
‘2.0.0+nv23.05’
import torchvision
/home/hg111/torch-2.0.0-env/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/hg111/torch-2.0.0-env/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs’If you don’t plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
torchvision.version
‘0.16.0a0+a6dea86’
import cv2
cv2.version
‘4.8.0’

1 Like

I believe this is reporting the version of CUDA that PyTorch was built with (which is correct - those PyTorch wheels were built with 11.4). I am unsure of other future issues you may encounter with using CUDA 12 with those wheels, I haven’t tried that.

It appears that this is not the way to go for a working CUDA 12.x runtime with Pytorch… I have downloaded the latest docker image for Pytorch. However 16GB will take too much space on the ORIN AGX 64GB. Could you guide me through moving the docker data to my external disk?

I have changed docker.service ExecStart to:
ExecStart=/usr/bin/dockerd --data-root /media/hg/docker/data -H fd:// --containerd=/run/containerd/containerd.sock
and in daemon.json added
“data-root”: “/media/hg/docker/data”

but I get an error:
Job for docker.service failed because the control process exited with error code

I am wondering if I shouldn’t have just sym-linked /var/lib/docker to my external disk location before I even pull the image docker out from NGC, but hopefully I don’t have to start everything from scratch and can use the image I already downloaded

what am I missing?

1 Like

@hg1 are you sure that your external storage is mounted at boot? It should have an entry in /etc/fstab - otherwise, it won’t be mounted at the time the Docker daemon starts.

You should be able to copy your existing /var/lib/docker to your external storage to keep your previous containers. This is the general procedure I follow for relocating the Docker data directory:

ok, I checked the mount and trying to run the mv command now based on the IBM guideline but getting lines like this:

mv: cannot create regular file ‘./docker/overlay2/ecb9239999c15c88d7e073ad517e78d7726d6c77979689c3a4fbf5955879de04/diff/usr/local/nvm/alias/lts/*’: Invalid argument

@hg1 you might need to do it with sudo, those docker data dirs have root-only files and subdirectories in them

worst-case, start with a fresh data-root directory, and re-download your images.

it was with sudo: sudo mv /var/lib/docker .
(the dot on the destination mounted dir)

I was afraid you will suggest to re-download the image… since I couldn’t just stop the mv command :)

are you sure I can’t just download directly to the the mounted external drive?

You need to have the docker data-root pointing to your external drive, otherwise all the images you pull will still be downloaded to your Jetson’s internal storage. At this point if the mv command doesn’t work, I would personally just start with a fresh data-root on your external storage.

is there anything other then adding the line to daemon.json? I have it like this:
{
“runtimes”: {
“nvidia”: {
“args”: ,
“path”: “nvidia-container-runtime”
}
},
“data-root”: “/media/hg1/nv_dockers/docker”
}
but I keep getting error for docker.service failed because the control process exited with error code even the directory is verified

here is my last sudo dockerd --debug

sudo dockerd --debug
INFO[2023-07-25T19:21:54.948114011-07:00] Starting up
DEBU[2023-07-25T19:21:54.949902024-07:00] Listener created for HTTP on unix (/var/run/docker.sock)
INFO[2023-07-25T19:21:54.950314347-07:00] detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf
DEBU[2023-07-25T19:21:54.951306675-07:00] Golang’s threads limit set to 448290
INFO[2023-07-25T19:21:54.952127257-07:00] parsed scheme: “unix” module=grpc
INFO[2023-07-25T19:21:54.952151065-07:00] scheme “unix” not registered, fallback to default scheme module=grpc
DEBU[2023-07-25T19:21:54.952171353-07:00] metrics API listening on /var/run/docker/metrics.sock
INFO[2023-07-25T19:21:54.952202425-07:00] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] } module=grpc
INFO[2023-07-25T19:21:54.952323322-07:00] ClientConn switching balancer to “pick_first” module=grpc
INFO[2023-07-25T19:21:54.954857837-07:00] parsed scheme: “unix” module=grpc
INFO[2023-07-25T19:21:54.954901230-07:00] scheme “unix” not registered, fallback to default scheme module=grpc
INFO[2023-07-25T19:21:54.954937006-07:00] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] } module=grpc
INFO[2023-07-25T19:21:54.954952174-07:00] ClientConn switching balancer to “pick_first” module=grpc
DEBU[2023-07-25T19:21:54.956473017-07:00] processing event stream module=libcontainerd namespace=plugins.moby
DEBU[2023-07-25T19:21:54.956704347-07:00] Using default logging driver json-file
DEBU[2023-07-25T19:21:54.957226879-07:00] [graphdriver] priority list: [btrfs zfs overlay2 fuse-overlayfs aufs overlay devicemapper vfs]
ERRO[2023-07-25T19:21:54.977323670-07:00] failed to mount overlay: invalid argument storage-driver=overlay2
ERRO[2023-07-25T19:21:54.977360726-07:00] [graphdriver] prior storage driver overlay2 failed: driver not supported
DEBU[2023-07-25T19:21:54.978275805-07:00] Cleaning up old mountid : start.
failed to start daemon: error initializing graphdriver: driver not supported

The only way I was able so far to “workround” this issue is to add “storage-driver”: “vfs” to the daemon.json, but I have no idea how it might impact the image downloaded.
As I understand from your previous messages and the IBM site, I shouldn’t need to resort to this. ???