Container hosting on AGX Xavier with JetPack 5.0

I have an AGX Xavier development kit that I’m planning to upgrade to JetPack 5.0. If I do, will I be able to host containers that are built on a JetPack 4.6.1 base? Will the CUDA tools mounted into the container work?

Quick answer: for an early out of the box attempt, no.

$ docker run --gpus all --runtime nvidia --network host --rm -it nvcr.io/nvidia/l4t-ml:r32.7.1-py3

and then from Jupyter:

# python3
Python 3.6.9 (default, Dec  8 2021, 21:08:43) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2022-04-10 00:37:34.001877: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2022-04-10 00:37:34.001983: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-04-10 00:37:34.002319: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2022-04-10 00:37:34.002372: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Segmentation fault (core dumped)

I’ll try to see if I can fix that by manually installing CUDA 10.2 to the container…

That’s what I figured. Looks like some of my projects go on hold until the chip shortage is over. :-(

I’m currently trying to see if copying over those SOs will work. In theory it should be because the cudart <-> CUDA ABI is forward compatible.

Will keep you aware and see if it works…

I think that I have it working properly now for both PyTorch and TensorFlow.

(Dockerfile below)

FROM nvcr.io/nvidia/l4t-ml:r32.7.1-py3

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/c/cuda-cudart/cuda-cudart-10-2_10.2.300-1_arm64.deb && dpkg-deb -x cuda-cudart-10-2_10.2.300-1_arm64.deb cudart && rm cuda-cudart-10-2_10.2.300-1_arm64.deb && cp -r cudart/usr/local/cuda-10.2/targets/aarch64-linux/lib/* /usr/local/cuda-10.2/targets/aarch64-linux/lib && rm -rf cudart

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/libc/libcurand/libcurand-10-2_10.1.2.300-1_arm64.deb && dpkg-deb -x libcurand-10-2_10.1.2.300-1_arm64.deb curand && rm libcurand-10-2_10.1.2.300-1_arm64.deb && cp -r curand/usr/local/cuda-10.2/targets/aarch64-linux/lib/* /usr/local/cuda-10.2/targets/aarch64-linux/lib && rm -rf curand

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/libc/libcufft/libcufft-10-2_10.1.2.300-1_arm64.deb && dpkg-deb -x libcufft-10-2_10.1.2.300-1_arm64.deb cufft && rm libcufft-10-2_10.1.2.300-1_arm64.deb && cp -r cufft/usr/local/cuda-10.2/targets/aarch64-linux/lib/* /usr/local/cuda-10.2/targets/aarch64-linux/lib && rm -rf cufft

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/libc/libcublas/libcublas10_10.2.3.300-1_arm64.deb && dpkg-deb -x libcublas10_10.2.3.300-1_arm64.deb cublas && rm libcublas10_10.2.3.300-1_arm64.deb && cp -r cublas/usr/local/cuda-10.2/targets/aarch64-linux/lib/* /usr/local/cuda-10.2/targets/aarch64-linux/lib && rm -rf cublas

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/c/cudnn/libcudnn8_8.2.1.32-1+cuda10.2_arm64.deb && dpkg-deb -x libcudnn8_8.2.1.32-1+cuda10.2_arm64.deb cudnn && rm libcudnn8_8.2.1.32-1+cuda10.2_arm64.deb && cp -rf cudnn/* / && rm -rf cudnn

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/c/cuda-nvtx/cuda-nvtx-10-2_10.2.300-1_arm64.deb && dpkg-deb -x cuda-nvtx-10-2_10.2.300-1_arm64.deb nvtx && rm cuda-nvtx-10-2_10.2.300-1_arm64.deb && cp -rf nvtx/* / && rm -rf nvtx

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/libc/libcusparse/libcusparse-10-2_10.3.1.300-1_arm64.deb && dpkg-deb -x libcusparse-10-2_10.3.1.300-1_arm64.deb cusparse && rm libcusparse-10-2_10.3.1.300-1_arm64.deb && cp -rf cusparse/* / && rm -rf cusparse

RUN wget https://repo.download.nvidia.com/jetson/common/pool/main/libc/libcusolver/libcusolver-10-2_10.3.0.300-1_arm64.deb && dpkg-deb -x libcusolver-10-2_10.3.0.300-1_arm64.deb cusolver && rm libcusolver-10-2_10.3.0.300-1_arm64.deb && cp -r cusolver/* / && rm -rf cusolver

With the benefit of hindsight, I can simplify that one a bit more now, but this one is simple enough to work reliably.

With the new L4T the CUDA runtime side libraries are no longer mounted in (and not like they’re the same version anyway), so we need to manually fetch those.

Thankfully they stopped mounting those even for the current release in JetPack 5.0, so won’t be an issue going forward.

Looks promising for my personal projects, but it’s not something I’d foist upon an open source community.

In hindsight, NVIDIA made the mistake of mounting those higher-level libraries to the guest when those libraries aren’t stable between CUDA releases. It was always going to break down at some point.

There’s essentially no fix that doesn’t involve shipping a CUDA 10.2 install on the new JetPack 5.0… or manually installing those libraries to the containers.

For the former option, this means installing CUDA 10.2 from JetPack 4.x and then using -v to mount that as a volume to /usr/local/cuda-10.2 within the container, which could either be documented or done automatically in that scenario.

For the latter one, this needs light container changes to fetch those libraries, and it’s what I’ve done in this case.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.