For various reasons we are still on RT 32.5 (JetPack 4.x). We are planning to upgrade to JetPack 5.x however it may be a while. In the meanwhile we would like to use later version of Pytorch, 1.11 and later. I am trying to understand our options here. Here is my understanding and what I have tried so far.
Trying to run any JetPack 5.x containers with later version of Pytorch does not work on JetPack4.x, I guess this is understandable.
I started with docker container dustynv/pytorch:1.10-r32.7.1
This comes with Pytorch 1.10 compiled for Python 3.6
Since we want to move past Python 3.6, I installed Anaconda in the container and created a conda env with Python 3.8
In this conda env I installed torch-1.11.0-cp38-cp38-linux_aarch64.whl from https://nvidia.box.com/shared/static/ssf2v7pf5i245fk4i0q926hy4imzs2ph.whl
It installs okay but I cannot import torch I get this error:
OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory
I think this is cos torch-1.11.0-cp38-cp38-linux_aarch64.whl is compiled for JetPack 5.x
So what are my options here?
Is compiling from source the only option ? Like should I build later versions of Pytorch from source inside of say dustynv/pytorch:1.10-r32.7.1` ?
Or is there any way I can use the pre-built whls ?
Hi @giriman1, yes, you need to build PyTorch from source against your desired version of Python and the CUDA 10.2 that comes with JetPack 4.6. The JetPack 5 wheels for PyTorch won’t work on JetPack 4 because they were built against a different version of CUDA (and other dependencies, such as the MPI issue you encountered)
I wouldn’t bother using dustynv/pytorch:1.10-r32.7.1 as a base for this, since it already is “polluted” with the Python 3.6 stuff. Instead, either just build the wheel outside of container or in your own container derived from l4t-base with your desired environment. When you kick off the PyTorch build, it will soon print a detailed configuration - confirm that it’s using the version of Python/ect that you want, before waiting for it to complete :)
p.s. there is a preliminary pytorch:builder dockerfile here for PyTorch 2.0 that you can reference:
@dusty_nv I am running into some issues while trying to build the wheel. Basically it gets stuck doing Re-running CMake... over and over. Basically it seems stuck in a configuration loop forever.
Here are the steps I did:
Docker version I am using is Docker version 19.03.6, build 369ce74a3c
Used nvcr.io/nvidia/l4t-base:r32.7.1 as base image
Installed all deps
Created a conda env with Python 3.8
Cloned Pytorch 1.12.1.This is the latest I could use with CUDA 10.2 which is the version on my JetPack 4.5
Applied the closest patch pytorch-1.10-jetpack-4.5.1.patch
Set various exports, like so: export USE_NCCL=0 && export USE_DISTRIBUTED=1 && export USE_QNNPACK=0 && export USE_PYTORCH_QNNPACK=0 && export TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2" && export USE_NATIVE_ARCH=1 && export PYTORCH_BUILD_VERSION="1.12.1" && export PYTORCH_BUILD_NUMBER=1
Build the wheel like so python3 setup.py bdist_wheel
Hi @giriman1, hmm not sure - here’s the only relevant thing I could find about that issue:
At a glance though, it looks like PyTorch picked up the right environment configuration. I haven’t used conda and am not sure if that’s related or not.
@dusty_nv I think this has to do with system time on NX. It was way in the past on my xavier and therefore CMakeLists.txt had a time stamp way in the past. I disabled NTP and set system time manually. Now I get past config. It is no longer stuck in re-running cmake loop and it is compiling now.
Summarizing my final solution in case someone else finds this thread at a later date:
tl’dr: I built Pytorch 1.10.0 from source for Python 3.8 inside of a conda env on my Xavier NX.
If you are on JetPack4.5 you are on CUDA 10.2
It is not possible to upgrade this, at least I could not get CUDA 11.x to work on JetPack 4.5
Pytorch support CUDA 10.2 only up to version 1.12.1. So you need to pick a release that is less than equal to 1.12.1. All later releases do not support CUDA 10.2 and require later versions of CUDA
We cannot compile Pytorch 1.12.1 for JetPack 4.5 since it PyTorch 1.12 requires a newer version of cuDNN
We cannot compile Pytorch 1.11.0 for JetPack 4.5. Builds fail due to nvcc fatal. Exact error nvcc fatal : 'arch=native': expected a number
We can compile Pytorch 1.10.0 fine
Some other points to note:
I did the build inside of docker container
Make sure your system time on Jetson is current. If past or future, it confuses cmake and your pytorch build will get stuck.
As @dusty_nv says above I used nvcr.io/nvidia/l4t-base:r32.7.1 as base image
If you are building on Jetsons, it will take a while. My build took almost 4 hours,
My Xavier only had 8 GB RAM and 3.6 GB swap. This is not enough to build Pytorch from source. I added additional 16 GB swap. Based on what I saw during compilation I think one needs at least 8 GB of swap on top of 8 GB RAM to build Pytorch fine.
When you are able to upgrade to JetPack 5, that will be easier to build newer versions of PyTorch for.
@dusty_nv unfortunately robot OEMs do not seem to have figured out a clean way to make JetPack upgrades. I still hold out hope for being able to move to JetPack5 sometime soon, but it is not clear when or how. Re-flashing the multiple Jetson modules on a robot without OEM’s well documented and tried process is tricky :) We shall see, fingers crossed :)