PyTorch for Jetson

Hi dusty, I face the same problem of torchvision interpoerability with pytorch. If I try using the container, then the memory on eMMC maxes out.

I am using a Xavier AGX module and I have inserted a 128 GB SD Card. Is there a setting to flash the SD card and use it as Root? How can I go about this?

Then I can clone the container and use it directly. Please let me know if there is a way

What may be easier is to just change the Docker data-root directory to a directory on your SD card. Then the containers will all be stored on the SD card. You can do it like this: https://www.ibm.com/docs/en/z-logdata-analytics/5.1.0?topic=compose-relocating-docker-root-directory

Also make sure your SD card gets mounted at boot-up with an entry in /etc/fstab, or else when the docker daemon initializes, your SD card containing the docker data-root won’t have been mounted yet.

I was facing the exact same issue yesterday. Issue was resolved after downgrading Pillow:

(rlms) nvidia@xavier:/srv/rlms/detect$ python3
Python 3.6.9 (default, Jun 29 2022, 11:45:57)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import PIL
>>> print (PIL.version)
7.1.2

Hi dusty. For some reason, the docker doesn’t relocate even after following all the instructions and goes back to var/lib/docker.

I require using yolov5. Is it possible to install the py3.6 wheels for torch and torchvision, then upgrade python to 3.8 and execute yolov5. Does that work?

What I do is add "data-root": "/new_dir_structure/docker" to /etc/docker/daemon.json and then reboot after making sure my drive gets auto-mounted in /etc/fstab

If you are on JetPack 4.x, we don’t have the pre-built PyTorch wheels for Python 3.8, but you could build them yourself. On JetPack 5.x, the PyTorch wheels are for Python 3.8. Upgrading Python won’t upgrade PyTorch automatically from Python 3.6 to 3.8 because you still need that wheel.

May I know if there is anyway to reduce memory usage of pytorch? It spent around 2GB memory when I try to initialize CUDA context even I am doing a very simple inferencing. I suspect it is related to the number of ops need to be loaded in to CUDA kernel, so I researched there is a SELECTED_OP_LIST option in pytorch, but doesn’t seems to work outside mobile build.

Hi @richardfat7, unfortunately we have not found a way to reduce it. It appears you are correct that SELECTED_OP_LIST only applies to the mobile builds.

1 Like

Hi @dusty_nv, have you ever experienced below issue? pytorch version is 1.11.0, torchaudio is 0.11.0. Torchaudiion was installed via pip3.

raceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jetson/.local/lib/python3.8/site-packages/torchaudio/__init__.py", line 1, in <module>
    from torchaudio import (  # noqa: F401
  File "/home/jetson/.local/lib/python3.8/site-packages/torchaudio/_extension.py", line 103, in <module>
    _init_extension()
  File "/home/jetson/.local/lib/python3.8/site-packages/torchaudio/_extension.py", line 88, in _init_extension
    _load_lib("libtorchaudio")
  File "/home/jetson/.local/lib/python3.8/site-packages/torchaudio/_extension.py", line 51, in _load_lib
    torch.ops.load_library(path)
  File "/home/jetson/.local/lib/python3.8/site-packages/torch/_ops.py", line 220, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/jetson/.local/lib/python3.8/site-packages/torchaudio/lib/libtorchaudio.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv

Hi @shahizat, I haven’t gotten that - I would try building torchaudio manually from source and see if that helps.

I build torchaudio in this Dockerfile here: https://github.com/dusty-nv/jetson-containers/blob/e36e937c69415ccc4f7be2fc9903c5432c0a68ba/Dockerfile.pytorch#L93
There are also pre-built containers up on NGC with PyTorch + torchvision + torchaudio here: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/l4t-pytorch

Hi @dusty_nv,
I have noted that there is torch 1.11 for Jetson Orin but there is no corresponding torchvision for torch 1.11. Is there a compatible torchvision that I can install for torch 1.11 on Jetson Orin?

Tks

Hi @dusty_nv,

So if technically I want to install pytorch below 1.11, say 1.7, then I will have to reinstall JetPack 4.6 ?
Can Jetson Orin take JetPack 4.6 ?

Hi @powlook, PyTorch 1.11 would use Torchvision v0.12 - I’ve update the original post above to add these.

Jetson Orin only supports JetPack 5.x. You could try building an earlier PyTorch from source, but I’m not sure if older PyTorch versions would support the minimum CUDA/cuDNN versions from JetPack 5.x.

Hi. It is not for nano, but still
The package >>> print(torch.__version__) 1.12.0a0+84d1cb9.nv22.4
doesn’t seem to include torch.distributed

>>> import torch
>>> print(torch.distributed.is_available())
False

Hi @Andrey1984, I don’t believe these new official wheels are build with distributed enabled, so you would need to build it with distributed turned on if you needed that.

@dusty_nv Thank you for following up
so far I tried building from sources with ninja, cmake; both failed

[  5%] Built target libprotoc
make: *** [Makefile:141: all] Error 2

Could you extend om how to build with Distributed enabled, please?

I just follow my normal build process (in the Build From Source section above) and make sure I have libopenmpi-dev installed first. Then unless you explicitly set USE_DISTRIBUTED=0 environment variable, it will enable distributed.

Thanks
seems I am running into


FAILED: lib/libtorch_global_deps.so 
: && /usr/bin/cc -fPIC -fopenmp -DNDEBUG -O3 -DNDEBUG -DNDEBUG  -Wl,--no-as-needed -rdynamic -shared -Wl,-soname,libtorch_global_deps.so -o lib/libtorch_global_deps.so caffe2/CMakeFiles/torch_global_deps.dir/__/torch/csrc/empty.c.o  -Wl,-rpath,/usr/lib/aarch64-linux-gnu/openmpi/lib:/usr/local/cuda-11.4/lib64::::::::  /usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi_cxx.so  /usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi.so  /usr/local/cuda-11.4/lib64/libcurand.so  /usr/local/cuda-11.4/lib64/libcufft.so  /usr/local/cuda-11.4/lib64/libcublas.so  /usr/lib/aarch64-linux-gnu/libcudnn.so  /usr/local/cuda-11.4/lib64/libcudart.so  -lLIBNVTOOLSEXT-NOTFOUND && :
/usr/bin/ld: cannot find -lLIBNVTOOLSEXT-NOTFOUND
collect2: error: ld returned 1 exit status
[39/1962] Building CXX object third_pa...Files/kineto_base.dir/src/Logger.cpp.o
../third_party/kineto/libkineto/src/Logger.cpp:28:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas]
   28 | #pragma GCC diagnostic ignored "-Wglobal-constructors"
      |                                ^~~~~~~~~~~~~~~~~~~~~~~
[46/1962] Building NVCC (Device) objec...dir/nccl/gloo_cuda_generated_nccl.cu.o
ninja: build stopped: subcommand failed.


seems got through by exporting paths

export PATH=/usr/local/cuda-11.4/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

It looks like it’s trying to use NCCL, but this isn’t supported on Jetson - did you set export USE_NCCL=0 ?

1 Like

passed this step using the argument
It seems for 1.13.0 pytorch the torchvision version needs to be as follows
git clone --branch v0.13.1 https://github.com/pytorch/vision torchvision
right?
building torchvision seems to result in

  File "/home/nvidia/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 544, in unix_cuda_flags
    cflags + _get_cuda_arch_flags(cflags))
  File "/home/nvidia/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1789, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (8.7) or GPU not supported

Thanks