PyTorch for Jetson

l.weingart · March 7, 2021, 1:36pm

Using different pieces of code here and there, I sometimes get this error message on my jetson Xavier AGX (I use the same codes on a jetson nano but never have this error):
RuntimeError: CUDA error: no kernel image is available for execution on the device
which leads me to think that there is a problem with the torch installation.

I tried the version of pytorch 1.7.0 and 1.8.0 with no success (meaning they are installed correctly according to the verification steps, but give me this error), so I thought I would try to build it from source.

I have L4T 32.5.1 so I’m wondering, should I apply one of the patches you provide before attempting to build torch from source (for compatibility with the code I’m trying to use, my goal is to build pytorch 1.7) ?

Thank you for your help

2570868576 · March 8, 2021, 1:04am

jetson-nano@jetsonnano-desktop:~$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t210ref, EABI: aarch64, DATE: Fri Oct 16 19:44:43 UTC 2020

(venv) jetson-nano@jetsonnano-desktop:~$ pip3 install torch-1.6.0-cp36-cp36m-linux_aarch64.whl 
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: torch-1.6.0-cp36-cp36m-linux_aarch64.whl is not a supported wheel on this platform.

so,what measures should i do to solve this problem???

yyjqr789 · March 8, 2021, 1:42am

Hi,@dusty_nv
In Xavier AGX, I installed libTorch with “xxx.whl” from Box.
But I cmake a project or compile project with QT creator 4.5.2, both encountered problems.

List item my device only has cuda 10.2,but it link library to cuda10.0.I guess the problem is libTorch cmake files.
QT compile,report header file problem.
Thank you in advance!

Blockquote
CMakeFiles/libtorch-yolov5.dir/link.txt:1:/usr/bin/c++ -Wall CMakeFiles/libtorch-yolov5.dir/src/detector.cpp.o CMakeFiles/libtorch-yolov5.dir/src/main.cpp.o -o libtorch-yolov5 -L/usr/local/cuda-10.0/lib64…
CMakeFiles/libtorch-yolov5.dir/build.make:157:libtorch-yolov5: /usr/local/cuda-10.0/lib64/libnvToolsExt.so
CMakeFiles/libtorch-yolov5.dir/build.make:158:libtorch-yolov5: /usr/local/cuda-10.0/lib64/libcudart.so
ai@ai-desktop:~/Documents/road-crack-detection-cpp_copy/buildTestCUDA$ cd /usr/local
ai@ai-desktop:/usr/local$ ls -alh
total 44K
drwxr-xr-x 11 root root 4.0K 8月 21 2020 .
drwxr-xr-x 12 root root 4.0K 8月 21 2020 …
drwxr-xr-x 2 root root 4.0K 3月 4 11:38 bin
lrwxrwxrwx 1 root root 9 8月 21 2020 cuda → cuda-10.2
drwxr-xr-x 12 root root 4.0K 8月 21 2020 cuda-10.2
drwxr-xr-x 2 root root 4.0K 4月 27 2018 etc
drwxr-xr-x 2 root root 4.0K 4月 27 2018 games
drwxr-xr-x 4 root root 4.0K 9月 8 18:17 include
drwxr-xr-x 5 root root 4.0K 9月 8 18:17 lib
lrwxrwxrwx 1 root root 9 4月 27 2018 man → share/man
drwxr-xr-x 2 root root 4.0K 4月 27 2018 sbin
drwxr-xr-x 8 root root 4.0K 9月 8 18:17 share
drwxr-xr-x 2 root root 4.0K 4月 27 2018 src

2570868576 · March 8, 2021, 10:13am

 >>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/__init__.py", line 135, in <module>
    _load_global_deps()
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/__init__.py", line 93, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

this problem have troubled me solong ！what should i do ？thanks a lot!!

dusty_nv · March 8, 2021, 3:48pm

Hi @2570868576, it looks like you have installed a PyTorch wheel built against a newer version of JetPack. Can you try one of these wheels built for JetPack 4.3?

Also, before you do that, run pip3 uninstall torch to uninstall the previous wheel you installed.

dusty_nv · March 8, 2021, 3:51pm

Hi @yyjqr789, which version of JetPack are you running on your AGX Xavier? The wheel you downloaded is for JetPack 4.2 or 4.3.

Regarding the code error you got, unfortunately I’m not familiar with using libtorch directly and haven’t seen that error before. Is the code you are trying to compile expect a different version of PyTorch?

dusty_nv · March 8, 2021, 3:55pm

Hi @l.weingart, are you sure that it is PyTorch which is throwing this error? These wheels were built on Xavier with TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2" (meaning it compile CUDA kernels for Nano/TX1/TX2/Xavier). You can set the same variable when you compile torchvision too.

If you compile PyTorch yourself for Jetson, yes you should apply the patch (i.e. the 1.7 patch for PyTorch 1.7) and remember to set the environment variables too.

l.weingart · March 8, 2021, 4:04pm

Hello @dusty_nv ,

Thank you for your reply.
Actually I went on a built it with the patch.

I have to admit that now that the build is completed, I’m a bit at loss as to how to install it.
Could you please help with the next step, how to install this now that it has complied successfully ?

Thank you

dusty_nv · March 8, 2021, 6:17pm

Sure thing - the built wheel should be under pytorch/distdirectory. Uninstall your previous PyTorch install with pip3 uninstall torch and then install this wheel instead.

l.weingart · March 8, 2021, 6:53pm

Awesome, thank you.
I was looking in build directory… :-/

l.weingart · March 8, 2021, 7:23pm

Hi @dusty_nv,

No, I’m not sure, and to be frank, after having compiled it on the Xavier, I don’t think so.
Here is my problem: I bought a Jetson Nano in December and I successfully installed tools to detect human posture in videos and everything works well on it.
Then in January I bought a Jetson Xavier, tried to install the same tool suite but every time I try to use it it ends up in segmentation fault.
The tool suite is mmpose, from open-mmlab.

From the issue I opened on their github, it was hinted that the problem was coming from torch, but not for certain either.
Also, when I search for RuntimeError: CUDA error: no kernel image is available for execution on the device on Google, results are often in relation to torch (even though it doesn’t mean much, I agree :-p ).

I reinstalled everything on the Xavier from scratch, reflashed the system, the jetpack, etc, impossible to make it work as it does on the Nano.

I’m at loss for ideas.
I’m pretty sure this is not the right thread to discuss this, but I wanted to reply to your question.

Cheers

dusty_nv · March 8, 2021, 7:40pm

It looks like mmcv compiles CUDA kernels. I’m not familiar with these projects, but try setting MMCV_CUDA_ARGS='-gencode=arch=compute_72,code=sm_72' before you install mmcv

yyjqr789 · March 9, 2021, 6:33am

My Xavier is JetPack 4.4. Is the 1.8.0 version suitable for JetPack4.4?

JetPack 4.4 (L4T R32.4.3) / JetPack 4.4.1 (L4T R32.4.4) / JetPack 4.5 (L4T R32.5.0)

Python 3.6 - torch-1.8.0-cp36-cp36m-linux_aarch64.whl

I used 1.8.0 on Xavier NX,it installed ok right now.And Run OK. Thank you!
BUT In QT,there also encounters some errors,same as mentioned before. I will check.

dusty_nv · March 9, 2021, 1:39pm

Hi @yyjqr789, yes the 1.8.0 wheel should work on JetPack 4.4. I am not familiar with using libtorch though.

philipp.becker · March 9, 2021, 5:20pm

Hi, how do I verify that PyTorch is using the GPU? Using jtop I assume it is not using it.

l.weingart · March 9, 2021, 9:33pm

Hello @dusty_nv ,

I rebuilt mmcv using MMCV_CUDA_ARGS='-gencode=arch=compute_72,code=sm_72' and the runtime error RuntimeError: CUDA error: no kernel image is available for execution on the device disappeared.
Thank you very much!

However, using mmpose still ends up in segmentation fault.
I’ll keep working with them to try and isolate the error.
I’m puzzled because it works just fine on the Nano… :-p

dusty_nv · March 10, 2021, 1:54pm

Hi @philipp.becker, have you called .cuda() on your tensors, model, and loss function (criterion)

If so, you should see GPU usage. If it is a very small data/model, it may be more difficult to detect because GPU usage is so low from it.

The PyTorch training examples from Hello AI World and also this torchvision test from l4t-pytorch container use the GPU.

robin.blanchard00 · March 12, 2021, 4:15pm

Hi, I’m trying to put together a L4T Base container to simulate a TX2 using JetPack 4.3 and PyTorch 1.1 with CUDA 10.0.
To do so, I’ve tried to start from a nvcr.io/nvidia/l4t-base:r32.3.1 image and I’ve run into several issues. Here’s my Dockerfile:

FROM nvcr.io/nvidia/l4t-base:r32.3.1

COPY nvidia-l4t-apt-source.list /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
RUN apt-key adv --fetch-key https://repo.download.nvidia.com/jetson/jetson-ota-public.asc

# Update, upgrade and install basics
RUN apt-get update -y
RUN apt-get install -y apt-utils git curl ca-certificates bzip2 cmake tree htop bmon iotop g++ \
 && apt-get install -y libglib2.0-0 libsm6 libxext6 libxrender-dev nano wget python3-pip pkg-config ffmpeg
RUN python3 -m pip install --upgrade pip

ENV NVIDIA_VISIBLE_DEVICES=all

RUN apt-get install -y \
  cuda-cudart-10-0 \
  cuda-cusparse-10-0 \
  cuda-cusparse-dev-10-0 \
  cuda-cudart-dev-10-0 \
  cuda-cufft-10-0 \
  cuda-cufft-dev-10-0 \
  cuda-curand-10-0 \
  cuda-curand-dev-10-0 \
  libcudnn7 \
  libcudnn7-dev \
  cuda-cublas-10-0 \
  cuda-cublas-dev-10-0

# Install PyTorch and TorchVision
# Taken from https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-8-0-now-available/72048
RUN wget https://nvidia.box.com/shared/static/mmu3xb3sp4o8qg9tji90kkxl1eijjfc6.whl -O torch-1.1.0-cp36-cp36m-linux_aarch64.whl \
 && apt-get -y install python3-pip libopenblas-base libopenmpi-dev \
 && python3 -m pip install Cython \
 && python3 -m pip install numpy torch-1.1.0-cp36-cp36m-linux_aarch64.whl

where nvidia-l4t-apt-source.list contains:

deb https://repo.download.nvidia.com/jetson/common r32 main
deb https://repo.download.nvidia.com/jetson/t186 r32 main

Trying to import PyTorch in the container results in an error:

Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 79, in <module>
    from torch._C import *
ImportError: libnvToolsExt.so.1: cannot open shared object file: No such file or directory

I’ve checked out nvidia:l4t-pytorch containers, but they require JetPack 4.4 or newer.

Do you have any idea how to create such a container?
Thanks !

dusty_nv · March 12, 2021, 4:46pm

Hi @robin.blanchard00, l4t-base already includes the CUDA/cuDNN libraries in it (these are mounted from the host at runtime when --runtime nvidia is used during docker run). So I would skip installing all that CUDA stuff into the container and see if that helps. Then just run it with --runtime nvidia

robin.blanchard00 · March 12, 2021, 4:49pm

Hi @dusty_nv , thanks for your help!

Yep that’s what I was doing before. Here’s the Dockerfile then:

FROM nvcr.io/nvidia/l4t-base:r32.3.1

COPY nvidia-l4t-apt-source.list /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
RUN apt-key adv --fetch-key https://repo.download.nvidia.com/jetson/jetson-ota-public.asc

# Update, upgrade and install basics
RUN apt-get update -y
RUN apt-get install -y apt-utils git curl ca-certificates bzip2 cmake tree htop bmon iotop g++ \
 && apt-get install -y libglib2.0-0 libsm6 libxext6 libxrender-dev nano wget python3-pip pkg-config ffmpeg
RUN python3 -m pip install --upgrade pip

ENV NVIDIA_VISIBLE_DEVICES=all

# Install PyTorch and TorchVision
# Taken from https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-8-0-now-available/72048
RUN wget https://nvidia.box.com/shared/static/mmu3xb3sp4o8qg9tji90kkxl1eijjfc6.whl -O torch-1.1.0-cp36-cp36m-linux_aarch64.whl \
 && apt-get -y install python3-pip libopenblas-base libopenmpi-dev \
 && python3 -m pip install Cython \
 && python3 -m pip install numpy torch-1.1.0-cp36-cp36m-linux_aarch64.whl

Running with --runtime nvidia and importing torch results in:

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 79, in <module>
    from torch._C import *
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory