pyTorch 2.0.0.nv23.05

hlau2 · November 22, 2023, 5:12pm

I am using 2.0.0.nv23.05 on my Jetson Orin Nano.

I am working on data parallel with pyTorch, but I got this error:
from torch.distributed import init_process_group, destroy_process_group
ImportError: cannot import name ‘init_process_group’ from ‘torch.distributed’

Is the pyTorch for Jetson not the same as the pyTorch?
I can run my program on my Linux Desktop, but got this error when I ran on Jetson.

dusty_nv · November 22, 2023, 7:57pm

Hi @hlau2, that PyTorch wheel for Jetson wasn’t built with USE_DISTRIBUTED enabled, so it doesn’t have torch.distributed available. You can either disable the code, or rebuild PyTorch with USE_DISTRIBUTED.

You can find instructions on building PyTorch from source in this topic:

hlau2 · November 23, 2023, 1:08am

Thanks for your reply.

The build from source provides pyTorch v1.11?

If I want to build with the v2.1.0 wheel for Jetson again, how can I enable the USE_DISTRIBUTED?
Which file should I edit?

And it is interesting that the official documentation does not contain these content, but a post in forum does.

dusty_nv · November 23, 2023, 1:40pm

@hlau2 if you build PyTorch 2.1, you don’t need any patches for Jetson and can build it straight away like normal.

To enable torch.distributed in the build, just export USE_DISTRIBUTED=1 and apt-get install libopenblas-dev libopenmpi-dev beforehand.

hlau2 · November 24, 2023, 3:26am

I cloned the pyTorch v2.1.0, and then installed the packages, and
export USE_NCCL=0
export USE_DISTRIBUTED=1
export USE_QNNPACK=0
export USE_PYTORCH_QNNPACK=0
export TORCH_CUDA_ARCH_LIST=“7.2;8.7”
export PYTORCH_BUILD_VERSION=2.1.0
export PYTORCH_BUILD_NUMBER=1

and then python3 setup.py bdist_wheel

it tooks few hours and stuck, because of memory. is it correct?

UPDATE: It actually dead.

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
[413/1605] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/LogcumsumexpKernel.cu.o
ninja: build stopped: subcommand failed.

hlau2 · November 24, 2023, 8:10pm

Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.distributed.is_available()
False
import torch.distributed as dist

I built the pyTorch with USE_DISTRIBUTED=1, and I can import the torch.distributed now.

But, it still returns False on torch.distributed.is_available().

Why?

dusty_nv · November 27, 2023, 12:25am

@hlau2 I haven’t used distributed mode, but I would check torch.__config__.show() and the PyTorch source to see what torch.distributed.is_available() is checking for.

system · December 20, 2023, 3:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.