Support for the PyTorch distributed package

I am trying to connect my Jetson nano and another node via the pytorch distributed package. However, the package is not included in the pytorch wheel provided by NVIDIA (https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano-version-1-3-0-now-available/). Also, the attempt to build pytorch with distributed support is failing (https://github.com/pytorch/pytorch/issues/31710).

Could NVIDIA consider building pytorch wheels that include torch.distributed? Since JetPack ships with OpenMPI, it may also be possible to exclude NCCL (export USE_NCCL=0) and build with MPI backend support.

Hi,

Thanks for your suggestion.
I will pass this request to our internal team.

Thanks!

Hi,

We got some feedback from our internal team.

When the PyTorch 1.4 is released, we will try to build it with distributed but no guarantee.
We also recommend that you can re-build the pyTorch package from source with distributed enabled at the same time.

Thanks.

Looks like PyTorch 1.4.0 is close to release. Thank you for your support!

Correct me if I’m wrong; by this

We also recommend that you can re-build the pyTorch package from source with distributed enabled at the same time.
do you mean that it is recommended to build pytorch with distributed once the nvidia-made wheels for pytorch 1.4.0 is out?

If you don’t wish to wait for PyTorch v1.4.0, in the meantime you could attempt to build v1.3.0 from source with USE_DISTRIBUTED enabled.

When v1.4.0 is released, I’ll attempt to build it with USE_DISTRIBUTED (without USE_NCCL) if all goes well.

I’m curious, would you mind sharing some info about your use-case for PyTorch distributed learning on Jetson? Do you have a cluster of Nano’s?

Hi,

Thanks for looking into this issue.

Well, the details I cannot disclose since they are a part of a paper I’m working on, but actually I am not doing distributed learning. I’m looking into ways that I can toss torch.Tensor objects across the network, and pytorch natively supports this using the dist.send / dist.recv pair.

Currently, I have verified the process of building pytorch with CUDA-enabled OpenMPI with on a desktop. Then, I successfully built OpenMPI with CUDA support on my Jetson Nano. (The default OpenMPI on Jetson is without CUDA support.) Then right when I was trying to build pytorch, my power supply cable failed…

I’ll continue to update my progress on building pytorch here.

For the PyTorch v1.4.0 release, I can try building PyTorch to use the default OpenMPI, but if it requires to recompile OpenMPI with CUDA support, I do not wish to make that a requirement for all users installing PyTorch on Jetson. Hopefully PyTorch can be compiled against the default OpenMPI, and then if desired specific users can recompile OpenMPI with CUDA support and swap out that backend.

Yes, I understand. Certainly not everyone needs an OpenMPI with CUDA support. But actually I doubt that the MPI backend can be swapped easily. When I built pytorch, it explicitly said during configuration that it found an OpenMPI that supports CUDA.

Anyway, I think I successfully built pytorch 1.3.0 with OpenMPI (hence the distributed package). I installed CUDA-aware OpenMPI (version 4.0.2) following https://discuss.pytorch.org/t/segfault-using-cuda-with-openmpi/11140/2. Then I just followed the build instructions on Jetpack 32.2.3, only without export USE_DISTRIBUTED=0 and with a 5G swap file.

I have tested the OpenMPI communication with pytorch, and it seems to work. However, one problem I face is that when I try to send/receive tensors that reside on CUDA devices, OpenMPI complains that the cuHostMemRegister method is not implemented. Searching google gave me the conclusion that this can’t really be fixed, but please let me know if anyone has resolved this issue. Maybe it’s not worth looking into since tensor communications work anyway.

Thanks.

OK, the PyTorch v1.4.0 wheels are now posted here: https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano-version-1-4-0-now-available/

They are built with USE_DISTRIBUTED=1 using the default OpenMPI backend. This is the version of OpenMPI from the Ubuntu repo, so I don’t believe it is CUDA-aware as you pointed out.