Request for PyTorch Wheel with MPI Backend on Jetson Orin

Hi, I’m trying to run DeepSpeed with MPI backend on Jetson Orin AGX. Below is a detailed summary of my environment and what I’ve tried so far.

🚀 Goal

I want to run DeepSpeed distributed training with MPI backend on Jetson Orin AGX 64GB, inside a Docker container.

Environment details:

  • Docker image: dustynv/pytorch:2.7-r36.4.0-cu128-24.04
  • Driver Version: 540.4.0
  • CUDA Version: 12.8
  • Python: 3.12.3
  • OS Info: R36 (release), REVISION: 4.4, GCID: 41062509, BOARD: generic, EABI: aarch64, DATE: Mon Jun 16 16:07:13 UTC 2025

MPI/UCX status:

  • I have successfully built UCX and OpenMPI from source.

  • Verified CUDA-aware support:

    ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
    mca:mpi:base:param:mpi_built_with_cuda_support:value:true
    

🔎 Problem Description

1. Container Test

Inside dustynv/pytorch:2.7-r36.4.0-cu128-24.04 container:

>>> import torch
>>> torch.__version__
'2.7.0'
>>> torch.distributed.is_available()
True
>>> torch.distributed.is_gloo_available()
True
>>> torch.distributed.is_nccl_available()
True
>>> torch.distributed.is_mpi_available()
False

👉 MPI backend is missing, so DeepSpeed + MPI cannot run.


2. Wheel Test

(a) torch-2.5.0a0+872d972e41


(b) torch-2.8.0 (Python 3.12 build)


(c) torch-2.3.0 (old build)


🙏 Request for Support

A PyTorch wheel/container for Jetson Orin built with MPI backend (USE_DISTRIBUTED=1, USE_MPI=1).

I can adjust to different JetPack/Python versions if needed, as long as MPI backend support is included.

This is required in order to successfully run DeepSpeed + MPI distributed training on Jetson Orin AGX.

This will take a while to compile and as far as I know the best, perhaps only, way to get torch with distributed.

Make sure agx orin is using all cores.

nvpmodel -q

If not 3 then run following and let it reboot.

sudo nvpmodel -m 3

Then

git clone -b release/2.8 https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
pip install -r requirements.txt
pip install openmpi mpi4py
sudo apt install cuda-nvtx-12-(YourCudaVersion6,8,or,9) libopenmpi3 libopenmpi-dev openmpi-bin openmpi-common

export MAX_JOBS=6
export TORCH_CUDA_ARCH_LIST="8.7"
export USE_CUDA=1
export USE_CUDNN=1
export USE_PRIORITIZED_TEXT_FOR_LD=1
export LD_LIBRARY_PATH='/usr/local/cuda/lib64:$LD_LIBRARY_PATH'
export USE_DISTRIBUTED=1
export USE_MPI=1
export PATH=$PATH:'/usr/lib/pkgconfig'
export USE_ROCM=0
python -m pip install --no-build-isolation -v .

python -m pip wheel --no-build-isolation -v . -w dist

Then copy your pytorch/dist/torch*.whl to your Dockerfile directory then add a
RUN to your Dockerfile.

1 Like

Hi,

Could you try to install OpenMPI first?
Since PyTorch uses it as a backend, you will need to enable it in your environment first.

You can find the building details below:

Thanks.

Thanks for the reply. 🙏 Just to clarify my setup:

  1. I’ve already compiled OpenMPI from source inside the container, with CUDA-aware support enabled.

  2. The environment verifies this successfully:

    ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
    mca:mpi:base:param:mpi_built_with_cuda_support:value:true
    
  3. UCX is also built and available inside the same container.

However, when running:

import torch
print(torch.distributed.is_mpi_available())

…it still returns False.

So from what I understand, this is likely because the current PyTorch wheel available for Jetson (from the l4t-pytorch container or official wheels) is not built with USE_MPI=1, hence MPI backend is unavailable regardless of the environment setup.

Thanks again!

Thanks so much for your detailed build instructions — it’s super helpful 🙏
I’ve heard from others that building PyTorch on Jetson with the distributed + MPI backend might take some effort, so I’ve been hoping for a prebuilt wheel to test with first.
But if none are available, I’ll definitely consider trying the build process following your steps.

Just in case — would you mind sharing a bit about your build environment? (e.g. JetPack version, Python version, any specific configuration you used)
It might help me avoid common pitfalls when attempting the build.

Appreciate your support again!

1 Like

Jetpack 6.2.1 agx orin 32gb

I’ve built pytorch from source with python 3.8, 3.10 and recently with above instructions with a new 3.12 venv.


Here’s something should work added to your Dockerile.

RUN git clone -b release/2.8 https://github.com/pytorch/pytorch && \
    cd pytorch && \
    git submodule sync && \
    git submodule update --init --recursive && \
    pip install -r requirements.txt && \
    pip install openmpi mpi4py && \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
        cuda-nvtx-12-6 &&\
    export MAX_JOBS=6 && \
    export TORCH_CUDA_ARCH_LIST="8.7" && \
    export USE_CUDA=1 && \
    export USE_CUDNN=1 && \
    export USE_PRIORITIZED_TEXT_FOR_LD=1 && \
    export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH" && \
    export USE_DISTRIBUTED=1 && \
    export USE_MPI=1 && \
    export PATH="$PATH:/usr/lib/pkgconfig" && \
    export USE_ROCM=0 && \
    python -m pip install --no-build-isolation -v . && \
    cd .. && rm -rf pytorch
    #only want next line if you want to create a torch*.whl
    #python -m pip wheel --no-build-isolation -v . -w dist 
2 Likes

Thanks a lot for sharing your build details! 🙏
I’m currently giving it a try following your instructions.

By the way, may I ask what base image you’re using for your Docker build?
It would help me match the environment more closely and avoid any unnecessary mismatches.

Thanks again!

Hi,

Yes, the prebuilt package is built with the script of jetson-container.
Based on the source below, it has enabled the USE_DISTRIBUTED but no USE_MPI.

https://github.com/dusty-nv/jetson-containers/blob/master/packages/pytorch/build.sh#L36

To unblock your work, please try to build it from the source by modifying the build.sh script shared above.
Or you can refer to the steps shared by @whitesscott in the topic below:

Thanks.

1 Like

Hi @whitesscott and everyone,

Thanks to all who replied — including the official response — even though I had already finished compiling by then, I really appreciate everyone’s help and input.

Following the instructions shared by @whitesscott, I successfully compiled PyTorch with MPI support on the first attempt. I’d like to share my build process, environment, and verification steps for reference.

Base Docker image used:
FROM dustynv/pytorch:2.7-r36.4.0-cu128-24.04

Inside the container, I first uninstalled the pre-installed PyTorch and then followed @whitesscott’s instructions. Missing dependencies were installed manually as needed.

Environment (inside the container):

  • OS: Ubuntu 24.04.1 LTS
  • L4T version: 36.4.4
  • Driver Version: 540.4.0
  • CUDA Version: 12.8
  • Python version: 3.12.3
  • PyTorch version: 2.8.0a0+gitc525a02
  • cmake version: 4.0.0
  • gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

MPI check:

import torch
torch.distributed.is_mpi_available()  # >>> True

Allreduce test:

import torch
import torch.distributed as dist

BACKEND = "mpi"
def main():
    dist.init_process_group(backend=BACKEND)

    rank = dist.get_rank()
    world_size = dist.get_world_size()

    device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")
 
    tensor = torch.ones(1, device=device) * rank
    print(f"[Rank {rank}] Before: {tensor}")
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)

    print(f"[Rank {rank}] After all_reduce: {tensor}")

if __name__ == "__main__":
    main()

This code ran successfully and produced the expected all-reduce results across ranks.

Overall, compiling PyTorch with MPI support was smoother than I expected. It may not require an exact match of all environment versions, so hopefully this gives others more confidence to try it out.

Thanks again to everyone for the guidance and discussion!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.