Hi, I’m trying to run DeepSpeed with MPI backend on Jetson Orin AGX. Below is a detailed summary of my environment and what I’ve tried so far.
🚀 Goal
I want to run DeepSpeed distributed training with MPI backend on Jetson Orin AGX 64GB , inside a Docker container.
Environment details:
Docker image : dustynv/pytorch:2.7-r36.4.0-cu128-24.04
Driver Version : 540.4.0
CUDA Version : 12.8
Python : 3.12.3
OS Info : R36 (release), REVISION: 4.4, GCID: 41062509, BOARD: generic, EABI: aarch64, DATE: Mon Jun 16 16:07:13 UTC 2025
MPI/UCX status:
I have successfully built UCX and OpenMPI from source.
Verified CUDA-aware support:
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true
🔎 Problem Description
1. Container Test
Inside dustynv/pytorch:2.7-r36.4.0-cu128-24.04 container:
>>> import torch
>>> torch.__version__
'2.7.0'
>>> torch.distributed.is_available()
True
>>> torch.distributed.is_gloo_available()
True
>>> torch.distributed.is_nccl_available()
True
>>> torch.distributed.is_mpi_available()
False
👉 MPI backend is missing , so DeepSpeed + MPI cannot run.
2. Wheel Test
(a) torch-2.5.0a0+872d972e41
(b) torch-2.8.0 (Python 3.12 build)
(c) torch-2.3.0 (old build)
🙏 Request for Support
A PyTorch wheel/container for Jetson Orin built with MPI backend (USE_DISTRIBUTED=1, USE_MPI=1).
I can adjust to different JetPack/Python versions if needed, as long as MPI backend support is included.
This is required in order to successfully run DeepSpeed + MPI distributed training on Jetson Orin AGX.
This will take a while to compile and as far as I know the best, perhaps only, way to get torch with distributed.
Make sure agx orin is using all cores.
nvpmodel -q
If not 3 then run following and let it reboot.
sudo nvpmodel -m 3
Then
git clone -b release/2.8 https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
pip install -r requirements.txt
pip install openmpi mpi4py
sudo apt install cuda-nvtx-12-(YourCudaVersion6,8,or,9) libopenmpi3 libopenmpi-dev openmpi-bin openmpi-common
export MAX_JOBS=6
export TORCH_CUDA_ARCH_LIST="8.7"
export USE_CUDA=1
export USE_CUDNN=1
export USE_PRIORITIZED_TEXT_FOR_LD=1
export LD_LIBRARY_PATH='/usr/local/cuda/lib64:$LD_LIBRARY_PATH'
export USE_DISTRIBUTED=1
export USE_MPI=1
export PATH=$PATH:'/usr/lib/pkgconfig'
export USE_ROCM=0
python -m pip install --no-build-isolation -v .
python -m pip wheel --no-build-isolation -v . -w dist
Then copy your pytorch/dist/torch*.whl to your Dockerfile directory then add a
RUN to your Dockerfile.
1 Like
Hi,
Could you try to install OpenMPI first?
Since PyTorch uses it as a backend, you will need to enable it in your environment first.
You can find the building details below:
I am a student ,and I have a Jetson Orion NX and a jetson orin nano, the configuration is as follows:
Jetpack 6
CUDA 12.2
Pytorch 2.3.0 (from source)
Pytorch _ Geometric 2.6.0 (from source)
python 3.9
I want to use them to build a distributed environment, when I use CPU + GLOO ,it’s OK. Then, I want to change to GPU and NCCL, I met with failure,fortunately ,I find nccl not support jetson ,and I find that , jetson support mpi , so I have a try to build openmpi from source :
export CUDA…
Thanks.
Thanks for the reply. 🙏 Just to clarify my setup:
I’ve already compiled OpenMPI from source inside the container , with CUDA-aware support enabled.
The environment verifies this successfully:
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true
UCX is also built and available inside the same container.
However, when running:
import torch
print(torch.distributed.is_mpi_available())
…it still returns False.
So from what I understand, this is likely because the current PyTorch wheel available for Jetson (from the l4t-pytorch container or official wheels) is not built with USE_MPI=1 , hence MPI backend is unavailable regardless of the environment setup.
Thanks again!
Thanks so much for your detailed build instructions — it’s super helpful 🙏
I’ve heard from others that building PyTorch on Jetson with the distributed + MPI backend might take some effort, so I’ve been hoping for a prebuilt wheel to test with first.
But if none are available, I’ll definitely consider trying the build process following your steps.
Just in case — would you mind sharing a bit about your build environment? (e.g. JetPack version, Python version, any specific configuration you used)
It might help me avoid common pitfalls when attempting the build.
Appreciate your support again!
1 Like
Jetpack 6.2.1 agx orin 32gb
I’ve built pytorch from source with python 3.8, 3.10 and recently with above instructions with a new 3.12 venv.
Here’s something should work added to your Dockerile.
RUN git clone -b release/2.8 https://github.com/pytorch/pytorch && \
cd pytorch && \
git submodule sync && \
git submodule update --init --recursive && \
pip install -r requirements.txt && \
pip install openmpi mpi4py && \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
cuda-nvtx-12-6 &&\
export MAX_JOBS=6 && \
export TORCH_CUDA_ARCH_LIST="8.7" && \
export USE_CUDA=1 && \
export USE_CUDNN=1 && \
export USE_PRIORITIZED_TEXT_FOR_LD=1 && \
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH" && \
export USE_DISTRIBUTED=1 && \
export USE_MPI=1 && \
export PATH="$PATH:/usr/lib/pkgconfig" && \
export USE_ROCM=0 && \
python -m pip install --no-build-isolation -v . && \
cd .. && rm -rf pytorch
#only want next line if you want to create a torch*.whl
#python -m pip wheel --no-build-isolation -v . -w dist
2 Likes
whitesscott:
Jetpack 6.2.1 agx orin 32gb
I’ve built pytorch from source with python 3.8, 3.10 and recently with above instructions with a new 3.12 venv.
Here’s something should work added to your Dockerile.
RUN git clone -b release/2.8 https://github.com/pytorch/pytorch && \
cd pytorch && \
git submodule sync && \
git submodule update --init --recursive && \
pip install -r requirements.txt && \
pip install openmpi mpi4py && \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
cuda-nvtx-12-6 &&\
Thanks a lot for sharing your build details! 🙏
I’m currently giving it a try following your instructions.
By the way, may I ask what base image you’re using for your Docker build?
It would help me match the environment more closely and avoid any unnecessary mismatches.
Thanks again!
Hi,
Yes, the prebuilt package is built with the script of jetson-container.
Based on the source below, it has enabled the USE_DISTRIBUTED but no USE_MPI.
https://github.com/dusty-nv/jetson-containers/blob/master/packages/pytorch/build.sh#L36
To unblock your work, please try to build it from the source by modifying the build.sh script shared above.
Or you can refer to the steps shared by @whitesscott in the topic below:
Build current versions of UCX, Openmpi, and pytorch with cuda/distributed.
Compile and install UCX with cuda
sudo apt update
sudo apt install -y build-essential git pkg-config \
autoconf automake libtool m4 \
libnuma-dev hwloc libhwloc-dev
export CUDA_HOME="/usr/local/cuda"
export UCX_PREFIX="/opt/ucx-1.20.0"
export PATH="${CUDA_HOME}/bin:${UCX_PREFIX}/bin:$PATH"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${UCX_PREFIX}/lib:$LD_LIBRARY_PATH"
sudo mkdir $UCX_PREFIX
git clone http…
Thanks.
1 Like
Hi @whitesscott and everyone,
Thanks to all who replied — including the official response — even though I had already finished compiling by then, I really appreciate everyone’s help and input.
Following the instructions shared by @whitesscott , I successfully compiled PyTorch with MPI support on the first attempt. I’d like to share my build process, environment, and verification steps for reference.
Base Docker image used:
FROM dustynv/pytorch:2.7-r36.4.0-cu128-24.04
Inside the container, I first uninstalled the pre-installed PyTorch and then followed @whitesscott ’s instructions. Missing dependencies were installed manually as needed.
Environment (inside the container):
OS: Ubuntu 24.04.1 LTS
L4T version: 36.4.4
Driver Version: 540.4.0
CUDA Version: 12.8
Python version: 3.12.3
PyTorch version: 2.8.0a0+gitc525a02
cmake version: 4.0.0
gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
MPI check:
import torch
torch.distributed.is_mpi_available() # >>> True
Allreduce test:
import torch
import torch.distributed as dist
BACKEND = "mpi"
def main():
dist.init_process_group(backend=BACKEND)
rank = dist.get_rank()
world_size = dist.get_world_size()
device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")
tensor = torch.ones(1, device=device) * rank
print(f"[Rank {rank}] Before: {tensor}")
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
print(f"[Rank {rank}] After all_reduce: {tensor}")
if __name__ == "__main__":
main()
This code ran successfully and produced the expected all-reduce results across ranks.
Overall, compiling PyTorch with MPI support was smoother than I expected. It may not require an exact match of all environment versions, so hopefully this gives others more confidence to try it out.
Thanks again to everyone for the guidance and discussion!
1 Like
system
Closed
September 10, 2025, 2:33am
14
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.