How to build ucx openmpi pytorch with cuda/distributed on Agx Orin

whitesscott · August 3, 2025, 5:34am

Build current versions of UCX, Openmpi, and pytorch with cuda/distributed.

Compile and install UCX with cuda

sudo apt update
sudo apt install -y build-essential git pkg-config \
     autoconf automake libtool m4 \
     libnuma-dev hwloc libhwloc-dev

export CUDA_HOME="/usr/local/cuda" 
export UCX_PREFIX="/opt/ucx-1.20.0"
export PATH="${CUDA_HOME}/bin:${UCX_PREFIX}/bin:$PATH"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${UCX_PREFIX}/lib:$LD_LIBRARY_PATH"

sudo mkdir $UCX_PREFIX

git clone  https://github.com/openucx/ucx.git && cd ucx

export UCX_TLS="cuda"
export UCX_NET_DEVICES="eno1"  # change to your preferred NIC.
export LDFLAGS="-L/usr/lib/aarch64-linux-gnu/tegra"

./autogen.sh
./configure --prefix=$UCX_PREFIX \
            --with-cuda="/usr/local/cuda" \
            --enable-mt              \
            --disable-assertions     \
            --disable-debug \
            --disable-params-check

make -j6

sudo make install

sudo tee /etc/profile.d/ucx.sh >/dev/null <<'UCXEOF'
export PATH=/opt/ucx-1.20.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/ucx-1.20.0/lib:$LD_LIBRARY_PATH
UCXEOF

source /etc/profile.d/ucx.sh

cd ..

Compile and install Openmpi with cuda

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.8.tar.gz
tar xfz openmpi-5.0.8.tar.gz
cd openmpi-5.0.8

export OMPI_PREFIX="/opt/openmpi-5.0.8"
sudo mkdir $OMPI_PREFIX

./configure --prefix=$OMPI_PREFIX \
    --with-cuda=$CUDA_HOME \
    --with-ucx=$UCX_PREFIX \
    --with-ucx-libdir=$UCX_PREFIX/lib \
    --with-cuda-libdir=/usr/lib/aarch64-linux-gnu \
    --enable-mpirun-prefix-by-default

make -j6

sudo make install


sudo tee /etc/profile.d/openmpi.sh >/dev/null <<'EOF'
export PATH=/opt/openmpi-5.0.8/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-5.0.8/lib:$LD_LIBRARY_PATH
EOF

source /etc/profile.d/openmpi.sh

Build pytorch/pytorch with mpi and distributed.
Add “”-b release/2.8" after git clone below or just
use main which is currently 2.9.0a0 although as of today 2.8 is not quite released.

git clone https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
pip install -r requirements.txt
sudo apt install cuda-nvtx-12-(YourCudaVersion6,8,or,9)

export MAX_JOBS=6
export TORCH_CUDA_ARCH_LIST="8.7"
export USE_CUDA=1
export USE_CUDNN=1
export USE_PRIORITIZED_TEXT_FOR_LD=1
export LD_LIBRARY_PATH='/usr/local/cuda/lib64:$LD_LIBRARY_PATH'
export USE_DISTRIBUTED=1
export USE_MPI=1
export PATH=$PATH:'/usr/lib/pkgconfig'
export USE_ROCM=0

python -m pip install --no-build-isolation -v .

To build a wheel to keep run following that will take a couple minutes.

python -m pip wheel --no-build-isolation -v . -w dist

Your pip wheel can now be found here:

./dist/torch*.whl

Test torch/openmpi:

cat > allreduce_mpi.py <<'EOF'
import torch
import torch.distributed as dist

def main():
    gpu_ok = torch.cuda.is_available()
    device = torch.device("cuda:0" if gpu_ok else "cpu")
    dtype  = torch.float32                       # <= keep it FP32

    dist.init_process_group("mpi")

    rank  = dist.get_rank()
    world = dist.get_world_size()

    tensor    = torch.ones(4, device=device, dtype=dtype) * rank
    dist.all_reduce(tensor)

    expected  = torch.ones(4, device=device, dtype=dtype) * sum(range(world))
    assert torch.allclose(tensor, expected)

    print(f"[{rank}/{world}] All-reduce OK → {tensor[0].item()}")

if __name__ == "__main__":
    main()
EOF

mpirun -n 4 python allreduce_mpi.py

It took a couple hours and compilations to finally discover that this is what Openmpi 5.x.x wanted to enable Cuda.
—with-cuda-libdir=/usr/lib/aarch64-linux-gnu

libcuda.so on Jetpack 6.2.1 with Cuda-12.9

dpkg -S $(find /usr -name libcuda.so)

cuda-driver-dev-12-9: /usr/local/cuda-12.9/targets/aarch64-linux/lib/stubs/libcuda.so
cuda-compat-12-9: /usr/local/cuda-12.9/compat/libcuda.so
nvidia-l4t-cuda: /usr/lib/aarch64-linux-gnu/nvidia/libcuda.so
nvidia-l4t-cuda: /usr/lib/aarch64-linux-gnu/libcuda.so

carolyuu · August 3, 2025, 5:46am

*** Please note that this reply is generated by LLM automatically ***

To build current versions of UCX, OpenMPI, and PyTorch with CUDA and distributed support on Jetson AGX Orin, follow these steps:

Step 1: Install dependencies

Install the necessary dependencies, including CUDA, using the following command:

sudo apt-get update && sudo apt-get install -y build-essential git cmake libnuma-dev libudev-dev libibverbs-dev librdmacm-dev libmlx5-dev libmlx4-dev

**Step 2: Build and install UCX with CUDA**

*   Clone the UCX repository and checkout the desired branch (e.g., `release-1.12.0`):
    ```bash
git clone https://github.com/openucx/ucx.git
cd ucx
git checkout release-1.12.0

Configure UCX with CUDA support:

./configure --enable-cuda --with-cuda=/usr/local/cuda --with-cudalibdir=/usr/lib/aarch64-linux-gnu

*   Build and install UCX:
    ```bash
make -j$(nproc)
sudo make install

Step 3: Build and install OpenMPI with CUDA

Clone the OpenMPI repository and checkout the desired branch (e.g., v5.0.x):

git clone GitHub - open-mpi/ompi: Open MPI main development repository
cd ompi
git checkout v5.0.x

*   Configure OpenMPI with CUDA support:
    ```bash
./configure --enable-cuda --with-cuda=/usr/local/cuda --with-cudalibdir=/usr/lib/aarch64-linux-gnu

Build and install OpenMPI:

make -j$(nproc)
sudo make install


**Step 4: Build PyTorch with MPI and distributed support**

*   Clone the PyTorch repository and checkout the desired branch (e.g., `release/1.12`):
    ```bash
git clone https://github.com/pytorch/pytorch.git
cd pytorch
git checkout release/1.12

Install the necessary dependencies:

pip install -r requirements.txt

*   Build PyTorch with MPI and distributed support:
    ```bash
python setup.py build

Install PyTorch:

python setup.py install


**Step 5: Build a wheel for PyTorch**

*   Build a wheel for PyTorch using the following command:
    ```bash
python setup.py bdist_wheel

The wheel can be found in the dist directory.

Step 6: Test PyTorch with OpenMPI

Test PyTorch with OpenMPI using the following command:

mpirun -n 4 python allreduce.py

This will run the `allreduce.py` script with 4 processes using OpenMPI.

Note: Make sure to replace the branch names and versions with the desired ones. Also, the build process may take a few hours depending on the system configuration.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

AastaLLL · August 5, 2025, 4:15am

Hi,

Thanks a lot for the sharing.
Just want to double-check. Does the below command work as expected?

mpirun -n 4 python allreduce_mpi.py

Thanks.

whitesscott · August 5, 2025, 5:09am

Yes

date; mpirun -n 4 python allreduce_mpi.py; date
Mon Aug  4 10:06:55 PM PDT 2025
[1/4] All-reduce OK → 6.0
[0/4] All-reduce OK → 6.0
[2/4] All-reduce OK → 6.0
[3/4] All-reduce OK → 6.0
Mon Aug  4 10:07:00 PM PDT 2025

AastaLLL · August 6, 2025, 7:24am

Thanks. This is helpful.

system · August 27, 2025, 2:14am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request for PyTorch Wheel with MPI Backend on Jetson Orin Jetson AGX Orin cuda , pytorch , openmpi	9	396	August 11, 2025
Jetson agx orin torch 1.11.0 install Jetson AGX Orin pytorch	5	906	November 6, 2023
Any PyTorch versions supporting torch.distributed and nccl backend on jetson orin nano? Jetson Orin Nano pytorch	18	751	February 27, 2025
Pytocrh环境搭建 Jetson AGX Orin pytorch , chinese	2	330	September 16, 2024
Can Jetson Orin support mpi? Jetson Orin NX openmpi	5	237	January 16, 2025
Install a CUDA compiled version of Torch in Jetson Orin AGX with jetpack 6.2 Jetson AGX Orin cuda	3	183	July 24, 2025
Module 'torch.distributed' has no attribute 'ReduceOp' Jetson AGX Orin pytorch	5	4201	June 14, 2023
Jetson distributed torch Jetson AGX Orin pytorch	4	124	September 22, 2025
Pytorch is not compiled with NCCL support Jetson AGX Orin pytorch	6	5579	August 10, 2023
Unable to install Torch on jetson agx orin (Im going insane) Jetson AGX Orin cuda , yolo , pytorch , cudnn	3	202	August 22, 2025

How to build ucx openmpi pytorch with cuda/distributed on Agx Orin

Related topics