How to build ucx openmpi pytorch with cuda/distributed on Agx Orin

Build current versions of UCX, Openmpi, and pytorch with cuda/distributed.

Compile and install UCX with cuda


sudo apt update
sudo apt install -y build-essential git pkg-config \
     autoconf automake libtool m4 \
     libnuma-dev hwloc libhwloc-dev
export CUDA_HOME="/usr/local/cuda" 
export UCX_PREFIX="/opt/ucx-1.20.0"
export PATH="${CUDA_HOME}/bin:${UCX_PREFIX}/bin:$PATH"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${UCX_PREFIX}/lib:$LD_LIBRARY_PATH"

sudo mkdir $UCX_PREFIX

git clone  https://github.com/openucx/ucx.git && cd ucx

export UCX_TLS="cuda"
export UCX_NET_DEVICES="eno1"  # change to your preferred NIC.
export LDFLAGS="-L/usr/lib/aarch64-linux-gnu/tegra"

./autogen.sh
./configure --prefix=$UCX_PREFIX \
            --with-cuda="/usr/local/cuda" \
            --enable-mt              \
            --disable-assertions     \
            --disable-debug \
            --disable-params-check

make -j6

sudo make install

sudo tee /etc/profile.d/ucx.sh >/dev/null <<'UCXEOF'
export PATH=/opt/ucx-1.20.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/ucx-1.20.0/lib:$LD_LIBRARY_PATH
UCXEOF

source /etc/profile.d/ucx.sh

cd ..

Compile and install Openmpi with cuda

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.8.tar.gz
tar xfz openmpi-5.0.8.tar.gz
cd openmpi-5.0.8

export OMPI_PREFIX="/opt/openmpi-5.0.8"
sudo mkdir $OMPI_PREFIX

./configure --prefix=$OMPI_PREFIX \
    --with-cuda=$CUDA_HOME \
    --with-ucx=$UCX_PREFIX \
    --with-ucx-libdir=$UCX_PREFIX/lib \
    --with-cuda-libdir=/usr/lib/aarch64-linux-gnu \
    --enable-mpirun-prefix-by-default

make -j6

sudo make install


sudo tee /etc/profile.d/openmpi.sh >/dev/null <<'EOF'
export PATH=/opt/openmpi-5.0.8/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-5.0.8/lib:$LD_LIBRARY_PATH
EOF

source /etc/profile.d/openmpi.sh


Build pytorch/pytorch with mpi and distributed.
Add “”-b release/2.8" after git clone below or just
use main which is currently 2.9.0a0 although as of today 2.8 is not quite released.

git clone https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
pip install -r requirements.txt
sudo apt install cuda-nvtx-12-(YourCudaVersion6,8,or,9)

export MAX_JOBS=6
export TORCH_CUDA_ARCH_LIST="8.7"
export USE_CUDA=1
export USE_CUDNN=1
export USE_PRIORITIZED_TEXT_FOR_LD=1
export LD_LIBRARY_PATH='/usr/local/cuda/lib64:$LD_LIBRARY_PATH'
export USE_DISTRIBUTED=1
export USE_MPI=1
export PATH=$PATH:'/usr/lib/pkgconfig'
export USE_ROCM=0

python -m pip install --no-build-isolation -v .

To build a wheel to keep run following that will take a couple minutes.

python -m pip wheel --no-build-isolation -v . -w dist

Your pip wheel can now be found here:

./dist/torch*.whl


Test torch/openmpi:

cat > allreduce_mpi.py <<'EOF'
import torch
import torch.distributed as dist

def main():
    gpu_ok = torch.cuda.is_available()
    device = torch.device("cuda:0" if gpu_ok else "cpu")
    dtype  = torch.float32                       # <= keep it FP32

    dist.init_process_group("mpi")

    rank  = dist.get_rank()
    world = dist.get_world_size()

    tensor    = torch.ones(4, device=device, dtype=dtype) * rank
    dist.all_reduce(tensor)

    expected  = torch.ones(4, device=device, dtype=dtype) * sum(range(world))
    assert torch.allclose(tensor, expected)

    print(f"[{rank}/{world}] All-reduce OK → {tensor[0].item()}")

if __name__ == "__main__":
    main()
EOF

mpirun -n 4 python allreduce_mpi.py


It took a couple hours and compilations to finally discover that this is what Openmpi 5.x.x wanted to enable Cuda.
—with-cuda-libdir=/usr/lib/aarch64-linux-gnu


libcuda.so on Jetpack 6.2.1 with Cuda-12.9

dpkg -S $(find /usr -name libcuda.so)

cuda-driver-dev-12-9: /usr/local/cuda-12.9/targets/aarch64-linux/lib/stubs/libcuda.so
cuda-compat-12-9: /usr/local/cuda-12.9/compat/libcuda.so
nvidia-l4t-cuda: /usr/lib/aarch64-linux-gnu/nvidia/libcuda.so
nvidia-l4t-cuda: /usr/lib/aarch64-linux-gnu/libcuda.so

*** Please note that this reply is generated by LLM automatically ***

To build current versions of UCX, OpenMPI, and PyTorch with CUDA and distributed support on Jetson AGX Orin, follow these steps:

Step 1: Install dependencies

  • Install the necessary dependencies, including CUDA, using the following command:

sudo apt-get update && sudo apt-get install -y build-essential git cmake libnuma-dev libudev-dev libibverbs-dev librdmacm-dev libmlx5-dev libmlx4-dev

**Step 2: Build and install UCX with CUDA**

*   Clone the UCX repository and checkout the desired branch (e.g., `release-1.12.0`):
    ```bash
git clone https://github.com/openucx/ucx.git
cd ucx
git checkout release-1.12.0
  • Configure UCX with CUDA support:

./configure --enable-cuda --with-cuda=/usr/local/cuda --with-cudalibdir=/usr/lib/aarch64-linux-gnu

*   Build and install UCX:
    ```bash
make -j$(nproc)
sudo make install

Step 3: Build and install OpenMPI with CUDA

  • Clone the OpenMPI repository and checkout the desired branch (e.g., v5.0.x):

git clone GitHub - open-mpi/ompi: Open MPI main development repository
cd ompi
git checkout v5.0.x

*   Configure OpenMPI with CUDA support:
    ```bash
./configure --enable-cuda --with-cuda=/usr/local/cuda --with-cudalibdir=/usr/lib/aarch64-linux-gnu
  • Build and install OpenMPI:

make -j$(nproc)
sudo make install


**Step 4: Build PyTorch with MPI and distributed support**

*   Clone the PyTorch repository and checkout the desired branch (e.g., `release/1.12`):
    ```bash
git clone https://github.com/pytorch/pytorch.git
cd pytorch
git checkout release/1.12
  • Install the necessary dependencies:

pip install -r requirements.txt

*   Build PyTorch with MPI and distributed support:
    ```bash
python setup.py build
  • Install PyTorch:

python setup.py install


**Step 5: Build a wheel for PyTorch**

*   Build a wheel for PyTorch using the following command:
    ```bash
python setup.py bdist_wheel
  • The wheel can be found in the dist directory.

Step 6: Test PyTorch with OpenMPI

  • Test PyTorch with OpenMPI using the following command:

mpirun -n 4 python allreduce.py

This will run the `allreduce.py` script with 4 processes using OpenMPI.

Note: Make sure to replace the branch names and versions with the desired ones. Also, the build process may take a few hours depending on the system configuration.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Thanks a lot for the sharing.
Just want to double-check. Does the below command work as expected?

mpirun -n 4 python allreduce_mpi.py

Thanks.

Yes

date; mpirun -n 4 python allreduce_mpi.py; date
Mon Aug  4 10:06:55 PM PDT 2025
[1/4] All-reduce OK → 6.0
[0/4] All-reduce OK → 6.0
[2/4] All-reduce OK → 6.0
[3/4] All-reduce OK → 6.0
Mon Aug  4 10:07:00 PM PDT 2025

Thanks. This is helpful.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.