I’m looking to deploy a LLM on multiple AGX Orin to have more powerful resources. Can I do it with Triton Inference Server?
Hi,
Do you want to separate a query to run on multiple Orin?
Or want to run different queries on the different Orin at the same time to increase throughput?
You can find our tutorial for LLM below:
Thanks.
Hi,
I wanna connect multiple AGX Orin to form a cluster to deploy a large model, which cannot run in a one jetson.
Using Llama 3.1 70B as an example, I want to deploy it on 2 x AGX Orin (32GB), so that the model can run successfully.
Thank you.
One way to do it is MPI.
I’ve done the following, several months back, as proof of concept. mpi4py, mpiexec or mpirun worked to run a python script among, in my case, 3 different OS, and cpu architectures
On both Orins do The following:
- Install MPI OpenMPI and mpi4py on both Orins
sudo apt update
sudo apt install openmpi-bin libopenmpi-dev
pip install mpi4py
- MPI requires password-less SSH access between nodes.
#Generate SSH keys if you don’t got them.
ssh-keygen -t rsa -b 4096
From one Orin copy the public key to the other Orin and
from 2nd Orin to first Orin.
ssh-copy-id username@remote_jetson_ip
-
Confirm ssh passwordless access from both Orins.
ssh username@remote_jetson_ip -
Create a directory to hold your python script, the hostfile and any needed files on both Orins.
#hostfile
orin1_ip_address slots=6
orin2_ip_address slots=6
#A test python script to test place on mirror dirctories on both Orins
from mpi4py import MPI
import torch
import os
import multiprocessing
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)
print(f"Rank {rank}/{size} running on {device} using {multiprocessing.cpu_count()} CPU cores")
tensor = torch.tensor([rank], dtype=torch.float32, device=device)
comm.Allreduce(MPI.IN_PLACE, tensor, op=MPI.SUM)
print(f"Rank {rank} has tensor {tensor}")
Hi,
If Llama 3.3 70B an option for you, it can run on AGX Orin:
Thanks.
It can only run on AGX Orin 64GB, not for 32GB.
Llama 3.3 70B is just an example. I wanna deploy a LLM model distributed to multiple AGX. Thanks.
Thanks @whitesscott
Have you tried to deploy LLM as well? Thanks.
No, but just looking for a better method and think I’ll try to use accelerate or deepspeed.
I think both would need pytorch built with
export USE_DISTRIBUTED=1