I’m exploring the possibility of running multiple LLM (Large Language Model) instances in parallel on the Jetson Orin AGX using multithreading or multiprocessing. The goal is to handle multiple queries simultaneously to improve performance and responsiveness in real-time applications. I’d like to know if this is feasible given the Orin AGX’s GPU and CPU architecture, and what would be the recommended approach — whether using threading, multiprocessing, or containerization. Any insights on resource allocation, performance optimization, or example implementations would be highly appreciated.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Best way to put more than one Jetson AGX Orin working together | 7 | 1687 | September 26, 2023 | |
如何实现多个 Jetson AGX Orin 硬件并行处理? | 2 | 132 | June 13, 2024 | |
LLMs on DLA | 4 | 60 | April 2, 2025 | |
Is there a plan to support MiG on Orin AGX? | 4 | 56 | March 17, 2025 | |
Multi-tasking performance GPU/CPU | 3 | 823 | June 9, 2022 | |
Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin Dev Kit | 1 | 499 | December 8, 2024 | |
TensorRT-LLM for Jetson | 10 | 1924 | April 21, 2025 | |
Want to run a Local LLM on Nvidia Jetson AGX Orin | 3 | 2858 | July 17, 2024 | |
MPS on AGX Orin? | 2 | 1315 | July 6, 2022 | |
Does AGX Orin support the Multi-Instance GPU(MIG)? | 4 | 1057 | May 11, 2022 |