Hi, has anyone tried running both an SLM like, TinyLlama-1.1B and a YOLOv8 Object Detection on a Jetson Orin Nano 8GB? Do you think it may able to run this simultaneously?
Hi @gaiusjulius, presuming your models fit into memory, you can run them simultaneously by throttling the SLM/LLM token generation rate to sustain the desired performance.
Run YOLO on a separate CUDA stream, but if you encounter stuttering you may need to add sleep() calls to the inner LLM model inference loop, so it doesn’t consume 100% GPU generating tokens as fast as possible. Text-based chats with the user may be irregular workload, while vision remains rather constant.
In this video we had ran VLM, LLM, vectorDB, ASR, and TTS simultaneously on AGX Orin:
There are rate limiters in there for controlling how fast each stream/model runs that get dialed in. Also I had to pay attention to the CUDA streams (if your LLM is running in a different process than YOLO that shouldn’t be necessary as they are already in different CUDA contexts at the driver level)
Thank you for this information!