Hi !
I have recently started testing llama3.3 and llama4 on my host. When I run llama3.3 it is quite slow and when I run llama4 my Jetson AGX Orin Developer Kit (64 GB) stops responding after about 3 seconds and I have to force restart the machine.
What possibilities are there to optimize the performance of our Jetson in order to be able to use the mentioned LLMs a little better?
Relevant information
- OS: Ubuntu 22.04.5 LTS aarch64
- Host: NVIDIA Jetson AGX Orin Developer Kit (64GB)
- Filesystem Avail: 1,6T (Samsung SSD 990 PRO 2TB)
- Kernel: 5.15.148-tegra
Llama3.1 runs well on 32gb agx orin dev kit.
I believe those two models are probably too large to run on AGX Orin 64gb. Here’s meta-llamas faq on
“Hardware requirements vary based on the specific Llama model being used, latency, throughput and cost constraints. For the larger Llama models to achieve low latency, one would split the model across multiple inference chips (typically a GPU) with tensor parallelism. Llama models are known to execute in a performant manner on a wide variety of hardware including GPUs, CPUs (both x86 and ARM based), TPUs, NPUs and AI Accelerators. The smaller Llama models typically run on system-on-chip (SOC) platforms found on PC, Mobile and other Edge devices.”
Here’s a blog about among other things computing resources required
“The Llama 4 Scout model is released as BF16 weights, but can fit within a single H100 GPU with on-the-fly int4 quantization; the Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while still maintaining quality. We provide code for on-the-fly int4 quantization which minimizes performance degradation as well.”
1 Like
Hi,
How do you run it on the Orin?
Please find our tutorial to run LLAMA with MLC below:
Thanks
1 Like
ps (right after I typed the below realized AastaLLL was asking the original poster how to run it. i’ll leave this in case anyone could find it helpful.)
Here’s an app.py.txt to run llama3.1 on agx orin 32gb dev kit.
To run it python app.py first
pip install -U accelerate huggingface_hub transfomers gradio
Install torch from your cuda version on jetson-ai-lab
And build and install bitsandbytes.
git clone bitsandbytes
cd bitsandbytes/
cmake -DCOMPUTE_BACKEND=cuda -S .
make -j 6
python -m pip wheel . -w dist
pip install dist/bitsandbytes.whl*
Using ngrok http 7860 running on my california based agx orin I’ve had family on the east coast use it and other llm’s over the internet.
1 Like
Thank you very much for your very quick reply.
I have both via the terminal: “ollama run llama3.3” or “ollama run llama4”
as well as via the web UI “Open WebUI”
Do you recommend installing and maintaining the other “Web UI” (Flowise, n8n, LLaMa Factory) in parallel to Open WebUI?
Which of the mentioned Web UIs are faster?
Thanks and best!
Hi,
Since MLC optimizes the model based on hardware, it should have better performance.
Do you want to run the model with MLC instead?
In our experience, the bottleneck of LLMs should be the inference instead of UI.
So you can pick up one based on your preference.
Thanks.
1 Like
Hi,
Thank you for your reply.
Yes, I would like to run the model with MLC to test the speed.
I would like to test the Llama 3.3 70B model with MLC. I assume that MLC does not interfere with the already installed Open WebUI (https://openwebui.com/) + configuration - correct?
Thanks and best!
Hi,
The details can be found on the models page of the NVIDIA Jetson AI Lab.
You will need to run the model with MLC first.
The server will be launched at the : based on your config.
After that, you can launch another open-webui container that communicates MLC at :.
Thanks.