Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB)

karghars · May 5, 2025, 10:37am

Hi !

I have recently started testing llama3.3 and llama4 on my host. When I run llama3.3 it is quite slow and when I run llama4 my Jetson AGX Orin Developer Kit (64 GB) stops responding after about 3 seconds and I have to force restart the machine.

What possibilities are there to optimize the performance of our Jetson in order to be able to use the mentioned LLMs a little better?

Relevant information

OS: Ubuntu 22.04.5 LTS aarch64
Host: NVIDIA Jetson AGX Orin Developer Kit (64GB)
Filesystem Avail: 1,6T (Samsung SSD 990 PRO 2TB)
Kernel: 5.15.148-tegra

whitesscott · May 6, 2025, 1:51am

Llama3.1 runs well on 32gb agx orin dev kit.

I believe those two models are probably too large to run on AGX Orin 64gb. Here’s meta-llamas faq on

“Hardware requirements vary based on the specific Llama model being used, latency, throughput and cost constraints. For the larger Llama models to achieve low latency, one would split the model across multiple inference chips (typically a GPU) with tensor parallelism. Llama models are known to execute in a performant manner on a wide variety of hardware including GPUs, CPUs (both x86 and ARM based), TPUs, NPUs and AI Accelerators. The smaller Llama models typically run on system-on-chip (SOC) platforms found on PC, Mobile and other Edge devices.”

Here’s a blog about among other things computing resources required

“The Llama 4 Scout model is released as BF16 weights, but can fit within a single H100 GPU with on-the-fly int4 quantization; the Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while still maintaining quality. We provide code for on-the-fly int4 quantization which minimizes performance degradation as well.”

AastaLLL · May 6, 2025, 3:12am

Hi,

How do you run it on the Orin?
Please find our tutorial to run LLAMA with MLC below:

Thanks

whitesscott · May 6, 2025, 4:30am

ps (right after I typed the below realized AastaLLL was asking the original poster how to run it. i’ll leave this in case anyone could find it helpful.)

Here’s an app.py.txt to run llama3.1 on agx orin 32gb dev kit.

To run it python app.py first
pip install -U accelerate huggingface_hub transfomers gradio

Install torch from your cuda version on jetson-ai-lab

And build and install bitsandbytes.
git clone bitsandbytes
cd bitsandbytes/
cmake -DCOMPUTE_BACKEND=cuda -S .
make -j 6
python -m pip wheel . -w dist
pip install dist/bitsandbytes.whl*

Using ngrok http 7860 running on my california based agx orin I’ve had family on the east coast use it and other llm’s over the internet.

karghars · May 6, 2025, 2:49pm

Thank you very much for your very quick reply.

I have both via the terminal: “ollama run llama3.3” or “ollama run llama4”
as well as via the web UI “Open WebUI”

Do you recommend installing and maintaining the other “Web UI” (Flowise, n8n, LLaMa Factory) in parallel to Open WebUI?

Which of the mentioned Web UIs are faster?

Thanks and best!

AastaLLL · May 7, 2025, 7:23am

Hi,

Since MLC optimizes the model based on hardware, it should have better performance.
Do you want to run the model with MLC instead?

In our experience, the bottleneck of LLMs should be the inference instead of UI.
So you can pick up one based on your preference.

Thanks.

karghars · May 10, 2025, 1:10pm

Hi,
Thank you for your reply.

Yes, I would like to run the model with MLC to test the speed.

I would like to test the Llama 3.3 70B model with MLC. I assume that MLC does not interfere with the already installed Open WebUI (https://openwebui.com/) + configuration - correct?

Thanks and best!

AastaLLL · May 12, 2025, 10:31am

Hi,

The details can be found on the models page of the NVIDIA Jetson AI Lab.

You will need to run the model with MLC first.
The server will be launched at the : based on your config.

After that, you can launch another open-webui container that communicates MLC at :.

Thanks.

system · June 4, 2025, 1:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running Llama3.1 on JP5.1 Jetson AGX Orin generative_ai , llama	6	164	January 10, 2025
FAQ: Can llama3.2 vision LM be deployed in Jetson Orin Nx 16g Jetson AGX Orin jetson-inference , generative_ai	5	211	November 27, 2024
Failed to MLC-compile mlc-ai/Llama-3.1-8B-Instruct-fp8-MLC on Jetson AGX orin Jetson AGX Orin generative_ai , llama-31-8b-instruct , llama	5	130	January 13, 2025
TensorRT-LLM for Jetson Jetson AGX Orin generative_ai	10	2118	April 21, 2025
Jetpack6.2+TensorRT OOM issue Jetson Orin Nano generative_ai , llama	7	171	February 21, 2025
MiniCPM-Llama3-V-2_5 live on Jetson Orin Jetson AGX Orin generative_ai	11	648	August 9, 2024
Problem: slow LLM inference speed on Jetson AGX Orin 64GB Jetson AGX Orin jetson-inference , generative_ai	2	236	April 8, 2025
Running Ollama / llama3.1 on Jetson AGX Xavier 16gb is it possible? how-to? Jetson AGX Xavier generative_ai , llama-31-8b-instruct	8	2056	October 19, 2024
Unable to Utilize GPU for LLM on NVIDIA Jetson AGX Orin Jetson AGX Orin generative_ai	4	226	July 4, 2024
Trouble running Llamaspeak on AGX Orin 64GB Jetson AGX Orin demos-and-tutorials , generative_ai	8	493	May 25, 2024

Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB)

Related topics