I have recently started testing llama3.3 and llama4 on my host. When I run llama3.3 it is quite slow and when I run llama4 my Jetson AGX Orin Developer Kit (64 GB) stops responding after about 3 seconds and I have to force restart the machine.
What possibilities are there to optimize the performance of our Jetson in order to be able to use the mentioned LLMs a little better?
I believe those two models are probably too large to run on AGX Orin 64gb. Here’s meta-llamas faq on
“Hardware requirements vary based on the specific Llama model being used, latency, throughput and cost constraints. For the larger Llama models to achieve low latency, one would split the model across multiple inference chips (typically a GPU) with tensor parallelism. Llama models are known to execute in a performant manner on a wide variety of hardware including GPUs, CPUs (both x86 and ARM based), TPUs, NPUs and AI Accelerators. The smaller Llama models typically run on system-on-chip (SOC) platforms found on PC, Mobile and other Edge devices.”
“The Llama 4 Scout model is released as BF16 weights, but can fit within a single H100 GPU with on-the-fly int4 quantization; the Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while still maintaining quality. We provide code for on-the-fly int4 quantization which minimizes performance degradation as well.”
ps (right after I typed the below realized AastaLLL was asking the original poster how to run it. i’ll leave this in case anyone could find it helpful.)
Here’s an app.py.txt to run llama3.1 on agx orin 32gb dev kit.
To run it python app.py first pip install -U accelerate huggingface_hub transfomers gradio
Install torch from your cuda version on jetson-ai-lab
And build and install bitsandbytes. git clone bitsandbytes cd bitsandbytes/ cmake -DCOMPUTE_BACKEND=cuda -S . make -j 6 python -m pip wheel . -w dist pip install dist/bitsandbytes.whl*
Using ngrok http 7860 running on my california based agx orin I’ve had family on the east coast use it and other llm’s over the internet.
Yes, I would like to run the model with MLC to test the speed.
I would like to test the Llama 3.3 70B model with MLC. I assume that MLC does not interfere with the already installed Open WebUI (https://openwebui.com/) + configuration - correct?