Problem: slow LLM inference speed on Jetson AGX Orin 64GB

329704910 · April 7, 2025, 11:48am

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Based on “Nvidia Jetson AGX Orin 64GB”, I tried to deploy LLM and run inference service with “Ollama” official Docker image, but found that the inference speed was slow, only about 50% of the Nvidia’s benchmarks (Benchmarks - NVIDIA Jetson AI Lab).

I have tried to investigate the reason and improve the speed, but it didn’t seem to work.

Some environment info of my Orin system:

LSB_RELEASE: Ubuntu 20.04
CUDA_VERSION: 12.2
L4T_VERSION: 35.4.1
JETPACK_VERSION: 5.1

Some of the things I’ve tried:

Change the “Power Mode” of Jetson AGX Orin to MAXN.
Migrate Docker directory (Data Root) to SSD, and the LLMs are saved on SSD.
And some tricks to imporve “Ollama” inference speed:
OLLAMA_FLASH_ATTENITON is set to 1.
Preload a model into Ollama to get faster response times. Refer to: (https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-preload-a-model-into-ollama-to-get-faster-response-times)

The “Docker Run” command I used to start a “Ollama” container:

sudo docker run -dit --runtime nvidia --gpus=all --rm --network=host -v /ssd/llm/ollama:/root/.ollama -e JETSON_JETPACK=5 -e OLLAMA_HOST=0.0.0.0:11434 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_DEBUG=1 --name ollama ollama/ollama

AastaLLL · April 8, 2025, 4:10am

Hi,

The benchmark score is generated with MLC.
You can find the benchmark script below:

github.com/dusty-nv/jetson-containers

packages/llm/mlc/benchmark.sh

master

#!/usr/bin/env bash
#
# Llama benchmark with MLC. This script should be invoked from the host and will run 
# the MLC container with the commands to download, quantize, and benchmark the models.
# It will add its collected performance data to jetson-containers/data/benchmarks/mlc.csv 
#
# Set the HUGGINGFACE_TOKEN environment variable to your HuggingFace account token 
# that has been granted access to the Meta-Llama models.  You can run it like this:
#
#    HUGGINGFACE_TOKEN=hf_abc123 ./benchmark.sh meta-llama/Llama-2-7b-hf
#
# If a model is not specified, then the default set of models will be benchmarked.
# See the environment variables below and their defaults for model settings to change.
#
# These are the possible quantization methods that can be set like QUANTIZATION=q4f16_ft
#
#  (MLC 0.1.0) q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f16_ft_group,q4f32_0,q4f32_1,q8f16_ft,q8f16_ft_group,q8f16_1
#  (MLC 0.1.1) q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16
#
set -ex

This file has been truncated. show original

Thanks.

Topic		Replies	Views
Issue with Nvidia Jetson AGX Orin Developer Kit (64 Gb) Jetson AGX Orin cuda , generative_ai	4	413	July 9, 2025
Ollama is running slow on Jetson AGX Orin Dev-kit (32G) Jetson AGX Orin generative_ai	1	1290	February 15, 2024
LLM library recomendations for maximum token speeds Jetson AGX Orin cuda , llama	11	674	March 16, 2026
LLMs token/sec Jetson AGX Orin generative_ai	1	1273	April 8, 2024
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	7	1259	May 12, 2025
Ollama 0.4.2 released and runs on Nvidia Jetson Orin AGX 64 Jetson AGX Orin generative_ai , llama	8	2299	November 21, 2024
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	26532	May 10, 2024
The token speed of LLM on Jetson AGX Orin Jetson AGX Orin generative_ai , llm , llama , deepseek	4	1048	September 25, 2025
Running LLM in jetson agx orin Jetson AGX Orin llm	3	455	March 4, 2026
Jetson orin nano insanely slow inference speed? Jetson Orin Nano generative_ai	2	1584	May 6, 2024

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Related topics