LLM library recomendations for maximum token speeds

seanmurp19 · January 24, 2026, 8:08am

Hey everyone.

I’m looking for the best way to run llm’s on the AGX orin. Right now i’ve tested the python llama cpp wheel with cuda12.6 and had fairly underwhelming performance with everything from a 0.5b model to a 75b model. It’s fairly consistently slow for me. When i switched to cuda 12.9(unsupported) and llamacpp 3.16(also unsupported) i was pushing 300+ tokens a second with llama.cpp c++ using a 7b mistreal model, which was more what i was expecting for 275 tops. However this came with a mountain of corruption issues with the cuda buffers and what not. So it was fast, but outputting garbage fast is still well…outputting garbage. I’ve also tested ollama and ran into some pretty slow results as well with the 7b model touching about 15 tokens a second. I also struggled with finding a version of tensorrt that i could install at all with constant build failures linking to dead ends.

I was hoping to get some updated resources that are well documented and currently supported on the jetson orin agx that would allow for clear instructions for install, and also provide better than 10-20 tokens a second with a 7b model. Something that actually uses the full performance of the device.

I spent about a week on this, and there is just so many incompatable docments in the nvidia documentation i was getting pretty lost and navigating all the dead ends. Noting also this is for a realtime robotics application that requires large token outputs at high speeds. So at least 75 tokens a second.

So if anyone has any tutorials, or recommendations that work as of today that would be great. I know this device can run at blistering speeds, i’m just super confused is to how to get to that point.

My current platform is
Jetson AGX orin developer kit 64gb, jetpack 6.2, cuda 12.6.

AastaLLL · January 26, 2026, 4:53am

Hi,

Please note that you can maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

In the link below, you can find the command and container to deploy a certain model on the Jetson:

Thanks.

seanmurp19 · January 26, 2026, 8:07pm

So, the option is to run everything in vllm with docker? That’s brutal, i was hoping this platform could have matured a bit more.

Is there a known timeline for the agx to be brought up to modern standards to match it’s hardware capabilities?

Thanks

AastaLLL · January 28, 2026, 7:10am

Hi,

You can also try to build vLLM on Jetson.
Below is the related script for your reference:

github.com/dusty-nv/jetson-containers

packages/llm/vllm/build.sh

master

#!/usr/bin/env bash
set -ex

uv pip install pre-commit nanobind==2.5.0
# Clone the repository if it doesn't exist
git clone --branch=${VLLM_BRANCH} --recursive --depth=1 https://github.com/vllm-project/vllm /opt/vllm ||
git clone --recursive --depth=1 https://github.com/vllm-project/vllm /opt/vllm

cd /opt/vllm
env

# cp /tmp/vllm/${VLLM_VERSION}.fa.diff /tmp/vllm/fa.diff
# git apply /tmp/vllm/${VLLM_VERSION}.diff

# echo "Applying vLLM CMake patches…"
#  [[ -z "${IS_SBSA}" || "${IS_SBSA}" == "1" || "${IS_SBSA,,}" == "true" ]]; then
#   git apply -p1 /tmp/vllm/0.10.2.diff || echo "patch already applied"
# elif [[ "$(uname -m)" == "aarch64" && "${TORCH_CUDA_ARCH_LIST}" = "8.7" ]]; then
  # Generate and apply patch for Jetson SM 87,
  # required to apply flash attention path fa.diff.

This file has been truncated. show original

Thanks.

seanmurp19 · February 2, 2026, 2:11am

Alright, i have sunk a few days now into trying to get these two methods working that you provided.

The docker commands provided in this Models | Jetson AI Lab are not recoginized. I just give docker not found, but i can assure you i’ve installed docker and spent the time back and forth, even sending llm agents on a wild goose chase to figure it out.

Second this jetson-containers/packages/llm/vllm/build.sh at master · dusty-nv/jetson-containers · GitHub
fails and refuses to build for cuda. I’m not sure what is going on over there at nvidia, but this is getting weird.

So i guess this brings me back to my original question

Is there a known timeline for the agx to be brought to modern standards? Or any functional standards in the near future or did i just buy a really expensive paper weight?

AastaLLL · February 2, 2026, 8:24am

Hi,

Please find below how to set up Docker:

After that, you only need to pull the container to run a model.
If you prefer to install vLLM locally, please try the repo above for building it locally.

Thanks.

seanmurp19 · February 16, 2026, 9:11pm

Alright i’m back, i have successfully built tensorrt-llm agx branch, it was an absolute nightmare. The pytorch packages are profoundly inconsistent and crash during build without a reliable error code. It really came down to luck to be honest. So i had an agent just keep running the build over and over until it was successful. It took a number of tries but it eventually worked. Very strange.

So at it’s current state tensorrt-llm agx does work, but it requires a very active build process that i can only recommend using an agent for to trouble shoot the failures, or just keep restarting the build over and over to get it to work.

One huge note, tensorrt-llm is extremely slow, i can not get more than 20 tokens a second on any size model. I’ve tried everything from a 0.5b generated engine, to a 34b models. All cap at 20 tokens a second. It feels hard locked. When running the system in max mode, it gets to 22 tokens a second but then immediately throttles and drops to 5 tokens a second. This is consistent across the board with all llm sizes.

I was never able to get the docker version to work, it just crashes.

I wish i had consistent error codes to help trouble shoot or provide some useful debugging info, but every single crash had an entirely unique and unrelated error code. Sometimes all it took was to just keep running the build over and over without any changes and eventually it works.

I will test with the new jetpack version, then update this post with my findings.

I also notice that latest jetpack version i’ve yet to test. Hopefully there is some performance corrections there.

All llama-cpp-python versions and all llama.cpp versions have a buffer corruption issue with the kv cache shift All models, all configurations. You will get a reliable output back and fourth for 1 or 2 inputs outputs then just garbage. I can’t find any way to document as the crash just seems to come from the cuda drivers themselves and error codes or outputs are inconsistent and unreliable.

I will update shortly with my latest findings.

pontual · February 17, 2026, 11:53pm

You did not provide the model used for comparison.
This is what I get with my AGX 64GB

gemma3:4b → eval rate: 34tokens/s
gpt-oss:20b → eval rate: 29 tokens/s (same rate for 150tokens or 700tokens)

Your setup seems more optimized than mine, you should be getting higher numbers.

seanmurp19 · February 18, 2026, 10:04pm

I was testing with a number of fine tuned models from huggingface. Their provided models just don’t run at anywhere near usable speeds.

I actually got fed up with trouble shooting their horrific setup documentation and i burned 90 dollars on Opus 4.6 to fix my system setup and get it running. VLLM works now with the docker setup. Buit it’s really not as fast as it should be. Max mode is very odd, it will run faster for a second then throttle down to some very slow speeds. Very unstable

I’m testing with a lot of varing models, some are Qwen and others are mistrals. Those are the fastest ones i could find for this device. I actually have settled on the Qwen3 4b and mistreal 4b instruct. They get decent enough speeds and is semi capable.

I’m just wondering now when they’re going to take the breaks off this platform, the numbers say it should be a lot faster than it’s currently showing.

Some places i’ve noticed things are fast for having multiple models running concurrently. An example of that is running 4 qwen 4 models, the speed is identical to to running a single instance of it.

Based on the specs of this device i’m expected at least double the speed of the current token outputs. Their roadmap says updating the orin platforms in Q2 to jetpack 7, so maybe that’s when things will get faster. Cross your fingers things will pick up speed then.

AastaLLL · March 2, 2026, 3:01am

Hi,

Could you try our vLLM container below (download the one that was built for r36.4)

Based on the benchmark result here, we can get 231 tok/s on AGX Orin 64GB with concurrency=8.
You can find the environment and command on the same page above.

Thanks.

seanmurp19 · March 14, 2026, 11:02pm

yes vllm works okay with extensive efforts to get it to work properly. It’s not as fast as it should be. I think the benchmarks are just wishful thinking.

I’ve since moved over to using SGLANG as it supports the latest qwen models and requires a lot less dependency hell to get working

seanmurp19 · March 16, 2026, 8:58am

after fairly signifcant trouble shooting i was able to get vllm up and running again using the Qwen 3.5 35ba3b. It is a reasonable speed, certainly usable, and the quality is very high.

Topic		Replies	Views
The token speed of LLM on Jetson AGX Orin Jetson AGX Orin generative_ai , llm , llama , deepseek	4	1048	September 25, 2025
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	7	1259	May 12, 2025
Running LLM in jetson agx orin Jetson AGX Orin llm	3	455	March 4, 2026
LLMs token/sec Jetson AGX Orin generative_ai	1	1273	April 8, 2024
TensorRT-LLM for Jetson Jetson AGX Orin generative_ai	10	4570	July 7, 2025
Problem: slow LLM inference speed on Jetson AGX Orin 64GB Jetson AGX Orin jetson-inference , generative_ai	1	933	April 8, 2025
Issue with Nvidia Jetson AGX Orin Developer Kit (64 Gb) Jetson AGX Orin cuda , generative_ai	4	413	July 9, 2025
Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin Dev Kit Jetson Projects jetson , generative_ai	1	931	December 8, 2024
The token speed of qwen 2.5 vl 3b model is very lower on Jeston AGX Orin Jetson AGX Orin generative_ai	2	572	September 22, 2025
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	26532	May 10, 2024

LLM library recomendations for maximum token speeds

Related topics