LLM library recomendations for maximum token speeds

Hey everyone.

I’m looking for the best way to run llm’s on the AGX orin. Right now i’ve tested the python llama cpp wheel with cuda12.6 and had fairly underwhelming performance with everything from a 0.5b model to a 75b model. It’s fairly consistently slow for me. When i switched to cuda 12.9(unsupported) and llamacpp 3.16(also unsupported) i was pushing 300+ tokens a second with llama.cpp c++ using a 7b mistreal model, which was more what i was expecting for 275 tops. However this came with a mountain of corruption issues with the cuda buffers and what not. So it was fast, but outputting garbage fast is still well…outputting garbage. I’ve also tested ollama and ran into some pretty slow results as well with the 7b model touching about 15 tokens a second. I also struggled with finding a version of tensorrt that i could install at all with constant build failures linking to dead ends.

I was hoping to get some updated resources that are well documented and currently supported on the jetson orin agx that would allow for clear instructions for install, and also provide better than 10-20 tokens a second with a 7b model. Something that actually uses the full performance of the device.

I spent about a week on this, and there is just so many incompatable docments in the nvidia documentation i was getting pretty lost and navigating all the dead ends. Noting also this is for a realtime robotics application that requires large token outputs at high speeds. So at least 75 tokens a second.

So if anyone has any tutorials, or recommendations that work as of today that would be great. I know this device can run at blistering speeds, i’m just super confused is to how to get to that point.

My current platform is
Jetson AGX orin developer kit 64gb, jetpack 6.2, cuda 12.6.

Hi,

Please note that you can maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

In the link below, you can find the command and container to deploy a certain model on the Jetson:

Thanks.

So, the option is to run everything in vllm with docker? That’s brutal, i was hoping this platform could have matured a bit more.

Is there a known timeline for the agx to be brought up to modern standards to match it’s hardware capabilities?

Thanks

Hi,

You can also try to build vLLM on Jetson.
Below is the related script for your reference:

Thanks.

Alright, i have sunk a few days now into trying to get these two methods working that you provided.

The docker commands provided in this Models | Jetson AI Lab are not recoginized. I just give docker not found, but i can assure you i’ve installed docker and spent the time back and forth, even sending llm agents on a wild goose chase to figure it out.

Second this jetson-containers/packages/llm/vllm/build.sh at master · dusty-nv/jetson-containers · GitHub
fails and refuses to build for cuda. I’m not sure what is going on over there at nvidia, but this is getting weird.

So i guess this brings me back to my original question

Is there a known timeline for the agx to be brought to modern standards? Or any functional standards in the near future or did i just buy a really expensive paper weight?

Hi,

Please find below how to set up Docker:

After that, you only need to pull the container to run a model.
If you prefer to install vLLM locally, please try the repo above for building it locally.

Thanks.

Alright i’m back, i have successfully built tensorrt-llm agx branch, it was an absolute nightmare. The pytorch packages are profoundly inconsistent and crash during build without a reliable error code. It really came down to luck to be honest. So i had an agent just keep running the build over and over until it was successful. It took a number of tries but it eventually worked. Very strange.

So at it’s current state tensorrt-llm agx does work, but it requires a very active build process that i can only recommend using an agent for to trouble shoot the failures, or just keep restarting the build over and over to get it to work.

One huge note, tensorrt-llm is extremely slow, i can not get more than 20 tokens a second on any size model. I’ve tried everything from a 0.5b generated engine, to a 34b models. All cap at 20 tokens a second. It feels hard locked. When running the system in max mode, it gets to 22 tokens a second but then immediately throttles and drops to 5 tokens a second. This is consistent across the board with all llm sizes.

I was never able to get the docker version to work, it just crashes.

I wish i had consistent error codes to help trouble shoot or provide some useful debugging info, but every single crash had an entirely unique and unrelated error code. Sometimes all it took was to just keep running the build over and over without any changes and eventually it works.

I will test with the new jetpack version, then update this post with my findings.

I also notice that latest jetpack version i’ve yet to test. Hopefully there is some performance corrections there.

All llama-cpp-python versions and all llama.cpp versions have a buffer corruption issue with the kv cache shift All models, all configurations. You will get a reliable output back and fourth for 1 or 2 inputs outputs then just garbage. I can’t find any way to document as the crash just seems to come from the cuda drivers themselves and error codes or outputs are inconsistent and unreliable.

I will update shortly with my latest findings.

You did not provide the model used for comparison.
This is what I get with my AGX 64GB

  • gemma3:4b → eval rate: 34tokens/s
  • gpt-oss:20b → eval rate: 29 tokens/s (same rate for 150tokens or 700tokens)

Your setup seems more optimized than mine, you should be getting higher numbers.

I was testing with a number of fine tuned models from huggingface. Their provided models just don’t run at anywhere near usable speeds.

I actually got fed up with trouble shooting their horrific setup documentation and i burned 90 dollars on Opus 4.6 to fix my system setup and get it running. VLLM works now with the docker setup. Buit it’s really not as fast as it should be. Max mode is very odd, it will run faster for a second then throttle down to some very slow speeds. Very unstable

I’m testing with a lot of varing models, some are Qwen and others are mistrals. Those are the fastest ones i could find for this device. I actually have settled on the Qwen3 4b and mistreal 4b instruct. They get decent enough speeds and is semi capable.

I’m just wondering now when they’re going to take the breaks off this platform, the numbers say it should be a lot faster than it’s currently showing.

Some places i’ve noticed things are fast for having multiple models running concurrently. An example of that is running 4 qwen 4 models, the speed is identical to to running a single instance of it.

Based on the specs of this device i’m expected at least double the speed of the current token outputs. Their roadmap says updating the orin platforms in Q2 to jetpack 7, so maybe that’s when things will get faster. Cross your fingers things will pick up speed then.

Hi,

Could you try our vLLM container below (download the one that was built for r36.4)

Based on the benchmark result here, we can get 231 tok/s on AGX Orin 64GB with concurrency=8.
You can find the environment and command on the same page above.

Thanks.