The token speed of LLM on Jetson AGX Orin

I run some language models through transformers of python on Jetson AGX Orin. The models are downloaded from hugging face, such as Deepseek-R1-Distill-Qwen-7B, Deepseek-R1-Distill-Qwen-1.5B and so on.

But I found that the output token speed of Deepseek-R1-Distill-Qwen-1.5B is less than 20 tokens/s and that of Deepseek-R1-Distill-Qwen-7B is about 10 tokens/s, which published by NVIDIA is much higher. The output tokens/sec of Deepseek-R1-Distill-Qwen-7B is 180.4 tested by NVIDIA and 16.96 tokens/sec of Deepseek-R1-Distill-Qwen-32B. Besides, the model size of Llama 3.3 70B is higher than 64G so that it can’t be run on Orin using python transformers.

I wonder how the apporach NVIDIA run these LLM and get the output token/sec.

Hi,

The table is run with VLLM with quantization=w4a 16, max concurrency =8, input seq len = 2048 and output seg len = 128.
Could you try a similar setting to see if you can get the same performance?

Thanks.

hello, could you tell me the version of vllm, pytorch and Jetpack?

The verison of mine is:

Jetpack 6.2.1

pytorch 2.5.0a0+872d972e41.nv24.08

I can’t confirm the suitable vllm version

Hi,

Sorry for the late update.
You can find the detailed steps in the link below:

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.