I run some language models through transformers of python on Jetson AGX Orin. The models are downloaded from hugging face, such as Deepseek-R1-Distill-Qwen-7B, Deepseek-R1-Distill-Qwen-1.5B and so on.
But I found that the output token speed of Deepseek-R1-Distill-Qwen-1.5B is less than 20 tokens/s and that of Deepseek-R1-Distill-Qwen-7B is about 10 tokens/s, which published by NVIDIA is much higher. The output tokens/sec of Deepseek-R1-Distill-Qwen-7B is 180.4 tested by NVIDIA and 16.96 tokens/sec of Deepseek-R1-Distill-Qwen-32B. Besides, the model size of Llama 3.3 70B is higher than 64G so that it can’t be run on Orin using python transformers.
I wonder how the apporach NVIDIA run these LLM and get the output token/sec.
