Problem: slow LLM inference speed on Jetson AGX Orin 64GB
Based on “Nvidia Jetson AGX Orin 64GB”, I tried to deploy LLM and run inference service with “Ollama” official Docker image, but found that the inference speed was slow, only about 50% of the Nvidia’s benchmarks (Benchmarks - NVIDIA Jetson AI Lab).
I have tried to investigate the reason and improve the speed, but it didn’t seem to work.
Some environment info of my Orin system:
- LSB_RELEASE: Ubuntu 20.04
- CUDA_VERSION: 12.2
- L4T_VERSION: 35.4.1
- JETPACK_VERSION: 5.1
Some of the things I’ve tried:
- Change the “Power Mode” of Jetson AGX Orin to MAXN.
- Migrate Docker directory (Data Root) to SSD, and the LLMs are saved on SSD.
And some tricks to imporve “Ollama” inference speed: - OLLAMA_FLASH_ATTENITON is set to 1.
- Preload a model into Ollama to get faster response times. Refer to: (ollama/docs/faq.md at main · ollama/ollama · GitHub)
The “Docker Run” command I used to start a “Ollama” container:
sudo docker run -dit --runtime nvidia --gpus=all --rm --network=host -v /ssd/llm/ollama:/root/.ollama -e JETSON_JETPACK=5 -e OLLAMA_HOST=0.0.0.0:11434 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_DEBUG=1 --name ollama ollama/ollama