Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Based on “Nvidia Jetson AGX Orin 64GB”, I tried to deploy LLM and run inference service with “Ollama” official Docker image, but found that the inference speed was slow, only about 50% of the Nvidia’s benchmarks (Benchmarks - NVIDIA Jetson AI Lab).

I have tried to investigate the reason and improve the speed, but it didn’t seem to work.

Some environment info of my Orin system:

  • LSB_RELEASE: Ubuntu 20.04
  • CUDA_VERSION: 12.2
  • L4T_VERSION: 35.4.1
  • JETPACK_VERSION: 5.1

Some of the things I’ve tried:

  • Change the “Power Mode” of Jetson AGX Orin to MAXN.
  • Migrate Docker directory (Data Root) to SSD, and the LLMs are saved on SSD.
    And some tricks to imporve “Ollama” inference speed:
  • OLLAMA_FLASH_ATTENITON is set to 1.
  • Preload a model into Ollama to get faster response times. Refer to: (ollama/docs/faq.md at main · ollama/ollama · GitHub)

The “Docker Run” command I used to start a “Ollama” container:

sudo docker run -dit --runtime nvidia --gpus=all --rm --network=host -v /ssd/llm/ollama:/root/.ollama -e JETSON_JETPACK=5 -e OLLAMA_HOST=0.0.0.0:11434 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_DEBUG=1 --name ollama ollama/ollama

Hi,

The benchmark score is generated with MLC.
You can find the benchmark script below:

Thanks.