Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Problem: slow LLM inference speed on Jetson AGX Orin 64GB

Based on “Nvidia Jetson AGX Orin 64GB”, I tried to deploy LLM and run inference service with “Ollama” official Docker image, but found that the inference speed was slow, only about 50% of the Nvidia’s benchmarks (Benchmarks - NVIDIA Jetson AI Lab).

I have tried to investigate the reason and improve the speed, but it didn’t seem to work.

Some environment info of my Orin system:

  • LSB_RELEASE: Ubuntu 20.04
  • CUDA_VERSION: 12.2
  • L4T_VERSION: 35.4.1
  • JETPACK_VERSION: 5.1

Some of the things I’ve tried:

The “Docker Run” command I used to start a “Ollama” container:

sudo docker run -dit --runtime nvidia --gpus=all --rm --network=host -v /ssd/llm/ollama:/root/.ollama -e JETSON_JETPACK=5 -e OLLAMA_HOST=0.0.0.0:11434 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_DEBUG=1 --name ollama ollama/ollama

Hi,

The benchmark score is generated with MLC.
You can find the benchmark script below:

Thanks.