Ollama LLM inference problems on Jetson Orin Nano: CUDA memory allocation failure and CPU memory error

Hardware:
Jetson Orin Nano 8GB

Software:
JetPack (Ubuntu 22.04)
Ollama running locally
Model: qwen2.5:1.5b-instruct (~986MB)

Problem:

When running the model with GPU enabled, the inference fails with:

error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model

When forcing CPU inference, the model sometimes fails with a memory related error such as:
“index out of range / memory index error”

System state from tegrastats:
RAM ~3.8GB free
lfb ~23x4MB (largest free block = 4MB)

Observations:
The system has sufficient total RAM but very small contiguous memory blocks.
This suggests memory fragmentation due to Jetson’s shared CPU/GPU memory.

Questions:

  1. Is there a recommended configuration for running LLMs on Jetson Orin Nano?
  2. Is it possible to increase contiguous memory for CUDA allocations?
  3. Are there Jetson-specific optimizations for llama.cpp / Ollama models?
  4. Would using TensorRT-LLM or another inference backend help?

CPU inference works sometimes but is slow.
GPU inference consistently fails with CUDA memory allocation errors.

Hi,

Could you check which software version you use?

$ cat /etc/nv_tegra_release 

There is a known memory issue in r36.4.7 and you can get the fix after upgrading your device to JetPack 6.2.2/r36.5.

Thanks.