Ollama LLM inference problems on Jetson Orin Nano: CUDA memory allocation failure and CPU memory error

Hardware:
Jetson Orin Nano 8GB

Software:
JetPack (Ubuntu 22.04)
Ollama running locally
Model: qwen2.5:1.5b-instruct (~986MB)

Problem:

When running the model with GPU enabled, the inference fails with:

error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model

When forcing CPU inference, the model sometimes fails with a memory related error such as:
“index out of range / memory index error”

System state from tegrastats:
RAM ~3.8GB free
lfb ~23x4MB (largest free block = 4MB)

Observations:
The system has sufficient total RAM but very small contiguous memory blocks.
This suggests memory fragmentation due to Jetson’s shared CPU/GPU memory.

Questions:

  1. Is there a recommended configuration for running LLMs on Jetson Orin Nano?
  2. Is it possible to increase contiguous memory for CUDA allocations?
  3. Are there Jetson-specific optimizations for llama.cpp / Ollama models?
  4. Would using TensorRT-LLM or another inference backend help?

CPU inference works sometimes but is slow.
GPU inference consistently fails with CUDA memory allocation errors.