Hardware:
Jetson Orin Nano 8GB
Software:
JetPack (Ubuntu 22.04)
Ollama running locally
Model: qwen2.5:1.5b-instruct (~986MB)
Problem:
When running the model with GPU enabled, the inference fails with:
error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
When forcing CPU inference, the model sometimes fails with a memory related error such as:
“index out of range / memory index error”
System state from tegrastats:
RAM ~3.8GB free
lfb ~23x4MB (largest free block = 4MB)
Observations:
The system has sufficient total RAM but very small contiguous memory blocks.
This suggests memory fragmentation due to Jetson’s shared CPU/GPU memory.
Questions:
- Is there a recommended configuration for running LLMs on Jetson Orin Nano?
- Is it possible to increase contiguous memory for CUDA allocations?
- Are there Jetson-specific optimizations for llama.cpp / Ollama models?
- Would using TensorRT-LLM or another inference backend help?
CPU inference works sometimes but is slow.
GPU inference consistently fails with CUDA memory allocation errors.