According to the Jetson Orin Nano Super presentation, it should be able to run llama3.2:3b, llama3.1:8b, gemma2:9b, etc. Currently, it seems we’re locked out of most of the performance, due to these models being impossible to run on the Jetson Orin Nano Super. At the moment, the only models I got to reliably work are gemma3:4b and llama3.2:1b, every other model fails outright due to ollama being unable to allocate CUDA0 buffer or actual out-of-memory errors.
After much trial and error involving the following steps and attempts:
- cloning the OS to the SSD via: GitHub - jetsonhacks/migrate-jetson-to-ssd: Use a SD card to setup a Jetson to boot and run from SSD
- making the system headless and memory optimisation according to: 🔖 Memory optimization - NVIDIA Jetson AI Lab
- installing
jetson-containers - running
jetson-containers run --name ollama $(autotag ollama)failed for me, due to a permissions issue, however, it allowed me to copy the docker command, which I turned into a compose.yaml (see below) jetson-containerspicked the following image, which I had no luck with:dustynv/ollama:r36.4-cu129-24.04- In another thread ( Updating Orin Nano breaks Ollama - #9 by pdobrien3 ) user AastaLLL provided a smaller docker command using the image
dustynv/ollama:main-r36.4.0, which worked for me with the below settings
This doesn’t solve or work around the underlying issue and it doesn’t enable the use of all the promised models, but at least my Jetson Orin Nano Super isn’t just a paperweight anymore now!
As always: Your mileage may vary and do any changes only after backing up and at your own risk.
ollama-compose.yaml
services:
ollama:
runtime: nvidia
environment:
- NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics
- PULSE_SERVER=unix:/run/user/1000/pulse/native
- OLLAMA_GPU_OVERHEAD=536870912
- OLLAMA_FLASH_ATTENTION=1
- GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_CONTEXT_LENGTH=2048
- OLLAMA_NEW_ENGINE=1
stdin_open: true
tty: true
network_mode: host
shm_size: 8g
volumes:
- /tmp/argus_socket:/tmp/argus_socket
- /etc/enctune.conf:/etc/enctune.conf
- /etc/nv_tegra_release:/etc/nv_tegra_release
- /tmp/nv_jetson_model:/tmp/nv_jetson_model
- /var/run/dbus:/var/run/dbus
- /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket
- /var/run/docker.sock:/var/run/docker.sock
- /home/helium/jetson-containers/data:/data
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
- /run/user/1000/pulse:/run/user/1000/pulse
devices:
- /dev/snd
- /dev/bus/usb
- /dev/i2c-0
- /dev/i2c-1
- /dev/i2c-2
- /dev/i2c-4
- /dev/i2c-5
- /dev/i2c-7
container_name: ollama
image: dustynv/ollama:main-r36.4.0
Edit: In the above compose.yaml I included some environment variables that were recommended in one of the ollama GitHub issues that I can’t link, because new users can only post 4 links at most. You can find it yourself under GitHub → ollama/issues/8597#issuecomment-2614533288. Or to quote the user from GitHub:
[rick-github] (removed the links from this quote):
Earlier log lines would show the memory calculations, but there are some standard OOM mitigations:
- Set
OLLAMA_GPU_OVERHEADto give the runner a buffer to grow in to (eg,OLLAMA_GPU_OVERHEAD=536870912to reserve 512M)- Enable flash attention by setting
OLLAMA_FLASH_ATTENTION=1in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure (note FA is not supported on all model architectures or GPUs, check the logs forflashto verify it’s active).- If flash attention is enabled, further gains can be achieved with KV quantization.
- Reduce the number layers that ollama thinks it can offload to the GPU by setting
num_gpu, see here.- In Linux with Nvidia devices, set
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. This will allow the GPU to offload to CPU memory if VRAM is exhausted. This is only useful for small amounts of memory as there is a performance penalty. However, in the case where the goal is to reduce OOMs, the amount offloaded will be small and the impact minimal.- Set
OLLAMA_NUM_PARALLELto 1. This reduces the size of the KV cache, the default is 2 if ollama thinks there’s available VRAM.- Reduce the size of the KV cache by lowering the value of
num_ctx, either in a Modelfile or an API call, or by settingOLLAMA_CONTEXT_LENGTH.- The ollama engine has a better allocation strategy, try using it by setting
OLLAMA_NEW_ENGINE=1in the server environment. Note this only works for model architectures supported by the ollama engine, see here for the currently supported families.

