Updating Orin Nano breaks Ollama

I have the same situation. Before updating, Gemma3 works fine (CLI), after update I get: cudaMalloc failed: out of memory. I use docker image: dustynv/ollama:0.6.8-r36.4-cu126-22.04 I also tried with smaller models and smaller kvantization. No success. I dont see any sollution yet.

Thanks for the reply @AastaLLL

I’ve decided I’m going in a different direction completely. I’m a video games programmer and so I’m going to treat the Orin Nano as I would any games console. I won’t have to rely on somebody else’s OS, no updates to wait for, or ‘wheels’ that don’t exist yet. I’m going to start in Rust (not that I’ve written anything in Rust before, but that’s part of the adventure) I’m going to write my own drivers that will talk directly to the hardware (I’m definitely going to need information from NVIDIA for this.)

I’m going to develop my own AI model completely from scratch not ‘fine tune’ somebody else’s. It’s going to be a lot of hard work but hey isn’t this what having an AI development board is all about?

Are there some serious technical documents I can download? Hopefully there’s no NDA and the tech spec is openly available - or is it?

Regards,
Muttley

Hi, @all

The “cudaMalloc failed: out of memory” or “error loading model: unable to allocate CUDA0 buffer” is a known issue that happens when upgrading to the r36.4.7.
We are actively working on the issue and you might find more information in the topic below:

Hi, @muttleydosomething

Is the document below what you want to find?

Thanks.

I ran into the same issue today before finding this thread. I have an orin nano where I’ve been running Gemma3:4b on it for about 8 months without issue. I did just recently upgrade ollama from 0.12.6 to 0.13.0 without issue Gemma3:4b ran fine, where I found the issue is when (after upgrading to ollama 0.13.0) I wanted to run qwen-vl:2b. Upon starting to load this model from cli it threw the error in this thread.

I hadn’t updated the OS in a few months so I ran apt update and brought the OS up to date thinking that something in ollama 13 was needing something that was missing in the OS. Updating the OS still threw the error.

I looked in jtop and I’m running L4T 36.4.7 (it appears as the affected version). Cuda 12.6. Still thinking that ollama 13 was missing something I upgraded cuda to 12.8 with the same error message.

At this point I downgraded ollama to 0.12.8 thinking that something was botched with 0.13 and the error sill continued when loading qwen-vl:2b and gemma3:4b. So I downgraded ollams to 12.6 (I think that was the version prior to upgrading ollama on last thursday. On 12.6 running gemma3 and granite3 worked and I could flip between the models without issue. I tried to run qwen-vl and ollama threw the error that I needed to upgrade ollama to at least 12.7 to run qwen-vl. So I upgraded ollama to 0.12.7 (stick with me I’m almost at a point). Now with ollama 12.7 loading qwen-vl the system threw an error on the cli as follows:

Error: 500 Internal Server Error: llama runner process has terminated: CUDA error: out of memory current device: 0, in function ggml_backend_cuda_buffer_set_tensor at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:743 cudaMemcpyAsyncReserve((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2)) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error

This was the first time a thrown error actually gave me something to work with.

I’m running ollama on the OS and not via docker or jetson containers.

Initially I had the env variables set to this.
Environment="OLLAMA_NUM_THREADS=6"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_CUDA=1"
Environment="OLLAMA_MAX_LOADED=2"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"

I changed the following to see if I could reduce the memory requirement so that qwen-vl would run.
Environment="OLLAMA_NUM_THREADS=4"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_CUDA=1"
Environment="OLLAMA_MAX_LOADED=1"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"

With those new settings in place I’m able to switch between gemma3:4b, granite3.3:2b, and qwen-vl:2b without running into the memory issue. While this isn’t ‘fixed’, it does work, or I’m just darn lucky.

So just to recap L4T is at 36.4.7 (i.e. “known issue that happens when upgrading to the r36.4.7” ) , Ollama (official release) 0.12.7 running on bare metal (not dockerized), and cuda 12.8, with the above environment variables does work. Not an ideal situation, but it allows me to keep moving forward with my project.

@hooper217 Thanks, but it doesn’t work for me.

alrough@alrough-ubuntu:~$ ollama run llama3.2:3b
Error: 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory

I editet the ollama service file and set the values as you did.

sudo systemctl edit ollama.service

alrough@alrough-ubuntu:~$ systemctl show ollama.service | grep OLLAMA
Environment=PATH=/home/alrough/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_NUM_THREADS=4 OLLAMA_HOST=0.0.0.0 OLLAMA_CUDA=1 OLLAMA_MAX_LOADED=1 OLLAMA_NUM_PARALLEL=2 OLLAMA_KEEP_ALIVE=-1 OLLAMA_FLASH_ATTENTION=1

I’m getting the impression that this is ending up to be a memory allocation or memory leak (doubt) condition. I worked with the nano again today with interesting results.

For full disclosure I purchased this nano to be the ai component for my home assistant installation. I found it really to slow for interactive response so I switched over to a different solution. So since I had it just sitting I decided to offload whisper and piper to the nano to get a better and quicker tts and stt and it worked great. Both of those running were consuming about 2GB of memory. I decided to spin ollama back up with gemma3:4b to do non-interactive ai stuff like image and intent analysis.

I recently upgraded to ollama v13 to be able to run qwen-vl:2b for image analysis to see if it was faster than the dense model Gemma3. That’s where I hit this issue.

Now this AM I decided to upgrade from what I had working on ollama 12.7 to 13.0 again, and the issue reappeared once again. I can load gemma3 but if I unload it and try to reload it immediately it will get the error. The same results if I switch models. Looking at jtop or htop I see that consume memory with everything running is about 6.9GB. Which isn’t the best place to be.

So I started out trying to free memory. I stopped whisper and it gave back about 1.5GB of ram, but I was still having issues loading qwen3-vl, and it tool a really long time for qwen3-vl to load (more on this in a bit).

(Oh, just to be clear I’m running in cli/multiuser mode not via the GUI since I always run the server headless. That also conserves RAM). Gemma3 has 131K context window so I told ollama to set the default context size to 8192. I’m not sure if it helps if the model has a different assigned. Looking at the model inside ollama it still says the context length is 131072, so that parameter might not have worked.

Environment=“OLLAMA_NUM_THREADS=2”
Environment=“OLLAMA_HOST=0.0.0.0”
Environment=“OLLAMA_CUDA=1”
Environment=“OLLAMA_MAX_LOADED=1”
Environment=“OLLAMA_NUM_PARALLEL=1”
Environment=“OLLAMA_KEEP_ALIVE=6h”
Environment=“OLLAMA_MAX_QUEUE=256”
Environment=“OLLAMA_FLASH_ATTENTION=1”
Environment=“OLLAMA_CONTEXT_LENGTH=8192”

The last thing I did today is run apt to update the OS and then rebooted the nano.

After the reboot I can once again run gemma3 and switch to granite and then back to gemma3 without getting this out of memory error. I can load and run qwen-vl:2b and switch back to gemma3 without the error. I don’t know if it was the memory conservation steps I took, or something in the apt update that ‘masked” the issue.

So its fixed, right? Well almost. I noticed that qwen3-vl was running doggedly slow. I looked at jtop to see the memory allocations and saw the GPU wasn’t being using at all for processing the query, it was all running on the 6 core arm processor. So this makes me think that there isn’t enough gpu memory available to run qwen3-vl so it runs all on the cpu (ollama will do that if there isn’t enough GPU vram available). It did this even though the GPU and CPU have shared memory.

So as of today 27-Nov-25 I’m successfully running ollama 13.0 with gemma3:4b and granite3.3 on the GPU with only piper as an add on application. I will discard using qwen-vl:2b for now (I was only interested in it for image recognition/description anyway). I’m coming to the realization that 8GB of shared memory is not enough if you want to run anything but the really tiny models.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.