"unable to allocate CUDA0 buffer" after Updating Ubuntu Packages

According to the Jetson Orin Nano Super presentation, it should be able to run llama3.2:3b, llama3.1:8b, gemma2:9b, etc. Currently, it seems we’re locked out of most of the performance, due to these models being impossible to run on the Jetson Orin Nano Super. At the moment, the only models I got to reliably work are gemma3:4b and llama3.2:1b, every other model fails outright due to ollama being unable to allocate CUDA0 buffer or actual out-of-memory errors.

After much trial and error involving the following steps and attempts:

This doesn’t solve or work around the underlying issue and it doesn’t enable the use of all the promised models, but at least my Jetson Orin Nano Super isn’t just a paperweight anymore now!

As always: Your mileage may vary and do any changes only after backing up and at your own risk.


ollama-compose.yaml

services:
    ollama:
        runtime: nvidia
        environment:
            - NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics
            - PULSE_SERVER=unix:/run/user/1000/pulse/native
            - OLLAMA_GPU_OVERHEAD=536870912
            - OLLAMA_FLASH_ATTENTION=1
            - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
            - OLLAMA_NUM_PARALLEL=1
            - OLLAMA_CONTEXT_LENGTH=2048
            - OLLAMA_NEW_ENGINE=1
        stdin_open: true
        tty: true
        network_mode: host
        shm_size: 8g
        volumes:
            - /tmp/argus_socket:/tmp/argus_socket
            - /etc/enctune.conf:/etc/enctune.conf
            - /etc/nv_tegra_release:/etc/nv_tegra_release
            - /tmp/nv_jetson_model:/tmp/nv_jetson_model
            - /var/run/dbus:/var/run/dbus
            - /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket
            - /var/run/docker.sock:/var/run/docker.sock
            - /home/helium/jetson-containers/data:/data
            - /etc/localtime:/etc/localtime:ro
            - /etc/timezone:/etc/timezone:ro
            - /run/user/1000/pulse:/run/user/1000/pulse
        devices:
            - /dev/snd
            - /dev/bus/usb
            - /dev/i2c-0
            - /dev/i2c-1
            - /dev/i2c-2
            - /dev/i2c-4
            - /dev/i2c-5
            - /dev/i2c-7
        container_name: ollama
        image: dustynv/ollama:main-r36.4.0

Edit: In the above compose.yaml I included some environment variables that were recommended in one of the ollama GitHub issues that I can’t link, because new users can only post 4 links at most. You can find it yourself under GitHub → ollama/issues/8597#issuecomment-2614533288. Or to quote the user from GitHub:

[rick-github] (removed the links from this quote):

Earlier log lines would show the memory calculations, but there are some standard OOM mitigations:

  1. Set OLLAMA_GPU_OVERHEAD to give the runner a buffer to grow in to (eg, OLLAMA_GPU_OVERHEAD=536870912 to reserve 512M)
  2. Enable flash attention by setting OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure (note FA is not supported on all model architectures or GPUs, check the logs for flash to verify it’s active).
  3. If flash attention is enabled, further gains can be achieved with KV quantization.
  4. Reduce the number layers that ollama thinks it can offload to the GPU by setting num_gpu, see here.
  5. In Linux with Nvidia devices, set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. This will allow the GPU to offload to CPU memory if VRAM is exhausted. This is only useful for small amounts of memory as there is a performance penalty. However, in the case where the goal is to reduce OOMs, the amount offloaded will be small and the impact minimal.
  6. Set OLLAMA_NUM_PARALLEL to 1. This reduces the size of the KV cache, the default is 2 if ollama thinks there’s available VRAM.
  7. Reduce the size of the KV cache by lowering the value of num_ctx, either in a Modelfile or an API call, or by setting OLLAMA_CONTEXT_LENGTH.
  8. The ollama engine has a better allocation strategy, try using it by setting OLLAMA_NEW_ENGINE=1 in the server environment. Note this only works for model architectures supported by the ollama engine, see here for the currently supported families.

Hi, @antreask

Thanks for the clarification. We will check with our internal team for more info.

Hi, @JSC2718

Unfortunately, we cannot receive a device from the forum users.
But will try other settings to see if we can reproduce this locally.

Hi, @all
We need some help to move this issue forward.

  1. Do you use the Orin Nano devkit?
  2. Any issues when running a CUDA sample (please test with vectorAdd as it allocates a buffer)
$ git clone -b v12.5 https://github.com/NVIDIA/cuda-samples.git
$ cd cuda-samples/Samples/0_Introduction/vectorAdd
$ make
$ ./vectorAdd 
  1. Does this have a 100% failure rate? Or sometimes work and sometimes fail?
  2. When the error occurs (unable to allocate CUDA0 buffer), please help to check if any error is shown in the dmesg:
$ sudo dmesg

We will test the steps shared by @peamouth_monobasicity and will provide more updates later.
Thanks.

  1. No

  2. Running correctly

    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    
  3. It depends.

    After a reboot, if I try to load llama3.2:3b or gemma3:4b, I get the following error:

    llama runner process has terminated: cudaMalloc failed: out of memory

    If I first load a smaller model like gemma3:1b (which loads successfully) and then try to load llama3.2, I get:

    error loading model: unable to allocate CUDA0 buffer

    No matter how many times I retried llama3.2, I kept seeing the same error.
    But when I loaded gemma3:4b again (which succeeded), llama3.2 then loaded successfully afterward.

    It sometimes works after certain models are loaded first.
    Note: Ollama is running directly on the Jetson host system (no Docker container).

  4. No new GPU-related messages appear in dmesg during or after the failure. The output before and after the error is essentially identical.

Hello,

Let me add here more information that might help.

I bought a Jetson Orin Nano 2 weeks ago and I’ve the same issue, even if I do not update the latest version of Ubuntu.

Here are the details of my current Jetson:

  • Jetson Firmware 36.4.7
  • SSD installed and configured to be used
  • OS Version Ubuntu 22.04.5 LTS

I’ve tried to follow the tutorial for the Ollama:

When I run the command: ollama run llama3.2:3b
After the download of the model, I get the error:
Error: 500 Internal Server Error: llama runner process has terminated: cudaMalloc failed: out of memory

Just like @antreask , if I try the gemma3:1b it works.

Let me know if you need more information.

Regards, Tiago

@AastaLLL

  1. Do you use the Orin Nano devkit?

Not using anything related to SDK at the moment.

  1. Any issues when running a CUDA sample (please test with vectorAdd as it allocates a buffer)

No issues.

  1. Does this have a 100% failure rate? Or sometimes work and sometimes fail?

100% failure rate

  1. When the error occurs (unable to allocate CUDA0 buffer), please help to check if any error is shown in the dmesg:
[   15.244633] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for aarch64  540.4.0  Release Build  (buildbrain@mobile-u64-6354-d6000)  Thu Sep 18 15:33:51 PDT 2025
[   15.252731] [drm] [nvidia-drm] [GPU ID 0x00020000] Loading driver
[   15.595509] [drm] Initialized nvidia-drm 0.0.0 20160202 for 13800000.display on minor 1
[   15.596184] checking generic (27de00000 1680000) vs hw (27de00000 1680000)
[   15.596188] fb0: switching to nvidia-drm from simple
[   15.596966] Console: switching to colour dummy device 80x25
[   15.636796] Console: switching to colour frame buffer device 430x90
[   15.636818] nv_platform 13800000.display: [drm] fb0: nvidia-drmdrmfb frame buffer device
[   15.738550] NVRM nvAssertFailedNoLog: Assertion failed: minRequiredIsoBandwidthKBPS <= clientBwValues[DISPLAY_ICC_BW_CLIENT_EXT].minRequiredIsoBandwidthKBPS @ kern_disp_0402.c:111
[   15.738562] CPU: 1 PID: 95 Comm: kworker/u12:2 Tainted: G           O      5.15.148-tegra #1
[   15.738567] Hardware name: NVIDIA NVIDIA Jetson Orin Nano Engineering Reference Developer Kit Super/Jetson, BIOS 36.4.7-gcid-42132812 09/18/2025
[   15.738569] Workqueue: dce-async-ipc-wq tegra_dce_client_ipc_send_recv [tegra_dce]
[   15.738584] Call trace:
[   15.738585]  dump_backtrace+0x0/0x1d0
[   15.738595]  show_stack+0x34/0x50
[   15.738599]  dump_stack_lvl+0x68/0x8c
[   15.738603]  dump_stack+0x18/0x3c
[   15.738605]  os_dump_stack+0x1c/0x28 [nvidia]
[   15.738714]  tlsEntryGet+0x110/0x120 [nvidia]
[   15.738814]  kdispArbAndAllocDisplayBandwidth_v04_02+0x274/0x290 [nvidia]
[   15.738915]  kdispInvokeDisplayModesetCallback_KERNEL+0xa8/0xf0 [nvidia]
[   15.739014]  hypervisorIsVgxHyper_IMPL+0x144/0x260 [nvidia]
[   15.739113]  tegra_dce_client_ipc_send_recv+0x100/0x200 [tegra_dce]
[   15.739120]  process_one_work+0x208/0x500
[   15.739125]  worker_thread+0x144/0x4a0
[   15.739129]  kthread+0x184/0x1a0
[   15.739132]  ret_from_fork+0x10/0x20
[   17.472195] ACK 04 d4
[   17.474467] ACK 04 d4
[   17.540971] IPv6: ADDRCONF(NETDEV_CHANGE): wlP1p1s0: link becomes ready
[   19.491634] rfkill: input handler disabled
[   26.447582] pwm-tegra-tachometer 39c0000.tachometer: Tachometer Overflow is detected
[   29.702715] audit: type=1326 audit(1761443099.790:5): auid=1000 uid=1000 gid=1000 ses=2 subj=kernel pid=3202 comm="chrome" exe="/snap/chromium/3264/usr/lib/chromium-browser/chrome" sig=0 arch=c00000b7 syscall=444 compat=0 ip=0xffff9fb45b68 code=0x50000
[   88.447605] pwm-tegra-tachometer 39c0000.tachometer: Tachometer Overflow is detected
[  420.617026] cpufreq: cpu0,cur:1371000,set:1728000,delta:357000,set ndiv:135
[  422.619902] cpufreq: cpu0,cur:1058000,set:729600,delta:328400,set ndiv:57
[  458.649589] cpufreq: cpu0,cur:1247000,set:1728000,delta:481000,set ndiv:135
[  468.652488] cpufreq: cpu0,cur:976000,set:729600,delta:246400,set ndiv:57
[  486.666749] cpufreq: cpu0,cur:1251000,set:1728000,delta:477000,set ndiv:135
[  533.721534] cpufreq: cpu0,cur:1243000,set:1728000,delta:485000,set ndiv:135
[  541.727364] cpufreq: cpu0,cur:1034000,set:883200,delta:150800,set ndiv:69
[  548.733490] cpufreq: cpu0,cur:1070000,set:729600,delta:340400,set ndiv:57
[  561.747642] cpufreq: cpu0,cur:1477000,set:1728000,delta:251000,set ndiv:135
[  583.771919] cpufreq: cpu0,cur:956000,set:729600,delta:226400,set ndiv:57
[  603.791719] cpufreq: cpu0,cur:1432000,set:1728000,delta:296000,set ndiv:135
[  603.796785] cpufreq: cpu0,cur:1475000,set:1728000,delta:253000,set ndiv:135
[  613.804893] cpufreq: cpu0,cur:1239000,set:1728000,delta:489000,set ndiv:135
[  637.826978] cpufreq: cpu0,cur:1243000,set:1728000,delta:485000,set ndiv:135
[  737.636292] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:438  [ERR]  IDLE SNAP RPC received
[  737.636304] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:439  [ERR]  IDLE SNAP ctrl_id:0
[  737.636306] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:440  [ERR]  IDLE SNAP reason:0x1
[  737.636308] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:444  [ERR]  IDLE_SNAP reason:ERR_IDLE_FLIP_POWERING_DOWN
[  737.636310] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:455  [ERR]  IDLE SNAP idle_status: 0xfffba9fe
[  737.636311] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:457  [ERR]  IDLE SNAP idle_status1: 0xffffffe3
[  737.636313] nvgpu: 17000000.gpu ga10b_pmu_pg_handle_idle_snap_rpc:459  [ERR]  IDLE SNAP idle_status2: 0xffffffff
[  744.924895] cpufreq: cpu0,cur:955000,set:729600,delta:225400,set ndiv:57
[  753.932246] cpufreq: cpu0,cur:1242000,set:1728000,delta:486000,set ndiv:135
[  908.069322] cpufreq: cpu0,cur:1152000,set:729600,delta:422400,set ndiv:57

Hi, all

Thanks a lot for your information. Here are some updates:

We got this issue reproduced locally.
In our experience, the error occurs when we set up the Orin Nano with the SD card image and upgrade.
If the system is flashed via SDK Manager, no error appears (from ollama) after upgrading to r36.4.7.

Please let us know if you see something different from our testing.
Thanks.

That can’t be right. I used the SDK Manager and installed directly onto the NVMe storage.

Even now, I’m having the same issues with the latest version.

The problem with the insufficient RAM is still there.

Hello @AastaLLL,

I have also encountered this issue since the update with a small 1.8b model (moondream:v2).
Sometimes, but very rarely, the model loads. I tried it on the docker version of dustynv and live with version 0.6.8 recompiled with dependency r36.4.7.

I added a log in Ollama to see all memory allocations and tried to reproduce it with a GO program.
Here are some excerpts from the Ollama logs:

ollama.log (20.0 KB)

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 732.30 MiB on device 0

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 768.00 MiB on device 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 768.00 MiB on device 0: cudaMalloc failed: out of memory

Here are the logs from my program:

Free GPU memory: 3674.66 MiB
Total GPU memory: 7619.86 MiB
Allocating buffer 1 of 732 MiB…
Buffer 1 allocated: 0x2040ae000
Allocating buffer 2 of 768 MiB…
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
Failed to allocate buffer 2 (768 MiB)
Releasing allocated GPU memory…
Cleanup complete.

My GO program to be in the same context as Ollama:
(cuda_alloc.go)

package main

/*
#cgo LDFLAGS: -lcudart
#include <cuda_runtime.h>
#include <stdio.h>

void *cudaAlloc(size_t size) {
    void *ptr;
    cudaError_t err = cudaMalloc(&ptr, size);
    if (err != cudaSuccess) {
        return NULL;
    }
    return ptr;
}

void cudaFreeMem(void *ptr) {
    cudaFree(ptr);
}

void getCudaMemInfo(size_t *freeMem, size_t *totalMem) {
    cudaMemGetInfo(freeMem, totalMem);
}
*/
import "C"
import (
    "fmt"
    "unsafe"
)

func main() {
    var freeMem, totalMem C.size_t
    C.getCudaMemInfo(&freeMem, &totalMem)
    fmt.Printf("Free GPU memory: %.2f MiB\n", float64(freeMem)/(1024*1024))
    fmt.Printf("Total GPU memory: %.2f MiB\n", float64(totalMem)/(1024*1024))

    sizes := []int{
        732 * 1024 * 1024, // 732 MiB
        768 * 1024 * 1024, // 768 MiB
    }
    buffers := make([]unsafe.Pointer, 0)

    for i, sz := range sizes {
        fmt.Printf("Allocating buffer %d of %d MiB...\n", i+1, sz/(1024*1024))
        ptr := C.cudaAlloc(C.size_t(sz))
        if ptr == nil {
            fmt.Printf("Failed to allocate buffer %d (%d MiB)\n", i+1, sz/(1024*1024))
            break
        } else {
            fmt.Printf("Buffer %d allocated: %v\n", i+1, ptr)
            buffers = append(buffers, ptr)
        }
    }

    fmt.Println("Releasing allocated GPU memory...")
    for _, buf := range buffers {
        C.cudaFreeMem(buf)
    }
    fmt.Println("Cleanup complete.")
}

Commands to build and execute:

CGO_LDFLAGS=“-L/usr/local/cuda/lib64” CGO_CFLAGS=“-I/usr/local/cuda/include” go build -o cuda_alloc cuda_alloc.go
./cuda_alloc

Hi, both

Thanks a lot for the information. Here are some updates from our side.

The underlying issue that triggers the “unable to allocate CUDA0 buffer” is the following error:

NvMapMemAllocInternalTagged: 1075072515 error 12

More precisely, the device fails to allocate a big chunk of memory (ex., 3GiB) although it has enough free memory.

We tried to set up the SD card device with the SDK Manager, and the issue is gone.
Originally, we thought this might be a temporary workaround, but it seems not to be true.

We will discuss this internally and update more information here.
Thanks.

1 Like

@AastaLLL

Unfortunately, we cannot receive a device from the forum users.

Can I upload my image for you to dd onto a device, e.g. from clonezilla or otherwise?

This is getting pretty silly.
I can still load big models like llama3.1:8b-instruct-q4_K_M on one of my jetsons [not updated] but almost nothing on the new one, and ontop of that the not updated one is running a whole bunch of other docker containers and no issue.

I flashed both via the SDK manager, directly to NVME so that has nothing to do with it.

2 Likes

The issue is caused by the firmware update from version 36.4.4 to 36.4.7.

The only way I was able to resolve it was by reinstalling the device using an SD card flashed with JetPack 6.2.1.

You can still download the image from the following page:

Direct Link to the 36.4.4 image: https://developer.nvidia.com/downloads/embedded/L4T/r36_Release_v4.4/jp62-r1-orin-nano-sd-card-image.zip

After reinstall don’t do any updates (apt update etc).

3 Likes

Hi, @JSC2718

Thanks a lot for the help.
We can reproduce this issue with the super mode SD Card image, so we do have the environment to debug now.

Hi. @all
We have verified a working process to upgrade r36.4.7.
Please help give it a try and let us know the following:

1. Flash Orin Nano as below

Create a username and password with the below script:

$ sudo ./l4t_create_default_user.sh [-u <user>] [-p <pswd>] [-n <host>] [-a] [-h]

Manually flash the system with below command:

NVMe: Flashing Support — NVIDIA Jetson Linux Developer Guide
SDCard: Flashing Support — NVIDIA Jetson Linux Developer Guide

2. Upgrade to r36.4.7 manually

Please do the upgrade manually:

$ sudo apt update
$ sudo apt dist-upgrade

3. Test the app

We are still working with our internal team to see why the upgrade will cause a memory allocation failure.
But please try the above to see if this can avoid the issue temporarily.

Thanks.

Because of my knowledge level, I will more than likely need to wait until someone provides more specific commands to follow. In the meantime, I wonder if this apt update apt upgrade issue, if fixed, will help. Anyone else with similar unmet dependencies?

Please release a fixed SD card version or provide the option via SDK Manager.

1 Like

I am certainly no expert and had a steep learning curve, but there must be another way to provide a proper working solution (or update) without a complete reflash of the system and lose all my carefully tuned running containers , programs and models which ran without issues before the update.

3 Likes

This is exactly what is happening to me

Hi @AastaLLL ,
I tried this procedure and I am experiencing fewer issues with Ollama and the moondream:v2 model. For example, with Ollama running natively and OpenWebUI running in Docker, it works. But if I run both in Docker, the error occurs more regularly.

On the other hand, the memory issue is less noticeable with my memory allocation test, where I can go up to 2048.

I ran tests under L4T_36.4.4. and had no problems with large allocations (e.g. 4096).

Hi,

Thanks a lot for the testing.

Just want to confirm that the required memory for both (Ollama and OpenWebUI) is less than the available system memory?
For example, were you able to run both in Docker on r36.4.4 before?

Could you help us check the ollama log to see if there are some errors related to NvMap?

NvMapMemAllocInternalTagged: 1075072515 error 12

Thanks.

Before the update I was able run multiple docker containers (Frigate NVR with multiple camerafeeds, OpenwebUI, n8n , watchtower) at the same time, with Ollama running natively with 4GB models and larger. Ofcourse some throtteling and available memory warnings with larger models but no errors and Ollama kept on running.

2 Likes