QMD + node-llama-cpp on Jetson Orin AGX: GPU runtime OOM

ckdavid233 · February 27, 2026, 3:16am

Hi,

I’m trying to run QMD（in order to speed up the index search in openclaw） with node-llama-cpp on a Jetson Orin AGX (aarch64). My goal is to use the GPU backend for embeddings and queries. However, I consistently get CUDA out-of-memory errors, even for small models (~300M).

Steps I followed:

Installed node-llama-cpp via bun globally.

Tried to disable CUDA with environment variables:

export LLAMA_CUDA=0
export GGML_CUDA_ENABLE=0
export GGML_CUDA_MAXPOOL=0

Forced CPU-only build → works fine, but obviously slower.
Attempted GPU build (default precompiled binary) → CUDA error: out of memory on pool allocation.
Compared with Jetson-specific llama.cpp tutorials: they compile from source on Jetson, and the same GPU model runs fine without manually setting pool sizes.

Observations:

node-llama-cpp precompiled binaries seem to reserve a large fixed CUDA VMM pool, which fails on Jetson’s limited GPU memory.
Jetson tutorials appear to auto-adapt the pool size dynamically, avoiding OOM.
Even with LLAMA_CUDA=0, the binary still detects CUDA headers and tries to build GPU backend.

Questions:

How does NVIDIA / node-llama-cpp handle VMM pool allocation for small-memory GPUs like Jetson?
Is there a recommended way to run node-llama-cpp GPU on Jetson without recompiling the entire package manually?
Any environment variables, cmake flags, or patches recommended for dynamic pool allocation?

I’m looking for guidance on how to make GPU runtime work efficiently on Jetson Orin AGX with node-llama-cpp + QMD.

Thanks in advance!

AastaLLL · March 2, 2026, 4:41am

Hi,

Could you share which OS version you use?

$ cat /etc/nv_tegra_release

There are some issues related to allocatable memory in r36.4.7.
Please make sure you have upgraded the environment to r36.5.

Thanks.

ckdavid233 · March 2, 2026, 4:51am

Hi,

We were running on JetPack r36.4.3 when we encountered the CUDA OOM issue.

The root cause was that the precompiled node-llama-cpp binary enables CUDA VMM by default, which attempts to reserve a large virtual memory pool. On Jetson (unified memory, limited allocatable contiguous GPU memory), this caused immediate CUDA out-of-memory errors even for small models (~300M).

We resolved it by rebuilding node-llama-cpp from source and explicitly disabling CUDA VMM via CMake options.

The key environment variables we used were:

export NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA=ON
export NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON

Then we forced a source build:

npm install -g node-llama-cpp --build-from-source

Disabling GGML_CUDA_NO_VMM prevented the large fixed CUDA VMM pool reservation and allowed the GPU backend to work correctly on r36.4.3 without OOM.

Additionally, since QMD was launched by openclaw and did not inherit our shell environment, we wrapped the QMD binary with a small launcher script to inject the required environment variables before execution.

After these changes, embeddings and queries run successfully on the GPU.

Best regards.

AastaLLL · March 4, 2026, 5:02am

Hi,

For AGX Orin, the memory size is 32GiB or 64GiB.
You should not run out of memory with a small model, even though the VMM pre-occupies some memory for KV Cache.

Have you measured the remaining memory before and after running the node-llama-cpp?

Thanks.

Topic		Replies	Views
Unable to Utilize GPU for LLM on NVIDIA Jetson AGX Orin Jetson AGX Orin generative_ai	4	373	July 4, 2024
Issue with Nvidia Jetson AGX Orin Developer Kit (64 Gb) Jetson AGX Orin cuda , generative_ai	5	365	July 30, 2025
LLM library recomendations for maximum token speeds Jetson AGX Orin cuda , llama	12	331	March 30, 2026
Unable to Utilize GPU for LLM on NVIDIA Jetson AGX Orin Jetson AGX Orin generative_ai	4	369	July 4, 2024
How to Run vLLM ≥ 0.11.0 on Jetson AGX Orin? Jetson AGX Orin llm	8	568	December 4, 2025
Want to run a Local LLM on Nvidia Jetson AGX Orin Jetson AGX Orin generative_ai	3	4040	July 17, 2024
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	1138	May 12, 2025
Llama 4 Maverick locally on NVIDIA Jetson AGX Orin using llama.cpp Jetson Projects generative_ai , llama	1	480	April 25, 2025
Compile llama.cpp to use the Jetson Orin Nano Super GPU Jetson Orin Nano generative_ai , llama	7	1793	September 24, 2025
Request for suitable vLLM Docker for Jetson AGX Orin with CUDA 12.6 Jetson AGX Orin cuda , docker , llm	3	170	November 19, 2025

QMD + node-llama-cpp on Jetson Orin AGX: GPU runtime OOM

Related topics