Hi,
I’m trying to run QMD(in order to speed up the index search in openclaw) with node-llama-cpp on a Jetson Orin AGX (aarch64). My goal is to use the GPU backend for embeddings and queries. However, I consistently get CUDA out-of-memory errors, even for small models (~300M).
Steps I followed:
-
Installed node-llama-cpp via bun globally.
-
Tried to disable CUDA with environment variables:
export LLAMA_CUDA=0
export GGML_CUDA_ENABLE=0
export GGML_CUDA_MAXPOOL=0
-
Forced CPU-only build → works fine, but obviously slower.
-
Attempted GPU build (default precompiled binary) → CUDA error: out of memory on pool allocation.
-
Compared with Jetson-specific llama.cpp tutorials: they compile from source on Jetson, and the same GPU model runs fine without manually setting pool sizes.
Observations:
-
node-llama-cpp precompiled binaries seem to reserve a large fixed CUDA VMM pool, which fails on Jetson’s limited GPU memory.
-
Jetson tutorials appear to auto-adapt the pool size dynamically, avoiding OOM.
-
Even with LLAMA_CUDA=0, the binary still detects CUDA headers and tries to build GPU backend.
Questions:
-
How does NVIDIA / node-llama-cpp handle VMM pool allocation for small-memory GPUs like Jetson?
-
Is there a recommended way to run node-llama-cpp GPU on Jetson without recompiling the entire package manually?
-
Any environment variables, cmake flags, or patches recommended for dynamic pool allocation?
I’m looking for guidance on how to make GPU runtime work efficiently on Jetson Orin AGX with node-llama-cpp + QMD.
Thanks in advance!
Hi,
Could you share which OS version you use?
$ cat /etc/nv_tegra_release
There are some issues related to allocatable memory in r36.4.7.
Please make sure you have upgraded the environment to r36.5.
Thanks.
Hi,
We were running on JetPack r36.4.3 when we encountered the CUDA OOM issue.
The root cause was that the precompiled node-llama-cpp binary enables CUDA VMM by default, which attempts to reserve a large virtual memory pool. On Jetson (unified memory, limited allocatable contiguous GPU memory), this caused immediate CUDA out-of-memory errors even for small models (~300M).
We resolved it by rebuilding node-llama-cpp from source and explicitly disabling CUDA VMM via CMake options.
The key environment variables we used were:
export NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA=ON
export NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON
Then we forced a source build:
npm install -g node-llama-cpp --build-from-source
Disabling GGML_CUDA_NO_VMM prevented the large fixed CUDA VMM pool reservation and allowed the GPU backend to work correctly on r36.4.3 without OOM.
Additionally, since QMD was launched by openclaw and did not inherit our shell environment, we wrapped the QMD binary with a small launcher script to inject the required environment variables before execution.
After these changes, embeddings and queries run successfully on the GPU.
Best regards.
Hi,
For AGX Orin, the memory size is 32GiB or 64GiB.
You should not run out of memory with a small model, even though the VMM pre-occupies some memory for KV Cache.
Have you measured the remaining memory before and after running the node-llama-cpp?
Thanks.