PyTorch CUDACachingAllocator NVML assertion when sharing CUDA context with llama.cpp on Orin Nano 8 GB (JetPack 6.2.2)

On Jetson Orin Nano Super 8 GB running JetPack 6.2.2 (L4T R36.5.0), I
cannot run llama.cpp dev-build (CUDA-enabled) and any PyTorch-based NeMo
ASR model concurrently on the same device. PyTorch fails at allocator
init with:

RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at
"/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838

Stack:

  • L4T R36.5.0 (JetPack 6.2.2), kernel 5.15.148-tegra, MAXN_SUPER
  • llama.cpp built from source at commit f3c3e0e with
    -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87
  • PyTorch from NVIDIA’s official Jetson wheel (jp/v62/), pinned to numpy<2
  • NeMo 2.0.0
  • cma=512M on kernel cmdline

The failure fires regardless of which side starts first:

  • llama.cpp first, then PyTorch tries model.to(“cuda”): NVML assertion above
  • PyTorch first (model on CUDA), then llama.cpp starts: cudaMalloc fails
    OOM on the 929 MB weight buffer (different failure mode, presumably
    NvMap fragmentation from PyTorch having subdivided the pool)

CTranslate2-based ASR providers (faster-whisper, Røst-CT2) are not
affected — those use their own CUDA binding, not PyTorch’s caching
allocator. The issue looks specific to the PyTorch
CUDACachingAllocator + Tegra NVML interaction.

I’ve written up the full reproducer, three hypotheses for the root
cause, and a list of workarounds I’ve tried (none fully working) here:

Two questions:

  1. Has anyone seen this resolved on a different llama.cpp commit or
    PyTorch build for Jetson?
  2. Is there an NVML-related env var or build flag I should try?
    PYTORCH_NO_CUDA_NVML=1 didn’t change the behaviour.

If anyone has run the official Package llama_cpp · GitHub container
alongside a PyTorch ASR model on JetPack 6.2.x, I’d love to know if that
combination works.

Hi,

It looks like the error is OOM.
Could you verify this with tegrastats as well?

$ sudo tegrastats

Thanks.

Hey @AastaLLL,.

Thanks for the quick reply!

Went and ran the reproducer with
tegrastats at 500 ms cadence plus pre/post snapshots of /proc/meminfo,
/proc/buddyinfo and free -m. The short version: it isn’t a classical
OOM, but it is a contiguous-memory problem as i see it.

RAM isn’t the issue. MemAvailable stays around 6 GB the whole time,
nothing gets killed, and there’s no swap on Jetson anyway. The same
NeMo model loads on CUDA fine in 9.8 s when llama.cpp isn’t already
running.

What does happen is that the CMA pool gets squeezed. tegrastats shows
lfb dropping to 1x4MB across three consecutive samples right before
the assertion fires, with one intermediate sample where lfb is 6x1MB,
meaning no 4 MB contiguous block at all. And just before the PyTorch
traceback, dmesg has four entries of:

NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0

Seems like NvMap is trying to grab a ~1 GB contiguous DMA buffer and failing. The total
CMA pool is 512 MB (cma=512M on the kernel cmdline; cma=1G refuses
to reserve at boot on Tegra234 because of the hardware carveouts),
and llama.cpp’s ggml-CUDA context has already taken roughly 300 MB
of it by the time PyTorch starts. There just isn’t 1 GB of contiguous
CMA left to give out.

PyTorch’s CUDACachingAllocator then hits its NVML query and the
assertion fires. I’m not fully sure whether the NVML failure is a
direct downstream effect of the NvMap state, or a parallel symptom of
CUDA context init failing under CMA exhaustion. Either way, the
user-visible failure is a PyTorch internal assertion that gives no
hint about NvMap or CMA, which is what made this hard to track down
in the first place for me.

Full annotated capture with the raw tegrastats samples, NvMap stderr,
the PyTorch trace and the pre/post snapshots:

Wrapper script in case anyone wants to capture this on their own stack:

Three things I’m hoping you or someone on the Tegra team can help
with:

  1. Is there any way to give NvMap more CMA on Orin Nano than 512 MB?
    cma=1G isn’t an option for the carveout reason above. Are there
    NvMap-specific tunables in /proc/sys, sysfs or as kernel module
    params that would let two CUDA processes share the pool more
    cooperatively rather than one pinning a chunk and starving the
    other?

  2. Is there a PyTorch build flag or runtime env var that disables
    NVML tracking in CUDACachingAllocator on embedded targets, where
    NVML doesn’t behave the way it does on discrete GPUs? I tried
    PYTORCH_NO_CUDA_NVML=1 with no effect on the official Jetson wheel.

  3. When the JetPack 7.2 NVIDIA llama.cpp container
    ( Package llama_cpp · GitHub ) ships for
    Orin Nano, will it sidestep this by using a different CMA strategy,
    or will the same contention show up when it shares a device with
    PyTorch?

Happy to run anything else you want me to try. Stack details are
unchanged from the OP btw.

Thanks.

Hi,

We are not aware of the CMA issue.
Do you mean the assertion is triggered by running out of CMA instead of running out of memory?
If so, based on your observation, does PyTorch requires for CMA or the NvMap?

We need to discuss this issue with our internal team, but we want to clarify the above question first.

Than sk.

Hi.,

Thanks for picking this up and i really appreciate that you’re bringing it to the internal team.

To answar your two questions:

1) Is it CMA exhaustion or general OOM?

CMA. MemAvailable stayed at ~6 GB the entire failure window. The line that actually triggers the chain is NvMap not being able to satisfy a contiguous ~1 GB DMA buffer:

NvMapMemAllocInternalTagged: 1075072515 error 12   (~1.0 GB, ENOMEM)

So plenty of RAM total, just not enough physically contiguous memory in the CMA pool.

2) Does PyTorch or NvMap require CMA?

NvMap is the one that ends up requesting CMA. PyTorch never asks for it directly. On Tegra UMA the chain is roughly: PyTorch CUDACachingAllocator to CUDA driver (libcuda) to NvMap kernel driver to CMA pool. The last step is because the DMA buffer needs to be physically contiguous.

llama.cpp’s ggml-CUDA goes through the same path. That’s why they collide: two NvMap clients drawing from the same CMA pool. Before anything starts, CmaFree is ~472 MB. After llama-server boots and loads Gemma 4 E4B, CmaFree is ~168 MB. PyTorch then needs ~1 GB contiguous, NvMap returns ENOMEM, and CUDACachingAllocator asserts on its NVML query.

On a discrete GPU this wouldn’t happen because each CUDA process gets its own VRAM carveout. On Tegra UMA they’re sharing NvMap and therefore CMA.

One extra data point since the 14 May capture, in case it’s useful for the internal discussion. I ran an 8-hour continuous-load test overnight (1187 inference cycles, llama-server only, no PyTorch-CUDA in the mix). Thoughts behind the test was: does CMA pressure build up slowly over time, or is it a one-shot thing at model-load?

The answar was: one-shot. CmaFree drops at model-load and stays there. lfb holds at 1×4 MB for the full 8 hours, RAM holds steady at ~6.9 GB, TJ around 52.5 C. No drift, no leak, no thermal issue. So whatever the internal team ends up looking at, this is an init-time allocation pattern, not something that worsens under load.

If raw tegrastats samples or anything else would help, just say the word and I’ll send the full log over

Hi,

There are some fixes related to NvMap recently (in r36,4,7 and r36.5).
Could you also test this issue on JetPack 6.2.1 (r36.4.4) to see if this is related?

Thanks

Hi.

We’re already running L4T 36.5.0 (GCID 43688277, built 2026-01-16), and we still reproduce the issue on this build, so whatever NvMap fixes are in r36.5.0 does not fully resolve my case. Could you confirm whether the fix you mean is in r36.5.0 or a later r36.5.x version? If it landed after my build date I may still be missing it.

And I understand the value of comparing against r36.4.4 to classify whether this is the same NvMap issue your fixes targeted. The constraint on our side: is that this is our working test/pilot device, so a downgrade-reflash is disruptive right now. We can run the r36.4.4 comparison on a spare unit or on a later timeline at a later time mabye, would the r36.5.0 reproduction below already be enough for you to assess, or is the side-by-side specifically what you need?

Our repro: (llama.cpp, CUDA backend) fails to allocate its ~2 GB contiguous CUDA0 weight buffer on reload:

NvMapMemAllocInternalTagged: 1075072515 error 12
cudaMalloc failed: out of memory

It triggers specifically when other resident processes (~3 GB anon pages) prevent the kernel from compacting CMA CMDfree stays low and lfb collapses to small blocks even with several GB MemAvailable. A sync + drop_caches + compact_memory cycle restores enough contiguous memory for the load to succeed, which points at CMA fragmentation rather than true OOM.

Is there a recommended cma= sizing or NvMap tunable for the 8 GB Orin Nano under mixed anon + contiguous pressure?

Thanks.

Hi,

The r36.5 fix is related to a CVE issue on NvMap:

The test for r36.4.4 helps us figure out if this issue is caused by the recent changes.
Thanks.

Hi AastaLLL,.

Thanks, that clarifies it!

If the r36.5 NvMap change is the security/CVE patch, thats helpful to know. My image is from 2026.01.16, so I may be missing the patch, if it landed on the package server after that. We’ll pull the latest NvMap security updates via the Debian package server and re-test on r36.5.0 first, and report whether our reproduction changes after here.

On the r36.4.4 side-by-side: we understand its value for classifying whether this is a regression versus long-standing behaviour, and i would like to give you that data. The honest constraint is that i currently have a single Orin Nano, and it’s my active test/pilot device and a downgrade-reflash would take my only working bench offline. So i can’t run the comparison right now, but i will do it on a spare unit as soon as i can free one up and report back the results here also.

While that’s pending, would you be able to share a recommended cma= sizing or NvMap tunable for the 8 GB Orin Nano under mixed anon+contiguous pressure? Even an interim tunable would help. Our current workaround= sync + drop_caches + compact_memory before reload, reliably restores enough contiguous memory to keep us running, but we’d prefer a supported tunable over a periodic-reclaim workaround.

Thanks again for the continued help, it’s much appreciated.

Hi

We have checked with our internal team but NvMap doesn’t allocate memory from the CMA. The allocation failure logs may indicate a scenario where an NvMap client requests a contiguous buffer, NvMap forwards the request to the kernel, and the allocation fails because the kernel cannot provide contiguous memory.

To check it further, we want to add some debug prints internally and check the size and flag of the memory request.
How we can reproduce this issue?
Should we follow the reproducing section mentioned below:

Thanks

hi,.

Thanks, and good to know on the CMA point!

i’ve fixed that in my notes.

Yes, the Reproducing section covers it. The one that hits this assertion is scripts/concurrent-loadtest.sh (llama-server on CUDA, wait for /health, then a NeMo model .to(‘cuda’)). If you want the memory snapshots sitting next to your prints, scripts/nvml-diag-capture.sh runs the same thing with tegrastats and pre/post meminfo+buddyinfo, if needed

You don’t need NeMo or a model download to trigger it though, it’s the first CUDA allocation from PyTorch that fires it, so once llama-server holds a context this is enough:

# terminal 1
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 LLAMA_ARG_FIT=off \
  llama-server -m /path/to/gemma-4-E4B-it-Q4_K_M.gguf \
    --host 127.0.0.1 --port 8080 \
    --ctx-size 1024 --parallel 1 --batch-size 128 --fit off -ngl 28 \
    --threads 4 --no-warmup --reasoning off --reasoning-budget 0

# terminal 2, once /health is ok
python3 -c "import torch; torch.zeros(1).cuda()"

llama.cpp is commit f3c3e0e (-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87), any Gemma-sized Q4_K_M gguf, PyTorch from the official Jetson wheel (jp/v62, numpy<2). Rest of the stack is the OP: L4T 36.5.0 (GCID 43688277, 2026-01-16), MAXN_SUPER, cma=512M, multi-user.target.

The size i can read off the NvMap line is 1075072515, four times, error 12. The flag is the part i can’t get from userspace, so your debug prints are the right way at it. Happy to run a debug build if that’s easier on your end

One thing so the prints land on the right target: two symptoms in my posts print that same NvMap line. The concurrent case (PyTorch+llama.cpp) is the CUDACachingAllocator.cpp:838 assertion, that’s this thread and what the repro above hits. The other one, single-process llama.cpp reload under 3 GB anon pressure giving cudaMalloc OOM on its weight buffer, is a separate thing i just work around, so the concurrent case is the one i’d love the prints on!

If it helps your aim, the two things that visibly move the contiguous allocation here are cma sizing and a compact_memory cycle before load, but i’ll leave the why to what your prints show.

Thanks!