On Jetson Orin Nano Super 8 GB running JetPack 6.2.2 (L4T R36.5.0), I
cannot run llama.cpp dev-build (CUDA-enabled) and any PyTorch-based NeMo
ASR model concurrently on the same device. PyTorch fails at allocator
init with:
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at
"/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838
Stack:
- L4T R36.5.0 (JetPack 6.2.2), kernel 5.15.148-tegra, MAXN_SUPER
- llama.cpp built from source at commit f3c3e0e with
-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 - PyTorch from NVIDIA’s official Jetson wheel (jp/v62/), pinned to numpy<2
- NeMo 2.0.0
- cma=512M on kernel cmdline
The failure fires regardless of which side starts first:
- llama.cpp first, then PyTorch tries model.to(“cuda”): NVML assertion above
- PyTorch first (model on CUDA), then llama.cpp starts: cudaMalloc fails
OOM on the 929 MB weight buffer (different failure mode, presumably
NvMap fragmentation from PyTorch having subdivided the pool)
CTranslate2-based ASR providers (faster-whisper, Røst-CT2) are not
affected — those use their own CUDA binding, not PyTorch’s caching
allocator. The issue looks specific to the PyTorch
CUDACachingAllocator + Tegra NVML interaction.
I’ve written up the full reproducer, three hypotheses for the root
cause, and a list of workarounds I’ve tried (none fully working) here:
Two questions:
- Has anyone seen this resolved on a different llama.cpp commit or
PyTorch build for Jetson? - Is there an NVML-related env var or build flag I should try?
PYTORCH_NO_CUDA_NVML=1 didn’t change the behaviour.
If anyone has run the official Package llama_cpp · GitHub container
alongside a PyTorch ASR model on JetPack 6.2.x, I’d love to know if that
combination works.