Hi, I am completely new to Jetson Orin Nano. And feeling confused on how to run multiple small llm together on gpu. Can anyone help me with it?
I’m trying to run Qwen3-ASR-0.6B on an NVIDIA Jetson Orin Nano Super (8 GB unified memory) and I consistently hit a CUDA allocator crash coming from NVML. I’m wondering if there is a known workaround or recommended setup for Jetson-class devices.
Hardware / software
- Board: Jetson Orin Nano Super, 8 GB unified memory (CPU + GPU)
- OS: Ubuntu 22.04 (JetPack 6 / R36.4.7)
- CUDA: 12.6
- PyTorch:
2.5.0a0+872d972e41.nv24.08(NVIDIA’s official Jetson build) - Transformers: 4.57.6
- qwen-asr: latest from pip
Core error
When I move the model to CUDA or run transcribe, I eventually get:
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED
at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838
From debugging, this is triggered when the PyTorch CUDA caching allocator calls nvmlDeviceGetMemoryInfo(). On Jetson, NVML returns an error instead of NVML_SUCCESS, so the internal assert fires.
What I’ve already tried
-
Simple
.to("cuda"):model = Qwen3ASRModel.from_pretrained(MODEL_DIR) model = model.to("cuda")→ Crashes with the NVML assert during
.to("cuda"). -
device_map="auto"+max_memory:model = Qwen3ASRModel.from_pretrained( MODEL_DIR, torch_dtype=torch.float16, device_map="auto", max_memory={0: "2GiB", "cpu": "4GiB"}, )→ Same NVML assert during
from_pretrained. -
Disable caching allocator:
os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"] = "1" model = Qwen3ASRModel.from_pretrained(MODEL_DIR, torch_dtype=torch.float16) model = model.to("cuda")→ Model loads on CUDA, but
model.transcribe()hits CUDA OOM in the audio encoder (F.conv2d), probably due to fragmentation without the caching allocator. -
Allocator tweaks like:
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync # or export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Falseexpandable_segments:Falselets the model load, buttranscribe()still eventually triggers the same NVML assert when acudaMallocfails. -
CPU-only fp32:
model = Qwen3ASRModel.from_pretrained(MODEL_DIR) model.transcribe(audio="audio.wav", language="English")→ Works, but is far too slow on the ARM CPU (~5–10+ minutes per utterance), so not usable for my real-time use case.
Other PyTorch-based components (e.g., Silero VAD) and LLMs (Ollama tiny models, MedGemma) run fine on this device.
My questions
- Is Jetson (8 GB unified memory) an officially supported / tested target for Qwen3-ASR-0.6B?
- Is there a recommended way to run Qwen3-ASR on Jetson that avoids this NVML-related crash (e.g., specific PyTorch version, flags, or a non-PyTorch runtime such as ONNX/TensorRT export that you support)?
- Is there a smaller / more Jetson-friendly Qwen ASR variant that you recommend for 8 GB unified memory devices?
- If the answer is to patch PyTorch’s
CUDACachingAllocator(remove theNVML_SUCCESS == rassert), do you have:- Any guidance on whether this is safe for Qwen3-ASR, and
- A known-good Jetson configuration (model size, dtype, sequence length, etc.) that fits within 8 GB unified memory?
when i run the file , I am getting the below error.
import os
import gc
import ctypes
import time
os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"] = "1"
import torch
from safetensors.torch import load_file
LIBC = ctypes.CDLL("libc.so.6")
def mem_free():
gc.collect()
LIBC.malloc_trim(0)
free, total = torch.cuda.mem_get_info()
return free / 1e6, total / 1e6
print("=" * 60, flush=True)
print("Qwen3-ASR-0.6B — Jetson GPU Test (direct load)", flush=True)
print("=" * 60, flush=True)
print(f"PyTorch: {torch.__version__}", flush=True)
free, total = mem_free()
print(f"GPU: {free:.0f} MB free / {total:.0f} MB total", flush=True)
MODEL_DIR = os.path.join(os.path.dirname(__file__), "models", "Qwen3-ASR-0.6B")
AUDIO_FILE = os.path.join(os.path.dirname(__file__), "models", "audios", "audio1_hello.mp3")
SAFETENSORS = os.path.join(MODEL_DIR, "model.safetensors")
print("\n[1] Creating model skeleton on meta device...", flush=True)
t0 = time.time()
from transformers import AutoConfig, AutoProcessor
from qwen_asr import Qwen3ASRModel
from qwen_asr.core.transformers_backend.modeling_qwen3_asr import Qwen3ASRForConditionalGeneration
config = AutoConfig.from_pretrained(MODEL_DIR, trust_remote_code=True)
with torch.device("meta"):
meta_model = Qwen3ASRForConditionalGeneration(config)
print(f" Meta model created in {time.time() - t0:.1f}s", flush=True)
free, _ = mem_free()
print(f" GPU free: {free:.0f} MB", flush=True)
print("\n[2] Loading weights from safetensors directly to CUDA (fp16)...", flush=True)
t0 = time.time()
state_dict = load_file(SAFETENSORS, device="cuda")
print(f" Loaded state_dict to CUDA in {time.time() - t0:.1f}s", flush=True)
free, _ = mem_free()
print(f" GPU free: {free:.0f} MB", flush=True)
print("\n[3] Assigning weights to model...", flush=True)
t0 = time.time()
missing, unexpected = meta_model.load_state_dict(state_dict, strict=False, assign=True)
del state_dict
gc.collect()
LIBC.malloc_trim(0)
print(f" Assigned in {time.time() - t0:.1f}s", flush=True)
if missing:
print(f" Missing keys: {len(missing)}", flush=True)
if unexpected:
print(f" Unexpected keys: {len(unexpected)}", flush=True)
meta_model = meta_model.half()
free, _ = mem_free()
print(f" GPU free: {free:.0f} MB", flush=True)
print("\n[4] Creating ASR wrapper...", flush=True)
processor = AutoProcessor.from_pretrained(MODEL_DIR, fix_mistral_regex=True)
model = Qwen3ASRModel(
backend="transformers",
model=meta_model,
processor=processor,
sampling_params=None,
forced_aligner=None,
)
free, _ = mem_free()
print(f" GPU free: {free:.0f} MB", flush=True)
print(f"\n[5] Transcribing: {os.path.basename(AUDIO_FILE)}", flush=True)
t0 = time.time()
results = model.transcribe(audio=AUDIO_FILE, language="English")
elapsed = time.time() - t0
text = results[0].text.strip() if results else "(empty)"
print(f" Cold: {elapsed:.2f}s", flush=True)
print(f' Text: "{text}"', flush=True)
print("\n[6] Warm run...", flush=True)
t0 = time.time()
results = model.transcribe(audio=AUDIO_FILE, language="English")
elapsed = time.time() - t0
text = results[0].text.strip() if results else "(empty)"
print(f" Warm: {elapsed:.2f}s", flush=True)
print(f' Text: "{text}"', flush=True)
free, _ = mem_free()
print(f"\nGPU free at end: {free:.0f} MB", flush=True)
print("=" * 60, flush=True)
print("DONE — GPU inference works!", flush=True)
