How to use Qwen3-ASR-0.6B on jetson orin nano?

Hi, I am completely new to Jetson Orin Nano. And feeling confused on how to run multiple small llm together on gpu. Can anyone help me with it?

I’m trying to run Qwen3-ASR-0.6B on an NVIDIA Jetson Orin Nano Super (8 GB unified memory) and I consistently hit a CUDA allocator crash coming from NVML. I’m wondering if there is a known workaround or recommended setup for Jetson-class devices.

Hardware / software

  • Board: Jetson Orin Nano Super, 8 GB unified memory (CPU + GPU)
  • OS: Ubuntu 22.04 (JetPack 6 / R36.4.7)
  • CUDA: 12.6
  • PyTorch: 2.5.0a0+872d972e41.nv24.08 (NVIDIA’s official Jetson build)
  • Transformers: 4.57.6
  • qwen-asr: latest from pip

Core error

When I move the model to CUDA or run transcribe, I eventually get:

RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED
  at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838

From debugging, this is triggered when the PyTorch CUDA caching allocator calls nvmlDeviceGetMemoryInfo(). On Jetson, NVML returns an error instead of NVML_SUCCESS, so the internal assert fires.

What I’ve already tried

  • Simple .to("cuda"):

    model = Qwen3ASRModel.from_pretrained(MODEL_DIR)
    model = model.to("cuda")
    

    → Crashes with the NVML assert during .to("cuda").

  • device_map="auto" + max_memory:

    model = Qwen3ASRModel.from_pretrained(
        MODEL_DIR,
        torch_dtype=torch.float16,
        device_map="auto",
        max_memory={0: "2GiB", "cpu": "4GiB"},
    )
    

    → Same NVML assert during from_pretrained.

  • Disable caching allocator:

    os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"] = "1"
    model = Qwen3ASRModel.from_pretrained(MODEL_DIR, torch_dtype=torch.float16)
    model = model.to("cuda")
    

    → Model loads on CUDA, but model.transcribe() hits CUDA OOM in the audio encoder (F.conv2d), probably due to fragmentation without the caching allocator.

  • Allocator tweaks like:

    export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
    # or
    export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
    

    expandable_segments:False lets the model load, but transcribe() still eventually triggers the same NVML assert when a cudaMalloc fails.

  • CPU-only fp32:

    model = Qwen3ASRModel.from_pretrained(MODEL_DIR)
    model.transcribe(audio="audio.wav", language="English")
    

    → Works, but is far too slow on the ARM CPU (~5–10+ minutes per utterance), so not usable for my real-time use case.

Other PyTorch-based components (e.g., Silero VAD) and LLMs (Ollama tiny models, MedGemma) run fine on this device.

My questions

  • Is Jetson (8 GB unified memory) an officially supported / tested target for Qwen3-ASR-0.6B?
  • Is there a recommended way to run Qwen3-ASR on Jetson that avoids this NVML-related crash (e.g., specific PyTorch version, flags, or a non-PyTorch runtime such as ONNX/TensorRT export that you support)?
  • Is there a smaller / more Jetson-friendly Qwen ASR variant that you recommend for 8 GB unified memory devices?
  • If the answer is to patch PyTorch’s CUDACachingAllocator (remove the NVML_SUCCESS == r assert), do you have:
    • Any guidance on whether this is safe for Qwen3-ASR, and
    • A known-good Jetson configuration (model size, dtype, sequence length, etc.) that fits within 8 GB unified memory?

when i run the file , I am getting the below error.

import os
import gc
import ctypes
import time

os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"] = "1"

import torch
from safetensors.torch import load_file

LIBC = ctypes.CDLL("libc.so.6")

def mem_free():
    gc.collect()
    LIBC.malloc_trim(0)
    free, total = torch.cuda.mem_get_info()
    return free / 1e6, total / 1e6

print("=" * 60, flush=True)
print("Qwen3-ASR-0.6B — Jetson GPU Test (direct load)", flush=True)
print("=" * 60, flush=True)
print(f"PyTorch: {torch.__version__}", flush=True)

free, total = mem_free()
print(f"GPU: {free:.0f} MB free / {total:.0f} MB total", flush=True)

MODEL_DIR = os.path.join(os.path.dirname(__file__), "models", "Qwen3-ASR-0.6B")
AUDIO_FILE = os.path.join(os.path.dirname(__file__), "models", "audios", "audio1_hello.mp3")
SAFETENSORS = os.path.join(MODEL_DIR, "model.safetensors")

print("\n[1] Creating model skeleton on meta device...", flush=True)
t0 = time.time()
from transformers import AutoConfig, AutoProcessor
from qwen_asr import Qwen3ASRModel
from qwen_asr.core.transformers_backend.modeling_qwen3_asr import Qwen3ASRForConditionalGeneration

config = AutoConfig.from_pretrained(MODEL_DIR, trust_remote_code=True)
with torch.device("meta"):
    meta_model = Qwen3ASRForConditionalGeneration(config)
print(f"    Meta model created in {time.time() - t0:.1f}s", flush=True)

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print("\n[2] Loading weights from safetensors directly to CUDA (fp16)...", flush=True)
t0 = time.time()
state_dict = load_file(SAFETENSORS, device="cuda")
print(f"    Loaded state_dict to CUDA in {time.time() - t0:.1f}s", flush=True)

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print("\n[3] Assigning weights to model...", flush=True)
t0 = time.time()
missing, unexpected = meta_model.load_state_dict(state_dict, strict=False, assign=True)
del state_dict
gc.collect()
LIBC.malloc_trim(0)
print(f"    Assigned in {time.time() - t0:.1f}s", flush=True)
if missing:
    print(f"    Missing keys: {len(missing)}", flush=True)
if unexpected:
    print(f"    Unexpected keys: {len(unexpected)}", flush=True)

meta_model = meta_model.half()

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print("\n[4] Creating ASR wrapper...", flush=True)
processor = AutoProcessor.from_pretrained(MODEL_DIR, fix_mistral_regex=True)
model = Qwen3ASRModel(
    backend="transformers",
    model=meta_model,
    processor=processor,
    sampling_params=None,
    forced_aligner=None,
)

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print(f"\n[5] Transcribing: {os.path.basename(AUDIO_FILE)}", flush=True)
t0 = time.time()
results = model.transcribe(audio=AUDIO_FILE, language="English")
elapsed = time.time() - t0
text = results[0].text.strip() if results else "(empty)"
print(f"    Cold: {elapsed:.2f}s", flush=True)
print(f'    Text: "{text}"', flush=True)

print("\n[6] Warm run...", flush=True)
t0 = time.time()
results = model.transcribe(audio=AUDIO_FILE, language="English")
elapsed = time.time() - t0
text = results[0].text.strip() if results else "(empty)"
print(f"    Warm: {elapsed:.2f}s", flush=True)
print(f'    Text: "{text}"', flush=True)

free, _ = mem_free()
print(f"\nGPU free at end: {free:.0f} MB", flush=True)
print("=" * 60, flush=True)
print("DONE — GPU inference works!", flush=True)

Hi,

There is a known memory issue in r36.4.7.
Please upgrade your device with JetPack 6.2.2/r36.5 to get the fix:

After that, you can find some related commands in the link below:
(We don’t have a command for the ASR model but you can try the Qwen3 LLM model first)

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.