How to use Qwen3-ASR-0.6B on jetson orin nano?

suchithkoduru · February 26, 2026, 4:07pm

Hi, I am completely new to Jetson Orin Nano. And feeling confused on how to run multiple small llm together on gpu. Can anyone help me with it?

I’m trying to run Qwen3-ASR-0.6B on an NVIDIA Jetson Orin Nano Super (8 GB unified memory) and I consistently hit a CUDA allocator crash coming from NVML. I’m wondering if there is a known workaround or recommended setup for Jetson-class devices.

Hardware / software

Board: Jetson Orin Nano Super, 8 GB unified memory (CPU + GPU)
OS: Ubuntu 22.04 (JetPack 6 / R36.4.7)
CUDA: 12.6
PyTorch: 2.5.0a0+872d972e41.nv24.08 (NVIDIA’s official Jetson build)
Transformers: 4.57.6
qwen-asr: latest from pip

Core error

When I move the model to CUDA or run transcribe, I eventually get:

RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED
  at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":838

From debugging, this is triggered when the PyTorch CUDA caching allocator calls nvmlDeviceGetMemoryInfo(). On Jetson, NVML returns an error instead of NVML_SUCCESS, so the internal assert fires.

What I’ve already tried

Simple .to("cuda"):

model = Qwen3ASRModel.from_pretrained(MODEL_DIR)
model = model.to("cuda")

→ Crashes with the NVML assert during .to("cuda").

device_map="auto" + max_memory:

model = Qwen3ASRModel.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory={0: "2GiB", "cpu": "4GiB"},
)

→ Same NVML assert during from_pretrained.

Disable caching allocator:
```
os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"] = "1"
model = Qwen3ASRModel.from_pretrained(MODEL_DIR, torch_dtype=torch.float16)
model = model.to("cuda")
```
→ Model loads on CUDA, but model.transcribe() hits CUDA OOM in the audio encoder (F.conv2d), probably due to fragmentation without the caching allocator.
Allocator tweaks like:
```
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
# or
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
```
expandable_segments:False lets the model load, but transcribe() still eventually triggers the same NVML assert when a cudaMalloc fails.
CPU-only fp32:
```
model = Qwen3ASRModel.from_pretrained(MODEL_DIR)
model.transcribe(audio="audio.wav", language="English")
```
→ Works, but is far too slow on the ARM CPU (~5–10+ minutes per utterance), so not usable for my real-time use case.

Other PyTorch-based components (e.g., Silero VAD) and LLMs (Ollama tiny models, MedGemma) run fine on this device.

My questions

Is Jetson (8 GB unified memory) an officially supported / tested target for Qwen3-ASR-0.6B?
Is there a recommended way to run Qwen3-ASR on Jetson that avoids this NVML-related crash (e.g., specific PyTorch version, flags, or a non-PyTorch runtime such as ONNX/TensorRT export that you support)?
Is there a smaller / more Jetson-friendly Qwen ASR variant that you recommend for 8 GB unified memory devices?
If the answer is to patch PyTorch’s CUDACachingAllocator (remove the NVML_SUCCESS == r assert), do you have:
- Any guidance on whether this is safe for Qwen3-ASR, and
- A known-good Jetson configuration (model size, dtype, sequence length, etc.) that fits within 8 GB unified memory?

when i run the file , I am getting the below error.

import os
import gc
import ctypes
import time

os.environ["PYTORCH_NO_CUDA_MEMORY_CACHING"] = "1"

import torch
from safetensors.torch import load_file

LIBC = ctypes.CDLL("libc.so.6")

def mem_free():
    gc.collect()
    LIBC.malloc_trim(0)
    free, total = torch.cuda.mem_get_info()
    return free / 1e6, total / 1e6

print("=" * 60, flush=True)
print("Qwen3-ASR-0.6B — Jetson GPU Test (direct load)", flush=True)
print("=" * 60, flush=True)
print(f"PyTorch: {torch.__version__}", flush=True)

free, total = mem_free()
print(f"GPU: {free:.0f} MB free / {total:.0f} MB total", flush=True)

MODEL_DIR = os.path.join(os.path.dirname(__file__), "models", "Qwen3-ASR-0.6B")
AUDIO_FILE = os.path.join(os.path.dirname(__file__), "models", "audios", "audio1_hello.mp3")
SAFETENSORS = os.path.join(MODEL_DIR, "model.safetensors")

print("\n[1] Creating model skeleton on meta device...", flush=True)
t0 = time.time()
from transformers import AutoConfig, AutoProcessor
from qwen_asr import Qwen3ASRModel
from qwen_asr.core.transformers_backend.modeling_qwen3_asr import Qwen3ASRForConditionalGeneration

config = AutoConfig.from_pretrained(MODEL_DIR, trust_remote_code=True)
with torch.device("meta"):
    meta_model = Qwen3ASRForConditionalGeneration(config)
print(f"    Meta model created in {time.time() - t0:.1f}s", flush=True)

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print("\n[2] Loading weights from safetensors directly to CUDA (fp16)...", flush=True)
t0 = time.time()
state_dict = load_file(SAFETENSORS, device="cuda")
print(f"    Loaded state_dict to CUDA in {time.time() - t0:.1f}s", flush=True)

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print("\n[3] Assigning weights to model...", flush=True)
t0 = time.time()
missing, unexpected = meta_model.load_state_dict(state_dict, strict=False, assign=True)
del state_dict
gc.collect()
LIBC.malloc_trim(0)
print(f"    Assigned in {time.time() - t0:.1f}s", flush=True)
if missing:
    print(f"    Missing keys: {len(missing)}", flush=True)
if unexpected:
    print(f"    Unexpected keys: {len(unexpected)}", flush=True)

meta_model = meta_model.half()

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print("\n[4] Creating ASR wrapper...", flush=True)
processor = AutoProcessor.from_pretrained(MODEL_DIR, fix_mistral_regex=True)
model = Qwen3ASRModel(
    backend="transformers",
    model=meta_model,
    processor=processor,
    sampling_params=None,
    forced_aligner=None,
)

free, _ = mem_free()
print(f"    GPU free: {free:.0f} MB", flush=True)

print(f"\n[5] Transcribing: {os.path.basename(AUDIO_FILE)}", flush=True)
t0 = time.time()
results = model.transcribe(audio=AUDIO_FILE, language="English")
elapsed = time.time() - t0
text = results[0].text.strip() if results else "(empty)"
print(f"    Cold: {elapsed:.2f}s", flush=True)
print(f'    Text: "{text}"', flush=True)

print("\n[6] Warm run...", flush=True)
t0 = time.time()
results = model.transcribe(audio=AUDIO_FILE, language="English")
elapsed = time.time() - t0
text = results[0].text.strip() if results else "(empty)"
print(f"    Warm: {elapsed:.2f}s", flush=True)
print(f'    Text: "{text}"', flush=True)

free, _ = mem_free()
print(f"\nGPU free at end: {free:.0f} MB", flush=True)
print("=" * 60, flush=True)
print("DONE — GPU inference works!", flush=True)

AastaLLL · March 2, 2026, 4:51am

Hi,

There is a known memory issue in r36.4.7.
Please upgrade your device with JetPack 6.2.2/r36.5 to get the fix:

After that, you can find some related commands in the link below:
(We don’t have a command for the ASR model but you can try the Qwen3 LLM model first)

Thanks.

system · March 23, 2026, 6:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Orin Nano Qwen3-VL-4B Jetson Orin Nano generative_ai , llm	9	1259	December 18, 2025
AI Models That Run on Jetson Orin Nano Super (8GB) — A Practical Guide Jetson Orin Nano jetson-inference , jetson , generative_ai , llm , cosmos , nemotron , nemoclaw , openclaw	1	142	April 1, 2026
Orin Nano - Building TensorRT-LLM from source Jetson Orin Nano tensorrt , cuda , llama	9	569	November 17, 2025
Performance Inquiry: Optimizing Qwen3-VL 2B Inference for 2 QPS Target on Orin Nano Super Jetson Orin Nano cudnn , cublas , llama	4	205	February 9, 2026
Jetson orin nano deploying yolo with llm Jetson Orin Nano llm	4	89	February 9, 2026
Available with Small Language Model on tutorial Jetson Orin Nano generative_ai	3	1000	May 3, 2024
Qwen3-VL-4B fine-tune Jetson Orin Nano generative_ai	2	290	January 22, 2026
Mlc_llm(0.19.0) does not support qwen3 model Jetson Orin Nano generative_ai	6	426	June 4, 2025
CUDA out of memory Jetson Orin Nano cuda	6	578	November 6, 2025
Jetson Orin Nano Super insufficient GPU memory Jetson Orin Nano cudnn	20	1067	April 29, 2025

How to use Qwen3-ASR-0.6B on jetson orin nano?

Hardware / software

Core error

What I’ve already tried

My questions

Related topics