Orin Nano Qwen3-VL-4B

Looking to run Qwen3-VL-4B with the Orin Nano.

Anyone get it running?

I created a new conda:

I installed torch, torchvision, torchaudio, from https://pypi.jetson-ai-lab.io/jp6/cu126

Installed latest transformers with pip install git+https://github.com/huggingface/transformers --index-url https://pypi.jetson-ai-lab.io/jp6/cu126

There’s an unsloth version that’s 4-bit Quantized, but I haven’t been able to get that working either.
unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit · Hugging Face

Any ideas would be appreciated! ❤️

-- coding: utf-8 --

import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
from vllm import LLM, SamplingParams

import os
os.environ[‘VLLM_WORKER_MULTIPROC_METHOD’] = ‘spawn’

def prepare_inputs_for_vllm(messages, processor):
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# qwen_vl_utils 0.0.14+ reqired
image_inputs, video_inputs, video_kwargs = process_vision_info(
messages,
image_patch_size=processor.image_processor.patch_size,
return_video_kwargs=True,
return_video_metadata=True
)
print(f"video_kwargs: {video_kwargs}")

mm_data = {}
if image_inputs is not None:
    mm_data['image'] = image_inputs
if video_inputs is not None:
    mm_data['video'] = video_inputs

return {
    'prompt': text,
    'multi_modal_data': mm_data,
    'mm_processor_kwargs': video_kwargs
}

if name == ‘main’:
# messages = [
#     {
#         “role”: “user”,
#         “content”: [
#             {
#                 “type”: “video”,
#                 “video”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4”,
#             },
#             {“type”: “text”, “text”: “这段视频有多长”},
#         ],
#     }
# ]

messages = [
    {
        "role": "user",
        "content": [
          {
              "type": "image",
              "image": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/receipt.png",
          },
          {"type": "text", "text": "Read all the text in the image."},
        ],
    }
]

# TODO: change to your own checkpoint path
checkpoint_path = "Qwen/Qwen3-VL-4B-Instruct-FP8"
processor = AutoProcessor.from_pretrained(checkpoint_path)
inputs = [prepare_inputs_for_vllm(message, processor) for message in [messages]]

llm = LLM(
    model=checkpoint_path,
    trust_remote_code=True,
    gpu_memory_utilization=0.70,
    enforce_eager=False,
    tensor_parallel_size=torch.cuda.device_count(),
    seed=0
)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=1024,
    top_k=-1,
    stop_token_ids=[],
)

for i, input_ in enumerate(inputs):
    print()
    print('=' * 40)
    print(f"Inputs[{i}]: {input_['prompt']=!r}")
print('\n' + '>' * 40)

outputs = llm.generate(inputs, sampling_params=sampling_params)
for i, output in enumerate(outputs):
    generated_text = output.outputs[0].text
    print()
    print('=' * 40)
    print(f"Generated text: {generated_text!r}")

Hi,

Didn’t try Qwen3-VL-4B, but we have tested Qwen2.5-VL-3B on Orin Nano, which can work correctly.
Please note that you will need to apply the memory optimization mentioned in the link below:

Please find our detailed (container/parameter/comment) for running Qwen2.5-VL-3B in the link below:

Thanks.

Thank you, I will poke around and see if I can get it running.

Not 4B, but so far I managed to get 2B (FP16) running with Transfomers after a fresh SD Card Install. Still trying to get my SSD flashed, but this works in the meantime. Uses nearly all the RAM, but 4B should work with 4-bit Quants once I can get that working.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

default: Load the model on the available device(s)

model = Qwen3VLForConditionalGeneration.from_pretrained(
“Qwen/Qwen3-VL-2B-Instruct”, dtype=“auto”, device_map=“auto”
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_>
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=>
)
print(output_text)

Running the above Python script, I get this error, but the model still processes. Anything I should be concerned about or adjust?

NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0

Hey, have you had any more luck since? My nano just came in, so as soon as I can figure out how to boot from SSD, I plan to try the Qwen3 family, including VL.

Qwen3-VL support was just added to llama.cpp yesterday so that would probably be the path of least resistance. Though you’ll probably have to build it from source, but that’s easy.

Hi,

NvMapMemAllocInternalTagged: 1075072515 error 12

This is a known issue on r36.4.7, which reports the same error message above.
You can find more details in the topic below:

Currently, we are still working on the issue internally.
Will let you know once we have any new updates for this.

Thanks.

Hi,

We have fixed the memory issue internally.
Please check the topic shared above for more information.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.