Can not run Qwen3-VL-8B-Instruct-FP8 on Jetson AGX Thor using vllm

steven6_wang · April 9, 2026, 10:45am

This is the settings for Jetson AGX Thor.

I downloaded the docker image from Nvidia.

Package vllm · GitHub

Packages:

accelerate==1.13.0
aiohappyeyeballs==2.6.1
aiohttp==3.13.5
aiosignal==1.4.0
annotated-doc==0.0.4
annotated-types==0.7.0
anthropic==0.89.0
anyio==4.13.0
apache-tvm-ffi==0.1.9
astor==0.8.1
attrs==26.1.0
av==17.0.0
bitsandbytes==0.48.0
blake3==1.0.8
cachetools==7.0.5
causal_conv1d==1.5.3
cbor2==5.9.0
certifi==2026.2.25
cffi==2.0.0
charset-normalizer==3.4.7
click==8.3.2
cloudpickle==3.1.2
cmake==3.31.10
compressed-tensors==0.14.0.1
cryptography==46.0.6
cuda-bindings==13.0.1
cuda-pathfinder==1.5.1
cuda-python==13.0.1
Cython==3.2.4
depyf==0.20.0
dill==0.4.1
diskcache==5.6.3
distro==1.9.0
dnspython==2.8.0
docstring_parser==0.17.0
docutils==0.22.4
einops==0.8.2
email-validator==2.3.0
fastapi==0.135.3
fastapi-cli==0.0.24
fastapi-cloud-cli==0.15.1
fastar==0.9.0
filelock==3.25.2
flash_attn==2.8.4
flashinfer-cubin==0.6.6
flashinfer-jit-cache==0.6.6+cu130
flashinfer-python==0.6.6
frozenlist==1.8.0
fsspec==2026.3.0
gguf==0.18.0
googleapis-common-protos==1.74.0
grpcio==1.80.0
h11==0.16.0
hf-xet==1.4.3
httpcore==1.0.9
httptools==0.7.1
httpx==0.28.1
httpx-sse==0.4.3
huggingface_hub==0.36.2
id==1.6.1
idna==3.11
ijson==3.5.0
importlib_metadata==8.7.1
interegular==0.3.3
jaraco.classes==3.4.0
jaraco.context==6.1.2
jaraco.functools==4.4.0
jeepney==0.9.0
Jinja2==3.1.6
jiter==0.13.0
jmespath==1.1.0
jsonschema==4.26.0
jsonschema-specifications==2025.9.1
keyring==25.7.0
lark==1.2.2
llguidance==1.3.0
lm-format-enforcer==0.11.3
loguru==0.7.3
mamba_ssm==2.2.6.post2
markdown-it-py==4.0.0
MarkupSafe==3.0.3
mcp==1.27.0
mdurl==0.1.2
mistral_common==1.11.0
ml_dtypes==0.5.4
model-hosting-container-standards==0.1.14
more-itertools==11.0.1
mpmath==1.3.0
msgspec==0.20.0
multidict==6.7.1
networkx==3.6.1
nh3==0.3.4
ninja==1.13.0
numpy==2.3.5
nvidia-cudnn-frontend==1.14.1
nvidia-cutlass==4.2.1.0
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
nvidia-ml-py==13.595.45
onnx==1.21.0
openai==2.30.0
openai-harmony==0.0.8
opencv-contrib-python-rolling==4.13.0
opentelemetry-api==1.40.0
opentelemetry-exporter-otlp==1.40.0
opentelemetry-exporter-otlp-proto-common==1.40.0
opentelemetry-exporter-otlp-proto-grpc==1.40.0
opentelemetry-exporter-otlp-proto-http==1.40.0
opentelemetry-proto==1.40.0
opentelemetry-sdk==1.40.0
opentelemetry-semantic-conventions==0.61b0
opentelemetry-semantic-conventions-ai==0.5.1
optimum==2.1.0
outlines_core==0.2.11
packaging==26.0
partial-json-parser==0.2.1.1.post7
pillow==12.2.0
pkginfo==1.12.1.2
prometheus-fastapi-instrumentator==7.1.0
prometheus_client==0.24.1
propcache==0.4.1
protobuf==6.33.6
psutil==7.2.2
py-cpuinfo==9.0.0
pybase64==1.4.3
pybind11==3.0.3
pybind11-global==3.0.3
pycountry==26.2.16
pycparser==3.0
pycute==4.1.0
pydantic==2.12.5
pydantic-extra-types==2.11.2
pydantic-settings==2.13.1
pydantic_core==2.41.5
Pygments==2.20.0
PyJWT==2.12.1
PySoundFile==0.9.0.post1
python-dotenv==1.2.2
python-json-logger==4.1.0
python-multipart==0.0.24
PyYAML==6.0.3
pyzmq==27.1.0
quack-kernels==0.3.9
qwen-vl-utils==0.0.14
readme_renderer==44.0
referencing==0.37.0
regex==2026.4.4
requests==2.33.1
requests-toolbelt==1.0.0
rfc3986==2.0.0
rich==14.3.3
rich-toolkit==0.19.7
rignore==0.7.6
rpds-py==0.30.0
safetensors==0.7.0
scikit-build==0.19.0
scipy==1.17.1
SecretStorage==3.5.0
semantic-version==2.10.0
sentencepiece==0.2.1
sentry-sdk==2.57.0
setproctitle==1.3.7
setuptools==80.10.2
setuptools-rust==1.12.1
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
sse-starlette==3.3.4
starlette==0.52.1
supervisor==4.3.0
sympy==1.14.0
tabulate==0.10.0
tensorrt @ file:///tmp/tensorrt-extracted/python/tensorrt-10.13.2.6-cp312-none-linux_aarch64.whl
tiktoken==0.12.0
tokenizers==0.22.2
torch==2.10.0
torch_memory_saver==0.0.9
torchaudio==2.10.0
torchcodec==0.10.0
torchvision==0.25.0
tqdm==4.67.3
transformers==4.57.3
triton==3.6.0
twine==6.2.0
typer==0.24.1
typing-inspection==0.4.2
typing_extensions==4.15.0
urllib3==2.6.3
uv==0.11.3
uvicorn==0.44.0
uvloop==0.22.1
vllm==0.19.0+cu130
watchfiles==1.1.1
websockets==16.0
wheel==0.46.3
xgrammar==0.1.32
yarl==1.23.0
zipp==3.23.0

I can normally run Qwen3-VL-8B-Instruct on Jetson AGX Thor using vllm.

But it failed on Qwen3-VL-8B-Instruct-FP8.

It seems that vllm does not support sm110 ?

steven6_wang · April 9, 2026, 11:18am

other error message

msdx321 · April 9, 2026, 6:47pm

Try this https://ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor

steven6_wang · April 10, 2026, 1:01am

I have already used this image.

Models | Jetson AI Lab

AastaLLL · April 10, 2026, 3:26am

Hi

We test this on Thor with JetPack 7.1.
The model can work without any issue. Please try it again:

$ sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit
latest-jetson-thor: Pulling from nvidia-ai-iot/vllm
Digest: sha256:b587dd56b4cb076209ad5156a626ac75f5a976d0e8e7d1e6a9fccd56d1bd65e8
Status: Image is up to date for ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
/opt/venv/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299] 
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299]   █▄█▀ █     █     █     █  model   cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299] 
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:233] non-default args: {'model_tag': 'cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit', 'model': 'cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit'}
config.json: 7.53kB [00:00, 33.2MB/s]
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 753/753 [00:00<00:00, 10.4MB/s]
(APIServer pid=1) INFO 04-10 03:18:14 [model.py:549] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=1) INFO 04-10 03:18:14 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 04-10 03:18:15 [vllm.py:790] Asynchronous scheduling is enabled.
tokenizer_config.json: 5.45kB [00:00, 29.7MB/s]
vocab.json: 2.78MB [00:00, 29.0MB/s]
merges.txt: 1.67MB [00:00, 152MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 75.7MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 8.80MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 6.66MB/s]
chat_template.jinja: 5.29kB [00:00, 30.3MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 218/218 [00:00<00:00, 2.19MB/s]
video_preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 861/861 [00:00<00:00, 8.77MB/s]
/opt/venv/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(EngineCore pid=101) INFO 04-10 03:18:46 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) INFO 04-10 03:18:50 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.173.99.143:54629 backend=nccl
(EngineCore pid=101) INFO 04-10 03:18:50 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 04-10 03:19:09 [gpu_model_runner.py:4735] Starting to load model cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit...
(EngineCore pid=101) INFO 04-10 03:19:10 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=101) INFO 04-10 03:19:10 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=101) INFO 04-10 03:19:10 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=101) INFO 04-10 03:19:10 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore pid=101) INFO 04-10 03:19:11 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 04-10 03:19:11 [flash_attn.py:596] Using FlashAttention version 2
model.safetensors.index.json: 121kB [00:00, 313MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 2.55G/2.55G [01:49<00:00, 23.3MB/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [03:34<00:00, 23.3MB/s]
(EngineCore pid=101) INFO 04-10 03:22:49 [weight_utils.py:581] Time spent downloading weights for cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit: 215.894969 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.78s/it]
(EngineCore pid=101) 
(EngineCore pid=101) INFO 04-10 03:22:53 [default_loader.py:384] Loading weights took 3.70 seconds
(EngineCore pid=101) INFO 04-10 03:22:54 [gpu_model_runner.py:4820] Model loading took 7.37 GiB memory and 223.565475 seconds
(EngineCore pid=101) INFO 04-10 03:22:54 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=101) INFO 04-10 03:23:12 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/37ea329476/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=101) INFO 04-10 03:23:12 [backends.py:1111] Dynamo bytecode transform time: 9.87 s
(EngineCore pid=101) INFO 04-10 03:23:22 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=101) INFO 04-10 03:23:32 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 19.36 s
(EngineCore pid=101) INFO 04-10 03:23:34 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/7869b98bd6347361d90c877c245a07f5b60add6fc9cffafb3437e0c4d90b95b6/rank_0_0/model
(EngineCore pid=101) INFO 04-10 03:23:34 [monitor.py:48] torch.compile took 32.15 s in total
(EngineCore pid=101) INFO 04-10 03:24:00 [monitor.py:76] Initial profiling/warmup run took 25.68 s
(EngineCore pid=101) INFO 04-10 03:24:07 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=101) INFO 04-10 03:24:07 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=101) INFO 04-10 03:24:09 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 0.08 GiB total
(EngineCore pid=101) INFO 04-10 03:24:09 [gpu_worker.py:436] Available KV cache memory: 95.19 GiB
(EngineCore pid=101) INFO 04-10 03:24:09 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9007 to maintain the same effective KV cache size.
(EngineCore pid=101) INFO 04-10 03:24:09 [kv_cache_utils.py:1319] GPU KV cache size: 693,168 tokens
(EngineCore pid=101) INFO 04-10 03:24:09 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 2.64x
(EngineCore pid=101) 2026-04-10 03:24:13,300 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=101) 2026-04-10 03:24:13,323 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████| 51/51 [00:05<00:00,  9.16it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:02<00:00, 13.10it/s]
(EngineCore pid=101) INFO 04-10 03:24:22 [gpu_model_runner.py:6046] Graph capturing finished in 9 secs, took 0.01 GiB
(EngineCore pid=101) INFO 04-10 03:24:22 [gpu_worker.py:597] CUDA graph pool memory: 0.01 GiB (actual), 0.08 GiB (estimated), difference: 0.07 GiB (803.9%).
(EngineCore pid=101) INFO 04-10 03:24:22 [core.py:283] init engine (profile, create kv cache, warmup model) took 88.61 seconds
(APIServer pid=1) INFO 04-10 03:24:23 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=1) WARNING 04-10 03:24:24 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 04-10 03:24:35 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 04-10 03:24:47 [base.py:231] Multi-modal warmup completed in 11.244s
(APIServer pid=1) INFO 04-10 03:24:48 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Thanks.

steven6_wang · April 10, 2026, 3:32am

I want to run Qwen3-VL-8B-Instruct-FP8 not Qwen3-VL-8B-Instruct-AWQ-4bit.

I know AWQ-4bit works well.

Qwen/Qwen3-VL-8B-Instruct-FP8 · Hugging Face

whitesscott · April 10, 2026, 4:38am

The Qwen/Qwen3-VL-8B-Instruct-FP8 can not run on Thor: __nvcc_device_query = 110.

Native Vllm on Thor returns this error:
This kernel only supports sm100f.

nvcr.io/nvidia/sglang:26.03-py3 fails with essentially the same error.
Exception: Capture cuda graph failed: No implemented fp8_blockwise_scaled_mm for current compute capability: 110

AastaLLL · April 13, 2026, 7:53am

Hi,

Our TensorRT Edge LLM has the Qwen3-VL-8B-Instruct FP8 support.
Could you check if this can meet your requirements?

Thanks

Topic		Replies	Views
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , deepseek , nemotron	46	3786	December 14, 2025
Experiences running Qwen/Qwen3-Coder-Next? Jetson Thor inference-server-triton , generative_ai	11	1328	April 8, 2026
Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB Jetson Thor generative_ai	10	1571	September 25, 2025
Running NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on the Nvidia Jetson Thor Jetson Thor llm , nemotron	10	777	March 30, 2026
求救，运行vllm报错 Jetson Thor camera , generative_ai	4	516	November 17, 2025
tensorrt推理Qwen/Qwen3-VL-8B-Instruct不兼容 Jetson Thor tensorrt , llm	4	333	October 27, 2025
FP8 series models hosting with the official 2509 vLLM consistently produces garbled output Jetson Thor generative_ai	5	163	November 18, 2025
Unable to run Nemotron AGX Thor Dev Kit Jetson Thor generative_ai , nemotron	10	213	March 11, 2026
Nvcr.io/nvidia/vllm:25.12-py3 Fails to Execute Qwen3-VL-8B Image Inference with Engine Error Jetson Thor generative_ai	11	529	February 3, 2026
JetPack 7.0/Jetson Linux 38.2 for NVIDIA Jetson Thor is now live Jetson Thor cudnn , llama	20	3511	October 27, 2025

Can not run Qwen3-VL-8B-Instruct-FP8 on Jetson AGX Thor using vllm

Related topics