Your GPU does not have native support for FP4 computation but FP4 quantization is being used

vgoklani · December 23, 2025, 3:20am

Your GPU does not have native support for FP4 computation but FP4 quantization is being used.... Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads

Updated to the latest docker image for vllm: nvcr.io/nvidia/vllm:25.12-py3

and ran gpt-oss-120b

python3 -m vllm.entrypoints.openai.api_server
–model “/huggingface_hub/models/openai/gpt_oss_120b”
–host 0.0.0.0
–port 8355
–tensor-parallel-size 1
–max-model-len 32768
–max-num-seqs 4
–gpu-memory-utilization 0.95
–kv-cache-dtype=auto
–async-scheduling

Why is it still using the Marlin kernel and true FP4?

eugr · December 23, 2025, 6:11am

Yes, gpt-oss support is still broken in vLLM for Blackwell, even in the main branch.

Flashinfer path doesn’t work, and Marlin is slow. I managed to get Triton backend working, but it’s about the same speed as Marlin, so no reason to do it.

greg.marc.silverman · January 30, 2026, 12:43am

Any plans on fixing this?

christopher_owen · January 30, 2026, 1:03am

I kind of did…. check the other thread.

greg.marc.silverman · January 30, 2026, 1:08am

Cool! I’ll have to digest the results of that thread. Sounds like you had to really hack into it.

greg.marc.silverman · January 30, 2026, 11:39pm

I JUST got it working in podman after a bunch of failed attempts. Had to do some gymnastics with my startup script, too. In any case, thanks for the jump start!

Topic		Replies	Views
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	7069	March 10, 2026
Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) DGX Spark / GB10 Projects jetson , nemotron	16	2153	April 26, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2484	December 25, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	11995	May 15, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1481	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	3072	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4424	February 27, 2026
vLLM containers DGX Spark / GB10	44	1794	March 28, 2026
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	13	1553	January 7, 2026
vLLM 0.17.0 MXFP4 Patches for DGX Spark: Qwen3.5-35B-A3B 70 tok/s, gpt-oss-120b 80 tok/s (TP=2) DGX Spark / GB10	32	2440	April 13, 2026

Your GPU does not have native support for FP4 computation but FP4 quantization is being used

Related topics