Your GPU does not have native support for FP4 computation but FP4 quantization is being used

Your GPU does not have native support for FP4 computation but FP4 quantization is being used.... Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads 

Updated to the latest docker image for vllm: nvcr.io/nvidia/vllm:25.12-py3

and ran gpt-oss-120b

python3 -m vllm.entrypoints.openai.api_server
–model “/huggingface_hub/models/openai/gpt_oss_120b”
–host 0.0.0.0
–port 8355
–tensor-parallel-size 1
–max-model-len 32768
–max-num-seqs 4
–gpu-memory-utilization 0.95
–kv-cache-dtype=auto
–async-scheduling

Why is it still using the Marlin kernel and true FP4?

1 Like

Yes, gpt-oss support is still broken in vLLM for Blackwell, even in the main branch.

Flashinfer path doesn’t work, and Marlin is slow. I managed to get Triton backend working, but it’s about the same speed as Marlin, so no reason to do it.

3 Likes

Any plans on fixing this?

I kind of did…. check the other thread.

Cool! I’ll have to digest the results of that thread. Sounds like you had to really hack into it.

I JUST got it working in podman after a bunch of failed attempts. Had to do some gymnastics with my startup script, too. In any case, thanks for the jump start!