Your GPU does not have native support for FP4 computation but FP4 quantization is being used.... Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads
Updated to the latest docker image for vllm: nvcr.io/nvidia/vllm:25.12-py3
Yes, gpt-oss support is still broken in vLLM for Blackwell, even in the main branch.
Flashinfer path doesn’t work, and Marlin is slow. I managed to get Triton backend working, but it’s about the same speed as Marlin, so no reason to do it.
I JUST got it working in podman after a bunch of failed attempts. Had to do some gymnastics with my startup script, too. In any case, thanks for the jump start!