Eugr/spark-vllm-docker mxfp4 build error

There’s a new command --clean @eugr added recently. You can use docker builder prune from time to time.

I haven’t pushed it to GitHub yet - will push alongside other changes later today. I wouldn’t recommend deleting ~/.cache/huggingface unless you want to re-download all models again. To purge unused stuff from HF cache directories you can use uvx hf cache prune.

Hey i’ve done tests today (gpt-oss-120b, single spark) and my results are around +20% better than best on sparkarena, i don’t know why, maybe sth got updated in the meantime:

The command:

sparkrun benchmark ./gpt-oss-120b-vllm.yaml --skip-run --hosts localhost --profile spark-arena-v1

The yaml file is:

recipe_version: ‘1’
name: OpenAI GPT-OSS 120B Solo
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer
model: openai/gpt-oss-120b
runtime: vllm
container: sparkarena/spark-vllm-docker:mxfp4
defaults:
port: 8000
host: 0.0.0.0
gpu_memory_utilization: 0.7
max_num_batched_tokens: 8192
env:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: ‘1’
command: |
vllm serve openai/gpt-oss-120b 
–enable-auto-tool-choice 
–tool-call-parser openai 
–reasoning-parser openai_gptoss 
–enable-auto-tool-choice 
–gpu-memory-utilization {gpu_memory_utilization} 
–enable-prefix-caching 
–load-format fastsafetensors 
–quantization mxfp4 
–mxfp4-backend CUTLASS 
–mxfp4-layers moe,qkv,o,lm_head 
–attention-backend FLASHINFER 
–kv-cache-dtype fp8 
–max-num-batched-tokens {max_num_batched_tokens} 
–host {host} 
–port {port}

The results:

profilfromsparkarena__benchmark_gpt-oss-120b-vllm_spark-arena-v1_tp1.zip (8.0 KB)

Hello, i tried to find probably more appropriate topic about general spark-vllm-docker.discussion. Probably i missed. Anyway - as i understood (from the Dockerfile.mxfp4 in the repo) the feature is not compatible with tf5 .

I am wondering, as the recent vllm version are now using transformers 5 what is the difference when building with --tf5 flag (checked the regular one - still there was transformer 5 inside)

Also, as i understood mxfp4 actually depends on cutlass, flashinfer and their usage in vllm codebase , which prevent it to be “patched” in the recent vllm versions

Probably this is somehow connected with the mxfp4 issues?