There’s a new command --clean @eugr added recently. You can use docker builder prune from time to time.
I haven’t pushed it to GitHub yet - will push alongside other changes later today. I wouldn’t recommend deleting ~/.cache/huggingface unless you want to re-download all models again. To purge unused stuff from HF cache directories you can use uvx hf cache prune.
Hey i’ve done tests today (gpt-oss-120b, single spark) and my results are around +20% better than best on sparkarena, i don’t know why, maybe sth got updated in the meantime:
The command:
sparkrun benchmark ./gpt-oss-120b-vllm.yaml --skip-run --hosts localhost --profile spark-arena-v1
The yaml file is:
recipe_version: ‘1’
name: OpenAI GPT-OSS 120B Solo
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer
model: openai/gpt-oss-120b
runtime: vllm
container: sparkarena/spark-vllm-docker:mxfp4
defaults:
port: 8000
host: 0.0.0.0
gpu_memory_utilization: 0.7
max_num_batched_tokens: 8192
env:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: ‘1’
command: |
vllm serve openai/gpt-oss-120b
–enable-auto-tool-choice
–tool-call-parser openai
–reasoning-parser openai_gptoss
–enable-auto-tool-choice
–gpu-memory-utilization {gpu_memory_utilization}
–enable-prefix-caching
–load-format fastsafetensors
–quantization mxfp4
–mxfp4-backend CUTLASS
–mxfp4-layers moe,qkv,o,lm_head
–attention-backend FLASHINFER
–kv-cache-dtype fp8
–max-num-batched-tokens {max_num_batched_tokens}
–host {host}
–port {port}
The results:
profilfromsparkarena__benchmark_gpt-oss-120b-vllm_spark-arena-v1_tp1.zip (8.0 KB)
Hello, i tried to find probably more appropriate topic about general spark-vllm-docker.discussion. Probably i missed. Anyway - as i understood (from the Dockerfile.mxfp4 in the repo) the feature is not compatible with tf5 .
I am wondering, as the recent vllm version are now using transformers 5 what is the difference when building with --tf5 flag (checked the regular one - still there was transformer 5 inside)
Also, as i understood mxfp4 actually depends on cutlass, flashinfer and their usage in vllm codebase , which prevent it to be “patched” in the recent vllm versions
Probably this is somehow connected with the mxfp4 issues?