Solved - running gpt oss 120b with two sparks

kim.dang · March 25, 2026, 11:57pm

Thought I share this with the community and for those still struggling to run gpt oss 120b.

When using @eugr community build, however, the recipe does not work as is, cause there are issues with the latest vllm releases.
To run it correctly (though not optimally), you can use the recipe below. There are a couple of config changes only, the VLLM environment and the mxfp4 layer, which were causing me issues. With this you can also use the no-ray option that stops your cpu spinning at 100%

./run-recipe.sh openai-gpt-oss-120b --no-ray

# Recipe: OpenAI GPT-OSS 120B
# OpenAI's open source 120B MoE model with MXFP4 quantization support

recipe_version: "1"
name: OpenAI GPT-OSS 120B
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer

# HuggingFace model to download (optional, for --download-model)
model: openai/gpt-oss-120b

# Container image to use
container: vllm-node-mxfp4

# Build arguments for build-and-copy.sh
build_args:
  - '--exp-mxfp4'

# No mods required for this model
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 3000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.70
  max_num_batched_tokens: 8192

# Environment variables to set in the container
# VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"
env:
  VLLM_MXFP4_BACKEND: TRTLLM_MXFP

# The vLLM serve command template
# Uses MXFP4 quantization for memory efficiency
command: |
  vllm serve openai/gpt-oss-120b \
      --tool-call-parser openai \
      --reasoning-parser openai_gptoss \
      --enable-auto-tool-choice \
      --tensor-parallel-size {tensor_parallel} \
      --distributed-executor-backend ray \
      --gpu-memory-utilization {gpu_memory_utilization} \
      --enable-prefix-caching \
      --load-format fastsafetensors \
      --quantization mxfp4 \
      --mxfp4-backend cutlass \
      --attention-backend FLASHINFER \
      --kv-cache-dtype fp8 \
      --max-num-batched-tokens {max_num_batched_tokens} \
      --host {host} \
      --port {port}

# --mxfp4-layers moe,qkv,o,lm_head \

aniculescu · March 30, 2026, 4:17pm

Thanks for the tutorial, moving to GB10 projects

Topic		Replies	Views
nvidia/gpt-oss-puzzle-88B DGX Spark / GB10	6	537	April 13, 2026
Optimized tensor-rt command for running gpt-oss-120B on two dgx-sparks DGX Spark / GB10	2	260	October 20, 2025
[Bug] TensorRT-LLM 1.2.0rc8: "TRTLLMGenFusedMoE does not support SM120" error on DGX Spark with gpt-oss-120b + Eagle3 DGX Spark / GB10 tensorrt	9	522	February 17, 2026
DGX Spark performance DGX Spark / GB10	50	4741	February 27, 2026
[SM121] 4 bugs causing ! output + gpt-oss-120B at 59 tok/s — full root cause analysis and working serve scripts DGX Spark / GB10	1	336	April 2, 2026
Ollama running both GPT-OSS models on CPU DGX Systems (Data Center)	1	228	November 12, 2025
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	29	2078	February 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2417	December 25, 2025
GPT-OSS-120B MXFP4 on RTX PRO 6000 Blackwell Max-Q (SM120): full debug path, what was actually broken, and what finally worked DGX Spark / GB10	8	401	April 1, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5141	December 9, 2025

Solved - running gpt oss 120b with two sparks

Related topics