Hello! I am tinkering with vllm and gpt-oss-120b. I am using @eugr‘s community build https://github.com/eugr/spark-vllm-docker/ latest, and https://github.com/eugr/llama-benchy for this purpose.
When I run the docker image w/ the --reasoning-parser=openai_gptoss switch, vllm returns a BadRequestError:
docker run \
--privileged \
--gpus all \
-it --rm \
--network=host --ipc=host \
--shm-size 64g \
-v "$HOME/models/gpt-oss-120b:/model" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$HOME/tiktoken_encodings:/tiktoken_encodings" \
-e TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings \
vllm-node \
bash -c -i "
vllm serve \
--served-model-name "openai/gpt-oss-120b" \
--host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.7 \
--load-format fastsafetensors \
--reasoning-parser=openai_gptoss \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser openai "
{
"error": {
"message": "gpt-oss has a special branch for parsing reasoning in non-streaming mode. This method shouldn't be used.",
"type": "BadRequestError",
"param": null,
"code": 400
}
}
When I run the same docker image without the same switch, gpt-oss-120b is served regularly and the llama-benchy records the following performances:
llama-benchy (0.1.1)
Date: 2026-01-23 08:06:36
Benchmarking model: openai/gpt-oss-120b at http://spark-a1ab.local:8000/v1
Loading text from cache: /home/adg/.cache/llama-benchy/f88f98465dba5c34bf03e8a31393fea9.txt
Total tokens available in text corpus: 192160
Warming up...
Warmup (User only) complete. Delta: 8 tokens (Server: 29, Local: 21)
Warmup (System+Empty) complete. Delta: 13 tokens (Server: 34, Local: 21)
Measuring latency using mode: generation...
Average latency (generation): 21.78 ms
Running test: pp=2048, tg=32, depth=0
Running test: pp=2048, tg=32, depth=4096
Running test: pp=2048, tg=32, depth=8192
Running test: pp=2048, tg=32, depth=16384
Running test: pp=2048, tg=32, depth=32768
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| openai/gpt-oss-120b | pp2048 | 44671.30 ± 53.85 | 66.65 ± 0.16 | 44.87 ± 0.16 | 66.69 ± 0.15 |
| openai/gpt-oss-120b | tg32 | 108.82 ± 0.28 | | | |
| openai/gpt-oss-120b | pp2048 @ d4096 | 30216.38 ± 102.26 | 223.14 ± 2.03 | 201.35 ± 2.03 | 223.18 ± 2.03 |
| openai/gpt-oss-120b | tg32 @ d4096 | 89.76 ± 0.46 | | | |
| openai/gpt-oss-120b | pp2048 @ d8192 | 24934.57 ± 75.09 | 424.99 ± 0.82 | 403.20 ± 0.82 | 425.04 ± 0.83 |
| openai/gpt-oss-120b | tg32 @ d8192 | 76.39 ± 0.03 | | | |
| openai/gpt-oss-120b | pp2048 @ d16384 | 18704.14 ± 59.24 | 989.74 ± 8.26 | 967.96 ± 8.26 | 989.79 ± 8.27 |
| openai/gpt-oss-120b | tg32 @ d16384 | 59.16 ± 0.19 | | | |
| openai/gpt-oss-120b | pp2048 @ d32768 | 12543.98 ± 44.95 | 2755.51 ± 11.09 | 2733.73 ± 11.09 | 2755.59 ± 11.08 |
| openai/gpt-oss-120b | tg32 @ d32768 | 40.64 ± 0.05 | | | |
llama-benchy (0.1.1)
date: 2026-01-23 08:06:36 | latency mode: generation
In absence of a reasoning parser the model behaves according to a “standard” language model, treating all generated text, namely chain of thoughts and generated text, as the main response content. Arguably the chain of thoughts tokens are indeed part of the generated tokens. As a result, the I regard the benchmark above as representative of the effective inference performance achievable on our Sparks.
That being said, I would like to have the reasoning parser part in good order. Can someone illuminate me on the reasons why I am getting the Code: 400, BadRequestError, while running the vllm-node with the said switch, please?
Thank you! :-)