Support for openai_gptoss reasoning parser in vLLM, and its impact on the effective inference performance on Spark

adg1 · January 23, 2026, 8:51am

Hello! I am tinkering with vllm and gpt-oss-120b. I am using @eugr‘s community build https://github.com/eugr/spark-vllm-docker/ latest, and https://github.com/eugr/llama-benchy for this purpose.

When I run the docker image w/ the --reasoning-parser=openai_gptoss switch, vllm returns a BadRequestError:

docker run \
  --privileged \
  --gpus all \
  -it --rm \
  --network=host --ipc=host \
  --shm-size 64g \
  -v "$HOME/models/gpt-oss-120b:/model" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$HOME/tiktoken_encodings:/tiktoken_encodings" \
  -e TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings \
  vllm-node \
  bash -c -i "
  vllm serve \
    --served-model-name "openai/gpt-oss-120b" \
    --host 0.0.0.0 --port 8000 \
    --gpu-memory-utilization 0.7 \
    --load-format fastsafetensors \
    --reasoning-parser=openai_gptoss \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser openai "

{
  "error": {
    "message": "gpt-oss has a special branch for parsing reasoning in non-streaming mode. This method shouldn't be used.",
    "type": "BadRequestError",
    "param": null,
    "code": 400
  }
}

When I run the same docker image without the same switch, gpt-oss-120b is served regularly and the llama-benchy records the following performances:

llama-benchy (0.1.1)
Date: 2026-01-23 08:06:36
Benchmarking model: openai/gpt-oss-120b at http://spark-a1ab.local:8000/v1
Loading text from cache: /home/adg/.cache/llama-benchy/f88f98465dba5c34bf03e8a31393fea9.txt
Total tokens available in text corpus: 192160
Warming up...
Warmup (User only) complete. Delta: 8 tokens (Server: 29, Local: 21)
Warmup (System+Empty) complete. Delta: 13 tokens (Server: 34, Local: 21)
Measuring latency using mode: generation...
Average latency (generation): 21.78 ms
Running test: pp=2048, tg=32, depth=0
Running test: pp=2048, tg=32, depth=4096
Running test: pp=2048, tg=32, depth=8192
Running test: pp=2048, tg=32, depth=16384
Running test: pp=2048, tg=32, depth=32768

| model               |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| openai/gpt-oss-120b |          pp2048 |  44671.30 ± 53.85 |    66.65 ± 0.16 |    44.87 ± 0.16 |    66.69 ± 0.15 |
| openai/gpt-oss-120b |            tg32 |     108.82 ± 0.28 |                 |                 |                 |
| openai/gpt-oss-120b |  pp2048 @ d4096 | 30216.38 ± 102.26 |   223.14 ± 2.03 |   201.35 ± 2.03 |   223.18 ± 2.03 |
| openai/gpt-oss-120b |    tg32 @ d4096 |      89.76 ± 0.46 |                 |                 |                 |
| openai/gpt-oss-120b |  pp2048 @ d8192 |  24934.57 ± 75.09 |   424.99 ± 0.82 |   403.20 ± 0.82 |   425.04 ± 0.83 |
| openai/gpt-oss-120b |    tg32 @ d8192 |      76.39 ± 0.03 |                 |                 |                 |
| openai/gpt-oss-120b | pp2048 @ d16384 |  18704.14 ± 59.24 |   989.74 ± 8.26 |   967.96 ± 8.26 |   989.79 ± 8.27 |
| openai/gpt-oss-120b |   tg32 @ d16384 |      59.16 ± 0.19 |                 |                 |                 |
| openai/gpt-oss-120b | pp2048 @ d32768 |  12543.98 ± 44.95 | 2755.51 ± 11.09 | 2733.73 ± 11.09 | 2755.59 ± 11.08 |
| openai/gpt-oss-120b |   tg32 @ d32768 |      40.64 ± 0.05 |                 |                 |                 |

llama-benchy (0.1.1)
date: 2026-01-23 08:06:36 | latency mode: generation

In absence of a reasoning parser the model behaves according to a “standard” language model, treating all generated text, namely chain of thoughts and generated text, as the main response content. Arguably the chain of thoughts tokens are indeed part of the generated tokens. As a result, the I regard the benchmark above as representative of the effective inference performance achievable on our Sparks.

That being said, I would like to have the reasoning parser part in good order. Can someone illuminate me on the reasons why I am getting the Code: 400, BadRequestError, while running the vllm-node with the said switch, please?

Thank you! :-)

adg1 · January 23, 2026, 2:02pm

I am investigating whether this may be a manifestation of the issue #26480 in vllm, which is tentatively fixed in master. https://github.com/vllm-project/vllm/issues/26480

eugr · January 23, 2026, 7:46pm

When did you build the container last time?
I just ran with yesterday build, and everything works fine.

BTW, with this Docker build, you don’t have to map tiktoken encodings - they are embedded in the container, so you can drop the volume and the environment variable (it is also set automatically).

adg1 · January 23, 2026, 8:14pm

Thank you for your reply and the heads up about tiktoken encodings, well noted and reflected in the updated launch script.

The last time I have built the container was 3 days ago.
I tried with other docker images too, but the –reasoning-parser switch results in erratic behaviour. More specifically, with this container and other based on transformers 4.xx I get error 400; while moving to the latest vLLM and transformers 5 the APIServer incurs into a 500 Internal Server Error. The diagnostic message is the same:

”gpt-oss has a special branch for parsing reasoning in non-streaming mode. This method shouldn’t be used.”

I welcome your thoughts! Many thanks :-)

eugr · January 23, 2026, 8:51pm

Just rebuild the container again - looks like there was a bug in vLLM that has been fixed since.

./build-and-copy.sh -t vllm-node-20260123-whl --use-wheels --pre-flashinfer --rebuild-vllm --rebuild-deps -c

Then run:

docker run \
 --privileged \
 --gpus all -it \
 --rm --network host \
 --ipc=host \
 -v  ~/.cache/huggingface:/root/.cache/huggingface \
 vllm-node-20260123-whl \
 bash -c -i "vllm serve \
  openai/gpt-oss-120b \
        --host 0.0.0.0 \
        --port 8888 \
        --enable-auto-tool-choice \
        --tool-call-parser=openai \
        --reasoning-parser=openai_gptoss \
        --gpu-memory-utilization 0.70 \
        --enable-prefix-caching \
        --load-format fastsafetensors"

I’ve just tested it and it works just fine.

ed.swarthout · January 23, 2026, 9:40pm

I had to add a --no-cache to the docker build to fix a manylinux platform tag issue:

$ ./build-and-copy.sh -t vllm-node-20260123-whl --use-wheels --pre-flashinfer --rebuild-vllm --rebuild-deps
Using pre-built vLLM wheels (mode: nightly)
Setting CACHEBUST_DEPS…
Setting CACHEBUST_VLLM…
Using pre-release FlashInfer…
Building image with command: docker build -t vllm-node-20260123-whl -f Dockerfile.wheels --build-arg CACHEBUST_DEPS=1769202636 --build-arg CACHEBUST_VLLM=1769202636 --build-arg TRITON_REF=v3.5.1 --build-arg VLLM_REF=main --build-arg BUILD_JOBS=16 --build-arg FLASHINFER_PRE=–pre .
[+] Building 2.1s (13/19) docker:default
=> [internal] load build definition from Dockerfile.wheels 0.0s
=> => transferring dockerfile: 4.08kB 0.0s
=> resolve image config for docker-image://docker.io/docker/dockerfile:1.6 0.4s
=> CACHED docker-image://docker.io/docker/dockerfile:1.6@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f50032edf31be0021 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:13.1.0-devel-ubuntu24.04 0.7s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [stage-0 1/13] FROM docker.io/nvidia/cuda:13.1.0-devel-ubuntu24.04@sha256:7f32ae6e575abb29f2dacf6c75fe94a262bb48dcc5196ac833ced59d9fde8107 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 82B 0.0s
=> CACHED [stage-0 2/13] RUN apt update && apt upgrade -y && apt install -y --allow-change-held-packages --no-install-recommends python3 python3-pip python3-dev vim curl git wget jq libcudnn9-cuda-13 libnccl-dev libnccl2 libibverbs1 libibverbs-dev rdma-core 0.0s
=> CACHED [stage-0 3/13] WORKDIR /workspace/vllm 0.0s
=> CACHED [stage-0 4/13] RUN mkdir -p tiktoken_encodings && wget -O tiktoken_encodings/o200k_base.tiktoken “https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken” && wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.wind 0.0s
=> CACHED [stage-0 5/13] COPY fastsafetensors.patch . 0.0s
=> CACHED [stage-0 6/13] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv uv pip install -U fastsafetensors 0.0s
=> ERROR [stage-0 7/13] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv if [ “0” = “1” ]; then export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed ‘s/^v//’) && uv pip install -U http 0.8s

[stage-0 7/13] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv if [ “0” = “1” ]; then export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed ‘s/^v//’) && uv pip install -U https://github.com
/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu130-cp38-abi3-manylinux_2_35_aarch64.whl --torch-backend=auto; else uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130; fi:
0.317 Using Python 3.12.3 environment at: /usr
0.762 × No solution found when resolving dependencies:
0.762 ╰─▶ Because only vllm==0.14.0rc2.dev290+g586a57ad7.cu130 is available and
0.762 vllm==0.14.0rc2.dev290+g586a57ad7.cu130 has no wheels with a matching
0.762 platform tag (e.g., manylinux_2_39_aarch64), we can conclude that all
0.762 versions of vllm cannot be used.
0.762 And because you require vllm, we can conclude that your requirements
0.762 are unsatisfiable.
0.762
0.762 hint: vllm was requested with a pre-release marker (e.g., all of:
0.762 vllm<0.14.0rc2.dev290+g586a57ad7.cu130
0.762 vllm>0.14.0rc2.dev290+g586a57ad7.cu130
0.762 ), but pre-releases weren’t enabled (try: --prerelease=allow)
0.762
0.762 hint: vllm was found on https://wheels.vllm.ai/nightly/cu130, but not
0.762 at the requested version (all of:
0.762 vllm<0.14.0rc2.dev290+g586a57ad7.cu130
0.762 vllm>0.14.0rc2.dev290+g586a57ad7.cu130
0.762 ). A compatible version may be available on a subsequent index (e.g.,
0.762 Simple index). By default, uv will only consider versions
0.762 that are published on the first index that contains a given package, to
0.762 avoid dependency confusion attacks. If all indexes are equally trusted,
0.762 use --index-strategy unsafe-best-match to consider all versions from
0.762 all indexes, regardless of the order in which they were defined.
0.762
0.762 hint: Wheels are available for vllm
0.762 (v0.14.0rc2.dev290+g586a57ad7.cu130) on the following platform:
0.762 manylinux_2_35_x86_64

eugr · January 23, 2026, 10:03pm

Yeah, I was getting the same error, but now everything installs without any issues. No changes needed.

adg1 · January 26, 2026, 7:41pm

Thank you for your kind assistance on this matter. I am sorry for having taken a few days to follow up on this, which I spent taking a sneak peak into the rabbit hole, so as to come up with the underlying reasons for the observed faults.

It turns out that there were no issues with the docker image per se. I have rebuilt it a dozen times, exploring the combinatorial space offered by command line options, only to collect a range of glitches that puzzled me a bit. I am sharing this story here because I hope it can save time for other community members.

When I first rebuilt the docker image, vLLM failed to start, lamenting inconsistencies between the flashinfer versions. Fine. I tried again while disabling caching, before moving to preferring releases to nightly builds, or trying the latest transformers, and combinations thereof. But nothing, vLLM crashed while loading chunks of safetensors (JSONDecodeError). To the point that I had to infer that the issue was to be found elsewhere.

I eventually ran an integrity check on my cached openai/gpt-oss-120b, and it failed miserably.

hf cache verify openai/gpt-oss-120b --revision b5c939de8f754692c1647ca79fbf85e8c1e70f8a

And I downloaded again the affected files with:

hf download openai/gpt-oss-120b --revision b5c939de8f754692c1647ca79fbf85e8c1e70f8a --force-download

It strikes me how ‘hf download openai/gpt-oss-120b’ had first completed the download without alerting me–and even less recovering from–possible data corruptions. It strikes me also how removing the --reasoning-parser switch from the ‘vllm serve’ command line allowed vLLM to stomach the corrupted model without complaints. The only hint, aside from the JSONDecodeError, was an erroneous resolved architecture (independently reported on GH too, https://github.com/vllm-project/vllm/issues/32857). This probably explains also the 109 t/s recorded while disabling --reasoning-parser. But maybe I will dig deeper on that when I have some more time.

Thanks again!

system · February 9, 2026, 7:42pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	3561	February 13, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4011	December 9, 2025
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	4784	February 24, 2026
vLLM Compatibility Problem with GPT OSS 120B and OpenClaw by spark-vllm-docker DGX Spark / GB10 cuda	17	813	February 13, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	48	3414	February 13, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1728	January 22, 2026
Issue with run gpt-oss-120b in vLLM Jetson Thor generative_ai	23	2780	October 18, 2025
Run VLLM in Spark DGX Spark / GB10	143	9758	January 31, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1163	December 7, 2025
HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) DGX Spark / GB10 docker , llama , dgx	29	815	February 14, 2026

Support for openai_gptoss reasoning parser in vLLM, and its impact on the effective inference performance on Spark

Related topics