Support for openai_gptoss reasoning parser in vLLM, and its impact on the effective inference performance on Spark

Hello! I am tinkering with vllm and gpt-oss-120b. I am using @eugr‘s community build https://github.com/eugr/spark-vllm-docker/ latest, and https://github.com/eugr/llama-benchy for this purpose.

When I run the docker image w/ the --reasoning-parser=openai_gptoss switch, vllm returns a BadRequestError:

docker run \
  --privileged \
  --gpus all \
  -it --rm \
  --network=host --ipc=host \
  --shm-size 64g \
  -v "$HOME/models/gpt-oss-120b:/model" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -v "$HOME/tiktoken_encodings:/tiktoken_encodings" \
  -e TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings \
  vllm-node \
  bash -c -i "
  vllm serve \
    --served-model-name "openai/gpt-oss-120b" \
    --host 0.0.0.0 --port 8000 \
    --gpu-memory-utilization 0.7 \
    --load-format fastsafetensors \
    --reasoning-parser=openai_gptoss \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser openai "
{
  "error": {
    "message": "gpt-oss has a special branch for parsing reasoning in non-streaming mode. This method shouldn't be used.",
    "type": "BadRequestError",
    "param": null,
    "code": 400
  }
}

When I run the same docker image without the same switch, gpt-oss-120b is served regularly and the llama-benchy records the following performances:

llama-benchy (0.1.1)
Date: 2026-01-23 08:06:36
Benchmarking model: openai/gpt-oss-120b at http://spark-a1ab.local:8000/v1
Loading text from cache: /home/adg/.cache/llama-benchy/f88f98465dba5c34bf03e8a31393fea9.txt
Total tokens available in text corpus: 192160
Warming up...
Warmup (User only) complete. Delta: 8 tokens (Server: 29, Local: 21)
Warmup (System+Empty) complete. Delta: 13 tokens (Server: 34, Local: 21)
Measuring latency using mode: generation...
Average latency (generation): 21.78 ms
Running test: pp=2048, tg=32, depth=0
Running test: pp=2048, tg=32, depth=4096
Running test: pp=2048, tg=32, depth=8192
Running test: pp=2048, tg=32, depth=16384
Running test: pp=2048, tg=32, depth=32768

| model               |            test |               t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------|----------------:|------------------:|----------------:|----------------:|----------------:|
| openai/gpt-oss-120b |          pp2048 |  44671.30 ± 53.85 |    66.65 ± 0.16 |    44.87 ± 0.16 |    66.69 ± 0.15 |
| openai/gpt-oss-120b |            tg32 |     108.82 ± 0.28 |                 |                 |                 |
| openai/gpt-oss-120b |  pp2048 @ d4096 | 30216.38 ± 102.26 |   223.14 ± 2.03 |   201.35 ± 2.03 |   223.18 ± 2.03 |
| openai/gpt-oss-120b |    tg32 @ d4096 |      89.76 ± 0.46 |                 |                 |                 |
| openai/gpt-oss-120b |  pp2048 @ d8192 |  24934.57 ± 75.09 |   424.99 ± 0.82 |   403.20 ± 0.82 |   425.04 ± 0.83 |
| openai/gpt-oss-120b |    tg32 @ d8192 |      76.39 ± 0.03 |                 |                 |                 |
| openai/gpt-oss-120b | pp2048 @ d16384 |  18704.14 ± 59.24 |   989.74 ± 8.26 |   967.96 ± 8.26 |   989.79 ± 8.27 |
| openai/gpt-oss-120b |   tg32 @ d16384 |      59.16 ± 0.19 |                 |                 |                 |
| openai/gpt-oss-120b | pp2048 @ d32768 |  12543.98 ± 44.95 | 2755.51 ± 11.09 | 2733.73 ± 11.09 | 2755.59 ± 11.08 |
| openai/gpt-oss-120b |   tg32 @ d32768 |      40.64 ± 0.05 |                 |                 |                 |

llama-benchy (0.1.1)
date: 2026-01-23 08:06:36 | latency mode: generation

In absence of a reasoning parser the model behaves according to a “standard” language model, treating all generated text, namely chain of thoughts and generated text, as the main response content. Arguably the chain of thoughts tokens are indeed part of the generated tokens. As a result, the I regard the benchmark above as representative of the effective inference performance achievable on our Sparks.

That being said, I would like to have the reasoning parser part in good order. Can someone illuminate me on the reasons why I am getting the Code: 400, BadRequestError, while running the vllm-node with the said switch, please?

Thank you! :-)

I am investigating whether this may be a manifestation of the issue #26480 in vllm, which is tentatively fixed in master. https://github.com/vllm-project/vllm/issues/26480

When did you build the container last time?
I just ran with yesterday build, and everything works fine.

BTW, with this Docker build, you don’t have to map tiktoken encodings - they are embedded in the container, so you can drop the volume and the environment variable (it is also set automatically).

Thank you for your reply and the heads up about tiktoken encodings, well noted and reflected in the updated launch script.

The last time I have built the container was 3 days ago.
I tried with other docker images too, but the –reasoning-parser switch results in erratic behaviour. More specifically, with this container and other based on transformers 4.xx I get error 400; while moving to the latest vLLM and transformers 5 the APIServer incurs into a 500 Internal Server Error. The diagnostic message is the same:

”gpt-oss has a special branch for parsing reasoning in non-streaming mode. This method shouldn’t be used.”

I welcome your thoughts! Many thanks :-)

Just rebuild the container again - looks like there was a bug in vLLM that has been fixed since.

./build-and-copy.sh -t vllm-node-20260123-whl --use-wheels --pre-flashinfer --rebuild-vllm --rebuild-deps -c

Then run:

docker run \
 --privileged \
 --gpus all -it \
 --rm --network host \
 --ipc=host \
 -v  ~/.cache/huggingface:/root/.cache/huggingface \
 vllm-node-20260123-whl \
 bash -c -i "vllm serve \
  openai/gpt-oss-120b \
        --host 0.0.0.0 \
        --port 8888 \
        --enable-auto-tool-choice \
        --tool-call-parser=openai \
        --reasoning-parser=openai_gptoss \
        --gpu-memory-utilization 0.70 \
        --enable-prefix-caching \
        --load-format fastsafetensors"

I’ve just tested it and it works just fine.

I had to add a --no-cache to the docker build to fix a manylinux platform tag issue:

$ ./build-and-copy.sh -t vllm-node-20260123-whl --use-wheels --pre-flashinfer --rebuild-vllm --rebuild-deps
Using pre-built vLLM wheels (mode: nightly)
Setting CACHEBUST_DEPS…
Setting CACHEBUST_VLLM…
Using pre-release FlashInfer…
Building image with command: docker build -t vllm-node-20260123-whl -f Dockerfile.wheels --build-arg CACHEBUST_DEPS=1769202636 --build-arg CACHEBUST_VLLM=1769202636 --build-arg TRITON_REF=v3.5.1 --build-arg VLLM_REF=main --build-arg BUILD_JOBS=16 --build-arg FLASHINFER_PRE=–pre .
[+] Building 2.1s (13/19) docker:default
=> [internal] load build definition from Dockerfile.wheels 0.0s
=> => transferring dockerfile: 4.08kB 0.0s
=> resolve image config for docker-image://docker.io/docker/dockerfile:1.6 0.4s
=> CACHED docker-image://docker.io/docker/dockerfile:1.6@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f50032edf31be0021 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:13.1.0-devel-ubuntu24.04 0.7s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [stage-0 1/13] FROM docker.io/nvidia/cuda:13.1.0-devel-ubuntu24.04@sha256:7f32ae6e575abb29f2dacf6c75fe94a262bb48dcc5196ac833ced59d9fde8107 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 82B 0.0s
=> CACHED [stage-0 2/13] RUN apt update && apt upgrade -y && apt install -y --allow-change-held-packages --no-install-recommends python3 python3-pip python3-dev vim curl git wget jq libcudnn9-cuda-13 libnccl-dev libnccl2 libibverbs1 libibverbs-dev rdma-core 0.0s
=> CACHED [stage-0 3/13] WORKDIR /workspace/vllm 0.0s
=> CACHED [stage-0 4/13] RUN mkdir -p tiktoken_encodings && wget -O tiktoken_encodings/o200k_base.tiktoken “https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken” && wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.wind 0.0s
=> CACHED [stage-0 5/13] COPY fastsafetensors.patch . 0.0s
=> CACHED [stage-0 6/13] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv uv pip install -U fastsafetensors 0.0s
=> ERROR [stage-0 7/13] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv if [ “0” = “1” ]; then export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed ‘s/^v//’) && uv pip install -U http 0.8s

[stage-0 7/13] RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv if [ “0” = “1” ]; then export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed ‘s/^v//’) && uv pip install -U https://github.com
/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu130-cp38-abi3-manylinux_2_35_aarch64.whl --torch-backend=auto; else uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130; fi:
0.317 Using Python 3.12.3 environment at: /usr
0.762 × No solution found when resolving dependencies:
0.762 ╰─▶ Because only vllm==0.14.0rc2.dev290+g586a57ad7.cu130 is available and
0.762 vllm==0.14.0rc2.dev290+g586a57ad7.cu130 has no wheels with a matching
0.762 platform tag (e.g., manylinux_2_39_aarch64), we can conclude that all
0.762 versions of vllm cannot be used.
0.762 And because you require vllm, we can conclude that your requirements
0.762 are unsatisfiable.
0.762
0.762 hint: vllm was requested with a pre-release marker (e.g., all of:
0.762 vllm<0.14.0rc2.dev290+g586a57ad7.cu130
0.762 vllm>0.14.0rc2.dev290+g586a57ad7.cu130
0.762 ), but pre-releases weren’t enabled (try: --prerelease=allow)
0.762
0.762 hint: vllm was found on https://wheels.vllm.ai/nightly/cu130, but not
0.762 at the requested version (all of:
0.762 vllm<0.14.0rc2.dev290+g586a57ad7.cu130
0.762 vllm>0.14.0rc2.dev290+g586a57ad7.cu130
0.762 ). A compatible version may be available on a subsequent index (e.g.,
0.762 Simple index). By default, uv will only consider versions
0.762 that are published on the first index that contains a given package, to
0.762 avoid dependency confusion attacks. If all indexes are equally trusted,
0.762 use --index-strategy unsafe-best-match to consider all versions from
0.762 all indexes, regardless of the order in which they were defined.
0.762
0.762 hint: Wheels are available for vllm
0.762 (v0.14.0rc2.dev290+g586a57ad7.cu130) on the following platform:
0.762 manylinux_2_35_x86_64

Yeah, I was getting the same error, but now everything installs without any issues. No changes needed.

Thank you for your kind assistance on this matter. I am sorry for having taken a few days to follow up on this, which I spent taking a sneak peak into the rabbit hole, so as to come up with the underlying reasons for the observed faults.

It turns out that there were no issues with the docker image per se. I have rebuilt it a dozen times, exploring the combinatorial space offered by command line options, only to collect a range of glitches that puzzled me a bit. I am sharing this story here because I hope it can save time for other community members.

When I first rebuilt the docker image, vLLM failed to start, lamenting inconsistencies between the flashinfer versions. Fine. I tried again while disabling caching, before moving to preferring releases to nightly builds, or trying the latest transformers, and combinations thereof. But nothing, vLLM crashed while loading chunks of safetensors (JSONDecodeError). To the point that I had to infer that the issue was to be found elsewhere.

I eventually ran an integrity check on my cached openai/gpt-oss-120b, and it failed miserably.

hf cache verify openai/gpt-oss-120b --revision b5c939de8f754692c1647ca79fbf85e8c1e70f8a

And I downloaded again the affected files with:

hf download openai/gpt-oss-120b --revision b5c939de8f754692c1647ca79fbf85e8c1e70f8a --force-download

It strikes me how ‘hf download openai/gpt-oss-120b’ had first completed the download without alerting me–and even less recovering from–possible data corruptions. It strikes me also how removing the --reasoning-parser switch from the ‘vllm serve’ command line allowed vLLM to stomach the corrupted model without complaints. The only hint, aside from the JSONDecodeError, was an erroneous resolved architecture (independently reported on GH too, https://github.com/vllm-project/vllm/issues/32857). This probably explains also the 109 t/s recorded while disabling --reasoning-parser. But maybe I will dig deeper on that when I have some more time.

Thanks again!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.