Issue with run gpt-oss-120b in vLLM

First, I downloaded the thor_vllm_container from here Link

Then run the container

sudo docker run --runtime=nvidia \
--gpus all \
-it \
--rm \
--network=host \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
thor_vllm_container:25.08-py3-base

Once inside the container
vllm serve openai/gpt-oss-120b

But this is not working, a lot of errors I got. So any suggestions?

Hi,

Thanks for reporting this.

We also see some errors when running the gpt-oss-120b model on the vLLM container.
Will check it further and provide more info to you later.

Thanks

thanks

Hi,

There is a compatibility issue, and we will check it further.

But the model can work correctly with ollama.
You can try it if ollama is an option for you:

$ sudo docker run -it --rm --runtime nvidia ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
# ollama run gpt-oss:120b
>>> Which number is larger, 9.11 or 9.8?
Thinking...
The user asks: Which number is larger, 9.11 or 9.8? Straightforward: 9.8 > 9.11. But note that 9.11 could be interpreted as a date (September 11) but here 
it's numeric. So answer: 9.8 is larger. Could add explanation about decimal comparison. Provide clear answer.
...done thinking.

9.8 is the larger number.  

When comparing decimals, you look at the digits from left to right:

- Both numbers start with a 9 in the units place.
- In the tenths place, 9.8 has an 8, while 9.11 has a 1.

Since 8 > 1, the number 9.8 is greater than 9.11.

>>> Send a message (/? for help)

Thanks.

yes i tried with ollama , but i think better in vllm to get more token per second

Hi!Is there any tutorial that can help me to build vllm from source code myself?I’d like to make a test on the model Qwen3-Next-80B-A3B-Instruct-AWQ-4bit.However the version of vllm in this container is not compatible.I think I can use the pytorch container(nvcr.io/nvidia/pytorch 25.08-py3 or 25.09-py3) to build my vllm from source code.But there is so much trouble to me.

Check this way, I tried withQwen2.5-VL-7B-Instruct-quantized.w4a16 and it’s working, but you can try with qwen3…

First, download the image from this link

sudo docker run --runtime=nvidia \ --gpus all \ -it \ --rm \ --network=host \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ thor_vllm_container:25.08-py3-base

3 )inside the container

python -m vllm.entrypoints.openai.api_server
–model RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w4a16
–quantization compressed-tensors
–host 0.0.0.0
–gpu-memory-utilization 0.24

Is the version of vllm in this container is 0.9.2+4ef1e343.nv25.8.post1.cu130?I’ve tried before.But this Qwen3 model requires me to build vllm from the newest source code for vllm-0.11.0rc2 that can work.Btw thank you so much.

1 Like

Hi,

gpt-oss-120b and gpt-oss-20b depends on vllm 0.10.1: openai/gpt-oss-120b Β· Hugging Face
However, the tritonserver:25.08-vllm-python-py3 container uses vLLM v0.9.2, which doesn’t support the gpt-oss model yet.

# vllm -v INFO 10-01 06:07:10 [__init__.py:244] Automatically detected platform cuda. 
0.9.2+4ef1e343.nv25.8.post1.cu130

Please wait for our new vLLM container release or use ollama as a temporary solution.
Thanks.

Hi,

Thanks for your patience.

Below are the steps to run gpt-oss-20b with our new vLLM container.
(Test with 20b model, but 120b is expected to work as well)

For gpt-oss model, you will need the WAR mentioned in this link on Harmony encoding.

$ sudo docker run -it --rm nvcr.io/nvidia/vllm:25.09-py3
# mkdir /etc/encodings
# wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O /etc/encodings/cl100k_base.tiktoken
# wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -O /etc/encodings/o200k_base.tiktoken
# export TIKTOKEN_ENCODINGS_BASE=/etc/encodings
# vllm serve openai/gpt-oss-20b
...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1953] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [cuda.py:323] Using Triton backend on V1 engine.
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(EngineCore_0 pid=1222) INFO 10-02 05:50:40 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:03,  1.51s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:03<00:01,  1.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.74s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.71s/it]
(EngineCore_0 pid=1222)
(EngineCore_0 pid=1222) INFO 10-02 05:50:46 [default_loader.py:262] Loading weights took 5.27 seconds
(EngineCore_0 pid=1222) WARNING 10-02 05:50:46 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_0 pid=1222) INFO 10-02 05:50:48 [gpu_model_runner.py:2007] Model loading took 13.7193 GiB and 9.283907 seconds
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/ac91ec61b3/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:559] Dynamo bytecode transform time: 3.40 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.612 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [monitor.py:34] torch.compile takes 3.40 s in total
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [gpu_worker.py:276] Available KV cache memory: 94.61 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1013] GPU KV cache size: 2,066,752 tokens
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1017] Maximum concurrency for 131,072 tokens per request: 31.02x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 83/83 [00:11<00:00,  7.41it/s]
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [gpu_model_runner.py:2708] Graph capturing finished in 12 secs, took 0.96 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [core.py:214] init engine (profile, create kv cache, warmup model) took 21.89 seconds
(APIServer pid=1150) INFO 10-02 05:51:13 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 258345
(APIServer pid=1150) INFO 10-02 05:51:13 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=1150) WARNING 10-02 05:51:14 [serving_responses.py:137] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=1150) INFO 10-02 05:51:15 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8000
...

Thanks.

Thanks for the solution! I’ll check it out and let you know

Thanks, I tried that and it works on my Thor for gpt-oss-20b as well as for gpt-oss-120b.

The context seems to be quite small and I see no parameter controlling this. How can it be adjusted?

can you share the way to start gpt-oss on thor?

For my case, I can start vllm but keep getting empty string as output

Here’s one way to run it. And, I saved o200k_base.tiktoken to ~/.cache/huggingface/hub/harmony/

docker run --name vllm --rm -it --network host \
  --runtime=nvidia --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=16g \
  -e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \
  -e TIKTOKEN_ENCODINGS_BASE="/root/.cache/huggingface/hub/harmony" \
  -e TIKTOKEN_RS_CACHE_DIR="/root/.cache/huggingface/hub/harmony" \
  -v "$HOME/.cache:/root/.cache" \
  nvcr.io/nvidia/vllm:25.09-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model "openai/gpt-oss-20b" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 512 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.25 \
    --kv-cache-dtype=auto

From a second terminal on Thor run something like following

curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"model": "openai/gpt-oss-20b",
"messages": [{"role":"user", "content": "What are Chihuahuas famous for?"}]
}' |jq

Hi,

You can run the model with the below comment:

vllm serve openai/gpt-oss-20b

Thanks.

1 Like

I did exactly, what @AastaLLL suggested in his post. The system was setup according to the quickstart guide for Thor. You can start with the 20b version of the model and then try the 120b version. Both worked but my question, how to adjust the context size is still open…

thanks for your reply!

compared with the log you provided, here is an additional log in my env:

[rank0]:W1009 06:44:02.915000 244 torch/_inductor/utils.py:1545] [0/0] Not enough SMs to use max_autotune_gemm mode

you can try β€œβ€“max-seq-len 32000 --max-model-len 32000” on vllm serve

my mistake~

I am trying to get content in think mode, so there is always None until thinking is finished~

Hi,

Just want to double-confirm that you can get the expected results after the thinking is finished, right?
Thanks.