Issue with run gpt-oss-120b in vLLM

First, I downloaded the thor_vllm_container from here Link

Then run the container

sudo docker run --runtime=nvidia \
--gpus all \
-it \
--rm \
--network=host \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
thor_vllm_container:25.08-py3-base

Once inside the container
vllm serve openai/gpt-oss-120b

But this is not working, a lot of errors I got. So any suggestions?

Hi,

Thanks for reporting this.

We also see some errors when running the gpt-oss-120b model on the vLLM container.
Will check it further and provide more info to you later.

Thanks

thanks

Hi,

There is a compatibility issue, and we will check it further.

But the model can work correctly with ollama.
You can try it if ollama is an option for you:

$ sudo docker run -it --rm --runtime nvidia ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
# ollama run gpt-oss:120b
>>> Which number is larger, 9.11 or 9.8?
Thinking...
The user asks: Which number is larger, 9.11 or 9.8? Straightforward: 9.8 > 9.11. But note that 9.11 could be interpreted as a date (September 11) but here 
it's numeric. So answer: 9.8 is larger. Could add explanation about decimal comparison. Provide clear answer.
...done thinking.

9.8 is the larger number.  

When comparing decimals, you look at the digits from left to right:

- Both numbers start with a 9 in the units place.
- In the tenths place, 9.8 has an 8, while 9.11 has a 1.

Since 8 > 1, the number 9.8 is greater than 9.11.

>>> Send a message (/? for help)

Thanks.

yes i tried with ollama , but i think better in vllm to get more token per second

Hi!Is there any tutorial that can help me to build vllm from source code myself?I’d like to make a test on the model Qwen3-Next-80B-A3B-Instruct-AWQ-4bit.However the version of vllm in this container is not compatible.I think I can use the pytorch container(nvcr.io/nvidia/pytorch 25.08-py3 or 25.09-py3) to build my vllm from source code.But there is so much trouble to me.

Check this way, I tried withQwen2.5-VL-7B-Instruct-quantized.w4a16 and it’s working, but you can try with qwen3…

First, download the image from this link

sudo docker run --runtime=nvidia \ --gpus all \ -it \ --rm \ --network=host \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ thor_vllm_container:25.08-py3-base

3 )inside the container

python -m vllm.entrypoints.openai.api_server
–model RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w4a16
–quantization compressed-tensors
–host 0.0.0.0
–gpu-memory-utilization 0.24

Is the version of vllm in this container is 0.9.2+4ef1e343.nv25.8.post1.cu130?I’ve tried before.But this Qwen3 model requires me to build vllm from the newest source code for vllm-0.11.0rc2 that can work.Btw thank you so much.

1 Like