First, I downloaded the thor_vllm_container from here Link
Then run the container
sudo docker run --runtime=nvidia \
--gpus all \
-it \
--rm \
--network=host \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
thor_vllm_container:25.08-py3-base
Once inside the container
vllm serve openai/gpt-oss-120b
But this is not working, a lot of errors I got. So any suggestions?
Hi,
Thanks for reporting this.
We also see some errors when running the gpt-oss-120b model on the vLLM container.
Will check it further and provide more info to you later.
Thanks
Hi,
There is a compatibility issue, and we will check it further.
But the model can work correctly with ollama.
You can try it if ollama is an option for you:
$ sudo docker run -it --rm --runtime nvidia ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
# ollama run gpt-oss:120b
>>> Which number is larger, 9.11 or 9.8?
Thinking...
The user asks: Which number is larger, 9.11 or 9.8? Straightforward: 9.8 > 9.11. But note that 9.11 could be interpreted as a date (September 11) but here
it's numeric. So answer: 9.8 is larger. Could add explanation about decimal comparison. Provide clear answer.
...done thinking.
9.8 is the larger number.
When comparing decimals, you look at the digits from left to right:
- Both numbers start with a 9 in the units place.
- In the tenths place, 9.8 has an 8, while 9.11 has a 1.
Since 8 > 1, the number 9.8 is greater than 9.11.
>>> Send a message (/? for help)
Thanks.
yes i tried with ollama , but i think better in vllm to get more token per second
Hi!Is there any tutorial that can help me to build vllm from source code myself?I’d like to make a test on the model Qwen3-Next-80B-A3B-Instruct-AWQ-4bit.However the version of vllm in this container is not compatible.I think I can use the pytorch container(nvcr.io/nvidia/pytorch 25.08-py3 or 25.09-py3) to build my vllm from source code.But there is so much trouble to me.
Check this way, I tried withQwen2.5-VL-7B-Instruct-quantized.w4a16
and it’s working, but you can try with qwen3…
First, download the image from this link
sudo docker run --runtime=nvidia \ --gpus all \ -it \ --rm \ --network=host \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ thor_vllm_container:25.08-py3-base
3 )inside the container
python -m vllm.entrypoints.openai.api_server
–model RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w4a16
–quantization compressed-tensors
–host 0.0.0.0
–gpu-memory-utilization 0.24
Is the version of vllm in this container is 0.9.2+4ef1e343.nv25.8.post1.cu130?I’ve tried before.But this Qwen3 model requires me to build vllm from the newest source code for vllm-0.11.0rc2 that can work.Btw thank you so much.
1 Like