There is a compatibility issue, and we will check it further.
But the model can work correctly with ollama.
You can try it if ollama is an option for you:
$ sudo docker run -it --rm --runtime nvidia ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
# ollama run gpt-oss:120b
>>> Which number is larger, 9.11 or 9.8?
Thinking...
The user asks: Which number is larger, 9.11 or 9.8? Straightforward: 9.8 > 9.11. But note that 9.11 could be interpreted as a date (September 11) but here
it's numeric. So answer: 9.8 is larger. Could add explanation about decimal comparison. Provide clear answer.
...done thinking.
9.8 is the larger number.
When comparing decimals, you look at the digits from left to right:
- Both numbers start with a 9 in the units place.
- In the tenths place, 9.8 has an 8, while 9.11 has a 1.
Since 8 > 1, the number 9.8 is greater than 9.11.
>>> Send a message (/? for help)
Hi!Is there any tutorial that can help me to build vllm from source code myself?Iβd like to make a test on the model Qwen3-Next-80B-A3B-Instruct-AWQ-4bit.However the version of vllm in this container is not compatible.I think I can use the pytorch container(nvcr.io/nvidia/pytorch 25.08-py3 or 25.09-py3) to build my vllm from source code.But there is so much trouble to me.
Is the version of vllm in this container is 0.9.2+4ef1e343.nv25.8.post1.cu130?Iβve tried before.But this Qwen3 model requires me to build vllm from the newest source code for vllm-0.11.0rc2 that can work.Btw thank you so much.
gpt-oss-120b and gpt-oss-20b depends on vllm 0.10.1: openai/gpt-oss-120b Β· Hugging Face
However, the tritonserver:25.08-vllm-python-py3 container uses vLLM v0.9.2, which doesnβt support the gpt-oss model yet.
Below are the steps to run gpt-oss-20b with our new vLLM container.
(Test with 20b model, but 120b is expected to work as well)
For gpt-oss model, you will need the WAR mentioned in this link on Harmony encoding.
$ sudo docker run -it --rm nvcr.io/nvidia/vllm:25.09-py3
# mkdir /etc/encodings
# wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O /etc/encodings/cl100k_base.tiktoken
# wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -O /etc/encodings/o200k_base.tiktoken
# export TIKTOKEN_ENCODINGS_BASE=/etc/encodings
# vllm serve openai/gpt-oss-20b
...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1953] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [cuda.py:323] Using Triton backend on V1 engine.
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(EngineCore_0 pid=1222) INFO 10-02 05:50:40 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.51s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:01, 1.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.74s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.71s/it]
(EngineCore_0 pid=1222)
(EngineCore_0 pid=1222) INFO 10-02 05:50:46 [default_loader.py:262] Loading weights took 5.27 seconds
(EngineCore_0 pid=1222) WARNING 10-02 05:50:46 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_0 pid=1222) INFO 10-02 05:50:48 [gpu_model_runner.py:2007] Model loading took 13.7193 GiB and 9.283907 seconds
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/ac91ec61b3/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:559] Dynamo bytecode transform time: 3.40 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.612 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [monitor.py:34] torch.compile takes 3.40 s in total
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [gpu_worker.py:276] Available KV cache memory: 94.61 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1013] GPU KV cache size: 2,066,752 tokens
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1017] Maximum concurrency for 131,072 tokens per request: 31.02x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 83/83 [00:11<00:00, 7.41it/s]
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [gpu_model_runner.py:2708] Graph capturing finished in 12 secs, took 0.96 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [core.py:214] init engine (profile, create kv cache, warmup model) took 21.89 seconds
(APIServer pid=1150) INFO 10-02 05:51:13 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 258345
(APIServer pid=1150) INFO 10-02 05:51:13 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=1150) WARNING 10-02 05:51:14 [serving_responses.py:137] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=1150) INFO 10-02 05:51:15 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8000
...
I did exactly, what @AastaLLL suggested in his post. The system was setup according to the quickstart guide for Thor. You can start with the 20b version of the model and then try the 120b version. Both worked but my question, how to adjust the context size is still openβ¦