Just to update the thread!
SUCCESS: Your posted two-wheel packages worked. (Now, Iβll try with the actual parameters I wanted to set.)
Things I tried that did not work:
- Built your fork triton and then rebuilt vllm
- updated vllm in docker ^^ with the above example
- built main of triton and vllm (as of today)
- deleting the triton_kernels
I attempted over two dozen different combinations without any success. Even using different hash points in both Triton and vllm didnβt yield any results.
$ uv run vllm serve openai/gpt-oss-120b
INFO 10-27 19:12:37 [__init__.py:225] Automatically detected platform cuda.
(APIServer pid=242516) INFO 10-27 19:12:40 [api_server.py:1886] vLLM API server version 0.12.0
(APIServer pid=242516) INFO 10-27 19:12:40 [utils.py:243] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'model': 'openai/gpt-oss-120b'}
(APIServer pid=242516) INFO 10-27 19:12:41 [model.py:663] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 15/15 [00:00<00:00, 15.65it/s]
(APIServer pid=242516) INFO 10-27 19:12:42 [model.py:1751] Using max model len 131072
(APIServer pid=242516) INFO 10-27 19:12:43 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=242516) INFO 10-27 19:12:43 [config.py:274] Overriding max cuda graph capture size to 992 for performance.
INFO 10-27 19:12:45 [__init__.py:225] Automatically detected platform cuda.
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:46 [core.py:730] Waiting for init message from front-end.
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:46 [core.py:97] Initializing a V1 LLM engine (v0.12.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [992, 976, 960, 944, 928, 912, 896, 880, 864, 848, 832, 816, 800, 784, 768, 752, 736, 720, 704, 688, 672, 656, 640, 624, 608, 592, 576, 560, 544, 528, 512, 496, 480, 464, 448, 432, 416, 400, 384, 368, 352, 336, 320, 304, 288, 272, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 992, 'local_cache_dir': None}
(EngineCore_DP0 pid=242581) /home/wade/.vllm/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning:
(EngineCore_DP0 pid=242581) Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=242581) Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=242581) (8.0) - (12.0)
(EngineCore_DP0 pid=242581)
(EngineCore_DP0 pid=242581) warnings.warn(
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:52 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:52 [gpu_model_runner.py:2860] Starting to load model openai/gpt-oss-120b...
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:53 [cuda.py:398] Using Triton backend on V1 engine.
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:53 [mxfp4.py:126] Using Marlin backend
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:56 [weight_utils.py:419] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/15 [00:25<06:03, 25.95s/it]
Loading safetensors checkpoint shards: 13% Completed | 2/15 [00:50<05:25, 25.05s/it]
Loading safetensors checkpoint shards: 20% Completed | 3/15 [01:17<05:09, 25.78s/it]
Loading safetensors checkpoint shards: 27% Completed | 4/15 [01:43<04:45, 25.99s/it]
Loading safetensors checkpoint shards: 33% Completed | 5/15 [02:09<04:21, 26.14s/it]
Loading safetensors checkpoint shards: 40% Completed | 6/15 [02:35<03:54, 26.07s/it]
Loading safetensors checkpoint shards: 47% Completed | 7/15 [03:00<03:24, 25.61s/it]
Loading safetensors checkpoint shards: 53% Completed | 8/15 [03:19<02:44, 23.52s/it]
Loading safetensors checkpoint shards: 60% Completed | 9/15 [03:37<02:10, 21.83s/it]
Loading safetensors checkpoint shards: 67% Completed | 10/15 [04:04<01:56, 23.34s/it]
Loading safetensors checkpoint shards: 73% Completed | 11/15 [04:29<01:35, 23.79s/it]
Loading safetensors checkpoint shards: 80% Completed | 12/15 [04:52<01:11, 23.82s/it]
Loading safetensors checkpoint shards: 87% Completed | 13/15 [05:19<00:49, 24.68s/it]
Loading safetensors checkpoint shards: 93% Completed | 14/15 [05:37<00:22, 22.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [06:01<00:00, 23.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [06:01<00:00, 24.12s/it]
(EngineCore_DP0 pid=242581)
(EngineCore_DP0 pid=242581) INFO 10-27 19:18:58 [default_loader.py:314] Loading weights took 361.90 seconds
(EngineCore_DP0 pid=242581) WARNING 10-27 19:18:58 [marlin_utils_fp4.py:204] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:04 [gpu_model_runner.py:2921] Model loading took 65.9651 GiB and 371.103032 seconds
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:08 [backends.py:609] Using cache directory: /home/wade/.cache/vllm/torch_compile_cache/ad8c474a62/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:08 [backends.py:623] Dynamo bytecode transform time: 3.83 s
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:11 [backends.py:207] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.066 s
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:12 [monitor.py:34] torch.compile takes 6.89 s in total
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:13 [gpu_worker.py:316] Available KV cache memory: 38.44 GiB
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:13 [kv_cache_utils.py:1201] GPU KV cache size: 559,776 tokens
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:13 [kv_cache_utils.py:1206] Maximum concurrency for 131,072 tokens per request: 8.40x
(EngineCore_DP0 pid=242581) 2025-10-27 19:19:18,317 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=242581) 2025-10-27 19:19:19,048 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:12<00:00, 6.57it/s]
Capturing CUDA graphs (decode, FULL): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:03<00:00, 10.66it/s]
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:35 [gpu_model_runner.py:3848] Graph capturing finished in 16 secs, took 0.90 GiB
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:35 [core.py:243] init engine (profile, create kv cache, warmup model) took 31.43 seconds
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:37 [gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
(APIServer pid=242516) INFO 10-27 19:19:37 [loggers.py:209] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 69973
(APIServer pid=242516) INFO 10-27 19:19:39 [api_server.py:1639] Supported tasks: ['generate']
(APIServer pid=242516) WARNING 10-27 19:19:40 [serving_responses.py:169] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=242516) INFO 10-27 19:19:41 [api_server.py:1955] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:38] Available routes are:
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=242516) INFO: Started server process [242516]
(APIServer pid=242516) INFO: Waiting for application startup.
(APIServer pid=242516) INFO: Application startup complete.