Run VLLM in Spark

you have to uninstall triton_kernels.
in my case:

rm -rf /home/spark/ray_test/.venv/lib/python3.12/site-packages/triton_kernels

I was using this wheels build from my own. I know the issue from cmakelist.txt it should be fixed in the next release of vllm(it is build from main): Release vllm Β· johnnynunez/pypi Β· GitHub

I did similar modifications from this: GitHub - yvbbrjdr/triton at spark
it should be fixed in main from triton.

I had this issue:

triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

But after I uninstalled kernels by running uv pip uninstall triton-kernels and then compiled Triton from the main branch - I’m getting the same error as the poster above.

I did uninstall with:

uv pip uninstall triton-kernels

I’m trying your triton repo now; and will report back.

Validated it was deleted:

ls .vllm/lib/python3.12/site-packages/triton_kernel
ls: cannot access '.vllm/lib/python3.12/site-packages/triton_kernel': No such file or directory
1 Like

Just to update the thread!

SUCCESS: Your posted two-wheel packages worked. (Now, I’ll try with the actual parameters I wanted to set.)

Things I tried that did not work:

  • Built your fork triton and then rebuilt vllm
  • updated vllm in docker ^^ with the above example
  • built main of triton and vllm (as of today)
  • deleting the triton_kernels

I attempted over two dozen different combinations without any success. Even using different hash points in both Triton and vllm didn’t yield any results.

$ uv run vllm serve openai/gpt-oss-120b
INFO 10-27 19:12:37 [__init__.py:225] Automatically detected platform cuda.
(APIServer pid=242516) INFO 10-27 19:12:40 [api_server.py:1886] vLLM API server version 0.12.0
(APIServer pid=242516) INFO 10-27 19:12:40 [utils.py:243] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'model': 'openai/gpt-oss-120b'}
(APIServer pid=242516) INFO 10-27 19:12:41 [model.py:663] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 15/15 [00:00<00:00, 15.65it/s]
(APIServer pid=242516) INFO 10-27 19:12:42 [model.py:1751] Using max model len 131072
(APIServer pid=242516) INFO 10-27 19:12:43 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=242516) INFO 10-27 19:12:43 [config.py:274] Overriding max cuda graph capture size to 992 for performance.
INFO 10-27 19:12:45 [__init__.py:225] Automatically detected platform cuda.
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:46 [core.py:730] Waiting for init message from front-end.
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:46 [core.py:97] Initializing a V1 LLM engine (v0.12.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [992, 976, 960, 944, 928, 912, 896, 880, 864, 848, 832, 816, 800, 784, 768, 752, 736, 720, 704, 688, 672, 656, 640, 624, 608, 592, 576, 560, 544, 528, 512, 496, 480, 464, 448, 432, 416, 400, 384, 368, 352, 336, 320, 304, 288, 272, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 992, 'local_cache_dir': None}
(EngineCore_DP0 pid=242581) /home/wade/.vllm/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: 
(EngineCore_DP0 pid=242581)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=242581)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=242581)     (8.0) - (12.0)
(EngineCore_DP0 pid=242581)     
(EngineCore_DP0 pid=242581)   warnings.warn(
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:52 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:52 [gpu_model_runner.py:2860] Starting to load model openai/gpt-oss-120b...
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:53 [cuda.py:398] Using Triton backend on V1 engine.
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:53 [mxfp4.py:126] Using Marlin backend
(EngineCore_DP0 pid=242581) INFO 10-27 19:12:56 [weight_utils.py:419] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:25<06:03, 25.95s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:50<05:25, 25.05s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [01:17<05:09, 25.78s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [01:43<04:45, 25.99s/it]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [02:09<04:21, 26.14s/it]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [02:35<03:54, 26.07s/it]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [03:00<03:24, 25.61s/it]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [03:19<02:44, 23.52s/it]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [03:37<02:10, 21.83s/it]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [04:04<01:56, 23.34s/it]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [04:29<01:35, 23.79s/it]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [04:52<01:11, 23.82s/it]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [05:19<00:49, 24.68s/it]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [05:37<00:22, 22.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [06:01<00:00, 23.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [06:01<00:00, 24.12s/it]
(EngineCore_DP0 pid=242581) 
(EngineCore_DP0 pid=242581) INFO 10-27 19:18:58 [default_loader.py:314] Loading weights took 361.90 seconds
(EngineCore_DP0 pid=242581) WARNING 10-27 19:18:58 [marlin_utils_fp4.py:204] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:04 [gpu_model_runner.py:2921] Model loading took 65.9651 GiB and 371.103032 seconds
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:08 [backends.py:609] Using cache directory: /home/wade/.cache/vllm/torch_compile_cache/ad8c474a62/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:08 [backends.py:623] Dynamo bytecode transform time: 3.83 s
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:11 [backends.py:207] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.066 s
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:12 [monitor.py:34] torch.compile takes 6.89 s in total
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:13 [gpu_worker.py:316] Available KV cache memory: 38.44 GiB
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:13 [kv_cache_utils.py:1201] GPU KV cache size: 559,776 tokens
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:13 [kv_cache_utils.py:1206] Maximum concurrency for 131,072 tokens per request: 8.40x
(EngineCore_DP0 pid=242581) 2025-10-27 19:19:18,317 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=242581) 2025-10-27 19:19:19,048 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 81/81 [00:12<00:00,  6.57it/s]
Capturing CUDA graphs (decode, FULL): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 35/35 [00:03<00:00, 10.66it/s]
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:35 [gpu_model_runner.py:3848] Graph capturing finished in 16 secs, took 0.90 GiB
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:35 [core.py:243] init engine (profile, create kv cache, warmup model) took 31.43 seconds
(EngineCore_DP0 pid=242581) INFO 10-27 19:19:37 [gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
(APIServer pid=242516) INFO 10-27 19:19:37 [loggers.py:209] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 69973
(APIServer pid=242516) INFO 10-27 19:19:39 [api_server.py:1639] Supported tasks: ['generate']
(APIServer pid=242516) WARNING 10-27 19:19:40 [serving_responses.py:169] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=242516) INFO 10-27 19:19:41 [api_server.py:1955] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:38] Available routes are:
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=242516) INFO 10-27 19:19:41 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=242516) INFO:     Started server process [242516]
(APIServer pid=242516) INFO:     Waiting for application startup.
(APIServer pid=242516) INFO:     Application startup complete.
2 Likes

Did you build on Spark or on some other machine targeting Spark? I’m just wondering why none of our builds work with gpt-oss-120b and yours does.

I find this interesting:

[marlin_utils_fp4.py:204] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

GB10 supports FP4 natively, why is that?

Yeah, I was able to run gpt-oss using his build as well. It runs much slower than via llama.cpp though. I’m getting ~35 t/s inference on vllm and 56 t/s on llama.cpp.

But I couldn’t run Qwen3-VL with his build - getting a lot of errors like that:

(EngineCore_DP0 pid=16210) [rank0]:W1028 00:14:08.419000 16210 torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(EngineCore_DP0 pid=16210) [rank0]:W1028 00:14:08.419000 16210 torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]                                         ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=16210) [rank0]:W1028 00:14:08.419000 16210 torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/home/eugr/vllm-0.12/.venv/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir

While it works with my build just fine. I guess I’ll stick to mine, as I don’t really need vllm for gpt-oss-120b for now.

We’re on it and will get back.

2 Likes

[quote=β€œeugr, post:5, topic:348862”]
η‰Ήι‡Œι‘Ώ
[/引用
[上传:(t


Compiling this thing is really slow.

Yeah, it takes forever on this hardware…

So, a couple of updates:

  1. It looks like gpt-oss-120b works with new Triton compiled from main branch as of this afternoon (wasn’t even compiling in the morning). I still had to remove the kernels. Tried to install kernels from main branch, and they fail with errors.
  2. My patch is still needed.
  3. IMPORTANT: the initial instructions have the order slightly wrong. You need to set these two variables BEFORE compiling VLLM, otherwise CMake won’t compile FP4 kernels:
export TORCH_CUDA_ARCH_LIST=12.1a # Spark 12.1, 12.0f, 12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

Having said that, gpt-oss-120b performance is meh in vllm. I’m getting 37 tokens/s with vLLM, while using llama.cpp, I get 56 tokens/s!

May I know how you installed llama.cpp on DGX-Spark

Compiled from source: llama.cpp/docs/build.md at master Β· ggml-org/llama.cpp Β· GitHub

Just follow the instructions there, very easy on Spark.

The instructions given by @johnny_nv worked, except for a minor modification. I ran into an error:

(.vllm) root@spark-bec1:~/vllm# uv pip install --no-build-isolation -e .
Using Python 3.12.3 environment at: /root/.vllm
  Γ— No solution found when resolving dependencies:
  ╰─▢ Because there is no version of apache-tvm-ffi==0.1.0b15 and flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can conclude that flashinfer-python==0.4.1 cannot be used.
      And because vllm==0.11.1rc6.dev68+g55011aef2.d20251103.cu130 depends on flashinfer-python==0.4.1, we can conclude that vllm==0.11.1rc6.dev68+g55011aef2.d20251103.cu130 cannot be used.
      And because only vllm==0.11.1rc6.dev68+g55011aef2.d20251103.cu130 is available and you require vllm, we can conclude that your requirements are unsatisfiable.

      hint: `apache-tvm-ffi` was requested with a pre-release marker (e.g., apache-tvm-ffi==0.1.0b15), but pre-releases weren't enabled (try: `--prerelease=allow`)

And so, I had to use uv pip install –no-build-isolation with --prerelease=allow flag

uv pip install --no-build-isolation -e . --prerelease=allow

Yes, this is currently broken in the main branch.

@johnny_nv @eugr I am now running into the following errors:

DEBUG /root/vllm/.deps/vllm-flash-attn-src/csrc/cutlass/include/cutlass/platform/platform.h:623:33: warning: β€˜ulonglong4’ is deprecated: use ulonglong4_16a or ulonglong4_32a [-Wdeprecated-declarations]
DEBUG   623 | struct alignment_of<ulonglong4> {
DEBUG       |                                 ^
DEBUG /usr/local/cuda/include/vector_types.h:551:113: note: declared here
DEBUG   551 | typedef __device_builtin__ struct ulonglong4 __VECTOR_TYPE_DEPRECATED__("use ulonglong4_16a or ulonglong4_32a") ulonglong4;
DEBUG       |                                                                                                                 ^~~~~~~~~~
DEBUG /root/vllm/.deps/vllm-flash-attn-src/csrc/cutlass/include/cutlass/platform/platform.h:627:33: warning: β€˜double4’ is deprecated: use double4_16a or double4_32a [-Wdeprecated-declarations]
DEBUG   627 | struct alignment_of<double4> {
DEBUG       |                                 ^
DEBUG /usr/local/cuda/include/vector_types.h:561:104: note: declared here
DEBUG   561 | typedef __device_builtin__ struct double4 __VECTOR_TYPE_DEPRECATED__("use double4_16a or double4_32a") double4;
DEBUG       |                                                                                                        ^~~~~~~

And top command shows the following output:

Is there something I am missing? I am going to try: Run VLLM in Spark - #10 by eugr and see if it will work with the patch!

Did not seem to work. The only thing that worked was to use VLLM_USE_PRECOMPILED=1 uv pip install --editable . --prerelease=allow --torch-backend cu129

Also found that uv pip install --editable . --prerelease=allow works but it takes a very long time.

Yes, compilation takes almost an hour or so and generates quite a few warnings. But it will actually work.
If you install a pre-compiled one, it will not work (at least it wasn’t working as of last week).

1 Like

vllm-qwen3 | ==========
vllm-qwen3 | == vLLM ==
vllm-qwen3 | ==========
vllm-qwen3 |
vllm-qwen3 | NVIDIA Release 25.09 (build 214638690)
vllm-qwen3 | vLLM Version 0.10.1.1+381074ae
vllm-qwen3 | Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
vllm-qwen3 | Copyright (c) 2014-2024 Facebook Inc.
vllm-qwen3 | Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
vllm-qwen3 | Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
vllm-qwen3 | Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
vllm-qwen3 | Copyright (c) 2011-2013 NYU (Clement Farabet)
vllm-qwen3 | Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
vllm-qwen3 | Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
vllm-qwen3 | Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
vllm-qwen3 | Copyright (c) 2015 Google Inc.
vllm-qwen3 | Copyright (c) 2015 Yangqing Jia
vllm-qwen3 | Copyright (c) 2013-2016 The Caffe contributors
vllm-qwen3 | All rights reserved.
vllm-qwen3 |
vllm-qwen3 | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
vllm-qwen3 |
vllm-qwen3 | GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
vllm-qwen3 | (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
vllm-qwen3 | and the Product-Specific Terms for NVIDIA AI Products
vllm-qwen3 | (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
vllm-qwen3 |
vllm-qwen3 | /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
vllm-qwen3 | import pynvml # type: ignore[import]
vllm-qwen3 | Traceback (most recent call last):
vllm-qwen3 | File β€œβ€, line 198, in _run_module_as_main
vllm-qwen3 | File β€œβ€, line 88, in _run_code
vllm-qwen3 | File β€œ/workspace/vllm/vllm/entrypoints/openai/api_server.py”, line 42, in
vllm-qwen3 | from vllm.engine.arg_utils import AsyncEngineArgs
vllm-qwen3 | File β€œ/workspace/vllm/vllm/engine/arg_utils.py”, line 83, in
vllm-qwen3 | from vllm.reasoning import ReasoningParserManager
vllm-qwen3 | File β€œ/workspace/vllm/vllm/reasoning/init.py”, line 5, in
vllm-qwen3 | from .basic_parsers import BaseThinkingReasoningParser
vllm-qwen3 | File β€œ/workspace/vllm/vllm/reasoning/basic_parsers.py”, line 7, in
vllm-qwen3 | from vllm.entrypoints.openai.protocol import (
vllm-qwen3 | File β€œ/workspace/vllm/vllm/entrypoints/openai/protocol.py”, line 54, in
vllm-qwen3 | from vllm.utils.serial_utils import (
vllm-qwen3 | File β€œ/workspace/vllm/vllm/utils/serial_utils.py”, line 12, in
vllm-qwen3 | from vllm import PoolingRequestOutput
vllm-qwen3 | ImportError: cannot import name β€˜PoolingRequestOutput’ from β€˜vllm’ (unknown location)

Followed the steps but i have an import error. Any ideas?

If the docker image builds without errors, then it seems to be related to the options you pass to vllm that require a dependency that is not satisfied.

Im looking to run vLLM most recent version inside a container on spark. Been following this thread. Looks like we have a solution to get vLLM working directly on spark (no container) and a solution that hacks the existing Nvidia Container (though im not 100% on the steps for this).

Is it possible to build a clean working container from say nvidia/cuda:13.0.0-devel-ubuntu22.04? Anyone got example dockerfiles?

Note im interested in running this in the same way as the official vLLM container such that I can interact with vLLM inside the container via the API.