HOW-TO: Run Qwen3-Coder-Next on Spark

Nope, reverting fastsafetensors patch didn’t help either. Looks like it’s a bug in the custom Triton code that is used by this model that only manifests when running in Ray environment, and possibly on DGX Spark only. And this code is getting executed regardless of the attention or MoE backend too.

I’ll probably open an issue in vLLM for that if I don’t forget - can’t spend any more time on this model now…

BTW, just merged that PR. We will work on populating the recipes - right now there are only few of them there.

unsloth has a new dynamic one:

I did a quick run (single Spark):

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 2441.67 ± 0.00 930.05 ± 0.00 838.77 ± 0.00 930.15 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 32.07 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_pp @ d4096 2216.34 ± 0.00 1939.37 ± 0.00 1848.09 ± 0.00 1939.47 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_tg @ d4096 31.81 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 @ d4096 1759.44 ± 0.00 1255.29 ± 0.00 1164.01 ± 0.00 1255.38 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 @ d4096 31.46 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_pp @ d8192 2432.24 ± 0.00 3459.38 ± 0.00 3368.09 ± 0.00 3459.48 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_tg @ d8192 31.15 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 @ d8192 2260.20 ± 0.00 997.40 ± 0.00 906.12 ± 0.00 997.48 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 @ d8192 30.82 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_pp @ d16384 2436.46 ± 0.00 6815.80 ± 0.00 6724.51 ± 0.00 6815.86 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic ctx_tg @ d16384 30.11 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic pp2048 @ d16384 1926.10 ± 0.00 1154.57 ± 0.00 1063.29 ± 0.00 1154.65 ± 0.00
unsloth/Qwen3-Coder-Next-FP8-Dynamic tg128 @ d16384 29.91 ± 0.00

Interesting, it performs slower than the official FP8 version.

I’ve been testing Qwen3-Coder-Next and it works really well overall. In particular, OpenClaw has been very useful — on a single node it honestly feels like it flies.

It would be very interesting to see how it performs on two nodes and how it scales compared to a single Spark setup. If anyone has already tested it in a multi-node configuration, I’d be curious to hear about the results or setup details.

Thanks for posting this one, I’m interested in testing out the model quality. I’m seeing similar performance, but here are the results up to 100K context. I’m using your GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks rebuilt today:

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Coder-Next-FP8 pp2048 3396.60 ± 76.40 684.18 ± 13.45 603.26 ± 13.45 684.30 ± 13.43
Qwen/Qwen3-Coder-Next-FP8 tg32 43.98 ± 0.15
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d4096 3217.89 ± 119.05 1355.59 ± 48.42 1274.67 ± 48.42 1355.73 ± 48.39
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d4096 43.31 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d4096 2580.88 ± 44.90 874.69 ± 13.93 793.77 ± 13.93 874.80 ± 13.94
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d4096 42.90 ± 0.16
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d8192 3532.87 ± 27.19 2399.85 ± 17.79 2318.93 ± 17.79 2400.00 ± 17.81
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d8192 42.45 ± 0.02
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d8192 3013.17 ± 133.09 761.96 ± 30.81 681.04 ± 30.81 762.10 ± 30.85
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d8192 42.10 ± 0.04
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d16384 3391.03 ± 2.93 4912.50 ± 4.17 4831.58 ± 4.17 4912.65 ± 4.16
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d16384 40.80 ± 0.07
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d16384 2846.79 ± 46.02 800.51 ± 11.50 719.59 ± 11.50 800.61 ± 11.49
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d16384 38.28 ± 2.93
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d32768 3137.26 ± 13.34 10525.78 ± 44.39 10444.86 ± 44.39 10525.91 ± 44.39
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d32768 37.96 ± 0.06
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d32768 1973.59 ± 466.17 1193.09 ± 315.29 1112.17 ± 315.29 1193.20 ± 315.28
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d32768 37.52 ± 0.05
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d65535 2754.67 ± 5.44 23871.52 ± 46.98 23790.60 ± 46.98 23871.65 ± 46.98
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d65535 33.37 ± 0.10
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d65535 1592.82 ± 16.47 1366.82 ± 13.21 1285.91 ± 13.21 1366.92 ± 13.23
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d65535 33.14 ± 0.11
Qwen/Qwen3-Coder-Next-FP8 ctx_pp @ d100000 2410.39 ± 5.73 41568.30 ± 98.69 41487.38 ± 98.69 41568.49 ± 98.66
Qwen/Qwen3-Coder-Next-FP8 ctx_tg @ d100000 29.63 ± 0.06
Qwen/Qwen3-Coder-Next-FP8 pp2048 @ d100000 1189.18 ± 21.12 1803.66 ± 30.22 1722.74 ± 30.22 1803.77 ± 30.21
Qwen/Qwen3-Coder-Next-FP8 tg32 @ d100000 29.41 ± 0.10

llama-benchy (0.1.1)
date: 2026-02-05 01:03:20 | latency mode: generation

Thanks for the post and github repo for vllm container. Got this model working on a single spark machine. how do I measure performance in terms of tokens/s. Logs in the server show different tokens/s for a taks I gave. Does anyone know what is the average token/s claude code opus does with API

FYI: I submitted a bug to vLLM team: [Bug]: Qwen3-Coder-Next fails with Triton allocator error on DGX Spark cluster (GB10, sm121) · Issue #33857 · vllm-project/vllm · GitHub

Looks great @eugr. Good work.

Is it possible to add –load-format to the list of possible overrides in recipes?

I can never get fastsafetensors to work. Is there something I am missing there?

I always get the UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True error

Also, I owe you a beer. The –eth-if & –ib-ifsaved my life. I have another subnet going between my PC & Sparks and couldn’t get anything to load. But once I figured out I could plug those variables in, was a huge weight off my shoulders. Appreciate it!

I’m going to try and see if I can cluster My Threadripper PC with 2X 5090 with the 2X Sparks. It only has 100GB ConnectX-5 though, so I am not sure if it has the juice.

Does the model load? This message is normal and expected on Spark as it doesn’t support GDS. Even without GDS, fastsafetensors are much faster.

Yeah, it’s a good idea, can you open an issue in the tracker, so we don’t forget?

Successful requests: 1
Failed requests: 0
Benchmark duration (s): 5.91
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.17
Output token throughput (tok/s): 20.15
Peak output token throughput (tok/s): 38.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 22.18
---------------Time to First Token----------------
Mean TTFT (ms): 2588.46
Median TTFT (ms): 2588.46
P99 TTFT (ms): 2588.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 28.11
Median TPOT (ms): 28.11
P99 TPOT (ms): 28.11
---------------Inter-token Latency----------------
Mean ITL (ms): 28.11
Median ITL (ms): 26.73
P99 ITL (ms): 79.64

Hi, today I built vLLM from the main branch and tested the non-quantized version of qwen3-coder-next on a dual Spark setup. I got the following throughput results. Compared to the benchmark results shared earlier, my output token throughput is significantly lower. Do you think there might be an issue with my setup, or is this simply due to the precision difference?

Can you post vllm serve command here?

vllm serve /workspace/Model/Qwen3-Coder-Next
–host 0.0.0.0 --port 8000
–distributed-executor-backend ray
–tensor-parallel-size 2
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–max-model-len 262144
–max-num-seqs 100

”””Failed to import Triton kernels. Please make sure your Triton version is compatible. Error: cannot import name ‘SparseMatrix’ from ‘triton_kernels.tensor’ (/usr/local/lib/python3.12/dist-packages/triton_kernels/tensor.py).”””
This is also the Triton kernel failure message that appears first when running the model.
I’m using Triton 3.5.1, PyTorch 2.9.1, and vLLM from the main branch.

Haven’t tried that one, but given that our benchmarks are for FP8 version, bf16 would 2x slower. But given that we are getting ~43 t/s on a single machine with FP8 one, you clustered setup should give at least 30 t/s.

However, I’m surprised that it works, as FP8 version fails in the cluster, probably skips Triton altogether for BF16.

The Triton message you are getting is because vLLM moved to Triton 3.6.0, so you need to build with that one. They also moved to PyTorch 2.10.

As usual, the easiest way to get an optimal cluster configuration is to use our community Docker build: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

I built PyTorch 2.10.0 from source about 2–3 days ago, and while the model loads fine, inference keeps crashing or hanging afterward. Do you know if this has been fixed recently? Also, as you mentioned, with PyTorch 2.10.0, qwen3-next-coder does not load on the cluster.

For context, I didn’t install vLLM from a wheel—I cloned the repo and built it from source.

I now use NGC pytorch container as my base that includes torch 2.10 with Spark support. FP8 model works fine on it, but only on a single machine. I don’t remember any crashes.

This is actually interesting, let me try with some of my older builds that use pytorch 2.9.1.

I’ve got Qwen/Qwen3-Coder-Next-FP8 running successfully on a 2-node DGX Spark
Ray cluster (TP=2) with vLLM v0.15.1 — without --enforce-eager.

The Triton allocator error (RuntimeError: Kernel requires a runtime memory
allocation, but no allocator was set) is caused by Triton 3.5.1’s _allocator
being a ContextVar. When you call triton.set_allocator(), the value is only
set in the current context — it doesn’t propagate to Ray worker threads.

The fix is to monkey-patch NullAllocator.call directly at the class level,
which works across all threads regardless of ContextVar state. Create these
two files in your site-packages/ on both nodes:

_triton_alloc_setup.py:

  try:
      import triton.runtime._allocation as _alloc
      import torch

      _alloc.NullAllocator.__call__ = staticmethod(
          lambda size, alignment, stream:
              torch.cuda.caching_allocator_alloc(size, stream=stream))
  except Exception:
      pass

_triton_alloc_setup.pth:

  import _triton_alloc_setup

The .pth file ensures the patch is applied automatically during Python site
initialization, before vLLM or Ray starts.

You can verify it’s working with:
python -c “import triton.runtime._allocation as a;
print(a.NullAllocator.call)”
It should print <function …> instead of .

Benchmark result (single request, 1024 input tokens, 128 output tokens):

Output token throughput (tok/s): 50.64
Mean TTFT (ms): 157.96
Mean TPOT (ms): 18.66

  ============ Serving Benchmark Result ============                            
  Successful requests:                     1                                    
  Failed requests:                         0                                    
  Benchmark duration (s):                  2.53                                 
  Total input tokens:                      1024                                 
  Total generated tokens:                  128                                  
  Request throughput (req/s):              0.40                                 
  Output token throughput (tok/s):         50.64                            
  Peak output token throughput (tok/s):    54.00                            
  Peak concurrent requests:                1.00                             
  Total token throughput (tok/s):          455.77                           
  ---------------Time to First Token----------------                        
  Mean TTFT (ms):                          157.96                           
  Median TTFT (ms):                        157.96                           
  P99 TTFT (ms):                           157.96                           
  -----Time per Output Token (excl. 1st token)------                        
  Mean TPOT (ms):                          18.66                            
  Median TPOT (ms):                        18.66                            
  P99 TPOT (ms):                           18.66                            
  ---------------Inter-token Latency----------------                        
  Mean ITL (ms):                           18.66                            
  Median ITL (ms):                         18.52                            
  P99 ITL (ms):                            20.69                            
  ==================================================                        

thanks.

I run this on a ASUS GX10:

❯ ./launch-cluster.sh --solo \
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_coder \
        --gpu-memory-utilization 0.8 \
        --host 0.0.0.0 --port 8888 \
        --load-format fastsafetensors \
        --attention-backend flashinfer \
        --enable-prefix-caching
Solo mode enabled. Skipping node detection.
Head Node: 127.0.0.1
Worker Nodes:
Container Name: vllm_node
Image Name: vllm-node
Action: exec
Starting Head Node on 127.0.0.1...
f9be5655dd7d9c159a39729e4974a696be7ed2899360f76c785cfe35da872b1e
Solo mode active: Skipping Ray cluster readiness check.
Executing command on head node: vllm serve Qwen/Qwen3-Coder-Next-FP8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8888 --load-format fastsafetensors --attention-backend flashinfer --enable-prefix-caching
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]        █     █     █▄   ▄█
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.16.0rc2.dev126+gb96f7314b.d20260212
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-Coder-Next-FP8
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:223] non-default args: {'model_tag': 'Qwen/Qwen3-Coder-Next-FP8', 'host': '0.0.0.0', 'port': 8888, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'Qwen/Qwen3-Coder-Next-FP8', 'load_format': 'fastsafetensors', 'attention_backend': 'flashinfer', 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': True}
(APIServer pid=116) WARNING 02-12 08:20:02 [envs.py:1625] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=116) INFO 02-12 08:20:04 [model.py:531] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=116) INFO 02-12 08:20:04 [model.py:1555] Using max model len 262144
(APIServer pid=116) INFO 02-12 08:20:05 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=116) WARNING 02-12 08:20:05 [config.py:337] Mamba cache mode is set to 'align' for Qwen3NextForCausalLM by default when prefix caching is enabled
(APIServer pid=116) INFO 02-12 08:20:05 [config.py:361] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=116) INFO 02-12 08:20:05 [config.py:504] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=116) INFO 02-12 08:20:05 [config.py:535] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=116) WARNING 02-12 08:20:05 [vllm.py:689] Async scheduling is not compatible with prefix caching for Mamba models and will be disabled.
(APIServer pid=116) INFO 02-12 08:20:05 [vllm.py:698] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=256) INFO 02-12 08:20:12 [core.py:97] Initializing a V1 LLM engine (v0.16.0rc2.dev126+gb96f7314b.d20260212) with config: model='Qwen/Qwen3-Coder-Next-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Coder-Next-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-Coder-Next-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=256) INFO 02-12 08:20:12 [parallel_state.py:1246] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://28.0.0.1:53493 backend=nccl
[W212 08:20:22.751247172 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(EngineCore_DP0 pid=256) INFO 02-12 08:20:22 [parallel_state.py:1474] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=256) INFO 02-12 08:20:23 [gpu_model_runner.py:4124] Starting to load model Qwen/Qwen3-Coder-Next-FP8...
(EngineCore_DP0 pid=256) INFO 02-12 08:20:33 [fp8.py:338] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU'].
(EngineCore_DP0 pid=256) INFO 02-12 08:20:34 [cuda.py:331] Using AttentionBackendEnum.FLASHINFER backend.
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/40 [00:00<?, ?it/s]
(EngineCore_DP0 pid=256) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=256)   warnings.warn(
Loading safetensors using Fastsafetensor loader:   2% Completed | 1/40 [00:02<01:42,  2.62s/it]
Loading safetensors using Fastsafetensor loader:   5% Completed | 2/40 [00:04<01:19,  2.10s/it]
Loading safetensors using Fastsafetensor loader:   8% Completed | 3/40 [00:06<01:14,  2.02s/it]
Loading safetensors using Fastsafetensor loader:  10% Completed | 4/40 [00:08<01:09,  1.92s/it]
Loading safetensors using Fastsafetensor loader:  12% Completed | 5/40 [00:09<01:04,  1.86s/it]
Loading safetensors using Fastsafetensor loader:  15% Completed | 6/40 [00:11<01:01,  1.82s/it]
Loading safetensors using Fastsafetensor loader:  18% Completed | 7/40 [00:13<01:01,  1.86s/it]
Loading safetensors using Fastsafetensor loader:  20% Completed | 8/40 [00:15<00:58,  1.83s/it]
Loading safetensors using Fastsafetensor loader:  22% Completed | 9/40 [00:16<00:55,  1.80s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 10/40 [00:18<00:51,  1.72s/it]
Loading safetensors using Fastsafetensor loader:  28% Completed | 11/40 [00:20<00:48,  1.67s/it]
Loading safetensors using Fastsafetensor loader:  30% Completed | 12/40 [00:22<00:49,  1.76s/it]
Loading safetensors using Fastsafetensor loader:  32% Completed | 13/40 [00:23<00:46,  1.70s/it]
Loading safetensors using Fastsafetensor loader:  35% Completed | 14/40 [00:25<00:44,  1.70s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 15/40 [00:26<00:42,  1.70s/it]
Loading safetensors using Fastsafetensor loader:  40% Completed | 16/40 [00:28<00:40,  1.70s/it]
Loading safetensors using Fastsafetensor loader:  42% Completed | 17/40 [00:30<00:40,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  45% Completed | 18/40 [00:32<00:39,  1.79s/it]
Loading safetensors using Fastsafetensor loader:  48% Completed | 19/40 [00:34<00:37,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  50% Completed | 20/40 [00:35<00:35,  1.76s/it]
Loading safetensors using Fastsafetensor loader:  52% Completed | 21/40 [00:37<00:33,  1.75s/it]
Loading safetensors using Fastsafetensor loader:  55% Completed | 22/40 [00:39<00:31,  1.75s/it]
Loading safetensors using Fastsafetensor loader:  57% Completed | 23/40 [00:41<00:30,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  60% Completed | 24/40 [00:43<00:28,  1.77s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 25/40 [00:44<00:25,  1.71s/it]
Loading safetensors using Fastsafetensor loader:  65% Completed | 26/40 [00:46<00:24,  1.73s/it]
Loading safetensors using Fastsafetensor loader:  68% Completed | 27/40 [00:47<00:21,  1.68s/it]
Loading safetensors using Fastsafetensor loader:  70% Completed | 28/40 [00:49<00:20,  1.71s/it]
Loading safetensors using Fastsafetensor loader:  72% Completed | 29/40 [00:51<00:19,  1.80s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 30/40 [00:53<00:17,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  78% Completed | 31/40 [00:55<00:15,  1.77s/it]
Loading safetensors using Fastsafetensor loader:  80% Completed | 32/40 [00:56<00:14,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  82% Completed | 33/40 [00:58<00:12,  1.72s/it]
Loading safetensors using Fastsafetensor loader:  85% Completed | 34/40 [01:00<00:10,  1.68s/it]
Loading safetensors using Fastsafetensor loader:  88% Completed | 35/40 [01:01<00:08,  1.65s/it]
Loading safetensors using Fastsafetensor loader:  90% Completed | 36/40 [01:03<00:07,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  92% Completed | 37/40 [01:05<00:05,  1.79s/it]
Loading safetensors using Fastsafetensor loader:  95% Completed | 38/40 [01:07<00:03,  1.78s/it]
Loading safetensors using Fastsafetensor loader:  98% Completed | 39/40 [01:09<00:01,  1.74s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 40/40 [01:10<00:00,  1.72s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 40/40 [01:10<00:00,  1.77s/it]
(EngineCore_DP0 pid=256)
(EngineCore_DP0 pid=256) INFO 02-12 08:21:49 [default_loader.py:293] Loading weights took 70.73 seconds
(EngineCore_DP0 pid=256) INFO 02-12 08:21:49 [fp8.py:495] Using MoEPrepareAndFinalizeNoEP
(EngineCore_DP0 pid=256) INFO 02-12 08:21:49 [gpu_model_runner.py:4221] Model loading took 74.89 GiB memory and 85.793719 seconds
(EngineCore_DP0 pid=256) INFO 02-12 08:21:56 [backends.py:918] Using cache directory: /root/.cache/vllm/torch_compile_cache/124adb2cb9/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=256) INFO 02-12 08:21:56 [backends.py:978] Dynamo bytecode transform time: 5.81 s
(EngineCore_DP0 pid=256) WARNING 02-12 08:21:58 [fused_moe.py:1089] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=256) INFO 02-12 08:22:22 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 25.071 s
(EngineCore_DP0 pid=256) INFO 02-12 08:22:22 [monitor.py:34] torch.compile takes 30.88 s in total
(EngineCore_DP0 pid=256) INFO 02-12 08:22:24 [gpu_worker.py:375] Available KV cache memory: 16.74 GiB
(EngineCore_DP0 pid=256) INFO 02-12 08:22:24 [kv_cache_utils.py:1308] GPU KV cache size: 182,784 tokens
(EngineCore_DP0 pid=256) INFO 02-12 08:22:24 [kv_cache_utils.py:1313] Maximum concurrency for 262,144 tokens per request: 2.75x
(EngineCore_DP0 pid=256) 2026-02-12 08:22:25,688 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=256) 2026-02-12 08:22:33,177 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████| 51/51 [02:21<00:00,  2.77s/it]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████| 35/35 [02:46<00:00,  4.76s/it]
(EngineCore_DP0 pid=256) INFO 02-12 08:27:45 [gpu_model_runner.py:5247] Graph capturing finished in 313 secs, took 0.69 GiB
(EngineCore_DP0 pid=256) INFO 02-12 08:27:46 [core.py:278] init engine (profile, create kv cache, warmup model) took 356.60 seconds
(EngineCore_DP0 pid=256) INFO 02-12 08:27:49 [vllm.py:698] Asynchronous scheduling is disabled.
(APIServer pid=116) INFO 02-12 08:27:50 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=116) INFO 02-12 08:27:50 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=116) WARNING 02-12 08:27:50 [model.py:1356] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=116) INFO 02-12 08:27:50 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=116) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=116) [2026-02-12 08:27:50] WARNING _http.py:779: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=116) INFO 02-12 08:27:50 [serving.py:188] Warming up chat template processing...
(APIServer pid=116) INFO 02-12 08:27:54 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=116) INFO 02-12 08:27:54 [serving.py:213] Chat template warmup completed in 4036.3ms
(APIServer pid=116) INFO 02-12 08:27:55 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=116) INFO 02-12 08:27:55 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:38] Available routes are:
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=116) INFO:     Started server process [116]
(APIServer pid=116) INFO:     Waiting for application startup.
(APIServer pid=116) INFO:     Application startup complete.

but only got about 2tok/s:

❯ docker exec -it vllm_node \
vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:8888 \
  --endpoint /v1/chat/completions \
  --model "Qwen/Qwen3-Coder-Next-FP8" \
  --dataset-name random \
  --random-input-len 64 \
  --random-output-len 256 \
  --num-prompts 200 \
  --max-concurrency 8
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0xf33337a3c360>, trust_remote_code=False, seed=0, num_prompts=200, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=64, random_output_len=256, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat', base_url='http://127.0.0.1:8888', host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', header=None, max_concurrency=8, model='Qwen/Qwen3-Coder-Next-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-dc3001e4-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False)
INFO 02-12 08:36:15 [datasets.py:607] Sampling input_len from [64, 64] and output_len from [256, 256]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
  0%|                                                                                                                         | 0/200 [00:00<?, ?it/s]
(APIServer pid=116) INFO:     Application startup complete.
(APIServer pid=116) INFO:     127.0.0.1:46196 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46224 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO:     127.0.0.1:46266 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO 02-12 08:37:05 [loggers.py:259] Engine 000: Avg prompt throughput: 14.4 tokens/s, Avg generation throughput: 0.2 tokens/s, Ru
nning: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Run
ning: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:25 [loggers.py:259] Engine 000: Avg prompt throughput: 43.2 tokens/s, Avg generation throughput: 0.8 tokens/s, Ru
nning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.6 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.2 toke
ns/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.2 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.2 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:25 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:25 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%

I can confirm similiar issues on my GX10 as well. Freshly built the container today from recent github checkout and only 2-3t/s.

Also noticed the following during startup and use, this an issue?

(EngineCore_DP0 pid=256) WARNING 02-12 08:21:58 [fused_moe.py:1089] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json

(EngineCore_DP0 pid=318) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1154: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, …] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, …].
(EngineCore_DP0 pid=318)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=318) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, …] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, …].
(EngineCore_DP0 pid=318)   return fn(*contiguous_args, **contiguous_kwargs)

Currently rebuilding the mxfp4 for gpt-oss to see if that now also has performance issues.

Edit: gpt-oss-120b still runs fine as before. :)

Rebuild the docker container using the following options works for me:

./build-and-copy.sh --pre-flashinfer --rebuild-deps --rebuild-vllm --vllm-ref v0.15.1

But I still got one error:

ERROR 02-12 11:11:34 [gpt_oss_triton_kernels_moe.py:34] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton_kernels.routing'

Maybe it is better to use the same vLLM version as the official docker vLLM version 26.01-py3?

vLLM Version 0.13.0+faa43dbf