Qwen3.5-35B-A3B optimizations on single Spark

This new topic is to split off discussion regarding the 35B-A3B release of Qwen3.5 from Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 β€” patches + quick-start + benchmark) - #228 by fususu

User @fususu has applied similar optimizations to the smaller MoE in Qwen3.5 to good effect.

A GitHub repo has been released generalizing the improvements from the larger Qwen3.5-122B-A10B thread to the smaller MoE model: GitHub - phuongncn/asus-gx10-qwen35-speed-hack: 4-5x faster Qwen3.5 on ASUS GX10 / DGX Spark β€” Hybrid INT4+FP8 + MTP via one shell script Β· GitHub

On Spark, we can use either the int4-AutoRound or official FP8. There is some suggestion that the FP8 may be superior for this model. int4 is faster of course. It would be valuable to have some quality comparisons, since currently this is more β€œvibes”/opinions.

Current experiments apply similar hybrid layers to the int4 as well as MTP. Throughput for int4fp8 is already peaking over 100 tok/s!

However, I believe starting there but using z-lab/Qwen3.5-35B-A3B-DFlash Β· Hugging Face as the drafter instead of inbuilt MTP will be able to make this the fastest model on Spark Arena.

Edit: Current real-world optimal combination for me is the hybrid int4fp8 checkpoint with built-in MTP set to 2 positions. However, if your work is coding, the DFlash drafted model may eclipse this (with caveats, see below).

My gentle feedback and wishlist for this:

  1. A menu option to simply apply/generate the hybrid model from existing separate int4 and FP8 checkpoints and stop. This is a lower barrier for those of us with already validated patched vllm-qwen35 Docker images who may not want to rebuild our container and want to control settings for serving the model.
  2. Consider refactoring the monolithic convenience script into a few separate scripts called from a common CLI script.
  3. Allow manual use of separate scripts to accomplish parts of this workflow with separate or experimental tooling.

For everyone coming from albond’s other thread, this is a directed script to make the hybrid repo which you may then test using your existing vllm-qwen35-v2 Docker image.

#!/bin/bash

# Ensure REPO_DIR is set (this is to the 122B hybrid repo)
REPO_DIR="$HOME/src/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4"

# ── Setup Default Paths ──────────────────────────────────────────────
DEFAULT_INT4="$HOME/models/Qwen3.5-35B-A3B-int4-AutoRound"
DEFAULT_FP8_NAME="Qwen/Qwen3.5-35B-A3B-FP8"
DEFAULT_FP8_PATH="$HOME/models/Qwen3.5-35B-A3B-FP8"
DEFAULT_OUT="Qwen3.5-35B-A3B-int4fp8"

echo "=== Configure Local Hybrid Model Build ==="
echo ""

# ── Gather Local Paths ───────────────────────────────────────────────
read -p "Path to the 122B-A10B hybrid repo by albond: [$REPO_DIR]" REPO_DIR
read -p "INT4 model path [$DEFAULT_INT4]: " INT4_DIR
INT4_DIR=${INT4_DIR:-$DEFAULT_INT4}
[ ! -d "$INT4_DIR" ] && { echo "Error: INT4 directory not found at $INT4_DIR"; exit 1; }

read -p "FP8 model NAME [$DEFAULT_FP8_NAME]: " FP8_MODELNAME
FP8_MODELNAME=${FP8_MODELNAME:-$DEFAULT_FP8_NAME}
# [ ! -d "$FP8_MODELNAME" ] && { echo "Error: FP8 directory not found at $FP8_MODELNAME"; exit 1; }

echo ""
read -p "Output folder name in ~/models/ [$DEFAULT_OUT]: " OUT_NAME
OUT_NAME=${OUT_NAME:-$DEFAULT_OUT}
HYBRID_OUT="$HOME/models/$OUT_NAME"

echo ""
echo "INT4 source : $INT4_DIR"
echo "FP8 source  : $FP8_MODELNAME"
echo "Output dir  : $HYBRID_OUT"
echo "────────────────────────────────────────────────────────────────"

# ── Check if output already exists ───────────────────────────────────
if [ -f "$HYBRID_OUT/model.safetensors.index.json" ]; then
    echo "Warning: $OUT_NAME already exists."
    read -p "Rebuild (delete and rebuild)? (y/N): " DO_REBUILD
    if [[ "$DO_REBUILD" =~ ^[Yy]$ ]]; then
        rm -rf "$HYBRID_OUT"
    else
        echo "Skipping build β€” keeping $OUT_NAME"
        exit 0
    fi
fi

# ── Memory flush ─────────────────────────────────────────────────────
read -p "Drop system memory cache before build? (Recommended) [Y/n]: " _DO_DROP
_DO_DROP=${_DO_DROP:-Y}
if [[ "$_DO_DROP" =~ ^[Yy]$ ]]; then
    sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' && echo "Memory cache flushed" || echo "Could not flush cache (continuing anyway)"
fi

# ── Setup venv ───────────────────────────────────────────────────────
cd "$REPO_DIR" || { echo "Error: REPO_DIR not found"; exit 1; }
if [ ! -d .venv ]; then python3 -m venv .venv; fi
# shellcheck disable=SC1091

# UNCOMMENT IF YOU WANT TO SET UP A VENV FOR THIS
# source .venv/bin/activate
# pip install -q -U pip
# pip install -q torch numpy safetensors huggingface_hub

# ── Build Hybrid ─────────────────────────────────────────────────────
echo "Building hybrid checkpoint (may take 20-60 minutes)..."
python "$REPO_DIR/patches/01-hybrid-int4-fp8/build-hybrid-checkpoint.py" \
    --gptq-dir "$INT4_DIR" \
    --fp8-repo "$FP8_MODELNAME" \
    --output "$HYBRID_OUT" \
    --force
echo "Hybrid checkpoint done: $HYBRID_OUT"

# ── MTP speculative weights ──────────────────────────────────────────
echo ""
read -p "Add MTP speculative decoding weights? (Y/n): " DO_MTP
DO_MTP=${DO_MTP:-Y}

if [[ "$DO_MTP" =~ ^[Yy]$ ]]; then
    if [ -f "$INT4_DIR/model_extra_tensors.safetensors" ]; then
        echo "Adding MTP weights from local INT4 source..."
        python "$REPO_DIR/patches/02-mtp-speculative/add-mtp-weights.py" \
            --source "$INT4_DIR" \
            --target "$HYBRID_OUT"
        echo "MTP done"
    else
        echo "INT4 has no model_extra_tensors β€” extracting from FP8 source..."
        read -p "FP8 model path [$DEFAULT_FP8_PATH]: " FP8_DIR
        python "$REPO_DIR/patches/02-mtp-speculative/add-mtp-from-fp8.py" \
            --fp8-repo "$FP8_DIR" \
            --target "$HYBRID_OUT"
        echo "MTP done (from FP8 source)"
    fi
else
    echo "Skipping MTP"
fi

I successfully have built my 35B-A3B-int4fp8 hybrid.

Using the hybrid model build with the above script, I get the following synthetic (optimistic) benchmark results ranging from 108-125 tok/s (updated benchmark script):

╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: Qwen3.5-35B-A3B-int4fp8  β€”  2026-04-12 22:14
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   2.31s = 110.8 tok/s
  [Code      ]   402 tokens in   3.28s = 122.2 tok/s
  [JSON      ]  1024 tokens in   8.51s = 120.3 tok/s
  [Math      ]    32 tokens in    .30s = 104.5 tok/s
  [LongCode  ]  2048 tokens in  16.47s = 124.3 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   2.34s = 109.4 tok/s
  [Code      ]   402 tokens in   3.28s = 122.4 tok/s
  [JSON      ]  1024 tokens in   8.45s = 121.1 tok/s
  [Math      ]    32 tokens in    .30s = 105.2 tok/s
  [LongCode  ]  2048 tokens in  16.33s = 125.3 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 40.0 tok/s (end-to-end)
  [req2 ]  1024 tokens = 39.5 tok/s (end-to-end)
  [req3 ]  1024 tokens = 40.7 tok/s (end-to-end)
  [req4 ]  1024 tokens = 39.6 tok/s (end-to-end)

  Total: 4096 tokens in 25.92s
  Total throughput: 157.9 tok/s (4 requests completed)

In real-world non-coding tasks I see low 80s tok/s sustained across generations for more than 5k tokens which is still excellent.

Startup script for the hybrid (includes a cache clear, you may want to change the port or served model name):

#!/bin/bash
docker rm vllm-qwen35
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
docker run -it --name vllm-qwen35 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  --gpus all --net=host --ipc=host \
  -v ~/models:/models \
  vllm-qwen35-v2 \
  serve /models/Qwen3.5-35B-A3B-int4fp8 \
  --served-model-name /models/Qwen3.5-35B-A3B-int4fp8 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-num-seqs 256

Here is a llama-benchy run with the hybrid int4fp8. Remember, llama-benchy does not measure MTP tokens so this is not real throughput. It does, however, give a good sense of how concurrency scales.

Peak throughput is around concurrency=32 with MTP=2

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Q3.5-35B-A3B-hybrid pp1024 (c1) 5577.14 Β± 64.78 5577.14 Β± 64.78 231.40 Β± 2.14 183.81 Β± 2.14 231.44 Β± 2.15
Q3.5-35B-A3B-hybrid tg128 (c1) 42.93 Β± 0.22 42.93 Β± 0.22 43.67 Β± 0.47 43.67 Β± 0.47
Q3.5-35B-A3B-hybrid pp1024 (c4) 3431.85 Β± 2020.07 921.61 Β± 548.15 2211.42 Β± 1875.16 2163.83 Β± 1875.16 2211.46 Β± 1875.16
Q3.5-35B-A3B-hybrid tg128 (c4) 57.50 Β± 3.31 15.15 Β± 0.54 64.00 Β± 0.00 16.42 Β± 0.86
Q3.5-35B-A3B-hybrid pp1024 (c16) 5391.46 Β± 278.49 347.08 Β± 22.85 3013.56 Β± 192.80 2965.97 Β± 192.80 3013.58 Β± 192.80
Q3.5-35B-A3B-hybrid tg128 (c16) 66.18 Β± 0.30 4.36 Β± 0.14 80.67 Β± 0.94 6.31 Β± 1.85
Q3.5-35B-A3B-hybrid pp1024 (c32) 5594.91 Β± 183.68 177.11 Β± 6.92 5843.22 Β± 226.35 5795.62 Β± 226.35 5843.23 Β± 226.34
Q3.5-35B-A3B-hybrid tg128 (c32) 69.71 Β± 0.12 2.29 Β± 0.08 96.67 Β± 0.94 4.20 Β± 1.64
Q3.5-35B-A3B-hybrid pp1024 (c64) 5265.48 Β± 381.70 88.50 Β± 16.83 11896.91 Β± 1495.40 11849.31 Β± 1495.40 11896.92 Β± 1495.39
Q3.5-35B-A3B-hybrid tg128 (c64) 65.68 Β± 0.32 1.19 Β± 0.05 128.00 Β± 0.00 2.84 Β± 1.60
Q3.5-35B-A3B-hybrid pp1024 (c96) 5271.95 Β± 67.80 69.67 Β± 21.37 15713.68 Β± 3341.88 15666.09 Β± 3341.88 15713.69 Β± 3341.88
Q3.5-35B-A3B-hybrid tg128 (c96) 63.94 Β± 0.27 0.78 Β± 0.04 153.00 Β± 7.87 2.27 Β± 1.35
Q3.5-35B-A3B-hybrid pp1024 (c128) 5154.63 Β± 25.67 59.72 Β± 24.14 19353.00 Β± 5722.75 19305.41 Β± 5722.75 19353.01 Β± 5722.75
Q3.5-35B-A3B-hybrid tg128 (c128) 63.33 Β± 0.06 0.59 Β± 0.03 158.67 Β± 8.18 1.93 Β± 1.06
Q3.5-35B-A3B-hybrid pp1024 (c160) 4678.03 Β± 44.41 51.86 Β± 24.20 23330.21 Β± 8187.29 23282.62 Β± 8187.29 23330.22 Β± 8187.28
Q3.5-35B-A3B-hybrid tg128 (c160) 62.28 Β± 0.48 0.46 Β± 0.03 174.00 Β± 8.52 1.74 Β± 1.07
Q3.5-35B-A3B-hybrid pp1024 (c196) 4622.91 Β± 3.66 46.87 Β± 25.18 27190.66 Β± 10860.25 27143.07 Β± 10860.25 27190.67 Β± 10860.25
Q3.5-35B-A3B-hybrid tg128 (c196) 62.58 Β± 0.22 0.38 Β± 0.03 196.00 Β± 0.00 1.60 Β± 1.05

DFlash cannot currently be run with the hybrid model, because the current patch to allow the hybrid to run with vLLM is pinned to 0.19.0. DFlash drafting support is now live in nightly and 0.19.rc1. So this uses a clean build of @eugr’s spark-vllm-docker with TF5 support.

Model: 35B-A3B standard Intel int4-AutoRound.
Required flags and caveats:

  • DFlash requires Flash_Attention presently. Flashinfer is not supported. This cuts down your KV cache efficiency and harms overall throughput.
  • Cannot quantify the KV cache (incompatible with Flash_Attention)
  • --max-num-batched-tokens 32768 required so the drafter has room to draft
  • IMPORTANT: These numbers are very high, but in real-world professional use I see lower actual throughput with DFlash than with MTP=2 today. Try both on your actual data!
╔══════════════════════════════════════════════════════╗
β•‘  Qwen3.5-35B-A3B-int4-AutoRound DFlash Benchmark: test
β•‘  Sun Apr 12 06:40:22 PM CDT 2026
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 2.18s = 117.4 tok/s (prompt: 23)
  [Code] 490 tokens in 3.22s = 152.1 tok/s (prompt: 30)
  [JSON] 1024 tokens in 6.15s = 166.5 tok/s (prompt: 48)
  [Math] 64 tokens in .48s = 133.3 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 12.32s = 166.2 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 2.09s = 122.4 tok/s (prompt: 23)
  [Code] 494 tokens in 3.23s = 152.9 tok/s (prompt: 30)
  [JSON] 1024 tokens in 7.91s = 129.4 tok/s (prompt: 48)
  [Math] 64 tokens in .44s = 145.4 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 12.12s = 168.9 tok/s (prompt: 37)

=== Done ===

Improvements on the horizon

  • Efficiency and throughput would substantially benefit if DFlash worked with Flashinfer
  • In theory, once the hybrid checkpoint can be built with DFlash, that would be a separate multiplicative increase by another 5-8%.

My startup script, which maps in ~/models so the hybrid model can be referenced in place:

#!/bin/bash
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-autoround \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve /models/Qwen3.5-35B-A3B-int4-AutoRound \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --dtype auto \
  --attention-backend flash_attn \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code \
  --chat-template unsloth.jinja \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32768 \
  --speculative-config '{"method":"dflash","model":"/models/Qwen3.5-35B-A3B-DFlash","num_speculative_tokens":15}' \

DFlash with int4 AutoRound starts slower but actually scales higher with most of the gains by c=24 or c=32. This is with DFlash predicting 15 tokens.

If your workload is well suited to have DFlash accepting at least 2 tokens on average, this will win on concurrent workloads even with the current constraints requiring flash_attention and standard checkpoint.

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c1) 5827.80 Β± 1033.77 5827.80 Β± 1033.77 268.46 Β± 36.96 182.44 Β± 36.96 268.51 Β± 36.99
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c1) 24.59 Β± 0.39 24.59 Β± 0.39 25.33 Β± 0.47 25.33 Β± 0.47
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c4) 4294.15 Β± 1211.40 1255.99 Β± 400.73 1021.78 Β± 377.54 935.76 Β± 377.54 1021.82 Β± 377.54
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c4) 47.33 Β± 2.01 13.11 Β± 0.48 54.33 Β± 1.25 14.33 Β± 1.11
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c16) 5647.82 Β± 343.87 367.03 Β± 25.50 2891.76 Β± 186.98 2805.73 Β± 186.98 2891.77 Β± 186.98
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c16) 78.39 Β± 0.62 5.45 Β± 0.23 96.00 Β± 0.00 7.21 Β± 1.54
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c24) 6153.77 Β± 15.07 264.24 Β± 6.47 3967.12 Β± 89.33 3881.10 Β± 89.33 3967.14 Β± 89.33
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c24) 83.16 Β± 0.48 4.01 Β± 0.27 109.00 Β± 1.41 6.18 Β± 2.17
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c32) 5917.61 Β± 160.90 196.04 Β± 4.37 5316.80 Β± 109.79 5230.78 Β± 109.79 5316.81 Β± 109.79
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c32) 84.43 Β± 0.91 3.04 Β± 0.22 118.67 Β± 2.87 5.36 Β± 2.31
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c48) 5950.76 Β± 34.49 134.27 Β± 16.20 7795.91 Β± 638.17 7709.89 Β± 638.17 7795.92 Β± 638.16
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c48) 76.37 Β± 0.27 2.01 Β± 0.16 133.33 Β± 0.47 4.24 Β± 2.01
/models/Qwen3.5-35B-A3B-int4-AutoRound pp1024 (c64) 2143.66 Β± 59.81 102.16 Β± 17.63 10457.93 Β± 2687.19 10371.91 Β± 2687.19 10457.95 Β± 2687.19
/models/Qwen3.5-35B-A3B-int4-AutoRound tg128 (c64) 70.30 Β± 1.01 1.60 Β± 0.78 133.00 Β± 3.74 3.67 Β± 2.97

In my testing, the hybrid and pure int4-AutoRound are equivalent in accuracy and general length of response on complex document analysis.

Next, we will evaluate the pure FP8 model with both drafting engines.

Thanks for the detailed wishlist β€” and for kicking off this thread!

On points 1–3: Done. I’ve refactored the monolithic script into separate modules: scripts/install.sh, start-server.sh, benchmark.sh, build-hybrid.sh, common.sh

Option 7 (β€œBuild checkpoint”) now lets you build hybrid or FP8+MTP without touching Docker or the server. Your stripped-down script from post 3 maps cleanly to scripts/build-hybrid.sh β€” advanced users can call it directly.

benchmark.sh also accepts a port arg now, so it works standalone against any OpenAI-compatible endpoint: bash scripts/benchmark.sh 8000

On Native FP8 quality: I ran both variants to address Stefan’s concern. Single-stream FP8 is ~70 tok/s vs ~112 tok/s hybrid β€” but the surprise is concurrent throughput: 4 parallel requests hit 185 tok/s total on FP8 vs 158 tok/s on hybrid. So FP8 actually wins for multi-user scenarios.

Updated repo: https://github.com/phuongncn/asus-gx10-qwen35-speed-hack

Thanks, I’ll go back and test these with concurrency when I have the chance.

I am at least as interested in how many tokens FP8 takes to get to the same place as int4-AutoRound. There are some benchmarks that, although they end up in the same place, FP8 is less verbose.

This is not as rigorous, but I have lengthy prompts which generate many thousands of tokens in response.

I quite like the new benchmark script.

Here is FP8 with built-in MTP=2, running inside the hybrid container (because the LM head going to int8 should still be a small gain even on the FP8 model):

═══ Benchmark ═══
[βœ“] Model: /models/Qwen3.5-35B-A3B-FP8

╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: Qwen3.5-35B-A3B-FP8  β€”  2026-04-12 21:27
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   3.42s = 74.7 tok/s
  [Code      ]   512 tokens in   6.33s = 80.8 tok/s
  [JSON      ]  1024 tokens in  13.21s = 77.4 tok/s
  [Math      ]    32 tokens in    .47s = 67.9 tok/s
  [LongCode  ]  2048 tokens in  24.97s = 81.9 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   3.42s = 74.6 tok/s
  [Code      ]   512 tokens in   6.34s = 80.6 tok/s
  [JSON      ]  1024 tokens in  13.25s = 77.2 tok/s
  [Math      ]    32 tokens in    .46s = 68.5 tok/s
  [LongCode  ]  2048 tokens in  25.03s = 81.8 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 34.5 tok/s (end-to-end)
  [req2 ]  1024 tokens = 34.5 tok/s (end-to-end)
  [req3 ]  1024 tokens = 34.5 tok/s (end-to-end)
  [req4 ]  1024 tokens = 34.6 tok/s (end-to-end)

  Total: 4096 tokens in 29.67s
  Total throughput: 138.0 tok/s (4 requests completed)

Edit: For actual document analysis, I see 57 tok/s (compared to 80 tok/s with the hybrid model with MTP=2). Similar results.

Startup script using the vllm-qwen35-v2 image

#!/bin/bash
# -e "--default-chat-template-kwargs '{"enable_thinking": false}'"
docker rm vllm-qwen35
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
docker run -it --name vllm-qwen35 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  --gpus all --net=host --ipc=host \
  -v ~/models:/models \
  vllm-qwen35-v2 \
  serve /models/Qwen3.5-35B-A3B-FP8 \
  --served-model-name /models/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-num-seqs 256

I am tried both the available models of 35B but getting same error

WARNING 04-13 02:37:02 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299] 
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299]        β–ˆ     β–ˆ     β–ˆβ–„   β–„β–ˆ
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299]  β–„β–„ β–„β–ˆ β–ˆ     β–ˆ     β–ˆ β–€β–„β–€ β–ˆ  version 0.19.0
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299]   β–ˆβ–„β–ˆβ–€ β–ˆ     β–ˆ     β–ˆ     β–ˆ  model   /local_models/qwen35-35b-fp8-mtp
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299]    β–€β–€  β–€β–€β–€β–€β–€ β–€β–€β–€β–€β–€ β–€     β–€
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299] 
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:233] non-default args: {'model_tag': '/local_models/qwen35-35b-fp8-mtp', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/local_models/qwen35-35b-fp8-mtp', 'max_model_len': 131072, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.92, 'language_model_only': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 04-13 02:37:02 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_UF_EAGER_ALLREDUCE
(APIServer pid=1) INFO 04-13 02:37:07 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 04-13 02:37:07 [model.py:1678] Using max model len 131072
(APIServer pid=1) WARNING 04-13 02:37:07 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 04-13 02:37:11 [model.py:549] Resolved architecture: Qwen3_5MoeMTP
(APIServer pid=1) INFO 04-13 02:37:11 [model.py:1678] Using max model len 262144
(APIServer pid=1) WARNING 04-13 02:37:11 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:435: UserWarning: 
(APIServer pid=1)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=1)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=1)     (8.0) - (12.0)
(APIServer pid=1)     
(APIServer pid=1)   queued_call()
(APIServer pid=1) INFO 04-13 02:37:12 [config.py:281] Setting attention block size to 1072 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 04-13 02:37:12 [config.py:312] Padding mamba page size by 0.75% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 04-13 02:37:12 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-13 02:37:12 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) INFO 04-13 02:37:12 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=188) INFO 04-13 02:37:17 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/local_models/qwen35-35b-fp8-mtp', speculative_config=SpeculativeConfig(method='mtp', model='/local_models/qwen35-35b-fp8-mtp', num_spec_tokens=2), tokenizer='/local_models/qwen35-35b-fp8-mtp', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/local_models/qwen35-35b-fp8-mtp, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=188) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:435: UserWarning: 
(EngineCore pid=188)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore pid=188)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore pid=188)     (8.0) - (12.0)
(EngineCore pid=188)     
(EngineCore pid=188)   queued_call()
(EngineCore pid=188) INFO 04-13 02:37:18 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=188) INFO 04-13 02:37:18 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.5.35:48121 backend=nccl
(EngineCore pid=188) INFO 04-13 02:37:18 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=188) WARNING 04-13 02:37:19 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=188) INFO 04-13 02:37:19 [gpu_model_runner.py:4735] Starting to load model /local_models/qwen35-35b-fp8-mtp...
(EngineCore pid=188) INFO 04-13 02:37:19 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=188) INFO 04-13 02:37:19 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=188) INFO 04-13 02:37:19 [__init__.py:261] Selected CutlassFP8ScaledMMLinearKernel for Fp8LinearMethod
(EngineCore pid=188) INFO 04-13 02:37:19 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
(EngineCore pid=188) INFO 04-13 02:37:21 [fp8.py:396] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'TRITON', 'MARLIN', 'BATCHED_DEEPGEMM', 'BATCHED_TRITON', 'XPU'].
(EngineCore pid=188) INFO 04-13 02:37:22 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=188) INFO 04-13 02:37:22 [flash_attn.py:596] Using FlashAttention version 2
(EngineCore pid=188) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=188) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:05<01:05,  5.08s/it]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:10<01:05,  5.42s/it]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:18<01:12,  6.59s/it]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:23<00:58,  5.86s/it]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:29<00:52,  5.79s/it]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:34<00:45,  5.65s/it]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:39<00:38,  5.44s/it]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:44<00:31,  5.32s/it]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [01:18<01:11, 14.26s/it]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [01:24<00:46, 11.65s/it]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [01:30<00:29,  9.97s/it]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [01:37<00:18,  9.14s/it]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [02:00<00:13, 13.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:02<00:00,  9.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:02<00:00,  8.75s/it]
(EngineCore pid=188) 
(EngineCore pid=188) INFO 04-13 02:39:29 [default_loader.py:384] Loading weights took 122.58 seconds
(EngineCore pid=188) INFO 04-13 02:39:29 [fp8.py:560] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=188) INFO 04-13 02:39:29 [gpu_model_runner.py:4759] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:01, 11.68it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:00<00:00, 11.55it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:00<00:00, 11.58it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:00<00:00, 11.75it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:16<00:00, 11.75it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:18<00:20,  4.01s/it]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:18<00:07,  2.50s/it]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:19<00:01,  1.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:27<00:00,  2.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:27<00:00,  1.95s/it]
(EngineCore pid=188) 
(EngineCore pid=188) INFO 04-13 02:39:57 [default_loader.py:384] Loading weights took 27.37 seconds
(EngineCore pid=188) INFO 04-13 02:39:57 [eagle.py:1376] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=188) INFO 04-13 02:39:57 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=188) INFO 04-13 02:39:57 [gpu_model_runner.py:4820] Model loading took 34.18 GiB memory and 157.231340 seconds
(EngineCore pid=188) INFO 04-13 02:40:06 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/2a9aff733a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=188) INFO 04-13 02:40:06 [backends.py:1111] Dynamo bytecode transform time: 8.71 s
(EngineCore pid=188) [rank0]:W0413 02:40:08.913000 188 torch/_inductor/utils.py:1679] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=188) INFO 04-13 02:40:12 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=188) INFO 04-13 02:41:51 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 103.88 s
(EngineCore pid=188) INFO 04-13 02:41:53 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/9ff78e8d7a0504a92bb3d98862d6d0c171b249edd524a58aeb6d9ea2a550e9d0/rank_0_0/model
(EngineCore pid=188) INFO 04-13 02:41:53 [monitor.py:48] torch.compile took 116.11 s in total
(EngineCore pid=188) Process EngineCore:
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     super().__init__(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     self.model_runner.profile_run()
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     outputs = self.model(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]               ^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.runnable(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 691, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     hidden_states = self.language_model.model(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 603, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     output = self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.fn(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 500, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     def forward(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 211, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.optimized_call(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     raise e
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "<eval_with_key>.231", line 330, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.runnable(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return range_entry.runnable(*args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self._compiled_fn(*args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return fn(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1148, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return compiled_fn(full_args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     all_outs = call_func_at_runtime_with_args(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     out = normalize_as_list(f(args))
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]                             ^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1962, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.compiled_fn(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return compiled_fn(runtime_args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729, in inner_fn
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     outs = compiled_fn(args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self.current_callable(inputs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     out = model(new_inputs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]           ^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/tmp/torchinductor_root/uv/cuv4tqzh434ucxieeg2q7wolbunkpbawuojnokscd5uk25eut3q4.py", line 659, in call
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     torch.ops._C.cutlass_scaled_mm.default(buf8, buf3, reinterpret_tensor(arg5_1, (2048, 12288), (1, 2048), 0), reinterpret_tensor(buf4, (s18, 16), (1, s18), 0), reinterpret_tensor(arg6_1, (16, 96), (1, 16), 0), None)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]     return self._op(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] RuntimeError: Error Internal
(EngineCore pid=188) Traceback (most recent call last):
(EngineCore pid=188)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=188)     self.run()
(EngineCore pid=188)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=188)     self._target(*self._args, **self._kwargs)
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=188)     raise e
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=188)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=188)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188)     return func(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=188)     super().__init__(
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=188)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=188)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188)     return func(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=188)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=188)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=188)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=188)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=188)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=188)     return func(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188)     return func(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(EngineCore pid=188)     self.model_runner.profile_run()
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run
(EngineCore pid=188)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=188)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188)     return func(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run
(EngineCore pid=188)     outputs = self.model(
(EngineCore pid=188)               ^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188)     return self.runnable(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188)     return self._call_impl(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188)     return forward_call(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 691, in forward
(EngineCore pid=188)     hidden_states = self.language_model.model(
(EngineCore pid=188)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 603, in __call__
(EngineCore pid=188)     output = self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=188)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore pid=188)     return self.fn(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 500, in forward
(EngineCore pid=188)     def forward(
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 211, in __call__
(EngineCore pid=188)     return self.optimized_call(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore pid=188)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore pid=188)     raise e
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore pid=188)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188)     return self._call_impl(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188)     return forward_call(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "<eval_with_key>.231", line 330, in forward
(EngineCore pid=188)     submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(EngineCore pid=188)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188)     return self.runnable(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__
(EngineCore pid=188)     return range_entry.runnable(*args)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore pid=188)     return self._compiled_fn(*args)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore pid=188)     return fn(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1148, in forward
(EngineCore pid=188)     return compiled_fn(full_args)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore pid=188)     all_outs = call_func_at_runtime_with_args(
(EngineCore pid=188)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore pid=188)     out = normalize_as_list(f(args))
(EngineCore pid=188)                             ^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1962, in __call__
(EngineCore pid=188)     return self.compiled_fn(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore pid=188)     return compiled_fn(runtime_args)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729, in inner_fn
(EngineCore pid=188)     outs = compiled_fn(args)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore pid=188)     return self.current_callable(inputs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore pid=188)     out = model(new_inputs)
(EngineCore pid=188)           ^^^^^^^^^^^^^^^^^
(EngineCore pid=188)   File "/tmp/torchinductor_root/uv/cuv4tqzh434ucxieeg2q7wolbunkpbawuojnokscd5uk25eut3q4.py", line 659, in call
(EngineCore pid=188)     torch.ops._C.cutlass_scaled_mm.default(buf8, buf3, reinterpret_tensor(arg5_1, (2048, 12288), (1, 2048), 0), reinterpret_tensor(buf4, (s18, 16), (1, s18), 0), reinterpret_tensor(arg6_1, (16, 96), (1, 16), 0), None)
(EngineCore pid=188)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
(EngineCore pid=188)     return self._op(*args, **kwargs)
(EngineCore pid=188)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) RuntimeError: Error Internal
[rank0]:[W413 02:41:54.732014588 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

tarani@gx10-4bd2:~/playbook/asus-gx10-qwen35-speed-hack$ ./vllm.sh

═══ Environment Check ═══
[βœ“] Docker: 29.2.1
[!] Could not read GPU memory info ([N/A])

=== vLLM Manager for ASUS GX10 ===
  1. First-time setup  β†’ [122B / 35B Hybrid / 35B FP8+MTP / Custom / Both]
  2. Select model and start server
  3. Stop server
  4. View logs
  5. Run benchmark
  6. Rebuild Docker image (--no-cache)
  7. Build checkpoint    β†’ [Hybrid INT4+FP8 / Native FP8+MTP]

Select (1-7): 2

═══ Scanning models ═══

=== Available models ===

  ── Hybrid small (35B / 27B / custom) ───────────────────────
  1. Hybrid (small)             βœ“ qwen35-35b-fp8-mtp, qwen35-35b-hybrid-int4fp8

  ── 122B-A10B (higher quality) ──────────────────────────────
  2. Qwen3.5-122B-A10B Hybrid v2  [~51 tok/s]  β˜… β€” needs build

  3. Enter model ID / path manually

Select model (1-3): 1

=== Select 35B Hybrid version ===
  1. qwen35-35b-fp8-mtp
  2. qwen35-35b-hybrid-int4fp8

Select (1-2): 1
[βœ“] Using: /home/tarani/models/qwen35-35b-fp8-mtp
[βœ“] Detected FP8-native model β€” using fp8 quantization mode
[βœ“] Model: /home/tarani/models/qwen35-35b-fp8-mtp

═══ Server configuration ═══
(Press Enter to use default values)

Port [8000]: 
Context length [131072]: 
Max model len [131072]: 
GPU memory utilization [0.92]: 
MTP speculative tokens [2]: 
Thinking mode (yes/no) [yes]: 
Vision encoder (yes/no) [no β€” saves RAM]: 
Tensor parallel size [1]: 

=== Configuration ===
  Model        : /home/tarani/models/qwen35-35b-fp8-mtp
  Port         : 8000
  Context      : 131072
  GPU mem util : 0.92
  MTP tokens   : 2
  Thinking     : yes
  Vision       : no
  TP size      : 1

Continue? (Y/n): 

═══ Preparing system ═══
Drop system memory cache before starting? (Recommended to free RAM for the model) [Y/n]: 
[βœ“] Memory cache flushed

═══ Starting Docker ═══
[!] vllm-sm121 not found β€” falling back to vllm/vllm-openai:latest (may fail on GB10)

Docker command:
docker run -d     --name vllm-qwen35     --gpus all     --net=host     --ipc=host     --shm-size=16g     -v /home/tarani/.cache/huggingface:/root/.cache/huggingface     -v /home/tarani/models:/local_models     -e HF_TOKEN=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     -e VLLM_UF_EAGER_ALLREDUCE=1     vllm/vllm-openai:latest      --model /local_models/qwen35-35b-fp8-mtp  --port 8000 --host 0.0.0.0 --max-model-len 131072 --gpu-memory-utilization 0.92 --tensor-parallel-size 1 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --language-model-only

Unable to find image 'vllm/vllm-openai:latest' locally

latest: Pulling from vllm/vllm-openai
d22e0eb57170: Pull complete 
5136ae4e75ff: Pull complete 
adaf80c23aee: Pull complete 
20a2b233d194: Pull complete 
df8f172ee39a: Pull complete 
6f760a753c5e: Pull complete 
d98306796149: Pull complete 
891990496470: Pull complete 
2ccaca59ba90: Pull complete 
f73d146a8209: Pull complete 
a2e605f89388: Pull complete 
454afc5971c2: Pull complete 
98329712c916: Pull complete 
b3144730d151: Pull complete 
600942fa6af9: Pull complete 
50726bd30a22: Pull complete 
6e8af4fd0a07: Pull complete 
712f972d1ad9: Pull complete 
0315e9ab9839: Pull complete 
7aa2d0bb5c94: Pull complete 
a1e92976b4cc: Pull complete 
14b90358ddb3: Pull complete 
edea9c79e022: Pull complete 
2ea2730b4ab4: Pull complete 
34a8a12a9c59: Pull complete 
d15babedbbdd: Pull complete 
57d2f2576db9: Pull complete 
Digest: sha256:d9a5c1c1614c959fde8d2a4d68449db184572528a6055afdd0caf1e66fb51504
Status: Downloaded newer image for vllm/vllm-openai:latest
315eb194f371da2c1add1dfc57afb30645df23fb59c041d87b52f90e665d526a
[βœ“] Container started: vllm-qwen35

═══ Waiting for model to load ═══
[!] First time: ~13 min | Subsequent: ~5-7 min
[!] Ctrl+C to exit (container keeps running in background)

  [05:05] Loading...
[βœ—] Container crashed. Check logs:

Thanks for you feedback, I’m looking into it right now.

FP8 with DFlash. DFlash does extremely well on synthetic benchmarks and code, but again, try it on your data - because if the longer predictions don’t get accepted it ends up a loss.

In my real-world prompts, it again actually loses to MTP on single threaded use. I am seeing an average of around 40 tok/s with this (though again, this is presently held back by flash_attn and non-LM-head optimized container) compared to 57 tok/s across 5k+ token generations.

I still think this has a lot of potential and is worth revisiting once DFlash works with flashinfer attention and an optimized container.

╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: Qwen3.5-35B-A3B-FP8  β€”  2026-04-12 21:48
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   3.18s = 80.5 tok/s
  [Code      ]   467 tokens in   4.77s = 97.8 tok/s
  [JSON      ]  1024 tokens in  10.24s = 100.0 tok/s
  [Math      ]    32 tokens in    .38s = 82.4 tok/s
  [LongCode  ]  2048 tokens in  16.38s = 124.9 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   3.13s = 81.6 tok/s
  [Code      ]   467 tokens in   4.77s = 97.7 tok/s
  [JSON      ]  1024 tokens in  10.25s = 99.8 tok/s
  [Math      ]    32 tokens in    .38s = 82.4 tok/s
  [LongCode  ]  2048 tokens in  16.39s = 124.8 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 65.9 tok/s (end-to-end)
  [req2 ]  1024 tokens = 64.9 tok/s (end-to-end)
  [req3 ]  1024 tokens = 65.9 tok/s (end-to-end)
  [req4 ]  1024 tokens = 65.9 tok/s (end-to-end)

  Total: 4096 tokens in 15.78s
  Total throughput: 259.4 tok/s (4 requests completed)

Startup script for this one

#!/bin/bash
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-autoround \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve /models/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --dtype auto \
  --attention-backend flash_attn \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code \
  --chat-template unsloth.jinja \
  --max-num-seqs 96 \
  --max-num-batched-tokens 32768 \
  --speculative-config '{"method":"dflash","model":"/models/Qwen3.5-35B-A3B-DFlash","num_speculative_tokens":15}' \

Here is the issue:
═══ Starting Docker ═══
[!] vllm-sm121 not found β€” falling back to vllm/vllm-openai:latest (may fail on GB10)
Did you start with option 1 to install? It will take around 40-60 min to build vllm-qwen35-v2 or vllm-sm121.
Have you tried rebuilt the image?
I’ll update the env-check to the script.

I’ve updated the script. If everything went well, it should show something like below. Then you can run without crash. You can try option 6 to rebuild the vllm-qwen35-v2/vllm-sm121.

═══ Environment Check ═══
[βœ“] Docker: 29.1.3
[βœ“] GPU: NVIDIA GB10
[βœ“] Docker images: vllm-qwen35-v2 βœ“  vllm-sm121 βœ“

=== vLLM Manager for ASUS GX10 ===

First-time setup  β†’ [122B / 35B Hybrid / 35B FP8+MTP / Custom / Both]

Select model and start server

Stop server

View logs

Run benchmark

Rebuild Docker image (–no-cache)

Build checkpoint    β†’ [Hybrid INT4+FP8 / Native FP8+MTP]

Select (1-7): 2

I already have /models/qwen35-122b-hybrid-int4fp8 running after successful ./install.sh

Qwen/Qwen3.5-35B-A3B-FP8 just finished downloading

Which of your new scripts do I need to run? I don’t want to accidentally break anything ;)


@fususu this was much more user friendly than I was expecting – nicely done!

Answer: Just follow the menus in ./vllm.sh.

  1. Build checkpoint β†’ [Hybrid INT4+FP8 / Native FP8+MTP]
═══ Benchmark ═══
[βœ“] Model: qwen/qwen3.5-35b-fp8-mtp

╔══════════════════════════════════════════════════════╗
β•‘  Benchmark: qwen3.5-35b-fp8-mtp  β€”  2026-04-13 18:34
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

  Warm-up... done

── Sequential (1 request) ──────────────────────────────
  Run 1/2:
  [Q&A       ]   256 tokens in   3.57s = 71.5 tok/s
  [Code      ]   490 tokens in   6.52s = 75.1 tok/s
  [JSON      ]  1024 tokens in  13.63s = 75.0 tok/s
  [Math      ]    32 tokens in    .49s = 65.0 tok/s
  [LongCode  ]  2048 tokens in  26.23s = 78.0 tok/s

  Run 2/2:
  [Q&A       ]   256 tokens in   3.57s = 71.6 tok/s
  [Code      ]   490 tokens in   6.54s = 74.8 tok/s
  [JSON      ]  1024 tokens in  13.60s = 75.2 tok/s
  [Math      ]    32 tokens in    .49s = 65.0 tok/s
  [LongCode  ]  2048 tokens in  26.04s = 78.6 tok/s

── Concurrent (4 parallel requests) ───────────────────────────
  Sending 4 requests simultaneously, measuring total throughput...

  [req1 ]  1024 tokens = 32.3 tok/s (end-to-end)
  [req2 ]  1024 tokens = 32.3 tok/s (end-to-end)
  [req3 ]  1024 tokens = 32.7 tok/s (end-to-end)
  [req4 ]  1024 tokens = 32.6 tok/s (end-to-end)

  Total: 4096 tokens in 31.68s
  Total throughput: 129.2 tok/s (4 requests completed)

Any works for gemma4?

Here is a python script to fix the weights problem mentioned here: training bug in Qwen3.5 35B A3B model

Before:

  • model.language_model.layers.36.linear_attn.conv1d.weight | Οƒ = 0.1019 | ⚠️ HIGH SIGMA
  • model.language_model.layers.37.linear_attn.conv1d.weight | Οƒ = 0.1024 | ⚠️ HIGH SIGMA

After:

  • model.language_model.layers.36.linear_attn.conv1d.weight | Οƒ = 0.0629 | βœ… NORMAL
  • model.language_model.layers.37.linear_attn.conv1d.weight | Οƒ = 0.0632 | βœ… NORMAL

fix_qwen_fp8.py

import torch
from safetensors.torch import load_file, save_file
import os
from glob import glob

model_path = "./" 
shards = glob(os.path.join(model_path, "*.safetensors"))

# The precise scale factor to move from 0.102 -> 0.063
scale_factor = 0.063 / 0.102

targets = [
    "model.language_model.layers.36.linear_attn.conv1d.weight",
    "model.language_model.layers.37.linear_attn.conv1d.weight"
]

print(f"Starting repair on {len(shards)} shards...")

for shard in shards:
    modified = False
    weights = load_file(shard)
    
    for name in targets:
        if name in weights:
            print(f"Applying fix to: {name} in {shard}")
            # Convert to float32 for the math, then back to original FP8 type
            orig_dtype = weights[name].dtype
            weights[name] = (weights[name].to(torch.float32) * scale_factor).to(orig_dtype)
            modified = True
            
    if modified:
        # Overwrite the shard with the repaired version
        save_file(weights, shard)
        print(f"Saved repaired shard: {shard}")

print("\nRepair complete. You can now restart vLLM.")

Pop it in models/qwen35-35b-fp8-mtp and run it with python3 fix_qwen_fp8.py