Qwen3.5-122B-A10B on single Spark: 38.4 tok/s (patches + benchmark included)

@Albond bravo on the performance, sir. You have lapped what we have so far with nvfp4. We do have nvfp4 working on flashinfer_cutlass with a few recent PRs (flashinfer .0.6.7, vllm PRs 37725, 38126 – merged to head and part of .19 I think) but even with all that I’m getting performance around ~26-28 with MTP k=3. Genuinely annoying that nvidia’s marquee datatype is not being more highly prioritized, but regardless, nicely done!

Unfortunately the patch seems to be fragile. There have been recent updates to inc.py and some related infrastructure within the last 2-5 days, so depending on when you checked out last, it does not apply in full or cleanly.

I have rebuilt the community Docker after this patch was apparently made. Manually applied the changes to what is current and made a new patch.

For those who want to use the community docker, put these files in a new mod directory /mods/enable-hybrid-int4fp8

run.sh

#!/bin/bash
# Enable hybrid FP8 and Int4 quantized models to load properly.

set -e
MOD_DIR="$(dirname "$0")"
QUANT_DIR="/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization"

echo "[patch-hybrid-int4fp8-loading] Applying patch..."

# Apply patches with --forward (skip if already applied)
patch --forward --batch -p1 -d "$QUANT_DIR" < inc.py.patch || {
    echo "[patch-hybrid-int4fp8-loading] inc.py.patch already applied or failed"
}

echo "[patch-hybrid-int4fp8-loading] Done."

Updated patch (had to add hf_config as a kwarg to maybe_update_config due to upstream change) named
inc.py.patch

--- a/inc.py	2026-04-05 16:48:47.260943608 -0500
+++ b/inc.py	2026-04-05 17:27:27.262535630 -0500
@@ -8,6 +8,9 @@
 import torch
 from torch.nn.parameter import Parameter

+from safetensors.torch import _TYPES as _SAFETENSORS_TO_TORCH_DTYPE
+from transformers import PretrainedConfig
+
 from vllm.logger import init_logger
 from vllm.model_executor.layers.linear import (
     LinearBase,
@@ -18,6 +21,7 @@
     QuantizationConfig,
     QuantizationMethods,
 )
+from vllm.model_executor.layers.quantization.fp8 import Fp8Config, Fp8LinearMethod
 from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
 from vllm.model_executor.parameter import (
     GroupQuantScaleParameter,
@@ -26,6 +30,7 @@
 )
 from vllm.platforms import current_platform
 from vllm.scalar_type import scalar_types
+from vllm.transformers_utils.config import get_safetensors_params_metadata

 if TYPE_CHECKING:
     from vllm.model_executor.models.utils import WeightsMapper
@@ -97,6 +102,10 @@
         self.backend = backend
         self.pack_factor = Fraction(32, weight_bits)

+        # Hybrid INT4+FP8: populated by maybe_update_config
+        self.fp8_config: Fp8Config | None = None
+        self.fp8_layers: set[str] = set()
+
     def __repr__(self) -> str:
         return (
             f"INCConfig(weight_bits={self.weight_bits}, "
@@ -232,6 +241,61 @@
             )
         if self.extra_config is not None:
             self.extra_config = hf_to_vllm_mapper.apply_dict(self.extra_config)
+        if self.fp8_layers:
+            self.fp8_layers = set(
+                    hf_to_vllm_mapper.apply_list(list(self.fp8_layers))
+            )
+
+    def maybe_update_config(self, model_name: str,
+                            hf_config: PretrainedConfig | None = None,
+                            revision: str | None = None
+                            ):
+        """Detect FP8 layers in hybrid INT4+FP8 checkpoints."""
+        metadata = get_safetensors_params_metadata(model_name, revision=revision)
+        fp8_weights: dict[str, dict[str, Any]] = {}
+        for param_name, info in metadata.items():
+            dtype_str = info.get("dtype", None)
+            if dtype_str is None:
+                continue
+            torch_dtype = _SAFETENSORS_TO_TORCH_DTYPE.get(dtype_str)
+            if torch_dtype == torch.float8_e4m3fn and param_name.endswith(".weight"):
+                scale_name = param_name.replace(".weight", ".weight_scale_inv")
+                if scale_name in metadata:
+                    fp8_weights[param_name] = info
+
+        if not fp8_weights:
+            return
+
+        # Infer block size from first FP8 weight + scale pair
+        block_size = None
+        for param_name, info in fp8_weights.items():
+            scale_name = param_name.replace(".weight", ".weight_scale_inv")
+            scale_info = metadata[scale_name]
+            w_shape = info.get("shape", [])
+            s_shape = scale_info.get("shape", [])
+            if len(w_shape) == 2 and len(s_shape) == 2:
+                block_size = [
+                        w_shape[0] // s_shape[0],
+                    w_shape[1] // s_shape[1],
+                ]
+                break
+
+        if block_size is None:
+            return
+
+        self.fp8_config = Fp8Config(
+                is_checkpoint_fp8_serialized=True,
+            activation_scheme="dynamic",
+            weight_block_size=block_size,
+        )
+        self.fp8_layers = {
+                name.rsplit(".weight", 1)[0] for name in fp8_weights
+        }
+        logger.info(
+                "Hybrid INT4+FP8: detected %d FP8 dense layers (block_size=%s)",
+            len(self.fp8_layers),
+            block_size,
+        )

     def apply_awq_quant_layer(self, layer, prefix: str, backend: str = "auto"):
         from vllm.model_executor.layers.fused_moe import FusedMoE
@@ -319,6 +383,26 @@
                 return AWQLinearMethod(quant_args)
         return None

+    def _is_layer_fp8(self, prefix: str) -> bool:
+        """Check if layer should use FP8 in hybrid checkpoint."""
+        if not self.fp8_layers:
+            return False
+        if prefix in self.fp8_layers:
+            return True
+        # Fused module matching
+        fused_mapping = getattr(self, "packed_modules_mapping", {})
+        proj_name = prefix.split(".")[-1]
+        if proj_name in fused_mapping:
+            shard_prefixes = [
+                    prefix.replace(proj_name, shard)
+                for shard in fused_mapping[proj_name]
+            ]
+            return all(
+                    any(fp8_layer in sp for fp8_layer in self.fp8_layers)
+                for sp in shard_prefixes
+            )
+        return any(fp8_layer in prefix for fp8_layer in self.fp8_layers)
+
     def apply_gptq_quant_layer(self, layer, prefix: str, backend: str = "auto"):
         from vllm.model_executor.layers.fused_moe import FusedMoE
         from vllm.model_executor.layers.quantization.utils.marlin_utils import (
@@ -328,6 +412,18 @@

         weight_bits, group_size, sym = self.get_layer_config(layer, prefix)
         if not self.check_quantized(weight_bits):
+            # Hybrid INT4+FP8: dispatch FP8 for dense layers
+            fp8_match = self._is_layer_fp8(prefix) if self.fp8_config else False
+            if "shared_expert" in prefix or "linear_attn" in prefix:
+                logger.info(
+                        "INC GPTQ dispatch: prefix=%s, bits=%d, fp8_match=%s, "
+                    "fp8_config=%s, layer_type=%s",
+                    prefix, weight_bits, fp8_match,
+                    self.fp8_config is not None,
+                    type(layer).__name__,
+                )
+            if self.fp8_config and fp8_match:
+                return Fp8LinearMethod(self.fp8_config)
             if isinstance(layer, (LinearBase, ParallelLMHead)):
                 return UnquantizedLinearMethod()
             else:
@@ -443,6 +539,9 @@
                 if (
                     layer_name == prefix or layer_name == f"model.{prefix}"
                 ) and self.extra_config[layer_name].get("bits", 16) >= 16:
+                    # Hybrid INT4+FP8: FP8 layers override unquantized
+                    if self.fp8_config and self._is_layer_fp8(prefix):
+                        return Fp8LinearMethod(self.fp8_config)
                     return UnquantizedLinearMethod()
         if current_platform.is_xpu():
             return self.apply_xpu_w4a16_quant_layer(layer, prefix)

Then

  1. Change the backend explicitly to flashinfer (--attention-backend FLASHINFER)
  2. Apply the mod
  3. Start the community Docker, with a custom script mapping the directory where the hybrid lives. Mine is below, tweak as needed:
#!/bin/bash
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo -d \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-autoround \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-chat-template \
  --apply-mod ~/containers/spark-vllm-docker/mods/enable-hybrid-int4fp8 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve /models/Qwen3.5-122B-A10B-hybrid-int4fp8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.88 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --attention-backend FLASHINFER \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --chat-template unsloth.jinja \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'

The vLLM startup log has some verbose stuff that does not seem to be 100% happy, but it does run. For document analysis, I’m seeing results around 32-35 tok/s, with prior baseline of more like 22 tok/s.

Edit: For vision-document analysis tasks at concurrency = 3 I am seeing over 70 tok/s overall. Scales decently.

I’m a non-technical person, so perhaps I just misunderstood what should’ve been obvious (totally possible), but I thought the whole value proposition of the GB10 Spark et al. devices was their fundamental compatibility with NVFP4 quantization. Painful realization, indeed.

Not quite ;). MTP works with Ray, but it doesn’t work with --no-ray.

1 Like

Thank you, find one more good param: concurrency=3

Tried all ideas that are written in this thread.
The only way that worked was

./run-recipe.sh qwen3.5-122b-int4-autoround --solo --speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”: 2}’

num_speculative_tokens: 2 was the sweet spot with about 32 token/s.

All the other ideas written here either led to errors or garbage output. I think maybe due to version mismatches etc.

Cheers for the work and insights to start with. After loads of tinkering and with a bit of help from Grok, I’ve got the whole thing up and running. It’s certainly faster than the original, but sadly at the expense of noticeably worse output quality. And looking at my benchmarks, that makes perfect sense. It spits out considerably fewer tokens, like a small model.

Hey, thanks for the feedback! We can’t reproduce the quality drop on our end. Quick question: did you build from docker/Dockerfile.hybrid (Step 3 in the README), or did you set things up manually? And do you see any weight_scale_inv not found warnings on startup?

I wanted to have a more neutral way of finding out if MTP was working for my cluster of 2, so I ran a few benchmarks and I’ll remove that MTP line from my config :). I also did some tests with Pytorch and kv-cache-dtype fp8_e4m3 and pp2048 performance drops abruptly.

Intel/Qwen3.5-122B-A10B-int4-AutoRound - t/s (total)

Test Case Ray with MTP=2 Ray with MTP=1 Ray without MTP Pytorch without MTP
pp2048 (c1) 3317.04 ± 9.37 3294.61 ± 7.25 3329.11 ± 34.05 3382.14 ± 18.94
tg128 (c1) 19.89 ± 0.07 25.60 ± 0.14 33.31 ± 0.09 42.89 ± 0.07
pp2048 (c2) 3658.02 ± 25.77 3497.00 ± 179.11 3588.72 ± 232.21 3617.41 ± 161.51
tg128 (c2) 34.51 ± 0.43 41.91 ± 3.68 53.08 ± 0.48 69.75 ± 5.42
pp2048 (c5) 3546.24 ± 115.83 3660.88 ± 5.86 3768.92 ± 16.64 3705.95 ± 55.25
tg128 (c5) 40.27 ± 1.60 51.99 ± 0.37 90.42 ± 0.70 82.71 ± 0.82
ctx_pp @ d4096 (c1) 3572.91 ± 62.99 3585.31 ± 12.49 3440.15 ± 7.80 3380.80 ± 5.58
ctx_tg @ d4096 (c1) 19.59 ± 0.23 24.26 ± 0.18 39.65 ± 0.11 42.59 ± 0.04
pp2048 @ d4096 (c1) 1229.80 ± 1.40 1146.21 ± 1.19 1699.20 ± 4.60 1669.83 ± 1.15
tg128 @ d4096 (c1) 19.60 ± 0.17 23.72 ± 0.13 39.48 ± 0.06 42.53 ± 0.03
ctx_pp @ d4096 (c2) 3455.81 ± 316.27 3540.93 ± 5.75 3585.29 ± 7.25 3438.64 ± 74.02
ctx_tg @ d4096 (c2) 32.22 ± 1.84 30.66 ± 0.16 58.72 ± 0.27 56.55 ± 3.60
pp2048 @ d4096 (c2) 1212.63 ± 1.48 1208.62 ± 20.37 1767.75 ± 0.77 1668.88 ± 1.58
tg128 @ d4096 (c2) 24.65 ± 1.07 37.28 ± 4.98 58.13 ± 0.21 53.40 ± 0.05
ctx_pp @ d4096 (c5) 3667.17 ± 9.59 3628.84 ± 36.05 3779.05 ± 44.08 3736.85 ± 39.02
ctx_tg @ d4096 (c5) 35.27 ± 0.23 37.06 ± 2.31 67.54 ± 2.59 66.01 ± 4.38
pp2048 @ d4096 (c5) 1199.02 ± 9.51 1196.53 ± 6.78 1876.41 ± 0.71 1825.57 ± 0.58
tg128 @ d4096 (c5) 26.04 ± 0.87 30.86 ± 1.55 70.78 ± 3.51 60.56 ± 0.16
ctx_pp @ d8192 (c1) 3183.07 ± 590.09 3517.21 ± 1.33 3695.17 ± 3.50 3644.84 ± 6.42
ctx_tg @ d8192 (c1) 19.39 ± 0.24 23.09 ± 0.02 39.17 ± 0.13 42.36 ± 0.05
pp2048 @ d8192 (c1) 699.28 ± 0.73 1107.27 ± 1.78 1649.75 ± 3.90 1617.04 ± 4.57
tg128 @ d8192 (c1) 19.13 ± 0.42 22.36 ± 0.20 39.12 ± 0.06 42.13 ± 0.04
ctx_pp @ d8192 (c2) 3583.20 ± 6.36 3527.72 ± 45.84 3684.65 ± 2.34 3649.65 ± 1.78
ctx_tg @ d8192 (c2) 32.76 ± 0.19 27.16 ± 2.02 50.72 ± 0.16 43.62 ± 0.04
pp2048 @ d8192 (c2) 688.09 ± 4.26 1139.88 ± 1.09 1719.05 ± 5.53 1650.66 ± 35.59
tg128 @ d8192 (c2) 19.66 ± 0.07 28.76 ± 0.20 57.33 ± 0.34 55.06 ± 3.97
ctx_pp @ d8192 (c5) 3535.07 ± 17.36 3574.11 ± 5.97 3746.01 ± 1.67 3717.39 ± 4.71
ctx_tg @ d8192 (c5) 23.92 ± 1.87 24.71 ± 0.31 44.79 ± 0.04 42.45 ± 0.09
pp2048 @ d8192 (c5) 687.84 ± 1.82 1175.30 ± 10.60 1826.52 ± 0.88 1807.76 ± 17.09
tg128 @ d8192 (c5) 18.62 ± 0.17 31.08 ± 1.74 68.47 ± 0.09 63.04 ± 2.67
ctx_pp @ d16384 (c1) 3348.71 ± 14.53 3471.21 ± 3.48 3542.47 ± 7.45 3518.67 ± 4.83
ctx_tg @ d16384 (c1) 19.65 ± 0.11 20.72 ± 0.04 38.65 ± 0.21 41.63 ± 0.02
pp2048 @ d16384 (c1) 632.94 ± 1.71 1044.84 ± 4.13 1560.91 ± 0.47 1553.10 ± 2.85
tg128 @ d16384 (c1) 19.21 ± 0.38 20.15 ± 0.04 38.53 ± 0.04 41.45 ± 0.07
ctx_pp @ d16384 (c2) 3315.85 ± 7.17 3432.45 ± 4.11 3625.22 ± 2.30 3594.17 ± 4.37
ctx_tg @ d16384 (c2) 15.11 ± 0.03 17.80 ± 0.04 44.38 ± 0.00 46.01 ± 0.16
pp2048 @ d16384 (c2) 625.46 ± 0.90 1071.93 ± 1.08 1651.27 ± 5.45 1561.94 ± 1.88
tg128 @ d16384 (c2) 18.68 ± 0.13 26.55 ± 0.19 56.30 ± 0.19 50.82 ± 0.05
ctx_pp @ d16384 (c5) 3290.43 ± 6.79 3419.07 ± 5.05 3628.80 ± 1.01 3600.10 ± 4.36
ctx_tg @ d16384 (c5) 12.93 ± 0.06 13.98 ± 0.03 28.77 ± 0.02 29.06 ± 0.04
pp2048 @ d16384 (c5) 620.31 ± 0.85 1088.04 ± 1.70 1734.56 ± 2.53 1692.23 ± 6.31
tg128 @ d16384 (c5) 17.50 ± 0.07 27.78 ± 0.05 65.86 ± 0.31 56.44 ± 0.23
ctx_pp @ d32768 (c1) 2971.03 ± 7.66 3190.56 ± 5.58 3341.74 ± 5.88 3322.73 ± 3.74
ctx_tg @ d32768 (c1) 19.12 ± 0.14 17.34 ± 0.01 37.33 ± 0.23 40.22 ± 0.06
pp2048 @ d32768 (c1) 528.87 ± 1.24 944.71 ± 0.92 1481.41 ± 0.35 1458.13 ± 3.93
tg128 @ d32768 (c1) 19.28 ± 0.26 17.03 ± 0.03 37.26 ± 0.18 40.20 ± 0.04
ctx_pp @ d32768 (c2) 2927.73 ± 3.58 3179.57 ± 1.49 3350.58 ± 1.27 3333.83 ± 3.37
ctx_tg @ d32768 (c2) 8.78 ± 0.04 10.31 ± 0.09 20.64 ± 0.02 20.98 ± 0.01
pp2048 @ d32768 (c2) 522.93 ± 2.32 967.78 ± 1.31 1530.86 ± 1.85 1464.34 ± 4.26
tg128 @ d32768 (c2) 17.04 ± 0.06 23.40 ± 0.34 54.01 ± 0.12 48.28 ± 0.04
ctx_pp @ d32768 (c5) 2903.15 ± 19.24 3128.81 ± 12.60 3321.27 ± 4.73 3309.36 ± 1.67
ctx_tg @ d32768 (c5) 6.61 ± 0.05 7.16 ± 0.03 14.14 ± 0.02 14.20 ± 0.01
pp2048 @ d32768 (c5) 226.11 ± 46.99 976.83 ± 2.02 1606.94 ± 1.54 1577.42 ± 4.87
tg128 @ d32768 (c5) 7.37 ± 1.12 23.99 ± 0.17 62.33 ± 0.05 52.77 ± 0.17

llama-benchy (0.3.5)
date: 2026-04-06 00:00:00 | latency mode: api

P.S.
This is more to compare the different runs than to check raw performance. The same server is running 2 more Qwen3 VLMs, Qdrant, Docling and Langflow. I didn’t want to take anything down during the tests.

2 Likes

For reference on my ASUS Ascent GX10 - solo:

  • LM Studio with Unsloth Q4_K_M - 262144 context, KV Q8_0:

Just to add some context on the quality side: in my setup, the AutoRound INT4 + FP8/INT8 hybrid gives almost the exact same quality as the original BF16. Plus, my current speed is now hitting around 50 tok/s (huge thanks to everyone in this thread for the ideas and insights!).

Standard Q4_K_M was never really an option for me from the start. You lose up to 4% in quality, which in practice means more logic breakdowns and refusals on complex tasks. To get a fair comparison in terms of quality, you’d really need to pit my setup against Q8_0.

The main difference is that AutoRound INT4 isn’t just blindly squashing weights to 4 bits. It’s an activation-aware algorithm that runs a calibration dataset to adjust the weights, ensuring the final output matches the original 16-bit model as closely as possible.

How much context are you getting with this setup? I only got 32k context at 38 tok/s. Context is too low for my openclaw setup, despite the added speed.

Yes, in v1, the context window is limited to 32k tokens. I am currently working on v2, which will support a 256k context window — the recommended configuration for Qwen3.5 122B. I expect v2 to run comfortably at around 45–50 tokens/second using AutoRound INT4 with an FP8/INT8 hybrid quantization approach. Longer context lengths beyond that may affect the stability of Qwen3.5 122B.

1 Like

Looking forward to it!

1 Like

I’m not sure what went wrong for you but - using my mod above on the community Docker built a few days ago - I am seeing no significant performance difference relative to the native int4. I have a couple prompts which exhibit quirky behavior and it shows the same. It’s definitely not like a small or different model.

Also with my start script you’ll get full 256k context.

Nice! spent a while fighting the inc.py patch on the latest VLLM wheels, but your patch fixed it. Got the Hybrid + MTP model running and was able to an A/B test with the provided benchmark:

vLLM: 0.19.1rc1.dev46+gc5e3454e5.d20260406

Qwen3.5-122B-A10B-hybrid-int4fp8 with MTP=1

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 6.78s = 37.7 tok/s (prompt: 23)
  [Code] 502 tokens in 12.76s = 39.3 tok/s (prompt: 30)
  [JSON] 1024 tokens in 25.89s = 39.5 tok/s (prompt: 48)
  [Math] 64 tokens in 1.77s = 36.1 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 51.06s = 40.1 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 6.78s = 37.7 tok/s (prompt: 23)
  [Code] 512 tokens in 13.01s = 39.3 tok/s (prompt: 30)
  [JSON] 1024 tokens in 26.22s = 39.0 tok/s (prompt: 48)
  [Math] 64 tokens in 1.76s = 36.3 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 51.17s = 40.0 tok/s (prompt: 37)

=== Done ===

Qwen3.5-122B-A10B-int4-AutoRound

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 8.87s = 28.8 tok/s (prompt: 23)
  [Code] 512 tokens in 17.71s = 28.9 tok/s (prompt: 30)
  [JSON] 1024 tokens in 35.82s = 28.5 tok/s (prompt: 48)
  [Math] 64 tokens in 2.31s = 27.7 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 70.88s = 28.8 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 8.91s = 28.7 tok/s (prompt: 23)
  [Code] 502 tokens in 17.44s = 28.7 tok/s (prompt: 30)
  [JSON] 1024 tokens in 35.51s = 28.8 tok/s (prompt: 48)
  [Math] 64 tokens in 2.32s = 27.5 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 71.05s = 28.8 tok/s (prompt: 37)

=== Done ===
1 Like

Glad it helped! I’ve been really happy with this setup. Though I didn’t do exhaustive benchmarks, my usual work is with postgraduate level text using specialized vocabulary which is why I tend to get a little less out of MTP.

Is that comparison with MTP=1 on the original int4-Auroround model, or no MTP?

The bottom is just vanilla int4-AutoRound, no MTP. Haven’t benchmarked the base model with MTP, or tweaked the speculative tokens yet, but I’m pretty happy with this model at 38-40tok/s.