Qwen3.5-397B-A17B run in dual spark! but I have a concern

Can you post output of ibdev2netdev and ifconfig -a?

Make sure you don’t use the same subnet on both “twins” for the same physical interface and you only need to assign IPs to one pair of twins (the one that is up).

You also don’t need all four in NCCL_IB_HCA, only the active ones: export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1

Have you run NCCL tests? Can you post the results?

Thanks eugr. Here’s what I’ve got:

ib_write_bw — RDMA works at 108 Gbps bare metal:

RDMA_Write BW Test
 Dual-port       : OFF          Device         : rocep1s0f1
 Number of qps   : 4            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm

 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      20000            108.30             107.93              0.205867

Node1 ibdev2netdev:

rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)

Node1 ifconfig (QSFP interfaces only):

enp1s0f1np1: mtu 9000
        inet 192.168.101.10  netmask 255.255.255.0

enP2p1s0f1np1: mtu 9000
        inet 169.254.20.13  netmask 255.255.0.0 (link-local only)

enp1s0f0np0: mtu 1500 (no IP)
enP2p1s0f0np0: mtu 1500 (no IP)

Node2 ibdev2netdev:

rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)

Node2 ifconfig (QSFP interfaces only):

enp1s0f1np1: mtu 9000
        inet 192.168.101.11  netmask 255.255.255.0

enP2p1s0f1np1: mtu 9000
        inet 169.254.173.60  netmask 255.255.0.0 (link-local only)

enp1s0f0np0: mtu 1500 (no IP)
enP2p1s0f0np0: mtu 1500 (no IP)

Netplan on both nodes — cleaned up per your networking guide. Single /24 IP on enp1s0f1np1, twin enP2p1s0f1np1 at MTU 9000 with link-local only. No stale /30 addresses.

Problem: RoCE works perfectly bare metal (108 Gbps), but NCCL allReduce crashes inside the vllm-node-tf5 container when launching with the recipe (./run-recipe.sh qwen3.5-397b-int4-autoround -n 192.168.101.10,192.168.101.11 --eth-if enp1s0f1np1). Same crash with my manual launch-cluster.sh without NCCL_IB_DISABLE=1. Error is RuntimeError: NCCL error: unhandled system error on the first allReduce in PyNcclCommunicator init. Only works with NCCL_IB_DISABLE=1, which gives ~9 tok/s.

Single cable in rightmost QSFP port. What am I missing for RDMA to work inside the container?

RoCE is working now — unplugging the second cable fixed the NCCL crash. ib_write_bw confirms 108 Gbps. Thanks for pointing me to the networking guide.

New issue: the recipe OOMs during engine startup. Model loads fine at 97.79 GiB (text-only with --language-model-only), CUDA graphs compile, but it dies during Ray compiled DAG init. Tried --gpu-memory-utilization 108 and 105, both fail. Available memory shows 108.4 GiB before requesting. With my old manual launch-cluster.sh setup (older vLLM build, NCCL_IB_DISABLE=1), 105GB worked fine. Is there extra memory overhead in the TF5 nightly that I need to account for? What gpu-memory-utilization value should I use with this recipe?

Quick benchy on dual node Asus GX10, latest firmware and latest build (3/24/2026):

| model   |            test |             t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:--------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| qwen    |          pp2048 | 1635.28 ± 14.89 |              |  1162.88 ± 11.48 |  1159.95 ± 11.48 |  1162.94 ± 11.48 |
| qwen    |            tg32 |    29.54 ± 0.01 | 30.00 ± 0.00 |                  |                  |                  |
| qwen    |  ctx_pp @ d4096 | 1872.59 ± 16.81 |              |   2024.38 ± 7.86 |   2021.45 ± 7.86 |   2024.45 ± 7.85 |
| qwen    |  ctx_tg @ d4096 |    29.54 ± 0.02 | 30.00 ± 0.00 |                  |                  |                  |
| qwen    |  pp2048 @ d4096 |   645.30 ± 4.92 |              |  3176.83 ± 24.09 |  3173.90 ± 24.09 |  3176.91 ± 24.08 |
| qwen    |    tg32 @ d4096 |    29.53 ± 0.05 | 30.00 ± 0.00 |                  |                  |                  |
| qwen    |  ctx_pp @ d8192 |  1835.90 ± 5.87 |              |  4117.84 ± 14.54 |  4114.91 ± 14.54 |  4117.90 ± 14.53 |
| qwen    |  ctx_tg @ d8192 |    29.51 ± 0.07 | 30.00 ± 0.00 |                  |                  |                  |
| qwen    |  pp2048 @ d8192 |   661.26 ± 8.37 |              |  3100.55 ± 38.92 |  3097.62 ± 38.92 |  3100.61 ± 38.93 |
| qwen    |    tg32 @ d8192 |    29.45 ± 0.04 | 30.00 ± 0.00 |                  |                  |                  |
| qwen    | ctx_pp @ d16384 |  1776.17 ± 2.63 |              |  8389.18 ± 38.40 |  8386.25 ± 38.40 |  8389.26 ± 38.41 |
| qwen    | ctx_tg @ d16384 |    29.37 ± 0.03 | 30.00 ± 0.00 |                  |                  |                  |
| qwen    | pp2048 @ d16384 |  773.84 ± 65.16 |              | 2667.49 ± 214.07 | 2664.56 ± 214.07 | 2667.58 ± 214.09 |
| qwen    |   tg32 @ d16384 |    29.37 ± 0.07 | 30.00 ± 0.00 |                  |                  |                  |

RoCE is working now (NCCL_IB_DISABLE=0, deleted stale .env, single cable). Went from 9 tok/s (TCP) to 13.7 tok/s (RoCE). But community benchmarks show 27+. gpu-memory-utilization 105, max-model-len 65536, language-model-only. What else could explain the 2x gap?

You can use the recipe in the repo with --no-ray flag - it will bypass Ray and use native backend with less overhead.

shut down both nodes, unplug the cables (and maybe power bricks too), wait a little, then plug it back in and start - you may be experiencing a well-known issue where CPU/GPU get stuck at lower freqs.

Really appreciate this forum and thread. I was trying to do this the hard way by myself. With y’all’s help and advice, I got the 397B running with full context on two clustered Asus Ascent GB10’s. I was really struggling with the last mile, I couldn’t reserve enough ram for the GPU memory utilization @ 112GB. The big win was changing the default swapiness=60 to swapiness=10. That seemed to get me over the finish line, (after doing everything else recommended in this thread). Thanks again to everyone publishing all their learnings!

Next up… think I’ll take one of the heretic versions of the 397B @ BF16 and run AutoRound on it to get it to this nice INT4 size.

I just tried this after rebuliding vllm-node and vllm-node-tf5

$ ./run-recipe.sh qwen3.5-397b-int4-autoround.yaml --no-ray
Recipe: Qwen3.5-397B-INT4-Autoround
EXPERIMENTAL recipe for Qwen3.5-397B-INT4-Autoround (please refer to README for details! Use with --no-ray parameter!)

Using cluster nodes from .env: 192.168.0.164, 192.168.0.180

=== Launching ===
Container: vllm-node-tf5
Mods: mods/fix-qwen3.5-autoround, mods/fix-qwen3.5-chat-template, mods/gpu-mem-util-gb
Cluster: 2 nodes

Using launch script: /tmp/tmpmb8_ff3k.sh
Detected Local IP: 192.168.0.164 (192.168.0.164/24)
Head Node: 192.168.0.164
Worker Nodes: 192.168.0.180
Container Name: vllm_node
Image Name: vllm-node-tf5
Action: exec
Checking SSH connectivity to worker nodes…
SSH to 192.168.0.180: OK
Starting Head Node on 192.168.0.164…
32f0a7dbf8a0dc4888df526faf65d7600d7dfda83986af1c56c2b37efc6d6a76
Starting Worker Node on 192.168.0.180…
988fb939945342fa0c8f75c743db7755d61fa340ea0168a6e522fc9227a1d946
Applying modifications to cluster nodes…
Applying mod ‘fix-qwen3.5-autoround’ to 192.168.0.164…
Copying directory content to container…
Successfully copied 4.1kB to vllm_node:/workspace/mods/fix-qwen3.5-autoround/
Running patch script on 192.168.0.164…
patching file transformers/modeling_rope_utils.py
Hunk #1 FAILED at 648.
1 out of 1 hunk FAILED – saving rejects to file transformers/modeling_rope_utils.py.rej
Error: Patch script failed on 192.168.0.164

Stopping cluster…
Stopping head node (192.168.0.164)…
Stopping worker node (192.168.0.180)…
Cluster stopped.


This thread has been awesome. I’ve been running 397B for five days now. It’s been stable, sitting there with 0.6 gigs of free RAM on one of the nodes and it hasn’t crashed yet.

If I had known that different manufacturers have different amounts of free VRAM available, I would have gotten another EdgeExpert instead of gigabyte. Gigabyte gives us 2 gigs less available VRAM because of some BIOS setting or something that cannot be changed. So it’s touch and go but it has been sitting there and running, at this point I am afraid to disturb it lol.

Please make sure that you rebuild vllm-node-tf5 with --tf5 argument:

./build-and-copy.sh -t vllm-node-tf5 --tf5 -c

Your error tells me that the container was built without --tf5 argument.

Same with Asus GB10.
2 GB less than my node 1 which is a Nvidia Spark.
As the boards are more or less the same I would be happy having both systems on same bios settings, except fan control which is different in layout.
But nevertheless I am happy running the 397B autoround rock stable since days under heavy load.

My Asus Ascent nodes are 115 GiB ram available, only head is like 112 GiB do you guys have similar resources available?

happypatrick/Qwen3.5-397B-A17B-heretic-int4-AutoRound · Hugging Face Uses less VRAM! Also, not as smart. But it’s a fun one.

Heretic with this model size sounds crazy. When you say, not as smart. Is it very noticeable?

HAving the same error but with two nodes (did you set different IPs to the p1, P2 twins?)

Tried johnny_nv’s native vLLM 0.18.1 build (post #149) on dual DGX Spark with Qwen3.5-397B.

Setup

- Dual DGX Spark (GB10, 122 GiB unified memory each), 200GbE QSFP

- vLLM 0.18.1 from johnnynunez/vllm + FlashInfer 0.6.7 from johnnynunez/flashinfer

- Built natively (no Docker) with `uv`, Python 3.12, PyTorch 2.11.0+cu130

Multi-node TP=2 without Docker

`vllm serve --nnodes 2` does **not** work for tensor parallelism in 0.18.1. The v1 engine treats `–nnodes` as data parallelism only. For multi-node TP you need:

bash torchrun --nnodes=2 --nproc-per-node=1 --node-rank=$RANK \ 
--master-addr=$MASTER_IP --master-port=29500 \
-m vllm.entrypoints.openai.api_server \
--distributed-executor-backend external_launcher \
--tensor-parallel-size 2 ...

Also required:

- `export GLOO_SOCKET_IFNAME=` — without this, gloo tries localhost and fails

- `export NCCL_SOCKET_IFNAME=` + `NCCL_IB_HCA=` as usual

NVFP4 397B does NOT fit on dual Spark

`nvidia/Qwen3.5-397B-A17B-NVFP4` (251 GB, 11 shards × 25 GB) OOM-kills during shard loading at every utilization level tested (0.90, 0.80, 0.75), even with `–language-model-only` and reduced context. The 25 GB shards are too large for the loading pipeline on unified memory. nvidia’s TP=4 recommendation is correct — this model needs 4 GPUs.

The int4-AutoRound version (199 GB, 41 shards × 5 GB) loads fine in ~6.5 min.

`–gpu-memory-utilization-gb` cannot be cleanly ported to 0.18.1

Adding a field to CacheConfig’s pydantic dataclass breaks VllmConfig initialization — `cache_config` becomes `None` during `_post_init_` due to pydantic v2 field ordering. Workaround: patch `request_memory()` in `v1/worker/utils.py` to read from a `VLLM_GPU_MEMORY_GB` env var instead. This handles the startup memory validation but does NOT affect KV cache block calculation.

v1 engine KV cache bug with hybrid Mamba models

With Qwen3.5-397B (hybrid Mamba + attention + linear attention), the v1 engine computes `num_gpu_blocks=0` even with ~10 GiB available for KV cache. A second code path then correctly computes 326-332 blocks, but only if `–num-gpu-blocks-override` is set. Without the override flag, the model gets stuck at 4 blocks (minimum default).

This appears to be a bug in how the v1 engine profiles memory for hybrid architectures during the initial profiling pass.

v1 engine is ~500x slower than v0 for this model

Even with everything correctly configured (332 KV blocks, proper NCCL, both nodes active, GPU at 96% utilization), inference runs at **3.2 tok/s prefill** vs ~1500 tok/s on eugr’s 0.17.x Docker build. Decode was similarly broken (~0.1 tok/s vs 30 tok/s).

Note: johnny_nv’s single-node TP=1 setup with the 122B Qwen3.5 (same hybrid Mamba architecture) works fine on 0.18.1. So the v1 engine handles hybrid Mamba models — the regression appears specific to **multi-node TP with external_launcher**, possibly combined with Marlin (int4-AutoRound) quantization kernels. The v0 engine in 0.17.x handles multi-node TP for this model correctly via its `–no-ray` NCCL path.

spark-vllm-docker builds are from main vLLM branch (0.18.x) and all these models work just fine. So looks like something is wrong with your build.

For most users I strongly recommend using nightly builds from spark-vllm-docker - they go through a test pipeline that tests a few popular models (including Qwen3.5 for regressions). Next nightly run will include Johnny’s patches - I’m testing them locally now, so far so good.

I don’t think V0 even supports these architectures anymore, but in any case, v1 was the default engine since last year at least.

Point in case, using my latest (unpublished) build with Johnny’s PRs - these PRs will be included in the next nightly run (if it doesn’t fail):

./run-recipe.sh -t vllm-node-20260330-nvfp4-cudnn-tf5 recipes/qwen3.5-122b-int4-autoround.yaml --port 8888 --served-model-name coder-250k --no-ray
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.1rc1.dev254+g494636b29.d20260330
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]   █▄█▀ █     █     █     █  model   Intel/Qwen3.5-122B-A10B-int4-AutoRound
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:233] non-default args: {'model_tag': 'Intel/Qwen3.5-122B-A10B-int4-AutoRound', 'chat_template': 'unsloth.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 8888, 'model': 'Intel/Qwen3.5-122B-A10B-int4-AutoRound', 'trust_remote_code': True, 'max_model_len': 262144, 'served_model_name': ['coder-250k'], 'load_format': 'fastsafetensors', 'reasoning_parser': 'qwen3', 'master_addr': '192.168.24.104', 'nnodes': 2, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True, 'max_num_batched_tokens': 8192}
(APIServer pid=69) WARNING 03-30 22:35:40 [envs.py:1749] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=69) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=69) INFO 03-30 22:35:41 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=69) INFO 03-30 22:35:41 [model.py:1679] Using max model len 262144
(APIServer pid=69) INFO 03-30 22:35:41 [arg_utils.py:1719] Inferred data_parallel_rank 0 from node_rank 0
(APIServer pid=69) INFO 03-30 22:35:41 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=69) WARNING 03-30 22:35:41 [config.py:253] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=69) INFO 03-30 22:35:41 [config.py:273] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=69) INFO 03-30 22:35:41 [vllm.py:789] Asynchronous scheduling is enabled.
(APIServer pid=69) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=69) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
/usr/local/lib/python3.12/dist-packages/torch/compiler/__init__.py:148: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
  return torch._dynamo.allow_in_graph(fn)
(EngineCore pid=124) INFO 03-30 22:35:58 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev254+g494636b29.d20260330) with config: model='Intel/Qwen3.5-122B-A10B-int4-AutoRound', speculative_config=None, tokenizer='Intel/Qwen3.5-122B-A10B-int4-AutoRound', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=inc, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=coder-250k, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=124) WARNING 03-30 22:35:58 [multiproc_executor.py:1014] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=124) INFO 03-30 22:35:58 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=192.168.24.104, mq_connect_ip=192.168.24.104 (local), world_size=2, local_world_size=1
/usr/local/lib/python3.12/dist-packages/torch/compiler/__init__.py:148: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
  return torch._dynamo.allow_in_graph(fn)
`Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(Worker pid=171) INFO 03-30 22:36:03 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:29501 backend=nccl
(Worker pid=171) INFO 03-30 22:36:09 [pynccl.py:111] vLLM is using nccl==2.29.7
(Worker pid=171) WARNING 03-30 22:36:10 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
(Worker pid=171) INFO 03-30 22:36:10 [parallel_state.py:1716] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=171) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(Worker pid=171) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [gpu_model_runner.py:4737] Starting to load model Intel/Qwen3.5-122B-A10B-int4-AutoRound...
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [gptq_marlin.py:382] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=171) INFO 03-30 22:36:21 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker_TP0 pid=171) INFO 03-30 22:36:21 [flash_attn.py:607] Using FlashAttention version 2
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors using Fastsafetensor loader:  12% Completed | 1/8 [00:06<00:48,  6.95s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 2/8 [00:13<00:40,  6.76s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 3/8 [00:20<00:34,  6.93s/it]
Loading safetensors using Fastsafetensor loader:  50% Completed | 4/8 [00:26<00:26,  6.65s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 5/8 [00:34<00:20,  6.86s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 6/8 [00:40<00:13,  6.55s/it]
Loading safetensors using Fastsafetensor loader:  88% Completed | 7/8 [00:43<00:05,  5.36s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:43<00:00,  3.90s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:43<00:00,  5.48s/it]
(Worker_TP0 pid=171)
(Worker_TP0 pid=171) INFO 03-30 22:37:06 [default_loader.py:384] Loading weights took 43.80 seconds
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [gpu_model_runner.py:4822] Model loading took 31.47 GiB memory and 47.181670 seconds
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [interface.py:586] Setting attention block size to 2096 tokens to ensure that attention page size is >= mamba page size.
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [interface.py:610] Padding mamba page size by 0.58% to ensure that mamba page size and attention page size are exactly equal.
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [gpu_model_runner.py:5761] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1412: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171)   allow_in_graph(einops.rearrange)
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1414: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171)   allow_in_graph(einops.reduce)
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1417: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171)   allow_in_graph(einops.repeat)  # available since einops 0.2.0
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1420: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171)   allow_in_graph(einops.einsum)  # available since einops 0.5.0
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1423: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171)   allow_in_graph(einops.pack)  # available since einops 0.6.0
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1426: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171)   allow_in_graph(einops.unpack)  # available since einops 0.6.0
(Worker_TP0 pid=171) INFO 03-30 22:37:20 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/3ce4635d9f/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=171) INFO 03-30 22:37:20 [backends.py:1111] Dynamo bytecode transform time: 5.39 s
(EngineCore pid=124) INFO 03-30 22:38:09 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=171) INFO 03-30 22:38:32 [backends.py:390] Compiling a graph for compile range (1, 8192) takes 72.03 s
(Worker_TP0 pid=171) INFO 03-30 22:38:37 [backends.py:895] collected artifacts: 49 entries, 39 artifacts, 192096968 bytes total
(Worker_TP0 pid=171) INFO 03-30 22:38:37 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/841de02c07eaef4369f98f98168acac541deb81e0d029b6009864039cc79cfb5/rank_0_0/model
(Worker_TP0 pid=171) INFO 03-30 22:38:37 [monitor.py:48] torch.compile took 82.85 s in total
(EngineCore pid=124) INFO 03-30 22:39:09 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=171) INFO 03-30 22:40:06 [monitor.py:76] Initial profiling/warmup run took 89.13 s
(EngineCore pid=124) INFO 03-30 22:40:09 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=171) INFO 03-30 22:40:11 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(Worker_TP0 pid=171) INFO 03-30 22:40:12 [gpu_model_runner.py:5884] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(Worker_TP0 pid=171) INFO 03-30 22:40:45 [gpu_model_runner.py:5963] Estimated CUDA graph memory: 1.53 GiB total
(Worker_TP0 pid=171) INFO 03-30 22:40:45 [gpu_worker.py:436] Available KV cache memory: 46.8 GiB
(Worker_TP0 pid=171) INFO 03-30 22:40:45 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.7000 to 0.7126 to maintain the same effective KV cache size.
(EngineCore pid=124) INFO 03-30 22:40:45 [kv_cache_utils.py:1319] GPU KV cache size: 1,018,656 tokens
(EngineCore pid=124) INFO 03-30 22:40:45 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 14.73x
(Worker_TP0 pid=171) 2026-03-30 22:40:48,462 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=171) 2026-03-30 22:40:49,052 - INFO - autotuner.py:464 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00,  8.33it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00,  8.91it/s]
(Worker_TP0 pid=171) INFO 03-30 22:41:00 [gpu_model_runner.py:6052] Graph capturing finished in 12 secs, took 0.83 GiB
(Worker_TP0 pid=171) INFO 03-30 22:41:00 [gpu_worker.py:597] CUDA graph pool memory: 0.83 GiB (actual), 1.53 GiB (estimated), difference: 0.7 GiB (83.8%).
(EngineCore pid=124) INFO 03-30 22:41:00 [core.py:283] init engine (profile, create kv cache, warmup model) took 232.16 seconds
(EngineCore pid=124) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=124) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=124) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=124) INFO 03-30 22:41:12 [vllm.py:789] Asynchronous scheduling is enabled.
(APIServer pid=69) INFO 03-30 22:41:12 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=69) INFO 03-30 22:41:13 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=69) WARNING 03-30 22:41:13 [model.py:1436] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=69) INFO 03-30 22:41:13 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=69) INFO 03-30 22:41:20 [base.py:231] Multi-modal warmup completed in 6.375s
(APIServer pid=69) INFO 03-30 22:41:20 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8888
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:37] Available routes are:
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=69) INFO:     Started server process [69]
(APIServer pid=69) INFO:     Waiting for application startup.
(APIServer pid=69) INFO:     Application startup complete.
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 3496.38 ± 116.70 594.66 ± 19.99 586.70 ± 19.99 594.83 ± 19.95
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 45.22 ± 0.20 46.69 ± 0.21
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d4096 3497.13 ± 368.29 1786.36 ± 202.64 1778.40 ± 202.64 1786.47 ± 202.59
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d4096 44.87 ± 0.02 46.32 ± 0.01
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d8192 3688.09 ± 39.16 2785.05 ± 29.71 2777.09 ± 29.71 2785.16 ± 29.71
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d8192 45.85 ± 2.13 47.35 ± 2.20
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d16384 3666.54 ± 15.81 5035.41 ± 21.75 5027.45 ± 21.75 5035.58 ± 21.82
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d16384 44.83 ± 1.54 46.30 ± 1.59
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d32078 3384.99 ± 14.77 10089.92 ± 44.19 10081.96 ± 44.19 10090.00 ± 44.21
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d32078 43.94 ± 1.52 45.37 ± 1.57

llama-benchy (0.3.5)
date: 2026-03-30 15:52:18 | latency mode: api | pp basis: ttfr

There is a slight improvement in that recipe compared to the previous one.