Pretty much fully vibe coded by opus 4.7 so if something doesn’t work 100% let me know and I will get mad at claude for you and fix the problem. Hopefully this is actually as big of an improvement as I think it is, I haven’t ran many tests with it yet but I would rather get this out to the public even if it only works 50%
╭───────────────────────────────── ⚡ llama-benchy Throughput Benchmark ──────────────────────────────────╮
│ /models/qwen3.6-27b-fp8 │
│ pp=[2048] tg=[128] depth=[0, 4096, 8192] concurrency=[1, 2, 4] runs=3 latency=generation │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:14:08
llama-benchy 0.3.7
Estimated latency: 223.1 ms
llama-benchy Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test ┃ c ┃ pp t/s ┃ tg t/s ┃ TTFT (ms) ┃ Total (ms) ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0 │ c1 │ 1,617 │ 23.4 │ 1,392 │ 6,637 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c2 │ 1,484 │ 36.7 │ 2,507 │ 8,439 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c4 │ 853 │ 55.1 │ 8,757 │ 16,062 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c1 │ 1,186 │ 22.3 │ 5,004 │ 10,529 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c2 │ 696 │ 33.2 │ 16,793 │ 22,993 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c4 │ 423 │ 59.1 │ 55,102 │ 61,909 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c1 │ 487 │ 23.9 │ 19,065 │ 24,189 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c2 │ 432 │ 39.5 │ 46,385 │ 51,847 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c4 │ 414 │ 43.0 │ 97,902 │ 105,207 │ 2048+128 │
└──────────────────────────┴───────┴─────────────┴─────────────┴──────────────┴─────────────┴─────────────┘
With FlashQLA ^
╭───────────────────────────────── ⚡ llama-benchy Throughput Benchmark ──────────────────────────────────╮
│ /models/qwen3.6-27b-fp8 │
│ pp=[2048] tg=[128] depth=[0, 4096, 8192] concurrency=[1, 2, 4] runs=3 latency=generation │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:14:49
llama-benchy 0.3.7
Estimated latency: 220.3 ms
llama-benchy Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test ┃ c ┃ pp t/s ┃ tg t/s ┃ TTFT (ms) ┃ Total (ms) ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0 │ c1 │ 1,529 │ 21.9 │ 1,455 │ 7,079 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c2 │ 746 │ 33.6 │ 5,029 │ 10,820 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c4 │ 950 │ 53.5 │ 8,984 │ 16,345 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c1 │ 1,136 │ 19.0 │ 5,251 │ 11,778 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c2 │ 791 │ 26.8 │ 14,249 │ 22,538 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c4 │ 409 │ 45.7 │ 59,517 │ 68,983 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c1 │ 659 │ 14.6 │ 14,517 │ 23,047 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c2 │ 583 │ 26.1 │ 35,676 │ 44,573 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c4 │ 362 │ 27.5 │ 102,900 │ 115,563 │ 2048+128 │
└──────────────────────────┴───────┴─────────────┴─────────────┴──────────────┴─────────────┴─────────────┘
Without FlashQLA ^
Noice, does this work with dflash?
Yup, shouldn’t really have any compatibility issues the way I understand it.
I get an error
=== Applying FlashQLA mod ===
[flashqla] installing flash_qla and TileLang dep
Successfully uninstalled tilelang-0.1.9
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.20.1rc1.dev91+g841ddd54f.d20260429.cu132 requires tilelang==0.1.9, but you have tilelang 0.1.8 which is incompatible.
Successfully installed tilelang-0.1.8
WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:
ERROR: Directory ‘.’ is not installable. Neither ‘setup.py’ nor ‘pyproject.toml’ found.
ERROR: pattern ‘init_block’ expected 1 match, found 0 in gdn_linear_attn.py
when update in files tilelang-0.1.9
=== Applying FlashQLA mod ===
[FlashQLA] bypassing arch gate; tilelang reports compute=12.1
[flashqla] flash_qla already installed; skipping pip
ERROR: pattern 'init_block' expected 1 match, found 0 in gdn_linear_attn.py
Error: Patch script failed on 169.254.42.28
Which model are you trying to use it with? Could you share your recipe?
Qwen/Qwen3.6-27B-FP8
name: Qwen3.6-27B
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"
model: Qwen/Qwen3.6-27B-FP8
cluster_only: true
container: vllm-node
mods:
- mods/vllm-tune
- mods/flashqla
build_args:
- --tf5
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.7
max_model_len: 262144
max_num_batched_tokens: 16384
max_num_seqs: 10
env:
HF_TOKEN: hf_..V
VLLM_MARLIN_USE_ATOMIC_ADD: 1
command: |
vllm serve Qwen/Qwen3.6-27B-FP8 \
--max-model-len {max_model_len} \
--max-num-seqs {max_num_seqs} \
--enable-prefix-caching \
--gpu-memory-utilization {gpu_memory_utilization} \
--port {port} \
--host {host} \
--load-format fastsafetensors \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-num-batched-tokens {max_num_batched_tokens} \
--trust-remote-code \
--generation-config auto \
-tp {tensor_parallel}
Might want to edit that and take your huggingface token out. It looks fine, I can try it out later today when I’m home from work and my spark is back open.
ops, fix it)
I also use vllm with pr:
./build-and-copy.sh -t vllm-node --apply-vllm-pr 40898 -c
You fixed it? Or am I misunderstanding? I am using 40898 as well.
Fix it = Removed the token from the message
Awesome job! Same results for me. I did something similar to see if we’d get to the same result. On GB10’s, switched from Gemm v1 → v2 in the FlashQLA suite and a few other modifications, got pretty bad results.
I can hopefully take a look tonight at what’s going on, sorry for taking so long my spark has been hard at work the past couple days. Clearly I need a second one :P
Should work nicely for y’all now with the latest push, let me know if you have any other issues!
(Worker_TP0 pid=203) WARNING 05-04 15:53:02 [gdn_linear_attn.py:825] from flash_qla import chunk_gated_delta_rule as _fqla_kernel
(Worker_TP0 pid=203) WARNING 05-04 15:53:02 [gdn_linear_attn.py:825] ModuleNotFoundError: No module named 'flash_qla'
(Worker_TP0 pid=203) INFO 05-04 15:53:03 [monitor.py:81] Initial profiling/warmup run took 35.23 s
(Worker_TP0 pid=203) INFO 05-04 15:53:04 [backends.py:1070] Using cache directory: /root/.cache/vllm/torch_compile_cache/7e0a23c9a8/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=203) INFO 05-04 15:53:04 [backends.py:1129] Dynamo bytecode transform time: 1.33 s
(Worker_TP0 pid=203) INFO 05-04 15:53:19 [backends.py:392] Compiling a graph for compile range (1, 16384) takes 13.83 s
(Worker_TP0 pid=203) INFO 05-04 15:53:21 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bbb37983419ec65a634d6891dc7736619beb8514ba5c437b4dc272aa9594c475/rank_0_0/model
(Worker_TP0 pid=203) INFO 05-04 15:53:21 [monitor.py:53] torch.compile took 17.86 s in total
(Worker_TP0 pid=203) INFO 05-04 15:53:22 [monitor.py:81] Initial profiling/warmup run took 0.97 s
(EngineCore pid=156) INFO 05-04 15:53:30 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=203) WARNING 05-04 15:53:34 [kv_cache_utils.py:1152] Add 2 padding layers, may waste at most 4.17% KV cache memory
(Worker_TP0 pid=203) WARNING 05-04 15:53:34 [kv_cache_utils.py:1152] Add 4 padding layers, may waste at most 25.00% KV cache memory
(Worker_TP0 pid=203) INFO 05-04 15:53:34 [gpu_model_runner.py:5970] Profiling CUDA graph memory: PIECEWISE=19 (largest=323), FULL=10 (largest=170)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] Traceback (most recent call last):(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] output = func(*args, **kwargs)(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 378, in determine_available_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5999, in profile_cudagraph_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] self._warmup_and_capture(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6159, in _warmup_and_capture
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] self._dummy_run(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5544, in _dummy_run
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] outputs = self.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.runnable(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] hidden_states = self.language_model.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.fn(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] def forward(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "<string>", line 287, in execution_fn
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "<string>", line 5, in __vllm_inlined_submods__1
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._op(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1173, in gdn_attention_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] self._forward_core(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1048, in _forward_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ) = self.chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 136, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._forward_method(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 239, in forward_flashqla
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return _flashqla_chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 137, in _flashqla_chunk_gated_delta_rule
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] from flash_qla import chunk_gated_delta_rule as _fqla_kernel
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ModuleNotFoundError: No module named 'flash_qla'
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] Traceback (most recent call last):(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] output = func(*args, **kwargs)(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 378, in determine_available_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5999, in profile_cudagraph_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] self._warmup_and_capture(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6159, in _warmup_and_capture
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] self._dummy_run(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5544, in _dummy_run
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] outputs = self.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.runnable(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] hidden_states = self.language_model.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.fn(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] def forward(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "<string>", line 287, in execution_fn
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "<string>", line 5, in __vllm_inlined_submods__1
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._op(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1173, in gdn_attention_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] self._forward_core(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1048, in _forward_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ) = self.chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 136, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return self._forward_method(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 239, in forward_flashqla
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] return _flashqla_chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 137, in _flashqla_chunk_gated_delta_rule
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] from flash_qla import chunk_gated_delta_rule as _fqla_kernel
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ModuleNotFoundError: No module named 'flash_qla'
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] EngineCore failed to start.
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] super().__init__(
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] return func(*args, **kwargs)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] return self.collective_rpc("determine_available_memory")
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] return future if non_block else future.result()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] return super().result()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] return self.__get_result()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] raise self._exception
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] response = self.aggregate(self.get_response())
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] raise RuntimeError(
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] RuntimeError: Worker failed with error 'No module named 'flash_qla'', please check the stack trace above for the root cause
Thanks for your update!
I’ve got VLLM running now, but now I’m getting a new error.
That means the run.sh isn’t running in your environment for some reason, what command are you using to launch?
full log -
LOG.txt (179.2 KB)
Go ahead and pull again
Thx, eugr been start, but Qwen/Qwen3.6-27B-FP8, dflash=16, tp=2, no-ray:
My results with flashQLA are no different from without flashQLA. (
You will only really see a difference on long context