FlashQLA

Pretty much fully vibe coded by opus 4.7 so if something doesn’t work 100% let me know and I will get mad at claude for you and fix the problem. Hopefully this is actually as big of an improvement as I think it is, I haven’t ran many tests with it yet but I would rather get this out to the public even if it only works 50%

╭───────────────────────────────── ⚡ llama-benchy Throughput Benchmark ──────────────────────────────────╮
│ /models/qwen3.6-27b-fp8                                                                                 │
│ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  latency=generation           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:14:08

  llama-benchy 0.3.7
  Estimated latency: 223.1 ms

                                           llama-benchy Results                                            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test                     ┃   c   ┃      pp t/s ┃      tg t/s ┃    TTFT (ms) ┃  Total (ms) ┃      Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0        │  c1   │       1,617 │        23.4 │        1,392 │       6,637 │    2048+128 │
│ pp2048 tg128 @ d0        │  c2   │       1,484 │        36.7 │        2,507 │       8,439 │    2048+128 │
│ pp2048 tg128 @ d0        │  c4   │         853 │        55.1 │        8,757 │      16,062 │    2048+128 │
│ pp2048 tg128 @ d4096     │  c1   │       1,186 │        22.3 │        5,004 │      10,529 │    2048+128 │
│ pp2048 tg128 @ d4096     │  c2   │         696 │        33.2 │       16,793 │      22,993 │    2048+128 │
│ pp2048 tg128 @ d4096     │  c4   │         423 │        59.1 │       55,102 │      61,909 │    2048+128 │
│ pp2048 tg128 @ d8192     │  c1   │         487 │        23.9 │       19,065 │      24,189 │    2048+128 │
│ pp2048 tg128 @ d8192     │  c2   │         432 │        39.5 │       46,385 │      51,847 │    2048+128 │
│ pp2048 tg128 @ d8192     │  c4   │         414 │        43.0 │       97,902 │     105,207 │    2048+128 │
└──────────────────────────┴───────┴─────────────┴─────────────┴──────────────┴─────────────┴─────────────┘

With FlashQLA ^

╭───────────────────────────────── ⚡ llama-benchy Throughput Benchmark ──────────────────────────────────╮
│ /models/qwen3.6-27b-fp8                                                                                 │
│ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  latency=generation           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:14:49

  llama-benchy 0.3.7
  Estimated latency: 220.3 ms

                                           llama-benchy Results                                            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test                     ┃   c   ┃      pp t/s ┃      tg t/s ┃    TTFT (ms) ┃  Total (ms) ┃      Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0        │  c1   │       1,529 │        21.9 │        1,455 │       7,079 │    2048+128 │
│ pp2048 tg128 @ d0        │  c2   │         746 │        33.6 │        5,029 │      10,820 │    2048+128 │
│ pp2048 tg128 @ d0        │  c4   │         950 │        53.5 │        8,984 │      16,345 │    2048+128 │
│ pp2048 tg128 @ d4096     │  c1   │       1,136 │        19.0 │        5,251 │      11,778 │    2048+128 │
│ pp2048 tg128 @ d4096     │  c2   │         791 │        26.8 │       14,249 │      22,538 │    2048+128 │
│ pp2048 tg128 @ d4096     │  c4   │         409 │        45.7 │       59,517 │      68,983 │    2048+128 │
│ pp2048 tg128 @ d8192     │  c1   │         659 │        14.6 │       14,517 │      23,047 │    2048+128 │
│ pp2048 tg128 @ d8192     │  c2   │         583 │        26.1 │       35,676 │      44,573 │    2048+128 │
│ pp2048 tg128 @ d8192     │  c4   │         362 │        27.5 │      102,900 │     115,563 │    2048+128 │
└──────────────────────────┴───────┴─────────────┴─────────────┴──────────────┴─────────────┴─────────────┘

Without FlashQLA ^

Noice, does this work with dflash?

Yup, shouldn’t really have any compatibility issues the way I understand it.

I get an error

=== Applying FlashQLA mod ===
[flashqla] installing flash_qla and TileLang dep
Successfully uninstalled tilelang-0.1.9
ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.20.1rc1.dev91+g841ddd54f.d20260429.cu132 requires tilelang==0.1.9, but you have tilelang 0.1.8 which is incompatible.
Successfully installed tilelang-0.1.8
WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: 
ERROR: Directory ‘.’ is not installable. Neither ‘setup.py’ nor ‘pyproject.toml’ found.
ERROR: pattern ‘init_block’ expected 1 match, found 0 in gdn_linear_attn.py

when update in files tilelang-0.1.9

=== Applying FlashQLA mod ===
[FlashQLA] bypassing arch gate; tilelang reports compute=12.1
[flashqla] flash_qla already installed; skipping pip
ERROR: pattern 'init_block' expected 1 match, found 0 in gdn_linear_attn.py
Error: Patch script failed on 169.254.42.28

Which model are you trying to use it with? Could you share your recipe?

Qwen/Qwen3.6-27B-FP8
name: Qwen3.6-27B
description: "vLLM serving Qwen3.6-27B in FP8 with MTP speculative decoding, 262K context, tool calling"
model: Qwen/Qwen3.6-27B-FP8

cluster_only: true

container: vllm-node

mods:
  - mods/vllm-tune
  - mods/flashqla

build_args:
  - --tf5

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  max_num_seqs: 10

env:
  HF_TOKEN: hf_..V
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
  vllm serve Qwen/Qwen3.6-27B-FP8 \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --enable-prefix-caching \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --port {port} \
    --host {host} \
    --load-format fastsafetensors \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --generation-config auto \
    -tp {tensor_parallel}

Might want to edit that and take your huggingface token out. It looks fine, I can try it out later today when I’m home from work and my spark is back open.

ops, fix it)

I also use vllm with pr:

./build-and-copy.sh -t vllm-node --apply-vllm-pr 40898 -c

You fixed it? Or am I misunderstanding? I am using 40898 as well.

Fix it = Removed the token from the message

Awesome job! Same results for me. I did something similar to see if we’d get to the same result. On GB10’s, switched from Gemm v1 → v2 in the FlashQLA suite and a few other modifications, got pretty bad results.

I can hopefully take a look tonight at what’s going on, sorry for taking so long my spark has been hard at work the past couple days. Clearly I need a second one :P

Should work nicely for y’all now with the latest push, let me know if you have any other issues!

(Worker_TP0 pid=203) WARNING 05-04 15:53:02 [gdn_linear_attn.py:825]     from flash_qla import chunk_gated_delta_rule as _fqla_kernel
(Worker_TP0 pid=203) WARNING 05-04 15:53:02 [gdn_linear_attn.py:825] ModuleNotFoundError: No module named 'flash_qla'
(Worker_TP0 pid=203) INFO 05-04 15:53:03 [monitor.py:81] Initial profiling/warmup run took 35.23 s
(Worker_TP0 pid=203) INFO 05-04 15:53:04 [backends.py:1070] Using cache directory: /root/.cache/vllm/torch_compile_cache/7e0a23c9a8/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=203) INFO 05-04 15:53:04 [backends.py:1129] Dynamo bytecode transform time: 1.33 s
(Worker_TP0 pid=203) INFO 05-04 15:53:19 [backends.py:392] Compiling a graph for compile range (1, 16384) takes 13.83 s
(Worker_TP0 pid=203) INFO 05-04 15:53:21 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bbb37983419ec65a634d6891dc7736619beb8514ba5c437b4dc272aa9594c475/rank_0_0/model
(Worker_TP0 pid=203) INFO 05-04 15:53:21 [monitor.py:53] torch.compile took 17.86 s in total
(Worker_TP0 pid=203) INFO 05-04 15:53:22 [monitor.py:81] Initial profiling/warmup run took 0.97 s
(EngineCore pid=156) INFO 05-04 15:53:30 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=203) WARNING 05-04 15:53:34 [kv_cache_utils.py:1152] Add 2 padding layers, may waste at most 4.17% KV cache memory
(Worker_TP0 pid=203) WARNING 05-04 15:53:34 [kv_cache_utils.py:1152] Add 4 padding layers, may waste at most 25.00% KV cache memory
(Worker_TP0 pid=203) INFO 05-04 15:53:34 [gpu_model_runner.py:5970] Profiling CUDA graph memory: PIECEWISE=19 (largest=323), FULL=10 (largest=170)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] Traceback (most recent call last):(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     output = func(*args, **kwargs)(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 378, in determine_available_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5999, in profile_cudagraph_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     self._warmup_and_capture(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6159, in _warmup_and_capture
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     self._dummy_run(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5544, in _dummy_run
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     outputs = self.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]               ^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     hidden_states = self.language_model.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.fn(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     def forward(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "<string>", line 287, in execution_fn
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "<string>", line 5, in __vllm_inlined_submods__1
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1173, in gdn_attention_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     self._forward_core(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1048, in _forward_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     ) = self.chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 136, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._forward_method(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 239, in forward_flashqla
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return _flashqla_chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 137, in _flashqla_chunk_gated_delta_rule
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     from flash_qla import chunk_gated_delta_rule as _fqla_kernel
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ModuleNotFoundError: No module named 'flash_qla'
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] Traceback (most recent call last):(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     output = func(*args, **kwargs)(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 378, in determine_available_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5999, in profile_cudagraph_memory
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     self._warmup_and_capture(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6159, in _warmup_and_capture
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     self._dummy_run(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5544, in _dummy_run
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     outputs = self.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]               ^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 695, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     hidden_states = self.language_model.model(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 520, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.fn(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 495, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     def forward(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "<string>", line 287, in execution_fn
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "<string>", line 5, in __vllm_inlined_submods__1
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1269, in __call__
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1173, in gdn_attention_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     self._forward_core(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 1048, in _forward_core
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     ) = self.chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 136, in forward
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return self._forward_method(*args, **kwargs)
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 239, in forward_flashqla
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     return _flashqla_chunk_gated_delta_rule(
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/gdn_linear_attn.py", line 137, in _flashqla_chunk_gated_delta_rule
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]     from flash_qla import chunk_gated_delta_rule as _fqla_kernel
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962] ModuleNotFoundError: No module named 'flash_qla'
(Worker_TP0 pid=203) ERROR 05-04 15:53:40 [multiproc_executor.py:962]
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] EngineCore failed to start.
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     super().__init__(
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     return future if non_block else future.result()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]                                     ^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     return super().result()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]            ^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     return self.__get_result()
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     raise self._exception
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     response = self.aggregate(self.get_response())
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136]     raise RuntimeError(
(EngineCore pid=156) ERROR 05-04 15:53:40 [core.py:1136] RuntimeError: Worker failed with error 'No module named 'flash_qla'', please check the stack trace above for the root cause

Thanks for your update!
I’ve got VLLM running now, but now I’m getting a new error.

That means the run.sh isn’t running in your environment for some reason, what command are you using to launch?

full log -

LOG.txt (179.2 KB)

Go ahead and pull again

Thx, eugr been start, but Qwen/Qwen3.6-27B-FP8, dflash=16, tp=2, no-ray:

My results with flashQLA are no different from without flashQLA. (

You will only really see a difference on long context