Qwen introduces FlashQLA - high-performance linear attention kernels built on TileLang

Hello people,
have you seen:

Learn more:

đź“– Blog: https://qwen.ai/blog?id=flashqla

đź’» Code: https://github.com/QwenLM/FlashQLA

Apperently it increases speed vs Flashinfer by 2x.
Has anabody already tested it with eugrs vllm spark container or will this even be relevant for us?
Thanks.

Cc @Albond maybe something to look into? :)

@eugr maybe interesting for you as well? :D

Requirements: SM90, CUDA 12.8+, PyTorch 2.8+.

GB10’s are SM121, sorry y’all.

But we also run sm89 marlin kernels for some stuff so. Is this a hard requirement :D

Requirements: “SM90 or above” - it isn’t specific to SM90. In fact, there’s good reason to believe this will be helpful to us; Hopper era optimizations for FP8 deliver basically identical scaling on GB10.

Looks interesting, I’ll definitely check it out when I have more time for this.

I wonder how big your todo-list is :D Every other thread I see you mentioned with a request to check something out. Take care of yourself.

lol, pretty big. I’m currently wrapping up some unrelated projects, so can’t be as active with the project and forums as I would like to, but I will have more time in the coming weeks :)

Thanks, bookmarked it. Unfortunately my DGX Spark is tied up with a fine-tuning run for the next week, so I’ll take a proper look once it’s free. I’ll try to review it on my Mac in the meantime, though I’m not sure that’ll be enough to tell whether it’s worth integrating into the DGX Spark setup.

the repo tells a different story than the blog, good to know