Running GLM-5.1-FP8 on RTX PRO 6000

GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition, 12.0
OS: Ubuntu 22.04 x86_64

I ran GLM-5.1-FP8 by referring to GLM-5 and GLM-5.1 Series Usage - vLLM Recipes . No matter Using Docker or Installing vLLM from source, this error always occurs:

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

I want to know if PRO6000 supports GLM-5.1-FP8.

Pro 6000 should have compute capability 12.0. It is also called Blackwell, but it is the consumer variant. So similar to Ampere or Ada Lovelace.

The error message says it is only supported on Hopper and Blackwell. So I would assume it is not compatible.

But it depends on the exact feature needed. Have you asked the GLM authors?

There’s a few posts like this, from the recent past, so probably still the case.

No, I think GLM5 is a universal model. H200 supports GLM-5.1-FP8. I don’t know how to modify the GLM-5.1-FP8 model either. So I want to try modifying VLLM parameters to see if Pro6000 can support GLM5 normally.

Thanks, I think so.