GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition, 12.0
OS: Ubuntu 22.04 x86_64
I ran GLM-5.1-FP8 by referring to GLM-5 and GLM-5.1 Series Usage - vLLM Recipes . No matter Using Docker or Installing vLLM from source, this error always occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
I want to know if PRO6000 supports GLM-5.1-FP8.