Running GLM-5.1-FP8 on RTX PRO 6000

Longliu · April 17, 2026, 10:50am

GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition, 12.0
OS: Ubuntu 22.04 x86_64

I ran GLM-5.1-FP8 by referring to GLM-5 and GLM-5.1 Series Usage - vLLM Recipes . No matter Using Docker or Installing vLLM from source, this error always occurs:

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

I want to know if PRO6000 supports GLM-5.1-FP8.

Curefab · April 18, 2026, 1:58pm

Pro 6000 should have compute capability 12.0. It is also called Blackwell, but it is the consumer variant. So similar to Ampere or Ada Lovelace.

The error message says it is only supported on Hopper and Blackwell. So I would assume it is not compatible.

But it depends on the exact feature needed. Have you asked the GLM authors?

rs277 · April 18, 2026, 7:12pm

There’s a few posts like this, from the recent past, so probably still the case.

Longliu · April 20, 2026, 3:11am

No, I think GLM5 is a universal model. H200 supports GLM-5.1-FP8. I don’t know how to modify the GLM-5.1-FP8 model either. So I want to try modifying VLLM parameters to see if Pro6000 can support GLM5 normally.

Longliu · April 20, 2026, 3:12am

Thanks, I think so.

Topic		Replies	Views
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	39	1672	April 20, 2026
GPT-OSS-120B MXFP4 on RTX PRO 6000 Blackwell Max-Q (SM120): full debug path, what was actually broken, and what finally worked DGX Spark / GB10	8	307	April 1, 2026
GLM 5.1 on Hugging Face... Is this model going to run on a Single Spark? How many will be necessary? DGX Spark / GB10	17	3128	April 12, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	1308	January 30, 2026
Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)? DGX Spark / GB10	35	1429	April 19, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2310	December 25, 2025
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	6377	March 10, 2026
Quadro RTX 6000 does not handle BF16? Please make an update? CUDA Programming and Performance cudnn	16	2248	November 15, 2024
GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels? DGX Spark / GB10 cublas , nemotron	4	605	March 27, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	26	1870	April 7, 2026

Running GLM-5.1-FP8 on RTX PRO 6000

Related topics