DGX Spark - INT8 AWQ (W8A16) completely broken on DGX Spark (GB10 Blackwell) - anyone got this working?

a.fehlhauer · May 25, 2026, 11:59pm

Hey all,

I’ve been banging my head against this for hours. Running a Qwen3.6-27B AWQ INT8 model (cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8, compressed-tensors format) on a DGX Spark (GB10 Blackwell, SM_120) with vLLM 0.21.0 and it’s completely impossible to get it running.

THE PROBLEM

The only kernel that can handle W8A16 INT8 on vLLM is conch-triton-kernels (v1.3 by Stack AV). Every other kernel rejects it:

Marlin: “Quant type (uint8) not supported, supported types are: [ScalarType.uint4]”
Exllama: “only supports float16 activations”
AllSpark: “Zero points currently not supported”

So conch-triton-kernels is installed, vLLM picks it up (Using ConchLinearKernel for CompressedTensorsWNA16), model loads fine (34.44 GiB), and then it crashes with:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Crash location: conch/ops/quantization/gemm.py:164 in mixed_precision_gemm

WHAT I’VE TRIED (everything fails with same error)

–enforce-eager (no torch.compile, no CUDA graphs) → Same crash
–kv-cache-dtype fp8_e4m3 → Same crash
–kv-cache-dtype auto (bf16 KV) → Same crash
CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 → Same crash
TRITON_NUM_STAGES=1 → Same crash
All of the above combined → Same crash
–gpu-memory-utilization from 0.85 to 0.90 → Same crash
–max-num-batched-tokens 80k and 131k → Same crash
Clearing Triton cache → Same crash

MY THEORY

DGX Spark (GB10) uses unified memory (CPU+GPU share the same 128GB). Triton JIT-compiled kernels assume discrete GPU memory (standard CUDA device pointers). On unified memory, the generated PTX uses TMA descriptors / L2 residency hints that don’t work with cudaMallocManaged regions which causes the illegal memory access.

This matches Triton issue #9348 (make_block_ptr + TMA on SM90+) but there’s no fix available. GB10 is SM_120 (Blackwell) so likely same class of issue.

SETUP

Hardware: NVIDIA DGX Spark (GB10 Blackwell, SM_120, 128GB unified memory, aarch64)
Driver: 580.159
CUDA: 12.8
vLLM: 0.21.0
PyTorch: 2.11.0
conch-triton-kernels: 1.3
Model: cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 (compressed-tensors, W8A16)

WHAT WORKS FINE ON SAME HARDWARE

BF16 unquantized Qwen3.6-27B runs perfectly for 13+ hours
All standard vLLM features (prefix caching, chunked prefill, MTP speculative decoding)

QUESTIONS

Has anyone successfully run W8A16 INT8 quantized models on DGX Spark / GB10 Blackwell (or any unified memory system)?
Is there an alternative to conch-triton-kernels for INT8 W8A16? (Machete? CUTLASS direct?)
Is this a known limitation of unified memory (GB10/GH200) with Triton kernels?
Would INT4 AWQ (Marlin kernel) work fine since Marlin is pure CUDA, no Triton?

Any help appreciated. It feels like INT8 quantization is just completely unsupported on DGX Spark right now, which is wild given this is NVIDIA’s own hardware.

joshua.dale.warner · May 26, 2026, 3:50am

Could you elaborate on the benefits of W8A16/INT8 AWQ related to other quant architectures?

Genuinely puzzled. I have a 3090 and on Ampere, it is valid to run INT8. However, since Hopper, and certainly on Blackwell/GB10, FP8 has essentially replaced INT8. The FP8 pathways are well established and very performant. On GB10 you will see basically a clean halving of memory use and doubling of throughput with FP8. It is fairly straightfoward to make a FP8 quant even if there is not one, but Qwen provided an official Qwen3.6-27B-FP8: Qwen/Qwen3.6-27B-FP8 · Hugging Face. Please consider trying that one.

I am not sure if INT8 development/kernels are seriously considered as development effort appears to have moved to targeting FP8. I doubt this is a hard limitation, but more of a “we have better options and people should use those for 8-bit”.

Meanwhile, regarding your 4th point, you are correct. INT4 is alive and well on GB10. In particular, Intel’s Autoround (W4A16) quants using Marlin and a particular key environment flag are, today, slightly more performant than even NVFP4 on GB10 - though NVFP4 is catching up.

whpthomas · May 27, 2026, 2:58am

You can make you own in about 6 hours on a single GB10

auto-round \
    --model "Qwen/Qwen3.6-27B" \
    --dataset "github-code-clean" \
    --nsamples "256" \
    --iters "400" \
    --seqlen "2048" \
    --device "cuda" \
    --scheme "W8A16" \
    --format "auto_round" \
    --output_dir "./auto-round" \
    --enable_torch_compile

Then run it with the latest spark vllm docker vllm-node-tf5 – its pretty slow but multi-modal and accurate

#!/bin/bash

docker container remove vllm-qwen36-27b
docker run -it --name vllm-qwen36-27b \
    --gpus all --net=host --ipc=host \
    -v ~/auto-round:/auto-round \
    vllm-node-tf5 \
    bash -c -i "vllm serve /auto-round/Qwen3.6-27B-w8g128 \
    --served-model-name qwen/qwen3.6-27b \
    --max-model-len 196608 \
    --gpu-memory-utilization 0.65 \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 16 \
    --dtype bfloat16 \
    --kv-cache-dtype fp8_e4m3 \
    --port 8000 \
    --host 0.0.0.0 \
    --load-format instanttensor \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}' \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --chat-template /auto-round/qwen3.6-enhanced.jinja \
    --reasoning-parser qwen3 \
    --generation-config auto \
    --override-generation-config '{\"temperature\": 0.7, \"top_p\": 0.8, \"top_k\": 20, \"presence_penalty\": 0.0, \"repetition_penalty\": 1.0}'"

Topic		Replies	Views
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	56	5367	April 13, 2026
Qwen3.6-27B AWQ INT4 on DGX Spark (GB10) — only 1.8-4.9 tok/s decode with 285k token prompt, how to improve? DGX Spark / GB10	4	121	May 27, 2026
Bf16 LoRA Fine-Tuning of Qwen3.5-35B-A3B on DGX Spark — No Quantization Required DGX Spark / GB10 Projects training , ai-model-training	5	857	April 6, 2026
NVFP4 quantization of a 100B-class Llama on 2× DGX Spark — lessons + open questions DGX Spark / GB10 llama	5	326	May 15, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10460	April 9, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4407	March 6, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	6004	March 27, 2026
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	807	February 3, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	735	March 20, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1836	April 28, 2026

DGX Spark - INT8 AWQ (W8A16) completely broken on DGX Spark (GB10 Blackwell) - anyone got this working?

Related topics