DGX Spark - INT8 AWQ (W8A16) completely broken on DGX Spark (GB10 Blackwell) - anyone got this working?

Hey all,

I’ve been banging my head against this for hours. Running a Qwen3.6-27B AWQ INT8 model (cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8, compressed-tensors format) on a DGX Spark (GB10 Blackwell, SM_120) with vLLM 0.21.0 and it’s completely impossible to get it running.

THE PROBLEM

The only kernel that can handle W8A16 INT8 on vLLM is conch-triton-kernels (v1.3 by Stack AV). Every other kernel rejects it:

  • Marlin: “Quant type (uint8) not supported, supported types are: [ScalarType.uint4]”

  • Exllama: “only supports float16 activations”

  • AllSpark: “Zero points currently not supported”

So conch-triton-kernels is installed, vLLM picks it up (Using ConchLinearKernel for CompressedTensorsWNA16), model loads fine (34.44 GiB), and then it crashes with:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Crash location: conch/ops/quantization/gemm.py:164 in mixed_precision_gemm

WHAT I’VE TRIED (everything fails with same error)

  • –enforce-eager (no torch.compile, no CUDA graphs) → Same crash

  • –kv-cache-dtype fp8_e4m3 → Same crash

  • –kv-cache-dtype auto (bf16 KV) → Same crash

  • CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 → Same crash

  • TRITON_NUM_STAGES=1 → Same crash

  • All of the above combined → Same crash

  • –gpu-memory-utilization from 0.85 to 0.90 → Same crash

  • –max-num-batched-tokens 80k and 131k → Same crash

  • Clearing Triton cache → Same crash

MY THEORY

DGX Spark (GB10) uses unified memory (CPU+GPU share the same 128GB). Triton JIT-compiled kernels assume discrete GPU memory (standard CUDA device pointers). On unified memory, the generated PTX uses TMA descriptors / L2 residency hints that don’t work with cudaMallocManaged regions which causes the illegal memory access.

This matches Triton issue #9348 (make_block_ptr + TMA on SM90+) but there’s no fix available. GB10 is SM_120 (Blackwell) so likely same class of issue.

SETUP

  • Hardware: NVIDIA DGX Spark (GB10 Blackwell, SM_120, 128GB unified memory, aarch64)

  • Driver: 580.159

  • CUDA: 12.8

  • vLLM: 0.21.0

  • PyTorch: 2.11.0

  • conch-triton-kernels: 1.3

  • Model: cyankiwi/Qwen3.6-27B-AWQ-BF16-INT8 (compressed-tensors, W8A16)

WHAT WORKS FINE ON SAME HARDWARE

  • BF16 unquantized Qwen3.6-27B runs perfectly for 13+ hours

  • All standard vLLM features (prefix caching, chunked prefill, MTP speculative decoding)

QUESTIONS

  1. Has anyone successfully run W8A16 INT8 quantized models on DGX Spark / GB10 Blackwell (or any unified memory system)?

  2. Is there an alternative to conch-triton-kernels for INT8 W8A16? (Machete? CUTLASS direct?)

  3. Is this a known limitation of unified memory (GB10/GH200) with Triton kernels?

  4. Would INT4 AWQ (Marlin kernel) work fine since Marlin is pure CUDA, no Triton?

Any help appreciated. It feels like INT8 quantization is just completely unsupported on DGX Spark right now, which is wild given this is NVIDIA’s own hardware.

Could you elaborate on the benefits of W8A16/INT8 AWQ related to other quant architectures?

Genuinely puzzled. I have a 3090 and on Ampere, it is valid to run INT8. However, since Hopper, and certainly on Blackwell/GB10, FP8 has essentially replaced INT8. The FP8 pathways are well established and very performant. On GB10 you will see basically a clean halving of memory use and doubling of throughput with FP8. It is fairly straightfoward to make a FP8 quant even if there is not one, but Qwen provided an official Qwen3.6-27B-FP8: Qwen/Qwen3.6-27B-FP8 · Hugging Face. Please consider trying that one.

I am not sure if INT8 development/kernels are seriously considered as development effort appears to have moved to targeting FP8. I doubt this is a hard limitation, but more of a “we have better options and people should use those for 8-bit”.

Meanwhile, regarding your 4th point, you are correct. INT4 is alive and well on GB10. In particular, Intel’s Autoround (W4A16) quants using Marlin and a particular key environment flag are, today, slightly more performant than even NVFP4 on GB10 - though NVFP4 is catching up.

You can make you own in about 6 hours on a single GB10

auto-round \
    --model "Qwen/Qwen3.6-27B" \
    --dataset "github-code-clean" \
    --nsamples "256" \
    --iters "400" \
    --seqlen "2048" \
    --device "cuda" \
    --scheme "W8A16" \
    --format "auto_round" \
    --output_dir "./auto-round" \
    --enable_torch_compile

Then run it with the latest spark vllm docker vllm-node-tf5 – its pretty slow but multi-modal and accurate

#!/bin/bash

docker container remove vllm-qwen36-27b
docker run -it --name vllm-qwen36-27b \
    --gpus all --net=host --ipc=host \
    -v ~/auto-round:/auto-round \
    vllm-node-tf5 \
    bash -c -i "vllm serve /auto-round/Qwen3.6-27B-w8g128 \
    --served-model-name qwen/qwen3.6-27b \
    --max-model-len 196608 \
    --gpu-memory-utilization 0.65 \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 16 \
    --dtype bfloat16 \
    --kv-cache-dtype fp8_e4m3 \
    --port 8000 \
    --host 0.0.0.0 \
    --load-format instanttensor \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 3}' \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --chat-template /auto-round/qwen3.6-enhanced.jinja \
    --reasoning-parser qwen3 \
    --generation-config auto \
    --override-generation-config '{\"temperature\": 0.7, \"top_p\": 0.8, \"top_k\": 20, \"presence_penalty\": 0.0, \"repetition_penalty\": 1.0}'"