flash3
February 3, 2026, 10:12am
1
DGX Spark GB10 (vLLM 0.15, GLM-4.7-Flash)
580.95.05 (CUDA 13.0) huge artefacts
580.126.09 (CUDA 13.0) less/later
590… n/a ?
Hi everyone,
I’m seeing a reproducible issue on DGX Spark GB10 (Grace-Blackwell, unified memory) where the artifact rate increases significantly with growing context length , eventually leading to tooling failures (function/tool calls breaking or producing malformed outputs).
Setup
Hardware: DGX Spark GB10 (ARM / Grace-Blackwell, 128 GB unified memory)
Inference: vLLM 0.15
Model: GLM-4.7-Flash
Client / Tooling: Claude Code (tool calls / structured outputs) 17k prefill and growing
Drivers: NVIDIA 580
(590 not yet available for ARM at the time of testing)
Observed behavior
At short to moderate context lengths, outputs are stable.
As the context grows:
Hallucinated tokens and formatting artifacts increase.
Structured outputs (JSON / tool calls) become unstable.
Tooling eventually fails entirely (invalid schemas, truncated or corrupted tool calls).
The degradation appears gradual and context-length dependent , not an immediate failure.
GPU memory is not obviously exhausted; this does not look like a simple OOM issue.
What we’ve ruled out
Prompt structure issues (tested minimal and verbose variants).
Client-side parsing errors (validated raw model outputs).
Obvious vLLM misconfiguration (KV cache sizes, batch size, etc.).
An almost identical mirror system ryzen 32 cores with RTX PRO 6000 works perfect (differences: x86, no unified memory, 590 driver instead of 580)
Open questions
Is this a known issue with long-context handling on GB10 / unified memory , especially under vLLM?
Could this be related to driver 580 limitations on ARM (e.g. FP8 / FP6 / KV-cache behavior), potentially fixed in 590?
Has anyone observed context-dependent output corruption or tooling instability specifically on Grace-Blackwell ?
Are there recommended vLLM settings or workarounds for GB10 (e.g. KV cache layout, paging behavior, precision choices)?
Any pointers, similar experiences, or low-level insights (driver, kernel, vLLM internals) would be very helpful.
Thanks!
You may be suffering from “context poisoning” where a large amount of context can corrupt your LLM output, for various reasons. This is probably not hardware dependent, have you tested your workload on different hardware?
eugr
February 3, 2026, 6:26pm
3
Or on a different model. Current GLM-4.7-flash implementations suffer from many issues.
This is not hardware specific. Expanding context from 4k-8k to128k-1m necessitates solutions like RoPE, LongRoPE, YaRN, etc.
Unfortunately these approaches tend to break down over extremely long contexts, and as a result you will see degraded responses, increased hallucinations and other errors.
flash3
February 4, 2026, 10:42am
5
Same model, same vllm, rtx pro 6000 works, dgx spark drifts. So it could be the model but finally rtx pro handles it.
And the 580 driver update of the dgx reduced it a bit. so it points directly to the dgx/driver.
eugr
February 4, 2026, 5:48pm
6
Which model quant are you using?
What vllm container?
While sm120/sm121 are pretty much the same except for unified memory, some FP8/FP4 paths that work with sm120 don’t get compiled on sm121, especially if you use official vLLM docker.
flash3
February 4, 2026, 6:43pm
7
GLM-4.7-Flash:
Quelle: unsloth/GLM-4.7-Flash-FP8-Dynamic (FP8 dynamisch quantisiert)
Modell: 31B MoE (30B-A3B)
vLLM: 0.15 aus Source gebaut auf NVIDIA 26.01 (CUDA 13.1)
vllm:25.12-py3 or vllm:26.01-py3 (as seen below)
Dockerfile.glm:
FROM nvcr.io/nvidia/vllm:26.01-py3
RUN apt-get update -qq && apt-get install -y wget && \
wget -qO /tmp/cuda-keyring.deb
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb &&
dpkg -i /tmp/cuda-keyring.deb && apt-get update -qq &&
apt-get install -y libcusparse-dev-13-1 libcusolver-dev-13-1 libcufft-dev-13-1
RUN pip install cmake ninja setuptools-scm
RUN git clone --depth 1 --branch v0.15.0 …vllm-project/vllm.git /tmp/vllm-build
WORKDIR /tmp/vllm-build
RUN python use_existing_torch.py && pip install -r requirements/build.txt
RUN pip uninstall -y vllm
ENV TORCH_CUDA_ARCH_LIST=“12.0a”
RUN pip install --no-build-isolation .
RUN pip install --no-deps “transformers>=5.0” “compressed-tensors>=0.13”
COPY patch_transformers.py /tmp/
RUN python3 /tmp/patch_transformers.py
start.glm47_flash:
podman run -d --replace --name vllm-glm47-flash
–device nvidia.com/gpu=all
-e VLLM_MLA_DISABLE=1
-e VLLM_DISABLED_KERNELS=CutlassFP8ScaledMMLinearKernel
localhost/vllm-glm
bash -c “python3 /data/tensordata/patch_streaming.py &&
vllm serve /data/tensordata/GLM-4.7-Flash-FP8
–served-model-name glm-4.7-flash
–max-model-len 100000
–kv-cache-dtype fp8
–tool-call-parser glm47
–enable-auto-tool-choice”
Important Notice: I had some problems posting the github repos uri here. its replaced on the fly into a promo slogan… magic.
eugr
February 4, 2026, 6:47pm
8
flash3
February 4, 2026, 8:24pm
10
still drifting. after ~17k or some more tooling breaks and after even more tokens only garbage. its the 30B flash model and it is only 1 dgx.
It could also be your settings for GLM 4.7 Flash. What’s your temp and top_p? temp should be 0.9 and top_p = 0.95 for GLM 4.7 Flash. Anything less it produces garbage output. Thinking disabled also helps.
flash3
February 4, 2026, 8:43pm
12
why does it work on rtx pro 6000 with same config?
Not sure. I run 65-131k context on Flash and it works pretty well. Without the temp, top_p, and the thinking disabled tweaks, it was terrible. It has now become one of the most usable models for coding right now.