Increasing artefact rate on growing context on DGX Spark (glm 4.7 flash)

flash3 · February 3, 2026, 10:12am

DGX Spark GB10 (vLLM 0.15, GLM-4.7-Flash)
580.95.05 (CUDA 13.0) huge artefacts
580.126.09 (CUDA 13.0) less/later
590… n/a ?

Hi everyone,

I’m seeing a reproducible issue on DGX Spark GB10 (Grace-Blackwell, unified memory) where the artifact rate increases significantly with growing context length, eventually leading to tooling failures (function/tool calls breaking or producing malformed outputs).

Setup

Hardware: DGX Spark GB10 (ARM / Grace-Blackwell, 128 GB unified memory)
Inference: vLLM 0.15
Model: GLM-4.7-Flash
Client / Tooling: Claude Code (tool calls / structured outputs) 17k prefill and growing
Drivers: NVIDIA 580
(590 not yet available for ARM at the time of testing)

Observed behavior

At short to moderate context lengths, outputs are stable.
As the context grows:
- Hallucinated tokens and formatting artifacts increase.
- Structured outputs (JSON / tool calls) become unstable.
- Tooling eventually fails entirely (invalid schemas, truncated or corrupted tool calls).
The degradation appears gradual and context-length dependent, not an immediate failure.
GPU memory is not obviously exhausted; this does not look like a simple OOM issue.

What we’ve ruled out

Prompt structure issues (tested minimal and verbose variants).
Client-side parsing errors (validated raw model outputs).
Obvious vLLM misconfiguration (KV cache sizes, batch size, etc.).
An almost identical mirror system ryzen 32 cores with RTX PRO 6000 works perfect (differences: x86, no unified memory, 590 driver instead of 580)

Open questions

Is this a known issue with long-context handling on GB10 / unified memory, especially under vLLM?
Could this be related to driver 580 limitations on ARM (e.g. FP8 / FP6 / KV-cache behavior), potentially fixed in 590?
Has anyone observed context-dependent output corruption or tooling instability specifically on Grace-Blackwell?
Are there recommended vLLM settings or workarounds for GB10 (e.g. KV cache layout, paging behavior, precision choices)?

Any pointers, similar experiences, or low-level insights (driver, kernel, vLLM internals) would be very helpful.

Thanks!

aniculescu · February 3, 2026, 6:13pm

You may be suffering from “context poisoning” where a large amount of context can corrupt your LLM output, for various reasons. This is probably not hardware dependent, have you tested your workload on different hardware?

eugr · February 3, 2026, 6:26pm

Or on a different model. Current GLM-4.7-flash implementations suffer from many issues.

Kidtronic · February 3, 2026, 7:10pm

This is not hardware specific. Expanding context from 4k-8k to128k-1m necessitates solutions like RoPE, LongRoPE, YaRN, etc.

Unfortunately these approaches tend to break down over extremely long contexts, and as a result you will see degraded responses, increased hallucinations and other errors.

flash3 · February 4, 2026, 10:42am

Same model, same vllm, rtx pro 6000 works, dgx spark drifts. So it could be the model but finally rtx pro handles it.

And the 580 driver update of the dgx reduced it a bit. so it points directly to the dgx/driver.

eugr · February 4, 2026, 5:48pm

Which model quant are you using?

What vllm container?

While sm120/sm121 are pretty much the same except for unified memory, some FP8/FP4 paths that work with sm120 don’t get compiled on sm121, especially if you use official vLLM docker.

flash3 · February 4, 2026, 6:43pm

GLM-4.7-Flash:

Quelle: unsloth/GLM-4.7-Flash-FP8-Dynamic (FP8 dynamisch quantisiert)
Modell: 31B MoE (30B-A3B)
vLLM: 0.15 aus Source gebaut auf NVIDIA 26.01 (CUDA 13.1)
vllm:25.12-py3 or vllm:26.01-py3 (as seen below)

Dockerfile.glm:

FROM nvcr.io/nvidia/vllm:26.01-py3

RUN apt-get update -qq && apt-get install -y wget && \
wget -qO /tmp/cuda-keyring.deb
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb &&
dpkg -i /tmp/cuda-keyring.deb && apt-get update -qq &&
apt-get install -y libcusparse-dev-13-1 libcusolver-dev-13-1 libcufft-dev-13-1

RUN pip install cmake ninja setuptools-scm
RUN git clone --depth 1 --branch v0.15.0 …vllm-project/vllm.git /tmp/vllm-build

WORKDIR /tmp/vllm-build
RUN python use_existing_torch.py && pip install -r requirements/build.txt
RUN pip uninstall -y vllm

ENV TORCH_CUDA_ARCH_LIST=“12.0a”
RUN pip install --no-build-isolation .

RUN pip install --no-deps “transformers>=5.0” “compressed-tensors>=0.13”
COPY patch_transformers.py /tmp/
RUN python3 /tmp/patch_transformers.py

start.glm47_flash:

podman run -d --replace --name vllm-glm47-flash
–device nvidia.com/gpu=all
-e VLLM_MLA_DISABLE=1
-e VLLM_DISABLED_KERNELS=CutlassFP8ScaledMMLinearKernel
localhost/vllm-glm
bash -c “python3 /data/tensordata/patch_streaming.py &&
vllm serve /data/tensordata/GLM-4.7-Flash-FP8
–served-model-name glm-4.7-flash
–max-model-len 100000
–kv-cache-dtype fp8
–tool-call-parser glm47
–enable-auto-tool-choice”

Important Notice: I had some problems posting the github repos uri here. its replaced on the fly into a promo slogan… magic.

eugr · February 4, 2026, 6:47pm

Is it on Spark?
If so, Spark arch is 12.1a.

Can you try with our Docker setup instead? GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
And follow the guidance here: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

flash3 · February 4, 2026, 6:59pm

i rebuild. lets see.

flash3 · February 4, 2026, 8:24pm

still drifting. after ~17k or some more tooling breaks and after even more tokens only garbage. its the 30B flash model and it is only 1 dgx.

aceangel · February 4, 2026, 8:30pm

It could also be your settings for GLM 4.7 Flash. What’s your temp and top_p? temp should be 0.9 and top_p = 0.95 for GLM 4.7 Flash. Anything less it produces garbage output. Thinking disabled also helps.

flash3 · February 4, 2026, 8:43pm

why does it work on rtx pro 6000 with same config?

aceangel · February 4, 2026, 9:05pm

Not sure. I run 65-131k context on Flash and it works pretty well. Without the temp, top_p, and the thinking disabled tweaks, it was terrible. It has now become one of the most usable models for coding right now.

Topic		Replies	Views
Make GLM-4.7-Flash go BRRRRR DGX Spark / GB10	17	1704	February 5, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	142	3981	March 12, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	295	March 3, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	34	2216	March 12, 2026
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	573	February 3, 2026
Tier 0 Findings on DGX Spark: Why Hybrid Mamba (Nemotron) Beats 120B for Agents (Plus sm121 Fix) DGX Spark / GB10 jetson , llama , agentic-ai , nemotron	29	510	February 3, 2026
Custom built vLLM + Qwen3.5-35B on NVIDIA DGX Spark (GB10) — sustained 50 tok/s, 1M context DGX Spark / GB10	6	711	March 7, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	32	996	February 26, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	3823	February 27, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1823	January 22, 2026

Increasing artefact rate on growing context on DGX Spark (glm 4.7 flash)

Setup

Observed behavior

What we’ve ruled out

Open questions

Related topics