Increasing artefact rate on growing context on DGX Spark (glm 4.7 flash)

DGX Spark GB10 (vLLM 0.15, GLM-4.7-Flash)
580.95.05 (CUDA 13.0) huge artefacts
580.126.09 (CUDA 13.0) less/later
590… n/a ?

Hi everyone,

I’m seeing a reproducible issue on DGX Spark GB10 (Grace-Blackwell, unified memory) where the artifact rate increases significantly with growing context length, eventually leading to tooling failures (function/tool calls breaking or producing malformed outputs).

Setup

  • Hardware: DGX Spark GB10 (ARM / Grace-Blackwell, 128 GB unified memory)

  • Inference: vLLM 0.15

  • Model: GLM-4.7-Flash

  • Client / Tooling: Claude Code (tool calls / structured outputs) 17k prefill and growing

  • Drivers: NVIDIA 580
    (590 not yet available for ARM at the time of testing)

Observed behavior

  • At short to moderate context lengths, outputs are stable.

  • As the context grows:

    • Hallucinated tokens and formatting artifacts increase.

    • Structured outputs (JSON / tool calls) become unstable.

    • Tooling eventually fails entirely (invalid schemas, truncated or corrupted tool calls).

  • The degradation appears gradual and context-length dependent, not an immediate failure.

  • GPU memory is not obviously exhausted; this does not look like a simple OOM issue.

What we’ve ruled out

  • Prompt structure issues (tested minimal and verbose variants).

  • Client-side parsing errors (validated raw model outputs).

  • Obvious vLLM misconfiguration (KV cache sizes, batch size, etc.).

  • An almost identical mirror system ryzen 32 cores with RTX PRO 6000 works perfect (differences: x86, no unified memory, 590 driver instead of 580)

Open questions

  1. Is this a known issue with long-context handling on GB10 / unified memory, especially under vLLM?

  2. Could this be related to driver 580 limitations on ARM (e.g. FP8 / FP6 / KV-cache behavior), potentially fixed in 590?

  3. Has anyone observed context-dependent output corruption or tooling instability specifically on Grace-Blackwell?

  4. Are there recommended vLLM settings or workarounds for GB10 (e.g. KV cache layout, paging behavior, precision choices)?

Any pointers, similar experiences, or low-level insights (driver, kernel, vLLM internals) would be very helpful.

Thanks!

You may be suffering from “context poisoning” where a large amount of context can corrupt your LLM output, for various reasons. This is probably not hardware dependent, have you tested your workload on different hardware?

Or on a different model. Current GLM-4.7-flash implementations suffer from many issues.

This is not hardware specific. Expanding context from 4k-8k to128k-1m necessitates solutions like RoPE, LongRoPE, YaRN, etc.

Unfortunately these approaches tend to break down over extremely long contexts, and as a result you will see degraded responses, increased hallucinations and other errors.

Same model, same vllm, rtx pro 6000 works, dgx spark drifts. So it could be the model but finally rtx pro handles it.

And the 580 driver update of the dgx reduced it a bit. so it points directly to the dgx/driver.

Which model quant are you using?

What vllm container?

While sm120/sm121 are pretty much the same except for unified memory, some FP8/FP4 paths that work with sm120 don’t get compiled on sm121, especially if you use official vLLM docker.

GLM-4.7-Flash:

  • Quelle: unsloth/GLM-4.7-Flash-FP8-Dynamic (FP8 dynamisch quantisiert)
  • Modell: 31B MoE (30B-A3B)
  • vLLM: 0.15 aus Source gebaut auf NVIDIA 26.01 (CUDA 13.1)
  • vllm:25.12-py3 or vllm:26.01-py3 (as seen below)

Dockerfile.glm:

FROM nvcr.io/nvidia/vllm:26.01-py3

RUN apt-get update -qq && apt-get install -y wget && \
wget -qO /tmp/cuda-keyring.deb
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb &&
dpkg -i /tmp/cuda-keyring.deb && apt-get update -qq &&
apt-get install -y libcusparse-dev-13-1 libcusolver-dev-13-1 libcufft-dev-13-1

RUN pip install cmake ninja setuptools-scm
RUN git clone --depth 1 --branch v0.15.0 …vllm-project/vllm.git /tmp/vllm-build

WORKDIR /tmp/vllm-build
RUN python use_existing_torch.py && pip install -r requirements/build.txt
RUN pip uninstall -y vllm

ENV TORCH_CUDA_ARCH_LIST=“12.0a”
RUN pip install --no-build-isolation .

RUN pip install --no-deps “transformers>=5.0” “compressed-tensors>=0.13”
COPY patch_transformers.py /tmp/
RUN python3 /tmp/patch_transformers.py

start.glm47_flash:

podman run -d --replace --name vllm-glm47-flash
–device nvidia.com/gpu=all
-e VLLM_MLA_DISABLE=1
-e VLLM_DISABLED_KERNELS=CutlassFP8ScaledMMLinearKernel
localhost/vllm-glm
bash -c “python3 /data/tensordata/patch_streaming.py &&
vllm serve /data/tensordata/GLM-4.7-Flash-FP8
–served-model-name glm-4.7-flash
–max-model-len 100000
–kv-cache-dtype fp8
–tool-call-parser glm47
–enable-auto-tool-choice”

Important Notice: I had some problems posting the github repos uri here. its replaced on the fly into a promo slogan… magic.

Is it on Spark?
If so, Spark arch is 12.1a.

Can you try with our Docker setup instead? GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
And follow the guidance here: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

i rebuild. lets see.

1 Like

still drifting. after ~17k or some more tooling breaks and after even more tokens only garbage. its the 30B flash model and it is only 1 dgx.

It could also be your settings for GLM 4.7 Flash. What’s your temp and top_p? temp should be 0.9 and top_p = 0.95 for GLM 4.7 Flash. Anything less it produces garbage output. Thinking disabled also helps.

why does it work on rtx pro 6000 with same config?

Not sure. I run 65-131k context on Flash and it works pretty well. Without the temp, top_p, and the thinking disabled tweaks, it was terrible. It has now become one of the most usable models for coding right now.