DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide

DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide

TL;DR: The DGX Spark (GB10, SM121) ships with vLLM builds that lack native Blackwell kernels due to a CMake architecture guard bug. A multi-stage Docker build that compiles SM121 kernels from the v0.17.0rc1 source and injects only the .so files into the stock image takes throughput from 13.3 to 48.6 tok/s — a 3.65x improvement with Qwen3.5-35B-A3B (FP8). No model changes, no driver changes, no hardware mods.


The Problem

The DGX Spark achieves ~13 tok/s with Qwen3.5-35B-A3B using the stock vllm/vllm-openai:cu130-nightly image (v0.17.0rc1, built March 6, 2026). Community members report 50 tok/s on the same hardware. The gap is not hardware — it’s missing native kernels.

vLLM’s CMake build system uses cuda_archs_loose_intersection with a "12.0f" (family) pattern to decide which Blackwell kernels to compile. This pattern is meant to match all SM12x architectures, but the pre-built Docker images don’t compile for SM121 at all — the stock image has zero Blackwell cubins.

Why You Can’t Just Rebuild

We went through 8 build iterations to find the working approach. Here’s what fails and why:

Attempt: pip install --no-build-isolation . with TORCH_CUDA_ARCH_LIST="12.1"

Fails on NVFP4 kernels: ptxas error: Instruction 'cvt with .e2m1x2' not supported on .target 'sm_121'. The SM121 (GB10) lacks the microscaling instructions that SM120 (datacenter Blackwell) has. The CMake guard incorrectly includes SM121 in NVFP4 compilation.

Attempt: Patch the NVFP4 guard and rebuild the full vLLM package

Compiles successfully, but breaks Qwen3.5 model loading. pip install . resolves the entire dependency tree, potentially changing the transformers library version. The stock image has a carefully curated transformers that recognizes qwen3_5_moe — rebuilding loses this.

Attempt: Use community images (hellohal2064)

Crash loop — built for a different model (Qwen3-Next-80B), different CUDA version (13.1 vs 13.0), incompatible entrypoint. Not a drop-in.

The Solution: Multi-Stage .so Injection

The key insight: only the compiled C extensions need to change. The stock image’s Python code, model support, and dependency versions are correct. We just need native Blackwell cubins in the .so files.

Step 1: Get the right source


git clone https://github.com/vllm-project/vllm.git

cd vllm

# Checkout the EXACT commit matching your base image

git checkout e68de8adc # v0.17.0rc1

Verify your base image version: docker run --rm --entrypoint python3 YOUR_IMAGE -c "import vllm; print(vllm.__version__)"

Step 2: Patch NVFP4 for SM121

One line change in CMakeLists.txt. Find the NVFP4 section (~line 651):


# BEFORE:

cuda_archs_loose_intersection(FP4_ARCHS "12.0f" "${CUDA_ARCHS}")

# AFTER:

cuda_archs_loose_intersection(FP4_ARCHS "12.0a" "${CUDA_ARCHS}")

Do the same for both the VERSION_GREATER_EQUAL 13.0 and else() branches. This excludes SM121 from NVFP4 compilation (SM121 hardware doesn’t support it) while keeping all other Blackwell kernels (scaled_mm, MoE, MLA, attention).

If your vLLM version also has cmake/external_projects/qutlass.cmake, apply the same change there.

Step 3: Multi-stage Dockerfile


# Stage 1: Compile SM121 kernels

FROM vllm/vllm-openai:cu130-nightly AS builder

RUN apt-get update && apt-get install -y git ninja-build && \

rm -rf /var/lib/apt/lists/* && pip install "cmake>=3.26"

RUN ln -sf /usr/local/cuda-13.0/targets/sbsa-linux/lib/libnvrtc.so.13 \

/usr/local/cuda/lib64/libnvrtc.so

COPY . /tmp/vllm-source/

WORKDIR /tmp/vllm-source

ENV TORCH_CUDA_ARCH_LIST="12.1"

ENV MAX_JOBS=4

ENV VLLM_TARGET_DEVICE=cuda

ENV CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc

ENV CUDA_HOME=/usr/local/cuda

ENV CPATH="/usr/local/lib/python3.12/dist-packages/nvidia/cu13/include"

ENV LIBRARY_PATH="/usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib"

SHELL ["/bin/bash", "-c"]

RUN set -o pipefail && pip install --no-build-isolation . 2>&1 | tee /tmp/build.log

# Stage 2: Inject ONLY the .so files into the pristine stock image

FROM vllm/vllm-openai:cu130-nightly

COPY --from=builder /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so \

/usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so

COPY --from=builder /usr/local/lib/python3.12/dist-packages/vllm/_moe_C.abi3.so \

/usr/local/lib/python3.12/dist-packages/vllm/_moe_C.abi3.so

Step 4: Build (~90 min on ARM64)


docker build -f Dockerfile.sm121-inject -t vllm-custom:sm121-inject .

Step 5: Verify


# Check for Blackwell cubins

docker run --rm --entrypoint bash vllm-custom:sm121-inject -c \

'cuobjdump -lelf /usr/local/lib/python3.12/dist-packages/vllm/_C.abi3.so | grep -c sm_120'

# Should show 50+ cubins (ours: 52)

# Verify version matches stock

docker run --rm --entrypoint python3 vllm-custom:sm121-inject -c \

'import vllm; print(vllm.__version__)'

# Should match your base image version exactly

Step 6: Launch


docker run -d --name qwen35 --gpus all --ipc host --shm-size 64gb \

-p 8000:8000 \

-e VLLM_FLASHINFER_MOE_BACKEND=latency \

-e VLLM_TEST_FORCE_FP8_MARLIN=1 \

-v /path/to/huggingface/cache:/root/.cache/huggingface \

vllm-custom:sm121-inject \

Qwen/Qwen3.5-35B-A3B \

--served-model-name qwen3.5-35b \

--port 8000 --host 0.0.0.0 \

--max-model-len 32768 \

--gpu-memory-utilization 0.65 \

--quantization fp8 --kv-cache-dtype fp8

Results

| Config | Single-Request | Improvement |

|--------|---------------|-------------|

| Stock image (no native kernels) | 13.3 tok/s | — |

| SM121-inject (52 Blackwell cubins) | 48.6 tok/s | 3.65x |

Consistent across 3 runs (48.7, 48.6, 48.6). Qwen3.5-35B-A3B, FP8, max_tokens=600.

Key Learnings

  1. SM121 ≠ SM120 for microscaling. SM121 (GB10/DGX Spark) lacks the cvt.e2m1x2 instruction that SM120 (datacenter Blackwell) has. NVFP4 kernels will never compile for SM121 — this is a hardware limitation, not a software bug.

  2. Don’t rebuild the full package. pip install . resolves the entire dependency tree and can break model support. The multi-stage .so injection preserves the stock image’s Python code and dependencies while replacing only the compiled kernels.

  3. VLLM_TEST_FORCE_FP8_MARLIN=1 is essential. Without this, vLLM may select a CUTLASS scaled_mm path that produces NaN logits on SM121. Marlin FP8 is stable and fast.

  4. The cubins say sm_120, not sm_121. Despite setting TORCH_CUDA_ARCH_LIST="12.1", the v0.17.0rc1 CMake produces sm_120 cubins. This is fine — SM121 runs sm_120 code natively via forward compatibility. All community builds (hellohal2064, namake-taro, sesmanovic) also target sm_120.

  5. CPATH and LIBRARY_PATH are required. The stock vLLM Docker image uses pip-installed CUDA packages. Headers like cusparse.h are at /usr/local/lib/python3.12/dist-packages/nvidia/cu13/include, not the standard CUDA path.

Environment

  • NVIDIA DGX Spark (GB10, SM121, 128GB LPDDR5x unified memory)

  • Driver 580.142, CUDA 13.0

  • Base image: vllm/vllm-openai:cu130-nightly (v0.17.0rc1, March 6 2026)

  • Model: Qwen/Qwen3.5-35B-A3B (FP8 on-the-fly quantization)

What’s Next

  • Testing SM120-targeted build (TORCH_CUDA_ARCH_LIST="12.0", zero patches, all kernels compile) for comparison

  • Concurrent throughput benchmarks (8-18 simultaneous requests)

  • Investigating the marlin_utils_fp8.py runtime check that still warns about missing native FP8 (the kernels are there, the Python check doesn’t know it)

  • Async scheduling and prefix caching on the new build

Happy to answer questions. The full lab notebook with every failed attempt and fix is available if anyone wants the gory details.

But how does this compare to using the community vLLM image for Spark? GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

Great question. I haven’t tested eugr’s Docker config directly, but looking at the repo, the key difference is the approach to getting native kernels into the image.

Community images like eugr’s (and hellohal2064’s, which I did test) do a full vLLM rebuild — they compile everything from source and ship a complete replacement image. That works, but we ran into two problems with that path:

  1. Dependency driftpip install . resolves the entire dependency tree, which can change the transformers version. The stock image has a carefully curated transformers that recognizes qwen3_5_moe. Our full rebuild loaded Qwen3-0.6B instead of Qwen3.5-35B-A3B because of this.

  2. Version coupling — the community image pins to a specific vLLM version. When the base image updates, you need a new community build.

The multi-stage .so injection approach avoids both problems: build the kernel .so files from source, then COPY only those two files (_C.abi3.so + _moe_C.abi3.so) into the pristine stock image. The stock image’s Python code, model support, and dependency versions stay untouched. You get native Blackwell cubins with zero risk to the application layer.

The pattern works with any base image — you just match the source commit to the base image version (docker run --rm --entrypoint python3 IMAGE -c "import vllm; print(vllm.__version__)") and rebuild. When vLLM updates, the same Dockerfile works — just change the base image tag.

Performance-wise, I’d expect similar throughput since both approaches end up with native sm_120 cubins. The difference is in maintainability and risk profile, not raw speed.

Would be curious to hear if anyone has benchmarked eugr’s image on Qwen3.5-35B — if the numbers match (~49 tok/s single-request), that confirms the native kernels are the key variable regardless of how you get them in.

From Spark-Arena, it looks like Qwen3.5-30B-A3B fp8 benches tg128/c1 at 50,75 tk/s in single setup.

i also run this model for multiple openclaws - I don’t think anything else under 100GB can touch it

Not quite.

  1. You can build two versions of the image - one with transformers 4.x (default for vLLM for now) and one with transformers 5.x (build with --tf5 flag). If you use recipe launcher or something like sparkrun, it will build a proper image automatically.

  2. Community docker assumes that you build image yourself, but you don’t have to build from the source. The nightly CI/CD builds vLLM and Flashinfer wheels from main* every night and then runs the resulting image(s) through a regression test pipeline where it is being tested on a bunch of models, both solo and in cluster and checked for any errors or performance regressions. If it passes, the wheels are published on GitHub, and then can be used for a very fast image rebuilds on the user side. So it’s always the latest tested version from main.

This is the latest performance run with today nightly build, BTW (for single request, zero context):

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-35B-A3B-FP8 pp2048 4240.97 ± 1660.94 621.23 ± 332.13 613.22 ± 332.13 621.42 ± 332.18
Qwen/Qwen3.5-35B-A3B-FP8 tg32 52.85 ± 0.04 54.57 ± 0.04

llama-benchy (0.3.5)
date: 2026-03-31 19:38:56 | latency mode: api

vLLM version 0.18.3.dev17+gce884756f.d20260331

Running it is as easy as:

git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./run-recipe.sh recipes/qwen3.5-35b-a3b-fp8.yaml --setup  --solo

* - sometimes we include PRs that solve issues but have not been merged yet.

Hello teachers,

I’ve encountered an issue and would appreciate your assistance. I am using spark-vllm-docker to build an image and run the Qwen3.5-35B-A3B model on a single node. I already have the model weights locally.

Below is the command I used to run the service:

VLLM_SPARK_EXTRA_DOCKER_ARGS=“-v /data/models/Qwen3.5-35B-A3B:/model” ./launch-cluster.sh --solo exec \ vllm serve \ /model \ --port 8000 --host 0.0.0.0 \ --gpu-memory-utilization 0.7 \ --served-model-name Qwen3.5-35B-A3B \ --tool-call-parser qwen3_coder \ --enable-auto-tool-choice \ --reasoning-parser qwen3 \ --load-format fastsafetensors

However, when I performed a performance test in this environment using the following command:

evalscope perf --parallel 1 --number 8 --url http://127.0.0.1:8000/v1/chat/completions --model Qwen3.5-35B-A3B --log-every-n-query 5 --connect-timeout 6000 --read-timeout 6000 --max-tokens 2048 --min-tokens 2048 --api openai --dataset speed_benchmark

The results seem unusually low to me. Could you please help me verify if this behavior is normal? If not, could you point me in the right direction for troubleshooting?

Speed Benchmark Results:

Prompt Tokens Speed(tokens/s) GPU Memory(GB)
2 22.59 0.0
12288 20.52 0.0
28672 18.23 0.0
61440 14.18 0.0

Try unplugging power cables and re-plugging them. Your spark might be experiencing the throttling bug.

You should really be careful with letting the LLM do too much of the work for you. @eugr has properly answered some key items, but I do want to point out that this write-up contains a number of incorrect statements that are continuously propagated by LLMs (e.g. “SM120 (datacenter Blackwell)”) or otherwise show decisions that are based on strict adherence to an objective without taking a step back to review.

  1. SM120 is not datacenter Blackwell. LLMs seem to think that because they make reasonable assumptions about version progression, but the Blackwell compute versions are all over the place – so NVIDIA has successfully tricked most LLMs.

  2. You started with a month old nightly version. If you’re going to be using a nightly version, it’s best if it’s more recent in this space.

  3. Why are you using on the fly FP8 quantization when you could be using an upstream provided FP8 model? That doesn’t make much sense.

  4. If you’ve already spent 90min to build it, why re-inject the .so files into the stock image? You can just use what you built.

  5. The attempt to use community images (plural) included a single image attempt for a one-off image (same version as you were looking at?) but did not include the prevailing community images.

I think there is value in diving into these things and building for yourself and troubleshooting problems and I really like the community element of everyone sharing. I am just pointing out some of these things because we all read/learn from these things (and so do LLMs!)

As a tip, you should periodically have the LLM take a step back and ask big picture question about what it’s been doing and why. Ask it for a harsh critique of its own work – you can be surprised how harsh it can be towards itself when it doesn’t know who the author was ;-) That won’t solve factual problems (it doesn’t know what it doesn’t know) – but you can refine your LLM driven work quite a bit by challenging it either with your own logic/knowledge or by forcing it to do that to itself.

Good point. I admit, I missed it, as I kinda skimmed over the writeup as it was obviously LLM-generated for the most part.

When it comes to leading edge stuff, like Spark, LLM (or even Google) can not be trusted.

Even some Flashinfer maintainers are not immune, there is an LLM-propagated “fact” going around that states that sm120 has 228KB shared memory (like datacenter Blackwell), while sm121 has only 99KB. This is incorrect, as both architectures are pretty much the same (sans unified memory on sm121) and both have the same 99KB shared memory. Which can be easily proven by looking at the source, in this case, CUDA developer documentation.

However, LLMs are happily spreading this misinformation, and I’ve just caught it in one of the pending PRs to flashinfer that separated sm120 and sm121 smem handling based on this “fact”. It’s been corrected since, so all is good.

Hi eugr,

Your repository is fantastic. My DGX has improved the speed, but I’m not be able to reach your numbers. In your experience… There’s some that you think I’ve missed to do or misconfigured before use it?

I’m using the latest main and your instructions.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3.5-35B-A3B-FP8 pp2048 2039.66 ± 500.07 1085.45 ± 319.28 1082.63 ± 319.28 1085.65 ± 319.33
Qwen/Qwen3.5-35B-A3B-FP8 tg32 36.63 ± 0.57 37.82 ± 0.60

Thanks in advance.

You may be experiencing a well-known power delivery bug. Shut down your Spark, unplug the USB-C power cable from the unit (and the brick from the wall, just in case), wait a bit, then plug everything back and test again.

Wow… it worked!

Thanks for your quick response. I’m a little embarrassed that I didn’t realize something so simple.

Weeks ago I was struggling with the firmware on my Spark (Asus edition) because it was having power issues, but once that was fixed, I stopped thinking about it.

Thank you so much!

Well, it’s not very obvious unless you’ve experienced it before.