Anyone having luck with Deepseek V4 Flash on Dual Sparks?

LuckyChap · May 7, 2026, 10:05pm

I am reading really good things about Deepseek V4 Flash and was wondering:

Anyone had any luck running this at acceptable speeds (DFlash, etc) on dual DGX Sparks, if so, what are the tokens per secon?
How are you feeling about this model versus Qwen 3.6 27B (which many people are raving about)?

raphael.amorim · May 21, 2026, 4:03am

Please check this thread DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #51 by kimbona.dy

j0n · May 22, 2026, 12:10am

You can read about my exploits with DeepSeek V4 Flash here.

j0n · May 22, 2026, 12:56am

@LuckyChap I should perhaps add that there is a known issue at the moment with multiple concurrent requests with large-ish contexts which @jasl asked me to file here: [Bug]: Decode slowdown with concurrent large context requests · Issue #8 · jasl/vllm · GitHub

My understanding is that he’s working on improving this as we speak, essentially trying some tricky stuff to balance the scheduling of simultaneous prefill and decode.

However per the exact pinned versions on my blog everything worked very well with just one request running at a time.

Having said that, right now I am seeing

May 22 00:46:55 spark-dbac kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359

when I launch vLLM across the cluster. I am still investigating what changed here; I think perhaps the recent NVIDIA updates I installed may be using more RAM than they were previously?

vr8vr8 · May 22, 2026, 11:07am

I have no issues running already. 48h non stop, full load, with c=4 with 300k context no issues.

billbrock1234 · May 22, 2026, 1:21pm

j0n:

@LuckyChap I should perhaps add that there is a known issue at the moment with multiple concurrent requests with large-ish contexts which @jasl9187 asked me to file here: [Bug]: Decode slowdown with concurrent large context requests · Issue #8 · jasl/vllm · GitHub

My understanding is that he’s working on improving this as we speak, essentially trying some tricky stuff to balance the scheduling of simultaneous prefill and decode.

However per the exact pinned versions on my blog everything worked very well with just one request running at a time.

Having said that, right now I am seeing
May 22 00:46:55 spark-dbac kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
when I launch vLLM across the cluster. I am still investigating what changed here; I think perhaps the recent NVIDIA updates I installed may be using more RAM than they were previously?

We had to lower the max_concurrent_children to 1 because of this issue. Deepseek 4 was giving a NV_ERR_NO_MEMORY error. Thanks for this info. Was not sure what was causing this issue. Hope they ge tit fixed soon.

voktolom · May 22, 2026, 3:27pm

Share your recipe, you have very good results. Also, specify which version of VLLM you are using. Thank you.

vr8vr8 · May 22, 2026, 9:22pm

HuggingFace model to download (optional, for --download-model)

model: deepseek-ai/DeepSeek-V4-Flash

Container image to use

container: vllm-node-dsv4:latest

Can only be run in a cluster

cluster_only: true

Custom vLLM build for DeepSeek V4 SM12x support

build_args:

--vllm-repo
GitHub - jasl/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs · GitHub
--vllm-ref
codex/ds4-sm120-min-enable
--rebuild-vllm

Default settings (can be overridden via CLI)

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.88
max_model_len: 300000
max_num_batched_tokens: 16384
max_num_seqs: 8
block_size: 256

Environment variables — keep HF fully offline since the model is bind-mounted

env:
TORCH_CUDA_ARCH_LIST: 12.1a
VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
VLLM_TRITON_MLA_SPARSE: 1
FLASHINFER_DISABLE_VERSION_CHECK: 1
TILELANG_CLEANUP_TEMP_FILES: 1
DG_JIT_USE_NVRTC: 0
DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
DG_JIT_PRINT_COMPILER_COMMAND: 1
NCCL_IB_DISABLE: 0
NCCL_DEBUG: WARN

mounts:

~/models/:/models/

The vLLM serve command template

command: |
vllm serve /models/deepseek-ai/DeepSeek-V4-Flash
–served-model-name deepseek-ai/DeepSeek-V4-Flash
–host {host}
–port {port}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–pipeline-parallel-size {pipeline_parallel}
–kv-cache-dtype fp8
–block-size {block_size}
–enable-prefix-caching
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–max-num-batched-tokens {max_num_batched_tokens}
–gpu-memory-utilization {gpu_memory_utilization}
–distributed-executor-backend mp
–compilation-config ‘{{“cudagraph_mode”:“FULL_AND_PIECEWISE”,“custom_ops”:[“all”]}}’
–tokenizer-mode deepseek_v4
–tool-call-parser deepseek_v4
–enable-auto-tool-choice
–reasoning-parser deepseek_v4
–reasoning-config ‘{{“reasoning_parser”:“deepseek_v4”,“reasoning_start_str”:“”,“reasoning_end_str”:“”}}’
–default-chat-template-kwargs ‘{{“thinking”:true}}’
–enable-expert-parallel
–load-format safetensors

vr8vr8 · May 22, 2026, 9:25pm

Just don’t use MTP as tool calling will be big issue. cant say llm as it’s running some instances now that i don’t want to stop :D but i used original recipe from DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #19 by arthurdroz

dashtotherock · May 23, 2026, 4:35am

Thank you so much, I followed your article, and the model runs finally!

j0n · May 23, 2026, 6:41am

With how fast all the repos are moving and all the minefields to avoid, you’ve done well to get it working! Glad the blog post helped!

serapis · May 23, 2026, 7:10am

I run with MTP=2 and tool calling seems fine on my end. 384K context window.

dashtotherock · June 3, 2026, 7:14pm

I got it running for a week, no issue. I have to manually stopped to try some other options.

jordan.mymail · June 4, 2026, 1:02am

I’ve got it up and running. Rock solid so far. I was using Qwen3.5-122B-A10B (iirc) and it was buggy - would crash under load. The one thing I am working through is thinking budget (it likes to think - A LOT). I’m currently working on a --logits-processors plugin to see if I can integrate a thinking budget that it will actually respect.

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	7058	June 15, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	4268	May 17, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	260	22351	July 15, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	761	June 19, 2026
[GUIDE] DeepSeek-V4-Flash on 2× DGX Spark (GB10) — Reproducible vLLM Serving Recipe up to 1M Token Context DGX Spark / GB10 Projects deepseek	2	730	June 30, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	388	20379	July 20, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	17591	May 18, 2026
DeepSeek v4 Flash (IQ2XXS) on a single GB10! DGX Spark / GB10 Projects llm , llama , deepseek	13	4671	July 2, 2026
Official NVidia optimized DeepSeek-V4-Flash models? DGX Spark / GB10 deepseek	28	2087	July 11, 2026
Instructions for running Deepseek-v4-flash with DSpark using Eugr's repo DGX Spark / GB10 Projects deepseek	10	1182	July 16, 2026