DeepSeek V4 Flash: Bringing Frontier AI to the Home

j0n · May 17, 2026, 2:13pm

@davwu If I configure the recipe with the maximum context size (1048576) and tell vLLM to use 0.9 of my RAM then it seems to run stably, and offers just short of 4 concurrent requests:

(APIServer pid=84) INFO 05-17 13:42:09 [model.py:1697] Using max model len 1048576
[snip]
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_model_runner.py:6246] Estimated CUDA graph memory: 0.70 GiB total
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:462] Available KV cache memory: 27.93 GiB
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:477] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8943 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9057. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1710] GPU KV cache size: 4,093,302 tokens
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1711] Maximum concurrency for 1,048,576 tokens per request: 3.90x

Here they are while running the Inspect Evals tool in this configuration:

It seems to be stable, although it’s non-trivial to actually exercise such a large context!

Let me know if you’d like me to run any other tests. Thanks for the question, I need to update my blog post on a couple of technical details.

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	40	3027	May 22, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	55	3719	May 21, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	6	574	May 22, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	13655	May 18, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	56	3450	May 22, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1362	May 11, 2026
DeepSeek v4 Flash (IQ2XXS) on a single GB10! DGX Spark / GB10 Projects llm , llama , deepseek	2	1836	May 11, 2026
DeepSeek V4 Flash MXFP4 proof-of-life on a single GB10/GX10 DGX Spark / GB10 cuda , kernel , deepseek	4	1146	May 8, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5511	March 16, 2026
DeepSeek v4 Flash on Two DGX Sparks - realistic tok/sec? DGX Spark / GB10 deepseek	2	226	May 21, 2026

DeepSeek V4 Flash: Bringing Frontier AI to the Home

Related topics