DeepSeek V4 Flash: Bringing Frontier AI to the Home

j0n · May 17, 2026, 11:18am

Building on the amazing work of @eugr / @eugr_nv , @arthurdroz , @jasl and everyone else in this community, I wrote this blog post:
DeepSeek V4 Flash: Bringing Frontier AI to the Home

Hope it’s useful!

davwu · May 17, 2026, 12:47pm

Can you comfortablely get 1M context with TP=2? The recipe you referred to was using 200K max_model_len.

jasl · May 17, 2026, 1:15pm

I believe GB10 *2 max context would be around 512K, memory-limited, wish GB10 successor can give us 256GB+ memory, and at least RTX Pro 6000 class DIE size

j0n · May 17, 2026, 1:19pm

@davwu The startup log says:

(EngineCore pid=158) INFO 05-17 13:06:07 [kv_cache_utils.py:1710] GPU KV cache size: 820,713 tokens
(EngineCore pid=158) INFO 05-17 13:06:07 [kv_cache_utils.py:1711] Maximum concurrency for 200,000 tokens per request: 4.10x

so this is looking good for a 800K token, four request cache. I’ll try it out. Actually if I increase the RAM usage to 0.9 it looks like it supports the full 1M context. Checking.

j0n · May 17, 2026, 2:13pm

@davwu If I configure the recipe with the maximum context size (1048576) and tell vLLM to use 0.9 of my RAM then it seems to run stably, and offers just short of 4 concurrent requests:

(APIServer pid=84) INFO 05-17 13:42:09 [model.py:1697] Using max model len 1048576
[snip]
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_model_runner.py:6246] Estimated CUDA graph memory: 0.70 GiB total
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:462] Available KV cache memory: 27.93 GiB
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:477] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8943 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9057. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1710] GPU KV cache size: 4,093,302 tokens
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1711] Maximum concurrency for 1,048,576 tokens per request: 3.90x

Here they are while running the Inspect Evals tool in this configuration:

It seems to be stable, although it’s non-trivial to actually exercise such a large context!

Let me know if you’d like me to run any other tests. Thanks for the question, I need to update my blog post on a couple of technical details.

p33zy · May 17, 2026, 2:25pm

@jasl do you think supporting long contexts will be eventually supported? Is the slow prefill something that can be developed away or is there something about the deepseek architecture that makes that really difficult on spark?

jasl · May 17, 2026, 2:31pm

I’ll try my best.

The major challenge is that Spark 5070-grade design and the limited memory bandwidth.
Luckily, profiling metrics show we still have room to optimize, so it’s possible, but we need time to make it true.

I super wish next gen we could have an Apple Ultra class SoC.

My equipment is Macbook Pro as daily console, RTX Pro 6000 * 2 workstation running the DPSK LLM and heavy workloads, and one single Spark running agents and verious small LLM for multiple purpose.

j0n · May 17, 2026, 3:04pm

Slight nuance to my response. It works stably up to 3 concurrent requests (66 tps overall, 22 tps per request). When I try 4 concurrent Inspect Evals connections, the performance fluctuates terribly:

We have clearly reached a resource bottleneck. I’d be interested in any theories @jasl? I checked for ARM CPU and QSFP112 saturation and it doesn’t seem to be those.

jasl · May 17, 2026, 3:07pm

Decode is memory bandwidth-limited.

marco.palaferri · May 17, 2026, 3:43pm

I managed to run DeepSeek V4 Flash locally on a single GX10/Spark using a llama.cpp-based setup rather than vLLM.

My working configuration is based on antirez/ds4 with the DeepSeek-V4-Flash GGUF quantized build.

I run it as a user systemd service exposing an OpenAI-compatible endpoin then connect Zoo.

In my case I also enabled a persistent KV cache directory, because for agentic coding workflows the first big bottleneck is not only raw generation speed but repeated context ingestion.

The model is usable locally, but in my tests the real limitation is context management. With one interactive coding-agent workload it is workable; pushing multiple concurrent long-context requests quickly exposes instability / throughput fluctuation, which matches what you are seeing with 4 concurrent Inspect Evals connections.

My impression so far:

single-user local agentic use: feasible

large context: feasible but needs careful KV/cache/context strategy

multi-request concurrency: becomes the bottleneck quickly

raw GPU utilization alone does not explain everything; the scheduler/KV/context pattern matters a lot

For comparison, I also tested Qwen3.6-35B PrismaQuant on the same machine with vLLM, FlashInfer/FP8 KV/MTP variants. That setup is much easier to serve with vLLM, but DeepSeek V4 Flash required the more custom llama.cpp/antirez path to become practical on one Spark/GX10.

So yes, DeepSeek V4 Flash can run locally on a single machine, but I would currently treat it more as a serious single-user local frontier-AI.

j0n · May 17, 2026, 10:14pm

Thanks @marco.palaferri. In response to your and @jasl’s comments I’ve updated the blog to highlight this scaling behavior that we’ve seen. I also corrected it to say that both TP and EP are used.

ivaldez · May 17, 2026, 11:04pm

With max_model_len: 512K on dual sparks:

GPU KV cache size: 2,457,221 tokens

Maximum concurrency for 524,288 tokens per request: 4.69x

Benchmark- PP is very slow as others reported.

╭─────────────────────────────────────────────────────────────────────────────────── ⚡ llama-benchy Throughput Benchmark ───────────────────────────────────────────────────────────────────────────────────╮
│ deepseek-ai/DeepSeek-V4-Flash                                                                                                                                                                              │
│ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  latency=generation                                                                                                              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:17:44

  llama-benchy 0.3.7
  Estimated latency: 200.0 ms

                                                                                             llama-benchy Results                                                                                             
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test                                              ┃      c       ┃                   pp t/s ┃                   tg t/s ┃                 TTFT (ms) ┃                Total (ms) ┃                    Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0                                 │      c1      │                      527 │                     35.1 │                     4,087 │                     7,537 │                  2048+128 │
│ pp2048 tg128 @ d0                                 │      c2      │                      492 │                     47.9 │                     8,319 │                    12,973 │                  2048+128 │
│ pp2048 tg128 @ d0                                 │      c4      │                      451 │                     37.6 │                    16,508 │                    28,087 │                  2048+128 │
│ pp2048 tg128 @ d4096                              │      c1      │                      425 │                     33.6 │                    14,645 │                    18,255 │                  2048+128 │
│ pp2048 tg128 @ d4096                              │      c2      │                      409 │                     16.6 │                    24,847 │                    32,129 │                  2048+128 │
│ pp2048 tg128 @ d4096                              │      c4      │                      389 │                      9.4 │                    46,696 │                    62,932 │                  2048+128 │
│ pp2048 tg128 @ d8192                              │      c1      │                      394 │                     32.4 │                    26,219 │                    29,964 │                  2048+128 │
│ pp2048 tg128 @ d8192                              │      c2      │                      368 │                     15.8 │                    50,054 │                    57,140 │                  2048+128 │
│ pp2048 tg128 @ d8192                              │      c4      │                      362 │                      6.3 │                    78,915 │                   105,744 │                  2048+128 │
└───────────────────────────────────────────────────┴──────────────┴──────────────────────────┴──────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────┘

Tweaked recipe I used:

recipe_version: "1"
name: DeepSeek-V4-Flash
description: DeepSeek V4 Flash FP8 on dual DGX Spark TP=2 with PR 41834 SM12x support
model: deepseek-ai/DeepSeek-V4-Flash
container: vllm-node-dsv4
cluster_only: true

build_args:
  - --vllm-repo
  - https://github.com/jasl/vllm.git
  - --vllm-ref
  - b1c97ff068b23858a4759394f8d6c858d822c957
  - --rebuild-vllm

mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  pipeline_parallel: 1
  gpu_memory_utilization: 0.9
  max_model_len: 512K
  max_num_batched_tokens: 8K
  max_num_seqs: 4
  block_size: 256
  served_model_name: deepseek-v4-flash

env:
  TORCH_CUDA_ARCH_LIST: 12.1a
  VLLM_TRITON_MLA_SPARSE: 1
  FLASHINFER_DISABLE_VERSION_CHECK: 1
  TILELANG_CLEANUP_TEMP_FILES: 1
  DG_JIT_USE_NVRTC: 0
  DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
  DG_JIT_PRINT_COMPILER_COMMAND: 1
  NCCL_IB_DISABLE: 0
  NCCL_DEBUG: WARN
  VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE: 4
  OMP_NUM_THREADS: 4

command: |
  vllm serve deepseek-ai/DeepSeek-V4-Flash \
      --served-model-name {served_model_name} \
      --host {host} \
      --port {port} \
      --trust-remote-code \
      --tensor-parallel-size {tensor_parallel} \
      --pipeline-parallel-size {pipeline_parallel} \
      --kv-cache-dtype fp8 \
      --block-size {block_size} \
      --enable-prefix-caching \
      --max-model-len {max_model_len} \
      --max-num-seqs {max_num_seqs} \
      --max-num-batched-tokens {max_num_batched_tokens} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      --distributed-executor-backend mp \
      --compilation-config '{{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}}' \
      --speculative-config '{{"method":"mtp","num_speculative_tokens":2}}' \
      --tokenizer-mode deepseek_v4 \
      --tool-call-parser deepseek_v4 \
      --enable-auto-tool-choice \
      --reasoning-parser deepseek_v4 \
      --reasoning-config '{{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}}' \
      --default-chat-template-kwargs '{{"thinking":true}}' \
      --load-format instanttensor \
      --enable-chunked-prefill

Topic		Replies	Views
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1169	June 4, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	57	4687	June 8, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	198	11117	June 8, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	29	1834	June 8, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	15284	May 18, 2026
DeepSeek v4 Flash (IQ2XXS) on a single GB10! DGX Spark / GB10 Projects llm , llama , deepseek	8	2974	June 5, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	65	5352	May 30, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1685	May 11, 2026
DeepSeek V4 Flash MXFP4 proof-of-life on a single GB10/GX10 DGX Spark / GB10 cuda , kernel , deepseek	4	1238	May 8, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5827	March 16, 2026

DeepSeek V4 Flash: Bringing Frontier AI to the Home

Related topics