Building on the amazing work of @eugr / @eugr_nv , @arthurdroz , @jasl and everyone else in this community, I wrote this blog post:
DeepSeek V4 Flash: Bringing Frontier AI to the Home
Hope itโs useful!
Building on the amazing work of @eugr / @eugr_nv , @arthurdroz , @jasl and everyone else in this community, I wrote this blog post:
DeepSeek V4 Flash: Bringing Frontier AI to the Home
Hope itโs useful!
Can you comfortablely get 1M context with TP=2? The recipe you referred to was using 200K max_model_len.
I believe GB10 *2 max context would be around 512K, memory-limited, wish GB10 successor can give us 256GB+ memory, and at least RTX Pro 6000 class DIE size
@davwu The startup log says:
(EngineCore pid=158) INFO 05-17 13:06:07 [kv_cache_utils.py:1710] GPU KV cache size: 820,713 tokens
(EngineCore pid=158) INFO 05-17 13:06:07 [kv_cache_utils.py:1711] Maximum concurrency for 200,000 tokens per request: 4.10x
so this is looking good for a 800K token, four request cache. Iโll try it out. Actually if I increase the RAM usage to 0.9 it looks like it supports the full 1M context. Checking.
@davwu If I configure the recipe with the maximum context size (1048576) and tell vLLM to use 0.9 of my RAM then it seems to run stably, and offers just short of 4 concurrent requests:
(APIServer pid=84) INFO 05-17 13:42:09 [model.py:1697] Using max model len 1048576
[snip]
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_model_runner.py:6246] Estimated CUDA graph memory: 0.70 GiB total
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:462] Available KV cache memory: 27.93 GiB
(Worker_TP0_EP0 pid=207) INFO 05-17 13:46:34 [gpu_worker.py:477] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8943 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9057. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1710] GPU KV cache size: 4,093,302 tokens
(EngineCore pid=157) INFO 05-17 13:46:35 [kv_cache_utils.py:1711] Maximum concurrency for 1,048,576 tokens per request: 3.90x
Here they are while running the Inspect Evals tool in this configuration:
It seems to be stable, although itโs non-trivial to actually exercise such a large context!
Let me know if youโd like me to run any other tests. Thanks for the question, I need to update my blog post on a couple of technical details.
@jasl do you think supporting long contexts will be eventually supported? Is the slow prefill something that can be developed away or is there something about the deepseek architecture that makes that really difficult on spark?
Iโll try my best.
The major challenge is that Spark 5070-grade design and the limited memory bandwidth.
Luckily, profiling metrics show we still have room to optimize, so itโs possible, but we need time to make it true.
I super wish next gen we could have an Apple Ultra class SoC.
My equipment is Macbook Pro as daily console, RTX Pro 6000 * 2 workstation running the DPSK LLM and heavy workloads, and one single Spark running agents and verious small LLM for multiple purpose.
Slight nuance to my response. It works stably up to 3 concurrent requests (66 tps overall, 22 tps per request). When I try 4 concurrent Inspect Evals connections, the performance fluctuates terribly:
We have clearly reached a resource bottleneck. Iโd be interested in any theories @jasl? I checked for ARM CPU and QSFP112 saturation and it doesnโt seem to be those.
Decode is memory bandwidth-limited.
I managed to run DeepSeek V4 Flash locally on a single GX10/Spark using a llama.cpp-based setup rather than vLLM.
My working configuration is based on antirez/ds4 with the DeepSeek-V4-Flash GGUF quantized build.
I run it as a user systemd service exposing an OpenAI-compatible endpoin then connect Zoo.
In my case I also enabled a persistent KV cache directory, because for agentic coding workflows the first big bottleneck is not only raw generation speed but repeated context ingestion.
The model is usable locally, but in my tests the real limitation is context management. With one interactive coding-agent workload it is workable; pushing multiple concurrent long-context requests quickly exposes instability / throughput fluctuation, which matches what you are seeing with 4 concurrent Inspect Evals connections.
My impression so far:
single-user local agentic use: feasible
large context: feasible but needs careful KV/cache/context strategy
multi-request concurrency: becomes the bottleneck quickly
raw GPU utilization alone does not explain everything; the scheduler/KV/context pattern matters a lot
For comparison, I also tested Qwen3.6-35B PrismaQuant on the same machine with vLLM, FlashInfer/FP8 KV/MTP variants. That setup is much easier to serve with vLLM, but DeepSeek V4 Flash required the more custom llama.cpp/antirez path to become practical on one Spark/GX10.
So yes, DeepSeek V4 Flash can run locally on a single machine, but I would currently treat it more as a serious single-user local frontier-AI.
Thanks @marco.palaferri. In response to your and @jaslโs comments Iโve updated the blog to highlight this scaling behavior that weโve seen. I also corrected it to say that both TP and EP are used.
With max_model_len: 512K on dual sparks:
GPU KV cache size: 2,457,221 tokens
Maximum concurrency for 524,288 tokens per request: 4.69x
Benchmark- PP is very slow as others reported.
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โก llama-benchy Throughput Benchmark โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ deepseek-ai/DeepSeek-V4-Flash โ
โ pp=[2048] tg=[128] depth=[0, 4096, 8192] concurrency=[1, 2, 4] runs=3 latency=generation โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 27/27 0:17:44
llama-benchy 0.3.7
Estimated latency: 200.0 ms
llama-benchy Results
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Test โ c โ pp t/s โ tg t/s โ TTFT (ms) โ Total (ms) โ Tokens โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ pp2048 tg128 @ d0 โ c1 โ 527 โ 35.1 โ 4,087 โ 7,537 โ 2048+128 โ
โ pp2048 tg128 @ d0 โ c2 โ 492 โ 47.9 โ 8,319 โ 12,973 โ 2048+128 โ
โ pp2048 tg128 @ d0 โ c4 โ 451 โ 37.6 โ 16,508 โ 28,087 โ 2048+128 โ
โ pp2048 tg128 @ d4096 โ c1 โ 425 โ 33.6 โ 14,645 โ 18,255 โ 2048+128 โ
โ pp2048 tg128 @ d4096 โ c2 โ 409 โ 16.6 โ 24,847 โ 32,129 โ 2048+128 โ
โ pp2048 tg128 @ d4096 โ c4 โ 389 โ 9.4 โ 46,696 โ 62,932 โ 2048+128 โ
โ pp2048 tg128 @ d8192 โ c1 โ 394 โ 32.4 โ 26,219 โ 29,964 โ 2048+128 โ
โ pp2048 tg128 @ d8192 โ c2 โ 368 โ 15.8 โ 50,054 โ 57,140 โ 2048+128 โ
โ pp2048 tg128 @ d8192 โ c4 โ 362 โ 6.3 โ 78,915 โ 105,744 โ 2048+128 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tweaked recipe I used:
recipe_version: "1"
name: DeepSeek-V4-Flash
description: DeepSeek V4 Flash FP8 on dual DGX Spark TP=2 with PR 41834 SM12x support
model: deepseek-ai/DeepSeek-V4-Flash
container: vllm-node-dsv4
cluster_only: true
build_args:
- --vllm-repo
- https://github.com/jasl/vllm.git
- --vllm-ref
- b1c97ff068b23858a4759394f8d6c858d822c957
- --rebuild-vllm
mods: []
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.9
max_model_len: 512K
max_num_batched_tokens: 8K
max_num_seqs: 4
block_size: 256
served_model_name: deepseek-v4-flash
env:
TORCH_CUDA_ARCH_LIST: 12.1a
VLLM_TRITON_MLA_SPARSE: 1
FLASHINFER_DISABLE_VERSION_CHECK: 1
TILELANG_CLEANUP_TEMP_FILES: 1
DG_JIT_USE_NVRTC: 0
DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
DG_JIT_PRINT_COMPILER_COMMAND: 1
NCCL_IB_DISABLE: 0
NCCL_DEBUG: WARN
VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE: 4
OMP_NUM_THREADS: 4
command: |
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--served-model-name {served_model_name} \
--host {host} \
--port {port} \
--trust-remote-code \
--tensor-parallel-size {tensor_parallel} \
--pipeline-parallel-size {pipeline_parallel} \
--kv-cache-dtype fp8 \
--block-size {block_size} \
--enable-prefix-caching \
--max-model-len {max_model_len} \
--max-num-seqs {max_num_seqs} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--distributed-executor-backend mp \
--compilation-config '{{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}}' \
--speculative-config '{{"method":"mtp","num_speculative_tokens":2}}' \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--reasoning-config '{{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}}' \
--default-chat-template-kwargs '{{"thinking":true}}' \
--load-format instanttensor \
--enable-chunked-prefill