DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Could you please guide me on how to properly build the image? I’m a bit confused. =(

git clone https://github.com/eugr/spark-vllm-docker
cd spark-vllm-docker
curl -L https://gist.githubusercontent.com/mlow/fc04615043b4cb9938e7be5719aa6aca/raw/d581c5db04d5ac534fe978ed371b289107dae231/deepseek-v4-flash.yaml -o recipes/deepseek-v4-flash.yaml
./run-recipe.py deepseek-v4-flash --no-ray -n <node-a-IP>,<node-b-IP> --setup

You can play with the flags to run-recipe.py to force rebuild the image and such

What did the update to VLLM do today to the recipe if anything ?

Thank you for the website! this is very well documented and detailed. 40t/s is already a good start to consider setup and running. Is this stable to run with continues agentic work now?

yes and no, it’s performance is great now but for some reason it likes to hold onto old context and it fills up the GPU KV cache quite fast with MTP enabled.

I see, good for testing but need to wait more time to address these issues.

a watcher to occasionally restart the server and good context management might hold you over until it’s fixed

There’s definitely something buggy with how the kv cache is kept and prefix cache is sometimes invalidated for this model in vllm, but apart from that I’m really enjoying it.

I rebuilt the image today and for some reason got a massive kv cache boost, up to 4x concurrency now at 300k context from 1.9x before. This model doesn’t suffer from context rot like Minimax M2.7 does and stays effective at 200k context and beyond, I went to about 270k yesterday without issue. It’s also very thorough and builds good plans and executes them well. It will stay as my daily driver for now, but hoping Minimax M3 might replace it next week.

How is it comparing to minimax overall? How does the code quality and most importantly general world knowledge compare to minimax? Really looking for a model to run as my daily on dual sparks that can actually handle AI model training questions and code as well as general questions well.

can you publish image ?

I also noticed model instability and random crashes.

I couldn’t put my finger on what causes these issues given they seem to occur at random.

This is most likely related to the changes in @jasl9187 's PR. Judging by the PR history, there were a lot of edits made yesterday and throughout the week.

We can actually build the image straight from the official community base image by applying just that single PR.

./build-and-copy.sh -t vllm-node-220-1-41834-ds4 --apply-vllm-pr 41834 --rebuild-vllm --cleanup -c

I haven’t had any crashes yet, the only clear issue I have is very occasionally I get a tool call error with output in Opencode like this:

     <|DSML|tool                                                                                                                                                                                                                           
                                                                                                                                                                                                                                             
     _calls>                                                                                                                                                                                                                                 
     <|DSML|invoke name="read">                                                                                                                                                                                                            
     <|DSML|parameter name="offset" string="false">1340</|DSML|parameter>                                                                                                                                                                
     <|DSML|parameter name="filePath" string="true">path to file redacted</|DSML|parameter>                                                                                                
     <|DSML|parameter name="limit" string="false">120</|DSML|parameter>                                                                                                                                                                  
     </|DSML|invoke>                                                                                                                                                                                                                       
     </|DSML|tool_calls>  

It doesn’t happen often enough to be really annoying though.


I run these command step by step with --force-rebuild in last command,and encounter this error. Any clue about what and why? I really want to experience 40t/s,but by now every try failed,which drived me crazy.
I will really appreciate if someone can give me an end-to-end solution to deploy the amazing solutions you are talking about ~

Try doing it this way:

  1. docker builder prune

  2. ./build-and-copy.sh -t vllm-node-220-1-41834-ds4 --apply-vllm-pr 41834 --rebuild-vllm --cleanup -c

  3. VLLM_SPARK_EXTRA_DOCKER_ARGS=“-v $HOME/DATA/hf/models/:/models” ./run-recipe.py deepseek-v4-flash --no-ray

And then use the recipe to launch it (just tweak it slightly to fit your setup).

recipe
recipe_version: "1"
name: DeepSeek-V4-Flash
description: DeepSeek V4 Flash FP8 on dual DGX Spark TP=2 with PR 41834 SM12x support
model: deepseek-ai/DeepSeek-V4-Flash
container: vllm-node-220-1-41834-ds4
cluster_only: true

build_args:
  - --apply-vllm-pr
  - "41834"
  - --rebuild-vllm

mods:
#  - mods/fix-ds4-gpu-cache
  - mods/drop-caches

defaults:
  port: 8888
  host: 0.0.0.0
  tensor_parallel: 2
  pipeline_parallel: 1
  gpu_memory_utilization: 0.90
  max_model_len: 262144
  max_num_batched_tokens: 6144  # 8192 #16384  # 4192
  max_num_seqs: 8
  block_size: 256
  served_model_name: my-ds4

env:
  TORCH_CUDA_ARCH_LIST: 12.1a
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
  VLLM_TRITON_MLA_SPARSE: 1
  FLASHINFER_DISABLE_VERSION_CHECK: 1
  TILELANG_CLEANUP_TEMP_FILES: 1
  DG_JIT_USE_NVRTC: 0
  DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
  DG_JIT_PRINT_COMPILER_COMMAND: 1
  NCCL_IB_DISABLE: 0
  NCCL_DEBUG: WARN
  OMP_NUM_THREADS: 4

command: |
  vllm serve \
      --model /models/deepseek-ai/DeepSeek-V4-Flash \
      --served-model-name {served_model_name} \
      --host {host} \
      --port {port} \
      --trust-remote-code \
      --tensor-parallel-size {tensor_parallel} \
      --pipeline-parallel-size {pipeline_parallel} \
      --kv-cache-dtype fp8 \
      --block-size {block_size} \
      --enable-prefix-caching \
      --max-model-len {max_model_len} \
      --max-num-seqs {max_num_seqs} \
      --max-num-batched-tokens {max_num_batched_tokens} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      --distributed-executor-backend mp \
      --compilation-config '{{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}}' \
      --tokenizer-mode deepseek_v4 \
      --tool-call-parser deepseek_v4 \
      --enable-auto-tool-choice \
      --reasoning-parser deepseek_v4 \
      --reasoning-config '{{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}}' \
      --default-chat-template-kwargs '{{"thinking":true}}' \
      --load-format safetensors


#      --speculative-config '{{"method":"mtp","num_speculative_tokens":2}}' \

Thanks. I’m trying this new version. The build is on process, probably about 1.5 hours. God bless me, wish I could make it done tonight.

I also catch this error sometimes.

>
<|DSML|invoke name="bash">
<|DSML|parameter name="description" string="true">Check binary dates and debug</|DSML|parameter>
<|DSML|parameter name="command" string="true">ls -la /workspace/cgraph2dot && stat /workspace/cgraph2dot</|DSML|parameter>
</|DSML|invoke>
</|DSML|tool_calls>

This is a known issue with nvidia’s cutlass library on 4.5.x, due to a race condition.

You can fix the issue by additng this as the last line in spark-vllm-docker/Dockerfile:

RUN uv pip install --force-reinstall --no-deps nvidia-cutlass-dsl-libs-cu13==4.5.2

Thanks to your work I have DeepSeek v4 Flash up and running today on my 2-Spark Ray cluster. Very impressive - top scores in tool-call comparing to my existing champions (Nemo 3 Super, Qwen 3.5 122B, Mistral 4 Small 119B), very very solid speed (~37 t/s stable), excellent drafter. Running with 1M context and MTP, big bench - totally sold.

PS: started building the image and wheels, then found out your pre-made image from 2 days ago, pulled and voila. Easy as pie. Thank you!

It works. Thank you~