Anyone having luck with Deepseek V4 Flash on Dual Sparks?

I am reading really good things about Deepseek V4 Flash and was wondering:

  1. Anyone had any luck running this at acceptable speeds (DFlash, etc) on dual DGX Sparks, if so, what are the tokens per secon?
  2. How are you feeling about this model versus Qwen 3.6 27B (which many people are raving about)?

Please check this thread DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #51 by kimbona.dy

You can read about my exploits with DeepSeek V4 Flash here.

@LuckyChap I should perhaps add that there is a known issue at the moment with multiple concurrent requests with large-ish contexts which @jasl asked me to file here: [Bug]: Decode slowdown with concurrent large context requests Ā· Issue #8 Ā· jasl/vllm Ā· GitHub

My understanding is that he’s working on improving this as we speak, essentially trying some tricky stuff to balance the scheduling of simultaneous prefill and decode.

However per the exact pinned versions on my blog everything worked very well with just one request running at a time.

Having said that, right now I am seeing

May 22 00:46:55 spark-dbac kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359

when I launch vLLM across the cluster. I am still investigating what changed here; I think perhaps the recent NVIDIA updates I installed may be using more RAM than they were previously?

I have no issues running already. 48h non stop, full load, with c=4 with 300k context no issues.

We had to lower the max_concurrent_children to 1 because of this issue. Deepseek 4 was giving a NV_ERR_NO_MEMORY error. Thanks for this info. Was not sure what was causing this issue. Hope they ge tit fixed soon.

Share your recipe, you have very good results. Also, specify which version of VLLM you are using. Thank you.

HuggingFace model to download (optional, for --download-model)

model: deepseek-ai/DeepSeek-V4-Flash

Container image to use

container: vllm-node-dsv4:latest

Can only be run in a cluster

cluster_only: true

Custom vLLM build for DeepSeek V4 SM12x support

build_args:

Default settings (can be overridden via CLI)

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.88
max_model_len: 300000
max_num_batched_tokens: 16384
max_num_seqs: 8
block_size: 256

Environment variables — keep HF fully offline since the model is bind-mounted

env:
TORCH_CUDA_ARCH_LIST: 12.1a
VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
VLLM_TRITON_MLA_SPARSE: 1
FLASHINFER_DISABLE_VERSION_CHECK: 1
TILELANG_CLEANUP_TEMP_FILES: 1
DG_JIT_USE_NVRTC: 0
DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
DG_JIT_PRINT_COMPILER_COMMAND: 1
NCCL_IB_DISABLE: 0
NCCL_DEBUG: WARN

mounts:

  • ~/models/:/models/

The vLLM serve command template

command: |
vllm serve /models/deepseek-ai/DeepSeek-V4-Flash
–served-model-name deepseek-ai/DeepSeek-V4-Flash
–host {host}
–port {port}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–pipeline-parallel-size {pipeline_parallel}
–kv-cache-dtype fp8
–block-size {block_size}
–enable-prefix-caching
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–max-num-batched-tokens {max_num_batched_tokens}
–gpu-memory-utilization {gpu_memory_utilization}
–distributed-executor-backend mp
–compilation-config ā€˜{{ā€œcudagraph_modeā€:ā€œFULL_AND_PIECEWISEā€,ā€œcustom_opsā€:[ā€œallā€]}}’
–tokenizer-mode deepseek_v4
–tool-call-parser deepseek_v4
–enable-auto-tool-choice
–reasoning-parser deepseek_v4
–reasoning-config ā€˜{{ā€œreasoning_parserā€:ā€œdeepseek_v4ā€,ā€œreasoning_start_strā€:ā€œā€,ā€œreasoning_end_strā€:ā€œā€}}’
–default-chat-template-kwargs ā€˜{{ā€œthinkingā€:true}}’
–enable-expert-parallel
–load-format safetensors

Just don’t use MTP as tool calling will be big issue. cant say llm as it’s running some instances now that i don’t want to stop :D but i used original recipe from DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #19 by arthurdroz

Thank you so much, I followed your article, and the model runs finally!

With how fast all the repos are moving and all the minefields to avoid, you’ve done well to get it working! Glad the blog post helped!

I run with MTP=2 and tool calling seems fine on my end. 384K context window.

I got it running for a week, no issue. I have to manually stopped to try some other options.

I’ve got it up and running. Rock solid so far. I was using Qwen3.5-122B-A10B (iirc) and it was buggy - would crash under load. The one thing I am working through is thinking budget (it likes to think - A LOT). I’m currently working on a --logits-processors plugin to see if I can integrate a thinking budget that it will actually respect.