I am reading really good things about Deepseek V4 Flash and was wondering:
Anyone had any luck running this at acceptable speeds (DFlash, etc) on dual DGX Sparks, if so, what are the tokens per secon?
How are you feeling about this model versus Qwen 3.6 27B (which many people are raving about)?
j0n
May 22, 2026, 12:10am
3
You can read about my exploits with DeepSeek V4 Flash here .
j0n
May 22, 2026, 12:56am
4
@LuckyChap I should perhaps add that there is a known issue at the moment with multiple concurrent requests with large-ish contexts which @jasl asked me to file here: [Bug]: Decode slowdown with concurrent large context requests Ā· Issue #8 Ā· jasl/vllm Ā· GitHub
My understanding is that heās working on improving this as we speak, essentially trying some tricky stuff to balance the scheduling of simultaneous prefill and decode.
However per the exact pinned versions on my blog everything worked very well with just one request running at a time.
Having said that, right now I am seeing
May 22 00:46:55 spark-dbac kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
when I launch vLLM across the cluster. I am still investigating what changed here; I think perhaps the recent NVIDIA updates I installed may be using more RAM than they were previously?
vr8vr8
May 22, 2026, 11:07am
5
I have no issues running already. 48h non stop, full load, with c=4 with 300k context no issues.
j0n:
@LuckyChap I should perhaps add that there is a known issue at the moment with multiple concurrent requests with large-ish contexts which @jasl9187 asked me to file here: [Bug]: Decode slowdown with concurrent large context requests Ā· Issue #8 Ā· jasl/vllm Ā· GitHub
My understanding is that heās working on improving this as we speak, essentially trying some tricky stuff to balance the scheduling of simultaneous prefill and decode.
However per the exact pinned versions on my blog everything worked very well with just one request running at a time.
Having said that, right now I am seeing
May 22 00:46:55 spark-dbac kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
when I launch vLLM across the cluster. I am still investigating what changed here; I think perhaps the recent NVIDIA updates I installed may be using more RAM than they were previously?
We had to lower the max_concurrent_children to 1 because of this issue. Deepseek 4 was giving a NV_ERR_NO_MEMORY error. Thanks for this info. Was not sure what was causing this issue. Hope they ge tit fixed soon.
Share your recipe, you have very good results. Also, specify which version of VLLM you are using. Thank you.
HuggingFace model to download (optional, for --download-model)
model: deepseek-ai/DeepSeek-V4-Flash
Container image to use
container: vllm-node-dsv4:latest
Can only be run in a cluster
cluster_only: true
Custom vLLM build for DeepSeek V4 SM12x support
build_args:
Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.88
max_model_len: 300000
max_num_batched_tokens: 16384
max_num_seqs: 8
block_size: 256
Environment variables ā keep HF fully offline since the model is bind-mounted
env:
TORCH_CUDA_ARCH_LIST: 12.1a
VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
VLLM_TRITON_MLA_SPARSE: 1
FLASHINFER_DISABLE_VERSION_CHECK: 1
TILELANG_CLEANUP_TEMP_FILES: 1
DG_JIT_USE_NVRTC: 0
DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
DG_JIT_PRINT_COMPILER_COMMAND: 1
NCCL_IB_DISABLE: 0
NCCL_DEBUG: WARN
mounts:
The vLLM serve command template
command: |
vllm serve /models/deepseek-ai/DeepSeek-V4-Flash
āserved-model-name deepseek-ai/DeepSeek-V4-Flash
āhost {host}
āport {port}
ātrust-remote-code
ātensor-parallel-size {tensor_parallel}
āpipeline-parallel-size {pipeline_parallel}
ākv-cache-dtype fp8
āblock-size {block_size}
āenable-prefix-caching
āmax-model-len {max_model_len}
āmax-num-seqs {max_num_seqs}
āmax-num-batched-tokens {max_num_batched_tokens}
āgpu-memory-utilization {gpu_memory_utilization}
ādistributed-executor-backend mp
ācompilation-config ā{{ācudagraph_modeā:āFULL_AND_PIECEWISEā,ācustom_opsā:[āallā]}}ā
ātokenizer-mode deepseek_v4
ātool-call-parser deepseek_v4
āenable-auto-tool-choice
āreasoning-parser deepseek_v4
āreasoning-config ā{{āreasoning_parserā:ādeepseek_v4ā,āreasoning_start_strā:āā,āreasoning_end_strā:āā}}ā
ādefault-chat-template-kwargs ā{{āthinkingā:true}}ā
āenable-expert-parallel
āload-format safetensors
Just donāt use MTP as tool calling will be big issue. cant say llm as itās running some instances now that i donāt want to stop :D but i used original recipe from DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark ā TP=2, MTP, 200K ctx, recipe + numbers - #19 by arthurdroz
Thank you so much, I followed your article, and the model runs finally!
j0n
May 23, 2026, 6:41am
11
With how fast all the repos are moving and all the minefields to avoid, youāve done well to get it working! Glad the blog post helped!
I run with MTP=2 and tool calling seems fine on my end. 384K context window.
I got it running for a week, no issue. I have to manually stopped to try some other options.
Iāve got it up and running. Rock solid so far. I was using Qwen3.5-122B-A10B (iirc) and it was buggy - would crash under load. The one thing I am working through is thinking budget (it likes to think - A LOT). Iām currently working on a --logits-processors plugin to see if I can integrate a thinking budget that it will actually respect.