PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

eugr · March 29, 2026, 10:08pm

I’m still getting crashes. Tested with Flashinfer built from the source from main during my nightly build.

johnny_nv · March 29, 2026, 10:08pm

as expected, look the coments on the PR

trystan1 · March 29, 2026, 10:08pm

I’m just cloning this and installing from source:

johnnynunez/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

then adding flashinfer nightlies on top

eugr · March 29, 2026, 10:12pm

So, the only way to avoid the crash currently is:

Rebuild both vLLM and Flashinfer with CUTLASS PR applied;
or use CUDA_LAUNCH_BLOCKING=1?

EDIT: looks like CUTLASS fix is not enough since there is an issue with that PR as well. I guess, we’ll just have to wait a little bit longer.

trystan1 · March 29, 2026, 10:37pm

If the tile shape mismatch is responsible for the crashing during long decode sequences, wouldn’t the same behavior be noticed with vllm_cutlass since it’s ultimately a cutlass bug?

There are likely a good amount of folks using eugr’s docker build and there doesn’t seem to be reports of it, I haven’t tested it in the same way (gpqa long context runs) but would that be helpful?

It doesn’t seem worth trying the most recent commits to johnny’s PR if an accuracy test is going to illegal except halfway through.

trystan1 · March 29, 2026, 10:49pm

I’m also noticing in your recent nv4p recipes you’re using only moe-backend, not setting the gemm backend to cutlass to match and avoid flashinfer

 "VLLM_NVFP4_GEMM_BACKEND": env_with_choices(
        "VLLM_NVFP4_GEMM_BACKEND",
        None,
        [
            "flashinfer-cudnn",
            "flashinfer-trtllm",
            "flashinfer-cutlass",
            "cutlass",
            "marlin",
        ],
    ),

I’m going to try –moe-backend cutlass and env VLLM_NVFP4_GEMM_BACKEND=cutlass and see if the crashing persists.

vllm also provides a buildtime env to set a cutlass source directory that can be useful if the bugfix winds are going to start blowing in the cutlass direction. VLLM_CUTLASS_SRC_DIR

Edit:

the choo choo train continues [NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x by johnnynunez · Pull Request #2913 · flashinfer-ai/flashinfer

johnny_nv · March 30, 2026, 12:41am

Run vLLM on Thor & Spark

Step-by-step guide to building and running vLLM with FlashInfer on NVIDIA Thor (SM110) and Spark (SM121) platforms.

1. Install uv

Clear any stale cache, then install the uv package manager:

sudo rm -rf ~/.cache/
sudo apt install ccache
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Create a Virtual Environment

sudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate

3. Install PyTorch

uv pip install --force-reinstall torch torchvision

4. Build and Install vLLM

Note: The build must include vllm-project/vllm#38423, so use the fork below.

git clone --recursive https://github.com/johnnynunez/vllm.git
cd vllm

export VLLM_VERSION=0.18.1
export TORCH_CUDA_ARCH_LIST=12.1a
export USE_CUDNN=1
export VERBOSE=1
export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"
export SETUPTOOLS_SCM_PRETEND_VERSION="${VLLM_VERSION}"
export DG_JIT_USE_NVRTC=1  # DeepGEMM NVRTC support — up to 10x compilation speedup

python3 use_existing_torch.py || echo "Skipping use_existing_torch.py"

uv pip install -r requirements/build.txt -v
python3 -m setuptools_scm

# Constrain parallelism on aarch64 to avoid OOM during compilation
ARCH=$(uname -i)
if [ "${ARCH}" = "aarch64" ]; then
    export NVCC_THREADS=1
    export CUDA_NVCC_FLAGS="-Xcudafe --threads=1"
    export MAKEFLAGS='-j2'
    export CMAKE_BUILD_PARALLEL_LEVEL=$MAX_JOBS
    export NINJAFLAGS='-j2'
fi

uv build --wheel --no-build-isolation -v --out-dir ./wheels .
uv pip install ./wheels/vllm*.whl

cd /opt/vllm
uv pip install compressed-tensors

5. Uninstall Pre-built FlashInfer Packages

Remove any pre-compiled FlashInfer packages to avoid conflicts with the editable install:

uv pip uninstall flashinfer-cubin flashinfer-python

6. Install FlashInfer from Source

Note: The build must include flashinfer-ai/flashinfer#2913, so use the fork below.

sudo rm -rf ~/.cache/
git clone --recursive https://github.com/johnnynunez/flashinfer.git
cd flashinfer
uv pip install --force-reinstall --no-build-isolation -e .

7. Export Environment Variables

Set the CUDA architecture target and related paths. Use 12.1a for Spark or 11.0a for Thor:

export TORCH_CUDA_ARCH_LIST=12.1a  # Spark: 12.1a — Thor: 11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export CUDA_HOME=/usr/local/cuda
export CPATH=$CUDA_HOME/include:${CPATH}
export C_INCLUDE_PATH=$CUDA_HOME/include:${C_INCLUDE_PATH}
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include:${CPLUS_INCLUDE_PATH}

# Recommended on Jetson platforms
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LD_LIBRARY_PATH}
export LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LIBRARY_PATH}

8. Clear Memory

Drop filesystem caches to free up memory before serving:

sudo sysctl -w vm.drop_caches=3

9. Serve the Model (Speculative Decoding with MTP)

Launch vLLM with Qwen3.5-122B using 3 speculative tokens via MTP:

vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
    --port 9000 \
    --max-num-seqs 2 \
    --tensor-parallel-size 1 \
    --max-model-len 131072 \
    --trust-remote-code \
    --gpu-memory-utilization 0.80 \
    --kv-cache-dtype fp8 \
    --speculative_config '{"method":"mtp","num_speculative_tokens":3}'

10. Run a Stress Test (Separate Terminal)

In another terminal with the .vllm environment activated, run the following script to send 10 long-context requests:

python3 -c "
import requests, time, sys, concurrent.futures

MODEL = 'Sehyo/Qwen3.5-122B-A10B-NVFP4'
PORT = 9000

# ~100K tokens — safely under 131072 - 1024 = 130048 limit
parts = []
for i in range(3000):
    parts.append(f'Section {i}: The quick brown fox jumps over the lazy dog. Technology advances rapidly in quantum computing and distributed systems. ')
prompt = 'Write a comprehensive analysis: ' + ' '.join(parts)
print(f'Approx words: {len(prompt.split())}')
sys.stdout.flush()

def send_request(idx):
    t0 = time.time()
    try:
        r = requests.post(f'http://localhost:{PORT}/v1/completions', json={
            'model': MODEL,
            'prompt': prompt,
            'max_tokens': 1024,
            'temperature': 0.7,
        }, timeout=600)
        elapsed = time.time() - t0
        if r.status_code == 200:
            data = r.json()
            text = data['choices'][0]['text']
            usage = data.get('usage', {})
            return f'[{idx}] OK - {len(text)}ch, prompt={usage.get(\"prompt_tokens\",\"?\")}, gen={usage.get(\"completion_tokens\",\"?\")}, {elapsed:.1f}s'
        else:
            err = r.json().get('error',{}).get('message','')[:200]
            return f'[{idx}] FAIL ({r.status_code}): {err}'
    except Exception as e:
        elapsed = time.time() - t0
        return f'[{idx}] CRASH - {type(e).__name__}: {e} ({elapsed:.1f}s)'

# Phase 1: Single ~100K token request
print('=== Phase 1: Single ~100K token request ===')
sys.stdout.flush()
print(send_request(1)); sys.stdout.flush()

# Phase 2: 2 concurrent
print('=== Phase 2: 2 concurrent ===')
sys.stdout.flush()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool:
    futs = [pool.submit(send_request, i) for i in range(2, 4)]
    for f in concurrent.futures.as_completed(futs):
        print(f.result()); sys.stdout.flush()

# Phase 3: 10 rapid
print('=== Phase 3: 10 rapid sequential ===')
sys.stdout.flush()
for i in range(4, 14):
    r = send_request(i)
    print(r); sys.stdout.flush()
    if 'CRASH' in r: break

print('Done.')
" 2>&1

trystan1 · March 30, 2026, 1:28am

Is it intended that flashinfer is by default building for 120f and not 121a?

I usually set FLASHINFER_CUDA_ARCH_LIST=“12.1a” but that is omitted in your post.

johnny_nv · March 30, 2026, 1:31am

you can add it, anyways that is when you build cubin or jit cache(when you have kernels or c++ code and appears in the wheel aarch64)… With flashinfer-python (-none in the wheel name) you are only building a wrapper

johnny_nv · March 30, 2026, 1:54am

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.7 \
--max-model-len 262144 \
--max-num-seqs 10 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--load-format fastsafetensors \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--mamba_ssm_cache_dtype float32

it is working Nemotron Super too

trystan1 · March 30, 2026, 1:58am

Sounds promising, i installed your forks of vllm and flashinfer, running the same gpqa test against them.

Still the same flashinfer autotuner dump at the start, but it’s been running for 30 minutes so far.

johnny_nv · March 30, 2026, 1:58am

that will be fixed with 0.6.8 with cubins. Now it compiles all necessary kernels (JIT), so first time take some minutes…

trystan1 · March 30, 2026, 2:00am

So the real root cause was GDC not being enabled for the kernel compilation?

johnny_nv · March 30, 2026, 2:12am

vLLM use PDL all the time now, for Hopper and for Blackwell

PDL (Programmatic Dependent Launch) is a Blackwell feature that allows one kernel to signal to the GPU scheduler that the next kernel in the stream can start executing before the current kernel finishes. It’s an optimization that reduces kernel launch latency by overlapping execution.

The problem is: if the PDL signaling has a bug or timing issue, the dependent kernel might start reading data that the first kernel hasn’t finished writing yet, or it might be launched with corrupted launch parameters. This creates a race condition

Random crashes (timing-dependent)
cudaErrorIllegalInstruction (corrupted kernel state or bad instruction dispatch)
Worse under higher load (MTP + 128K context = more kernels overlapping = more chances for the race to hit)
32K context works fine (less pressure, fewer kernel launches, race window is smaller)

There was some related issue in the past… SM120 CUTLASS FP4 GEMM: missing GDC compile flags cause PDL race condition — output tile corruption under concurrency · Issue #2708 · flashinfer-ai/flashinfer · GitHub but the flags were not activated for JIT and MoE either.

It works now because it’s a collection of fixes:

DGX Spark smx code logic
cutlass v4.4.2
GDC Flags flashinfer
mxfp4 and nvfp4 gemm flashinfer/cutlass
and more…

Now we can start seeing performance PRs :)

trystan1 · March 30, 2026, 2:14am

Ok, I believe this one.

So, I have my nvidia baseball hat back on, we’ve moved from the slow lane (marlin) to the fast lane (flashinfer cutlass) but we’re still driving a slow car.

At least the car doesn’t break down now though :)

How fast can the car go though?

Stay tuned?

grindstone · March 30, 2026, 3:44am

machine pushing hard, is today the day?

johnny_nv · March 30, 2026, 6:56am

We have ready some PRs to get better performance, so good news

jcrone · March 30, 2026, 9:20am

Hi, thanks for great work you all doing. I am still a newbie in this home serving LLM world but trying to catch up now that “everbody” can play. I just acquired a Spark and just stated playing with it via eugr’s script. I am a Robotics engineer and played around with a lot of different ML but like a said still a LLM newbie but would like to join in. So if I can do any testing or something something to help please let me know and I will be ready.

johnny_nv · March 30, 2026, 9:47am

I am from Physical AI side, so I understand you! I did this in my free time. I was a member of community last 3 years, so I am happy to help always

kyrylo.gorbachov · March 30, 2026, 10:40am

Looks promising, @johnny_nv thank you. I will wait to get this changes to @eugr’s setup.

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6118	March 28, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1480	January 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2218	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1111	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2708	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4143	February 27, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	73	3939	April 10, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	4772	March 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2386	March 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	1541	February 23, 2026