PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

I’m still getting crashes. Tested with Flashinfer built from the source from main during my nightly build.

1 Like

as expected, look the coments on the PR

I’m just cloning this and installing from source:

johnnynunez/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

then adding flashinfer nightlies on top

So, the only way to avoid the crash currently is:

  1. Rebuild both vLLM and Flashinfer with CUTLASS PR applied;
  2. or use CUDA_LAUNCH_BLOCKING=1?

EDIT: looks like CUTLASS fix is not enough since there is an issue with that PR as well. I guess, we’ll just have to wait a little bit longer.

If the tile shape mismatch is responsible for the crashing during long decode sequences, wouldn’t the same behavior be noticed with vllm_cutlass since it’s ultimately a cutlass bug?

There are likely a good amount of folks using eugr’s docker build and there doesn’t seem to be reports of it, I haven’t tested it in the same way (gpqa long context runs) but would that be helpful?

It doesn’t seem worth trying the most recent commits to johnny’s PR if an accuracy test is going to illegal except halfway through.

I’m also noticing in your recent nv4p recipes you’re using only moe-backend, not setting the gemm backend to cutlass to match and avoid flashinfer

 "VLLM_NVFP4_GEMM_BACKEND": env_with_choices(
        "VLLM_NVFP4_GEMM_BACKEND",
        None,
        [
            "flashinfer-cudnn",
            "flashinfer-trtllm",
            "flashinfer-cutlass",
            "cutlass",
            "marlin",
        ],
    ),

I’m going to try –moe-backend cutlass and env VLLM_NVFP4_GEMM_BACKEND=cutlass and see if the crashing persists.

vllm also provides a buildtime env to set a cutlass source directory that can be useful if the bugfix winds are going to start blowing in the cutlass direction. VLLM_CUTLASS_SRC_DIR

Edit:

the choo choo train continues [NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x by johnnynunez · Pull Request #2913 · flashinfer-ai/flashinfer

2 Likes

Run vLLM on Thor & Spark

Step-by-step guide to building and running vLLM with FlashInfer on NVIDIA Thor (SM110) and Spark (SM121) platforms.

1. Install uv

Clear any stale cache, then install the uv package manager:

sudo rm -rf ~/.cache/
sudo apt install ccache
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Create a Virtual Environment

sudo apt install python3-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate

3. Install PyTorch

uv pip install --force-reinstall torch torchvision

4. Build and Install vLLM

Note: The build must include vllm-project/vllm#38423, so use the fork below.

git clone --recursive https://github.com/johnnynunez/vllm.git
cd vllm

export VLLM_VERSION=0.18.1
export TORCH_CUDA_ARCH_LIST=12.1a
export USE_CUDNN=1
export VERBOSE=1
export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"
export SETUPTOOLS_SCM_PRETEND_VERSION="${VLLM_VERSION}"
export DG_JIT_USE_NVRTC=1  # DeepGEMM NVRTC support — up to 10x compilation speedup

python3 use_existing_torch.py || echo "Skipping use_existing_torch.py"

uv pip install -r requirements/build.txt -v
python3 -m setuptools_scm

# Constrain parallelism on aarch64 to avoid OOM during compilation
ARCH=$(uname -i)
if [ "${ARCH}" = "aarch64" ]; then
    export NVCC_THREADS=1
    export CUDA_NVCC_FLAGS="-Xcudafe --threads=1"
    export MAKEFLAGS='-j2'
    export CMAKE_BUILD_PARALLEL_LEVEL=$MAX_JOBS
    export NINJAFLAGS='-j2'
fi

uv build --wheel --no-build-isolation -v --out-dir ./wheels .
uv pip install ./wheels/vllm*.whl

cd /opt/vllm
uv pip install compressed-tensors

5. Uninstall Pre-built FlashInfer Packages

Remove any pre-compiled FlashInfer packages to avoid conflicts with the editable install:

uv pip uninstall flashinfer-cubin flashinfer-python

6. Install FlashInfer from Source

Note: The build must include flashinfer-ai/flashinfer#2913, so use the fork below.

sudo rm -rf ~/.cache/
git clone --recursive https://github.com/johnnynunez/flashinfer.git
cd flashinfer
uv pip install --force-reinstall --no-build-isolation -e .

7. Export Environment Variables

Set the CUDA architecture target and related paths. Use 12.1a for Spark or 11.0a for Thor:

export TORCH_CUDA_ARCH_LIST=12.1a  # Spark: 12.1a — Thor: 11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export CUDA_HOME=/usr/local/cuda
export CPATH=$CUDA_HOME/include:${CPATH}
export C_INCLUDE_PATH=$CUDA_HOME/include:${C_INCLUDE_PATH}
export CPLUS_INCLUDE_PATH=$CUDA_HOME/include:${CPLUS_INCLUDE_PATH}

# Recommended on Jetson platforms
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LD_LIBRARY_PATH}
export LIBRARY_PATH=$CUDA_HOME/lib64:$CUDA_HOME/lib:${LIBRARY_PATH}

8. Clear Memory

Drop filesystem caches to free up memory before serving:

sudo sysctl -w vm.drop_caches=3

9. Serve the Model (Speculative Decoding with MTP)

Launch vLLM with Qwen3.5-122B using 3 speculative tokens via MTP:

vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
    --port 9000 \
    --max-num-seqs 2 \
    --tensor-parallel-size 1 \
    --max-model-len 131072 \
    --trust-remote-code \
    --gpu-memory-utilization 0.80 \
    --kv-cache-dtype fp8 \
    --speculative_config '{"method":"mtp","num_speculative_tokens":3}'

10. Run a Stress Test (Separate Terminal)

In another terminal with the .vllm environment activated, run the following script to send 10 long-context requests:

python3 -c "
import requests, time, sys, concurrent.futures

MODEL = 'Sehyo/Qwen3.5-122B-A10B-NVFP4'
PORT = 9000

# ~100K tokens — safely under 131072 - 1024 = 130048 limit
parts = []
for i in range(3000):
    parts.append(f'Section {i}: The quick brown fox jumps over the lazy dog. Technology advances rapidly in quantum computing and distributed systems. ')
prompt = 'Write a comprehensive analysis: ' + ' '.join(parts)
print(f'Approx words: {len(prompt.split())}')
sys.stdout.flush()

def send_request(idx):
    t0 = time.time()
    try:
        r = requests.post(f'http://localhost:{PORT}/v1/completions', json={
            'model': MODEL,
            'prompt': prompt,
            'max_tokens': 1024,
            'temperature': 0.7,
        }, timeout=600)
        elapsed = time.time() - t0
        if r.status_code == 200:
            data = r.json()
            text = data['choices'][0]['text']
            usage = data.get('usage', {})
            return f'[{idx}] OK - {len(text)}ch, prompt={usage.get(\"prompt_tokens\",\"?\")}, gen={usage.get(\"completion_tokens\",\"?\")}, {elapsed:.1f}s'
        else:
            err = r.json().get('error',{}).get('message','')[:200]
            return f'[{idx}] FAIL ({r.status_code}): {err}'
    except Exception as e:
        elapsed = time.time() - t0
        return f'[{idx}] CRASH - {type(e).__name__}: {e} ({elapsed:.1f}s)'

# Phase 1: Single ~100K token request
print('=== Phase 1: Single ~100K token request ===')
sys.stdout.flush()
print(send_request(1)); sys.stdout.flush()

# Phase 2: 2 concurrent
print('=== Phase 2: 2 concurrent ===')
sys.stdout.flush()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as pool:
    futs = [pool.submit(send_request, i) for i in range(2, 4)]
    for f in concurrent.futures.as_completed(futs):
        print(f.result()); sys.stdout.flush()

# Phase 3: 10 rapid
print('=== Phase 3: 10 rapid sequential ===')
sys.stdout.flush()
for i in range(4, 14):
    r = send_request(i)
    print(r); sys.stdout.flush()
    if 'CRASH' in r: break

print('Done.')
" 2>&1
12 Likes

Is it intended that flashinfer is by default building for 120f and not 121a?

I usually set FLASHINFER_CUDA_ARCH_LIST=“12.1a” but that is omitted in your post.

you can add it, anyways that is when you build cubin or jit cache(when you have kernels or c++ code and appears in the wheel aarch64)… With flashinfer-python (-none in the wheel name) you are only building a wrapper

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.7 \
--max-model-len 262144 \
--max-num-seqs 10 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--load-format fastsafetensors \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3 \
--mamba_ssm_cache_dtype float32

it is working Nemotron Super too

3 Likes

Sounds promising, i installed your forks of vllm and flashinfer, running the same gpqa test against them.

Still the same flashinfer autotuner dump at the start, but it’s been running for 30 minutes so far.

that will be fixed with 0.6.8 with cubins. Now it compiles all necessary kernels (JIT), so first time take some minutes…

So the real root cause was GDC not being enabled for the kernel compilation?

vLLM use PDL all the time now, for Hopper and for Blackwell

PDL (Programmatic Dependent Launch) is a Blackwell feature that allows one kernel to signal to the GPU scheduler that the next kernel in the stream can start executing before the current kernel finishes. It’s an optimization that reduces kernel launch latency by overlapping execution.

The problem is: if the PDL signaling has a bug or timing issue, the dependent kernel might start reading data that the first kernel hasn’t finished writing yet, or it might be launched with corrupted launch parameters. This creates a race condition

  • Random crashes (timing-dependent)
  • cudaErrorIllegalInstruction (corrupted kernel state or bad instruction dispatch)
  • Worse under higher load (MTP + 128K context = more kernels overlapping = more chances for the race to hit)
  • 32K context works fine (less pressure, fewer kernel launches, race window is smaller)

There was some related issue in the past… SM120 CUTLASS FP4 GEMM: missing GDC compile flags cause PDL race condition — output tile corruption under concurrency · Issue #2708 · flashinfer-ai/flashinfer · GitHub but the flags were not activated for JIT and MoE either.

It works now because it’s a collection of fixes:

  • DGX Spark smx code logic
  • cutlass v4.4.2
  • GDC Flags flashinfer
  • mxfp4 and nvfp4 gemm flashinfer/cutlass
    and more…

Now we can start seeing performance PRs :)

6 Likes

Ok, I believe this one.

So, I have my nvidia baseball hat back on, we’ve moved from the slow lane (marlin) to the fast lane (flashinfer cutlass) but we’re still driving a slow car.

At least the car doesn’t break down now though :)

How fast can the car go though?

Stay tuned?

machine pushing hard, is today the day?

2 Likes

We have ready some PRs to get better performance, so good news

9 Likes

Hi, thanks for great work you all doing. I am still a newbie in this home serving LLM world but trying to catch up now that “everbody” can play. I just acquired a Spark and just stated playing with it via eugr’s script. I am a Robotics engineer and played around with a lot of different ML but like a said still a LLM newbie but would like to join in. So if I can do any testing or something something to help please let me know and I will be ready.

1 Like

I am from Physical AI side, so I understand you! I did this in my free time. I was a member of community last 3 years, so I am happy to help always

12 Likes

Looks promising, @johnny_nv thank you. I will wait to get this changes to @eugr’s setup.

1 Like