Run VLLM in Spark

it is coming next week

In the meantime, I’ve updated my container build to include building Triton from source and also switched to a multi-stage build with cache, so it takes less than 5 minutes to build new version of VLLM in subsequent builds. The first build will still take ~30 minutes, though. The final image size is down to 23.9GB (deployed) - not as small as NVIDIA one (which is about 15G), but still better than it was (36GB).

I may trim it down a bit more in the future, but the main goal was to improve the compilation times at this point.

@johnny_nv Great news. Thanks! Looking forward!

Is there some place we can test the WIP?

Is the repo that makes the container open for community contributions?

I’ve added a convenience script to start cluster on all nodes with autodiscovery option to the repo. Nothing too fancy, wanted to keep it barebones, but useful enough.

I’ll merge the pull requests after we resolve some questions there.


I see this pkg

vllm 0.13.0 is out

Nice. I see they included cu130 wheels, but I don’t think I miss anything compiling vllm from main, as usual. I’ll check it out though.

I accidentally took the time to read the changelog before posting here…

so Johnny beat me posting!

I’ve finally got to test the new cu130 wheels for vLLM and they seem to work well, but only the nightly version for now. The release one seems to link a wrong cudart library.

I’ve updated my Docker repository to include an option of installing vLLM from nightly wheels (and release when it gets fixed) instead of building from the source.

I’ve applied the same fastsafetensors patch, so it will work with the cluster just fine.

Strikes a good balance between the compilation speed and having the most recent build of vLLM. I’m not sure when exactly they build the nightly wheels, but at least once a day, so unless you need the most recent commit available, this is probably the best option.

You can use the new build by utilizing --use-wheels argument in build-and-copy.sh helper script - see README for more details.

I’ve tested it on a few models - so far so good.

You can also now specify --pre-flashinfer to build with pre-release flashinfer wheels (for both compile from source and install from wheels options).

Hello,

From pypi is cuda 12. you must to specify which wheel with the url.

pip install https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl

Many, many thanks. New container works great.

Oh, I was trying to use https://wheels.vllm.ai/cu130 which apparently doesn’t exist, so it pulled the one from PyPI I guess. I’ll pull it from GitHub then.

UPDATE: release wheels are now supported in the Docker build, but I would still recommend installing nightly ones, as they pull updated dependencies (I decided not to upgrade dependencies for release wheels at this point, other than flashinfer if pre-release is chosen).

I’ve added ability to use transformers v5 in Docker builds to support models like GLM 4.6V that require transformers 5.0.0rc0 or higher.

I suggest keeping this build image separate from the regular one in case of incompatibilities. It seems to work much better with other models now, but there may be cases when it breaks something.

Example usage

To build (using fast wheels build here) and distribute the image to your cluster node, you can use this command. You can name your image differently, I just use vllm-node-whl-tf5 here.

./build-and-copy.sh -t vllm-node-whl-tf5 --use-wheels --pre-tf --pre-flashinfer -c

Then, to run the model on all nodes of the cluster, you can use the new convenience script on the head node:

./launch-cluster.sh  \
        -t vllm-node-whl-tf5 \
        exec vllm serve zai-org/GLM-4.6V-FP8 \
        --tool-call-parser glm45 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        --allowed-local-media-path / \
        --mm-encoder-tp-mode data \
        -tp 2 \
        --gpu-memory-utilization 0.7 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000 \
        --load-format fastsafetensors

It will autodiscover your interface configuration, start cluster, and if everything is properly configured, launch the model. When you quit the process, it will shut down the cluster automatically.

Any improvements in performance for gpt-oss-120b?

No, gpt-oss-120b was the first thing I tested :) I knew that it won’t help, though.

I’m looking if I can backport improvements from sglang:spark (they are still not in the main sglang though).

I was able to incorporate Triton changes, but they won’t be used until Flashinfer MXFP4 path is restored on vLLM side. The changes sglang guy made there are not as straightforward, though. We’ll see.

I started writing patches for vllm to get the sm121a supported natively for mxfp4/nvfp4 on gpt-oss-120b, but didn’t complete the effort. There was some meaningful progress, but I ran out of steam. Here’s as far as I got:


diff --git a/vllm/envs.py b/vllm/envs.py
index 2f8158d88..1a24b0645 100755
--- a/vllm/envs.py
+++ b/vllm/envs.py
@@ -215,6 +215,7 @@ if TYPE_CHECKING:
     VLLM_HAS_FLASHINFER_CUBIN: bool = False
     VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: bool = False
     VLLM_USE_FLASHINFER_MOE_MXFP4_BF16: bool = False
+    VLLM_ALLOW_SM12X_MXFP4: bool = False
     VLLM_ROCM_FP8_MFMA_PAGE_ATTN: bool = False
     VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS: bool = False
     VLLM_ALLREDUCE_USE_SYMM_MEM: bool = True
@@ -1218,6 +1219,10 @@ environment_variables: dict[str, Callable[[], Any]] = {
     "VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8": lambda: bool(
         int(os.getenv("VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8", "0"))
     ),
+    # If set to 1, allow SM12X (DGX Spark) to try MXFP4
+    "VLLM_ALLOW_SM12X_MXFP4": lambda: bool(
+        int(os.getenv("VLLM_ALLOW_SM12X_MXFP4", "0"))
+    ),
     # If set to 1, use the FlashInfer CUTLASS backend for
     # MXFP8 (activation) x MXFP4 (weight) MoE.
     # This is separate from the TRTLLMGEN path controlled by
diff --git a/vllm/model_executor/layers/quantization/mxfp4.py b/vllm/model_executor/layers/quantization/mxfp4.py
index e96e87d15..dbc6622ff 100644
--- a/vllm/model_executor/layers/quantization/mxfp4.py
+++ b/vllm/model_executor/layers/quantization/mxfp4.py
@@ -87,13 +87,17 @@ def get_mxfp4_backend_with_lora() -> Mxfp4Backend:
         return Mxfp4Backend.NONE
 
     # If FlashInfer is not available, try either Marlin or Triton
+    cap = current_platform.get_device_capability()
+    # DGX Spark / GB10 reports SM12.x (e.g. (12, 1)).
+    is_sm12x = cap[0] == 12 and envs.VLLM_ALLOW_SM12X_MXFP4
     triton_kernels_supported = (
         has_triton_kernels()
         and is_torch_equal_or_newer("2.8.0")
         # NOTE: triton_kernels are only confirmed to work on SM90 and SM100
         # SM110 fails with this error: https://github.com/vllm-project/vllm/issues/29317
-        # SM120 needs this fix: https://github.com/triton-lang/triton/pull/8498
-        and (9, 0) <= current_platform.get_device_capability() < (11, 0)
+        # SM120/SM12x needs this fix: https://github.com/triton-lang/triton/pull/8498
+        # experimentally enabled for SM12x rather than hard-excluding.
+        and (((9, 0) <= cap < (11, 0)) or is_sm12x)
     )
     if envs.VLLM_MXFP4_USE_MARLIN or not triton_kernels_supported:
         logger.info_once("[get_mxfp4_backend_with_lora] Using Marlin backend")
@@ -110,6 +114,9 @@ def get_mxfp4_backend(with_lora_support: bool) -> Mxfp4Backend:
         return get_mxfp4_backend_with_lora()
 
     if current_platform.is_cuda():
+        cap = current_platform.get_device_capability()
+        # DGX Spark / GB10 reports SM12.x (e.g. (12, 1)).
+        is_sm12x = cap[0] == 12 and envs.VLLM_ALLOW_SM12X_MXFP4
         if (
             current_platform.is_device_capability(90)
             and has_flashinfer()
@@ -118,19 +125,19 @@ def get_mxfp4_backend(with_lora_support: bool) -> Mxfp4Backend:
             logger.info_once("Using FlashInfer MXFP4 BF16 backend for SM90")
             return Mxfp4Backend.SM90_FI_MXFP4_BF16
         elif (
-            current_platform.is_device_capability_family(100)
+            (current_platform.is_device_capability_family(100) or is_sm12x)
             and has_flashinfer()
             and envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS
         ):
-            logger.info_once("Using FlashInfer MXFP4 MXFP8 CUTLASS backend for SM100")
+            logger.info_once("Using FlashInfer MXFP4 MXFP8 CUTLASS backend for SM100/SM12X")
             return Mxfp4Backend.SM100_FI_MXFP4_MXFP8_CUTLASS
         elif (
-            current_platform.is_device_capability_family(100)
+            (current_platform.is_device_capability_family(100) or is_sm12x)
             and has_flashinfer()
             and envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8
         ):
             return Mxfp4Backend.SM100_FI_MXFP4_MXFP8_TRTLLM
-        elif current_platform.is_device_capability_family(100) and has_flashinfer():
+        elif (current_platform.is_device_capability_family(100) or is_sm12x) and has_flashinfer():
             logger.info_once(
                 "Using FlashInfer MXFP4 BF16 backend for SM100, "
                 "For faster performance on SM100, consider setting "
@@ -141,6 +148,7 @@ def get_mxfp4_backend(with_lora_support: bool) -> Mxfp4Backend:
         elif (
             current_platform.is_device_capability_family(100)
             or current_platform.is_device_capability(90)
+            or is_sm12x
         ) and not has_flashinfer():
             logger.warning_once(
                 "MXFP4 MoE is enabled on Hopper/Blackwell but FlashInfer "

I had saw that there are the vllm builds with cu130 and nightly torch. I was going to continue with it but never got around to it.

I had started to upstream some of my changes to you. I saw your feedback and it’s on my todo list to answer! Maybe I’ll pick it up again later.

I’ve had nightly torch in my internal builds for a while, but it didn’t give any performance boost so far, however resulted in quite a few crashes, so I decided not to push it to the repository yet. I’ll keep monitoring it, maybe introduce a switch, similar to pre-release flashinfer and transformers.

As for VLLM patching, it may be a bit more involved than re-enabling flashinfer path.

Flashinfer MXFP4 path was enabled by default in the beginning, but it was very buggy, so they rolled it back to Marlin kernel a month ago or so. One hope is that whatever they did at sglang together with their patches to triton, may do the trick. Maybe we should combine our efforts.

My assumption was that Marlin brought stability, but Cutter is the path forward for the speed (as the other Blackwell chips seem to be running from there).

The patch above gets Cutter to at least try to load a kernel, but I didn’t finish the evaluation of which kernel should be loaded. I was going to make patches to pytorch next but stopped.

OK, I managed to make it run with modified Triton and use Triton backend instead of Marlin, but it gives the same performance, so doesn’t really make any sense to enable.

Naive approach of just enabling Flashinfer like you’ve done above failed, predictably.