Has Anyone Tried Getting ~1.5×+ Speedup on DGX/Spark/Grace/Blackwell using this?

Magnus81 · December 17, 2025, 3:21pm

Hi everyone,

I recently found this GitHub repository with a configuration aimed at optimizing NVIDIA DGX, Spark, Grace, and Blackwell setups to improve performance:

🔗 https://github.com/GuigsEvt/dgx_spark_config

The original Reddit thread discussing these optimizations mentions roughly a ~1.5× speedup in some compute workloads after rebuilding the stack to take full advantage of Blackwell GPU features and Grace CPU support.Reddit

Here’s the Reddit thread with the technical breakdown on those claimed performance improvements:
🔗 https://www.reddit.com/r/LocalLLaMA/comments/1p7ddv3/optimising_nvidias_dgx_spark_grace_blackwell_15/?tl=de

Has anyone here already tried applying this config and seen the reported acceleration in practice? I’d love to hear about your experience — what worked, what didn’t, and any practical tips you can share.

Looking forward to your insights!

eugr · December 17, 2025, 7:26pm

I tried (you can see my replies in that Reddit thread) and didn’t get any performance improvements compared to a normal build. Also, his torchaudio was broken, and I was only able to run it after removing it.

Having said that, this is all a moving target, so I won’t be surprised if there are some improvements now.

One word of caution though - if you build triton from main, it breaks compatibility with vllm and some models won’t work due to missing matmul_ogs module. There is an open ticket on vllm side, but no movement, probably because those breaking changes won’t make it to the next Triton release (3.6.0).

eugr · December 17, 2025, 7:33pm

I’m spending way too much time already on this, so it would be great if someone else could test it. Also, it would be great to see a little bit more participation from nvidia there.

@johnny_nv - as someone who is closely working with pytorch/vllm/triton folks, can you have a look?

raphael.amorim · December 17, 2025, 8:38pm

I’ll take a stab at it later today

christopher_owen · December 17, 2025, 8:45pm

You can work around this and have triton from main by not installing the triton_kernels from the compile, and letting vllm use the triton 3.5.0 ‘third-party’ pinned ones it already includes (which include matmul_ogs)

eugr · December 17, 2025, 9:00pm

then it complains about kernel incompatibility and doesn’t load them anyway.
To be fair, I haven’t seen any difference in performance with and without triton_kernels installed. Haven’t tested extensively though. But current triton main branch doesn’t seem to give any advantages either - I’ve just tried it (without kernels package).

cyuen1 · December 17, 2025, 11:09pm

Hello Magnus81,

I tested the configuration from the Github repo and performed the comparison to the PyPI wheel from the repo’s requirements.txt (torch==2.9.0+cu130). From this, I can confirm the 1.5x speed up.

Along with this, I tested the latest NGC PyTorch container (25.11-py3) to the same PyPI wheel and can also report virtually the same 1.5x speed up.

The base PyPI wheel doesn’t support up to SM 12.1 (Blackwell), which is what the optimized build seeks to fix. However, NVIDIA has already added Blackwell support in the NGC PyTorch container.

In the case of a standalone virtual environment without docker, this configuration will achieve the 1.5x speedup. At the same time, we recommend using the NGC image as it’s a simple docker pull to achieve the same level of performance with way less hassle.

Let me know if you have any further questions or concerns!

eugr · December 17, 2025, 11:57pm

Would you recommend using it as a base for a custom vllm build as compared to nvidia/cuda:13.1.0-devel-ubuntu24.04?

NVES · December 18, 2025, 12:05am

Yes. The NGC PyTorch image already includes most foundational Python libraries with optimized dependencies. With the CUDA image, you will need to install and manage the Python dependencies yourself. It’s always good to start with the “closest matching” base image :)

eugr · December 18, 2025, 12:44am

I’ll give it a try. It is quite big though, 19GB uncompressed.

ericlewis777 · December 18, 2025, 3:30am

you end up having to pull in a bunch of stuff anyway from my experience since the jit compile thingie in vLLM requires basically everything in the cuda image (sorry for the super technical terminology lol)

eugr · December 18, 2025, 3:57am

Not if you don’t pull unnecessary dependencies. I think I tried this once and ended up with 40GB container. I’ll build it a “dumb” way first to see if I can get any real improvement in vllm, and if so, try to cherry pick stuff for the final image.

eugr · December 18, 2025, 5:32am

Well, when I build vllm on top of pytorch NGC image (25.11-py3), it fails to load gpt-oss-120b with this error:

(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 356, in __getattr__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     config = self._config[name]
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]              ~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] KeyError: 'assume_32bit_indexing'
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     super().__init__(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 328, in determine_available_memory
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     self.model_runner.profile_run()
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4544, in profile_run
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4268, in _dummy_run
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     outputs = self.model(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]               ^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1777, in _wrapped_call_impl
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1788, in _call_impl
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 722, in forward
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 504, in __call__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     with (
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 666, in __enter__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     prior[key] = config.__getattr__(key)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                  ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 388, in __getattr__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     raise AttributeError(f"{self.__name__}.{name} does not exist") from e
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] AttributeError: torch._inductor.config.assume_32bit_indexing does not exist
(EngineCore_DP0 pid=13022) Process EngineCore_DP0:
(EngineCore_DP0 pid=13022) Traceback (most recent call last):
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 356, in __getattr__
(EngineCore_DP0 pid=13022)     config = self._config[name]
(EngineCore_DP0 pid=13022)              ~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=13022) KeyError: 'assume_32bit_indexing'
(EngineCore_DP0 pid=13022)
(EngineCore_DP0 pid=13022) The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=13022)
(EngineCore_DP0 pid=13022) Traceback (most recent call last):
(EngineCore_DP0 pid=13022)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=13022)     self.run()
(EngineCore_DP0 pid=13022)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=13022)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=13022)     raise e
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=13022)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=13022)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=13022)     super().__init__(
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=13022)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=13022)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=13022)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=13022)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=13022)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=13022)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=13022)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=13022)     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022)     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 328, in determine_available_memory
(EngineCore_DP0 pid=13022)     self.model_runner.profile_run()
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4544, in profile_run
(EngineCore_DP0 pid=13022)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=13022)                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022)     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4268, in _dummy_run
(EngineCore_DP0 pid=13022)     outputs = self.model(
(EngineCore_DP0 pid=13022)               ^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=13022)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1777, in _wrapped_call_impl
(EngineCore_DP0 pid=13022)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1788, in _call_impl
(EngineCore_DP0 pid=13022)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 722, in forward
(EngineCore_DP0 pid=13022)     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 504, in __call__
(EngineCore_DP0 pid=13022)     with (
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 666, in __enter__
(EngineCore_DP0 pid=13022)     prior[key] = config.__getattr__(key)
(EngineCore_DP0 pid=13022)                  ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 388, in __getattr__
(EngineCore_DP0 pid=13022)     raise AttributeError(f"{self.__name__}.{name} does not exist") from e
(EngineCore_DP0 pid=13022) AttributeError: torch._inductor.config.assume_32bit_indexing does not exist

Doesn’t happen with pytorch installed from cu130 wheels.

Magnus81 · December 18, 2025, 9:02am

cyuen1:

Hello Magnus81,

I tested the configuration from the Github repo and performed the comparison to the PyPI wheel from the repo’s requirements.txt (torch==2.9.0+cu130). From this, I can confirm the 1.5x speed up.

Along with this, I tested the latest NGC PyTorch container (25.11-py3) to the same PyPI wheel and can also report virtually the same 1.5x speed up.

The base PyPI wheel doesn’t support up to SM 12.1 (Blackwell), which is what the optimized build seeks to fix. However, NVIDIA has already added Blackwell support in the NGC PyTorch container.

In the case of a standalone virtual environment without docker, this configuration will achieve the 1.5x speedup. At the same time, we recommend using the NGC image as it’s a simple docker pull to achieve the same level of performance with way less hassle.

Let me know if you have any further questions or concerns!

Hello cyuen1,

I’m relatively new to PyTorch. Did I understand this correctly that the acceleration is already built into the PyTorch container (25.11-py3), or do I still need to do something special?

I understand your post to mean that the acceleration is essentially already there in the container, but not when running without Docker—did I get that right? I interpreted the GitHub posts to mean that something would also need to be done on the base machine for it to work.

As mentioned, I’m not (yet) very experienced with PyTorch.

eugr · December 18, 2025, 5:38pm

Interestingly enough, I’ve observed a similar boost in performance when using cu129 wheels of PyTorch. For example, when running ComfyUI with cu129 pytorch (like in the playbook), it performed roughly 1.5x faster than cu130 pytorch wheels. This repo / new pytorch image seems to bring it back to cu129 levels. Any ideas why that would happen, given that cu129 wheels won’t have been compiled with sm121 support?

eugr · December 18, 2025, 5:40pm

Yes, you can use pytorch docker container without extra effort. If you want the same on the host OS, you would need to install pytorch from the repo referenced in the original post.

cyuen1 · December 18, 2025, 5:47pm

Yes this is correct. You may achieve the same acceleration by just pulling and running the official NGC PyTorch Docker container. No other steps required.

Magnus81 · December 18, 2025, 7:37pm

OK, thanks alot!

trystan1 · December 19, 2025, 12:52pm

diff --git a/vllm/compilation/decorators.py b/vllm/compilation/decorators.py
index 40bde97ac..3e30e3447 100644
--- a/vllm/compilation/decorators.py
+++ b/vllm/compilation/decorators.py
@@ -498,7 +498,8 @@ def _support_torch_compile(
         # Prepare inductor config patches
         # assume_32bit_indexing is only available in torch 2.10.0.dev+
         inductor_config_patches = {}
-        if is_torch_equal_or_newer("2.10.0.dev"):
+        ic = torch._inductor.config
+        if "assume_32bit_indexing" in getattr(ic, "_config", {}):
             inductor_config_patches["assume_32bit_indexing"] = True

         with (

This patch will get you running beyond the indexing attribute error

eugr · December 19, 2025, 4:35pm

Thanks, I’ll try it later, but it these pytorch improvements don’t seem to make any difference when it comes to vllm workloads.

Topic		Replies	Views
Effective PyTorch and CUDA DGX Spark / GB10 cudnn	23	10126	January 12, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	73	7561	March 27, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	6601	March 10, 2026
DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide DGX Spark / GB10 Projects cuda , cusparse	13	1110	April 1, 2026
Run VLLM in Spark DGX Spark / GB10	146	12123	April 28, 2026
vLLM containers DGX Spark / GB10	44	1537	March 28, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2572	March 26, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2365	December 25, 2025
Best practices for running llvm bench DGX Spark / GB10	2	160	December 21, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6974	March 28, 2026

Has Anyone Tried Getting ~1.5×+ Speedup on DGX/Spark/Grace/Blackwell using this?

Related topics