Has Anyone Tried Getting ~1.5×+ Speedup on DGX/Spark/Grace/Blackwell using this?

Hi everyone,

I recently found this GitHub repository with a configuration aimed at optimizing NVIDIA DGX, Spark, Grace, and Blackwell setups to improve performance:

🔗 https://github.com/GuigsEvt/dgx_spark_config

The original Reddit thread discussing these optimizations mentions roughly a ~1.5× speedup in some compute workloads after rebuilding the stack to take full advantage of Blackwell GPU features and Grace CPU support.Reddit

Here’s the Reddit thread with the technical breakdown on those claimed performance improvements:
🔗 https://www.reddit.com/r/LocalLLaMA/comments/1p7ddv3/optimising_nvidias_dgx_spark_grace_blackwell_15/?tl=de

Has anyone here already tried applying this config and seen the reported acceleration in practice? I’d love to hear about your experience — what worked, what didn’t, and any practical tips you can share.

Looking forward to your insights!

1 Like

I tried (you can see my replies in that Reddit thread) and didn’t get any performance improvements compared to a normal build. Also, his torchaudio was broken, and I was only able to run it after removing it.

Having said that, this is all a moving target, so I won’t be surprised if there are some improvements now.

One word of caution though - if you build triton from main, it breaks compatibility with vllm and some models won’t work due to missing matmul_ogs module. There is an open ticket on vllm side, but no movement, probably because those breaking changes won’t make it to the next Triton release (3.6.0).

I’m spending way too much time already on this, so it would be great if someone else could test it. Also, it would be great to see a little bit more participation from nvidia there.

@johnny_nv - as someone who is closely working with pytorch/vllm/triton folks, can you have a look?

2 Likes

I’ll take a stab at it later today

You can work around this and have triton from main by not installing the triton_kernels from the compile, and letting vllm use the triton 3.5.0 ‘third-party’ pinned ones it already includes (which include matmul_ogs)

then it complains about kernel incompatibility and doesn’t load them anyway.
To be fair, I haven’t seen any difference in performance with and without triton_kernels installed. Haven’t tested extensively though. But current triton main branch doesn’t seem to give any advantages either - I’ve just tried it (without kernels package).

Hello Magnus81,

I tested the configuration from the Github repo and performed the comparison to the PyPI wheel from the repo’s requirements.txt (torch==2.9.0+cu130). From this, I can confirm the 1.5x speed up.

Along with this, I tested the latest NGC PyTorch container (25.11-py3) to the same PyPI wheel and can also report virtually the same 1.5x speed up.

The base PyPI wheel doesn’t support up to SM 12.1 (Blackwell), which is what the optimized build seeks to fix. However, NVIDIA has already added Blackwell support in the NGC PyTorch container.

In the case of a standalone virtual environment without docker, this configuration will achieve the 1.5x speedup. At the same time, we recommend using the NGC image as it’s a simple docker pull to achieve the same level of performance with way less hassle.

Let me know if you have any further questions or concerns!

2 Likes

Would you recommend using it as a base for a custom vllm build as compared to nvidia/cuda:13.1.0-devel-ubuntu24.04?

1 Like

Yes. The NGC PyTorch image already includes most foundational Python libraries with optimized dependencies. With the CUDA image, you will need to install and manage the Python dependencies yourself. It’s always good to start with the “closest matching” base image :)

2 Likes

I’ll give it a try. It is quite big though, 19GB uncompressed.

you end up having to pull in a bunch of stuff anyway from my experience since the jit compile thingie in vLLM requires basically everything in the cuda image (sorry for the super technical terminology lol)

Not if you don’t pull unnecessary dependencies. I think I tried this once and ended up with 40GB container. I’ll build it a “dumb” way first to see if I can get any real improvement in vllm, and if so, try to cherry pick stuff for the final image.

1 Like

Well, when I build vllm on top of pytorch NGC image (25.11-py3), it fails to load gpt-oss-120b with this error:

(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 356, in __getattr__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     config = self._config[name]
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]              ~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] KeyError: 'assume_32bit_indexing'
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     super().__init__(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 328, in determine_available_memory
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     self.model_runner.profile_run()
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4544, in profile_run
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4268, in _dummy_run
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     outputs = self.model(
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]               ^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1777, in _wrapped_call_impl
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1788, in _call_impl
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 722, in forward
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 504, in __call__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     with (
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 666, in __enter__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     prior[key] = config.__getattr__(key)
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]                  ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 388, in __getattr__
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866]     raise AttributeError(f"{self.__name__}.{name} does not exist") from e
(EngineCore_DP0 pid=13022) ERROR 12-18 05:30:40 [core.py:866] AttributeError: torch._inductor.config.assume_32bit_indexing does not exist
(EngineCore_DP0 pid=13022) Process EngineCore_DP0:
(EngineCore_DP0 pid=13022) Traceback (most recent call last):
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 356, in __getattr__
(EngineCore_DP0 pid=13022)     config = self._config[name]
(EngineCore_DP0 pid=13022)              ~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=13022) KeyError: 'assume_32bit_indexing'
(EngineCore_DP0 pid=13022)
(EngineCore_DP0 pid=13022) The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=13022)
(EngineCore_DP0 pid=13022) Traceback (most recent call last):
(EngineCore_DP0 pid=13022)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=13022)     self.run()
(EngineCore_DP0 pid=13022)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=13022)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=13022)     raise e
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=13022)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=13022)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=13022)     super().__init__(
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=13022)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=13022)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=13022)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=13022)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=13022)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=13022)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=13022)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=13022)     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022)     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 328, in determine_available_memory
(EngineCore_DP0 pid=13022)     self.model_runner.profile_run()
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4544, in profile_run
(EngineCore_DP0 pid=13022)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=13022)                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 121, in decorate_context
(EngineCore_DP0 pid=13022)     return func(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4268, in _dummy_run
(EngineCore_DP0 pid=13022)     outputs = self.model(
(EngineCore_DP0 pid=13022)               ^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=13022)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1777, in _wrapped_call_impl
(EngineCore_DP0 pid=13022)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1788, in _call_impl
(EngineCore_DP0 pid=13022)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 722, in forward
(EngineCore_DP0 pid=13022)     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(EngineCore_DP0 pid=13022)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 504, in __call__
(EngineCore_DP0 pid=13022)     with (
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 666, in __enter__
(EngineCore_DP0 pid=13022)     prior[key] = config.__getattr__(key)
(EngineCore_DP0 pid=13022)                  ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=13022)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_config_module.py", line 388, in __getattr__
(EngineCore_DP0 pid=13022)     raise AttributeError(f"{self.__name__}.{name} does not exist") from e
(EngineCore_DP0 pid=13022) AttributeError: torch._inductor.config.assume_32bit_indexing does not exist

Doesn’t happen with pytorch installed from cu130 wheels.

Hello cyuen1,

I’m relatively new to PyTorch. Did I understand this correctly that the acceleration is already built into the PyTorch container (25.11-py3), or do I still need to do something special?

I understand your post to mean that the acceleration is essentially already there in the container, but not when running without Docker—did I get that right? I interpreted the GitHub posts to mean that something would also need to be done on the base machine for it to work.

As mentioned, I’m not (yet) very experienced with PyTorch.

Interestingly enough, I’ve observed a similar boost in performance when using cu129 wheels of PyTorch. For example, when running ComfyUI with cu129 pytorch (like in the playbook), it performed roughly 1.5x faster than cu130 pytorch wheels. This repo / new pytorch image seems to bring it back to cu129 levels. Any ideas why that would happen, given that cu129 wheels won’t have been compiled with sm121 support?

Yes, you can use pytorch docker container without extra effort. If you want the same on the host OS, you would need to install pytorch from the repo referenced in the original post.

1 Like

Yes this is correct. You may achieve the same acceleration by just pulling and running the official NGC PyTorch Docker container. No other steps required.

OK, thanks alot!

diff --git a/vllm/compilation/decorators.py b/vllm/compilation/decorators.py
index 40bde97ac..3e30e3447 100644
--- a/vllm/compilation/decorators.py
+++ b/vllm/compilation/decorators.py
@@ -498,7 +498,8 @@ def _support_torch_compile(
         # Prepare inductor config patches
         # assume_32bit_indexing is only available in torch 2.10.0.dev+
         inductor_config_patches = {}
-        if is_torch_equal_or_newer("2.10.0.dev"):
+        ic = torch._inductor.config
+        if "assume_32bit_indexing" in getattr(ic, "_config", {}):
             inductor_config_patches["assume_32bit_indexing"] = True

         with (

This patch will get you running beyond the indexing attribute error

1 Like

Thanks, I’ll try it later, but it these pytorch improvements don’t seem to make any difference when it comes to vllm workloads.