this instruction installed flashinfer-python 0.3.1.post1, which is not capable with vllm 0.11. the building process end with errors following. any suggestions please?
Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
Resolved 11 packages in 575ms
Uninstalled 1 package in 4ms
Installed 6 packages in 73ms
build==1.3.0
cmake==4.1.2
pyproject-hooks==1.2.0
setuptools==70.2.0
setuptools==79.0.1
setuptools-scm==9.2.2
wheel==0.45.1
Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
× No solution found when resolving dependencies:
╰─▶ Because there is no version of apache-tvm-ffi==0.1.0b15 and flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can conclude that flashinfer-python==0.4.1 cannot
be used.
And because vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 depends on flashinfer-python==0.4.1, we can conclude that vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 cannot
be used.
And because only vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 is available and you require vllm, we can conclude that your requirements are unsatisfiable.
hint: apache-tvm-ffi was requested with a pre-release marker (e.g., apache-tvm-ffi==0.1.0b15), but pre-releases weren’t enabled (try: --prerelease=allow)
@changtimwu I’ve been using sleep mode and just kept vllm running and haven’t had to drop_caches, but I’m just running VLLM locally on my Thor with very little traffic and with -–gpu-memory-utilization 0.25
I’ve been trying this for over a month now to get it working “properly”. I was able to compile it as well as xformers, flash-attention and flashinfer but almost none of the models work stable or with the same attention background. e.g. gpt-oss is not working with any other VLLM_ATTENTION_BACKEND than triton. Is that normal?
Same is the case for any model asking for flash-attention 3 sinks.
Thanks for this description. As Qwen3-VL-32b-instruct also requires vLLM 0.11 (and thus cannot be run using the NGC vllm-25.10-py3 container), I tried running this model using the vLLM installation.
The model loads successfully and some test-prompts gave good results but the throughput on THor with this model is only around 2 token/s. Do you have any suggestions how to increase performance?
The model I used is not MoE. I guess, you mean the Qwen/Qwen3-VL-30B-A3B-Instruct (which is MoE with only 3B active parameters)?
Of course, this runs faster (~27 token/s), but the quality of the results is significantly worse than with the 32B (non-MoE) model.