this instruction installed flashinfer-python 0.3.1.post1, which is not capable with vllm 0.11. the building process end with errors following. any suggestions please?
Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
Resolved 11 packages in 575ms
Uninstalled 1 package in 4ms
Installed 6 packages in 73ms
build==1.3.0
cmake==4.1.2
pyproject-hooks==1.2.0
setuptools==70.2.0
setuptools==79.0.1
setuptools-scm==9.2.2
wheel==0.45.1
Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
Ă— No solution found when resolving dependencies:
╰─▶ Because there is no version of apache-tvm-ffi==0.1.0b15 and flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can conclude that flashinfer-python==0.4.1 cannot
be used.
And because vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 depends on flashinfer-python==0.4.1, we can conclude that vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 cannot
be used.
And because only vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 is available and you require vllm, we can conclude that your requirements are unsatisfiable.
hint: apache-tvm-ffi was requested with a pre-release marker (e.g., apache-tvm-ffi==0.1.0b15), but pre-releases weren’t enabled (try: --prerelease=allow)
@changtimwu I’ve been using sleep mode and just kept vllm running and haven’t had to drop_caches, but I’m just running VLLM locally on my Thor with very little traffic and with -–gpu-memory-utilization 0.25
I’ve been trying this for over a month now to get it working “properly”. I was able to compile it as well as xformers, flash-attention and flashinfer but almost none of the models work stable or with the same attention background. e.g. gpt-oss is not working with any other VLLM_ATTENTION_BACKEND than triton. Is that normal?
Same is the case for any model asking for flash-attention 3 sinks.