Run VLLM in Thor from VLLM Repository

Run VLLM Thor

  1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create environment
sudo apt install python3-dev python3.12-dev
uv venv .vllm --python 3.12
source .vllm/bin/activate
  1. Install Pytorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
  1. Install flashinfer and triton
uv pip install xgrammar triton flashinfer-python --prerelease=allow
uv pip install https://github.com/vllm-project/vllm/releases/download/v0.14.0/vllm-0.14.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl
  1. Export variables
export TORCH_CUDA_ARCH_LIST=11.0a # Thor
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Clean memory
sudo sysctl -w vm.drop_caches=3
  1. Run gptoss 120b
mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings
# mxfp8 activation for MoE. faster, but higher risk for accuracy.
export VLLM_USE_FLASHINFER_MXFP4_MOE=1 
uv run vllm serve "openai/gpt-oss-120b" --async-scheduling --port 8000 --host 0.0.0.0 --trust_remote_code --swap-space 16 --max-model-len 32000 --tensor-parallel-size 1 --max-num-seqs 1024 --gpu-memory-utilization 0.7
1 Like

uv pip install xgrammar triton flashinfer-python

this instruction installed flashinfer-python 0.3.1.post1, which is not capable with vllm 0.11. the building process end with errors following. any suggestions please?

Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
Resolved 11 packages in 575ms
Uninstalled 1 package in 4ms
Installed 6 packages in 73ms

  • build==1.3.0
  • cmake==4.1.2
  • pyproject-hooks==1.2.0
  • setuptools==70.2.0
  • setuptools==79.0.1

  • setuptools-scm==9.2.2

  • wheel==0.45.1
    Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
    × No solution found when resolving dependencies:
    ╰─▶ Because there is no version of apache-tvm-ffi==0.1.0b15 and flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can conclude that flashinfer-python==0.4.1 cannot
    be used.
    And because vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 depends on flashinfer-python==0.4.1, we can conclude that vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 cannot
    be used.
    And because only vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 is available and you require vllm, we can conclude that your requirements are unsatisfiable.

    hint: apache-tvm-ffi was requested with a pre-release marker (e.g., apache-tvm-ffi==0.1.0b15), but pre-releases weren’t enabled (try: --prerelease=allow)

uv pip install xgrammar triton flashinfer-python --prerelease=allow
1 Like

Is it necessary to run sudo sysctl -w vm.drop_caches=3 before each new model load?

@changtimwu I’ve been using sleep mode and just kept vllm running and haven’t had to drop_caches, but I’m just running VLLM locally on my Thor with very little traffic and with -–gpu-memory-utilization 0.25

1 Like

It appears the CUDA Toolkit is required: CUDA Toolkit 13.0 Update 2 Downloads | NVIDIA Developer

For a fresh installed THOR – vannila Jetson Linux 38.2, I have to install below to make the vllm build succeed

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
sudo apt-get install python3-dev
1 Like

I’ve been trying this for over a month now to get it working “properly”. I was able to compile it as well as xformers, flash-attention and flashinfer but almost none of the models work stable or with the same attention background. e.g. gpt-oss is not working with any other VLLM_ATTENTION_BACKEND than triton. Is that normal?

Same is the case for any model asking for flash-attention 3 sinks.

It is normal. Depend of the support from each framework, if it has the structure of the model, kernels etc.

After completing the installation. I run Qwen2.5-VL,EngineCore failed to start.

(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843]   File "/home/ls/Env/vllm/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843]     raise PTXASError(error)
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] `ptxas` stderr:
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] ptxas fatal   : Value 'sm_110a' is not defined for option 'gpu-name'
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] 
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] Repro command: /home/ls/Env/vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_110a /tmp/tmpvepsp8ne.ptx -o /tmp/tmpvepsp8ne.ptx.o

before run:

export TORCH_CUDA_ARCH_LIST=11.0a # Thor
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

I missed this step; the issue has been resolved. THX

Thanks for this description. As Qwen3-VL-32b-instruct also requires vLLM 0.11 (and thus cannot be run using the NGC vllm-25.10-py3 container), I tried running this model using the vLLM installation.

I am using this command:

uv run vllm serve “Qwen/Qwen3-VL-32B-Instruct” --async-scheduling --port 8000 --host 0.0.0.0 --trust-remote-code --swap-space 16 --max-model-len 4096 --tensor-parallel-size 1 --max-num-seqs 2 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8 --enable-prefix-caching --dtype half --enable-chunked-prefill

The model loads successfully and some test-prompts gave good results but the throughput on THor with this model is only around 2 token/s. Do you have any suggestions how to increase performance?

it is because some kernels are needed to accelerate MoE

The model I used is not MoE. I guess, you mean the Qwen/Qwen3-VL-30B-A3B-Instruct (which is MoE with only 3B active parameters)?
Of course, this runs faster (~27 token/s), but the quality of the results is significantly worse than with the 32B (non-MoE) model.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.