Run VLLM in Thor from VLLM Repository

Run VLLM Thor

  1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create environment
uv venv .vllm --python 3.12
source .vllm/bin/activate
  1. Install Pytorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
  1. Install flashinfer and triton
uv pip install xgrammar triton flashinfer-python --prerelease=allow
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .
  1. Export variables
export TORCH_CUDA_ARCH_LIST=11.0a # Thor
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Clean memory
sudo sysctl -w vm.drop_caches=3
  1. Run gptoss 120b
mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings
# mxfp8 activation for MoE. faster, but higher risk for accuracy.
export VLLM_USE_FLASHINFER_MXFP4_MOE=1 
uv run vllm serve "openai/gpt-oss-120b" --async-scheduling --port 8000 --host 0.0.0.0 --trust_remote_code --swap-space 16 --max-model-len 32000 --tensor-parallel-size 1 --max-num-seqs 1024 --gpu-memory-utilization 0.7
1 Like

uv pip install xgrammar triton flashinfer-python

this instruction installed flashinfer-python 0.3.1.post1, which is not capable with vllm 0.11. the building process end with errors following. any suggestions please?

Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
Resolved 11 packages in 575ms
Uninstalled 1 package in 4ms
Installed 6 packages in 73ms

  • build==1.3.0
  • cmake==4.1.2
  • pyproject-hooks==1.2.0
  • setuptools==70.2.0
  • setuptools==79.0.1

  • setuptools-scm==9.2.2

  • wheel==0.45.1
    Using Python 3.12.3 environment at: /opt/vllm_test/.vllm
    Ă— No solution found when resolving dependencies:
    ╰─▶ Because there is no version of apache-tvm-ffi==0.1.0b15 and flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can conclude that flashinfer-python==0.4.1 cannot
    be used.
    And because vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 depends on flashinfer-python==0.4.1, we can conclude that vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 cannot
    be used.
    And because only vllm==0.11.1rc3.dev52+g356781693.d20251024.cu130 is available and you require vllm, we can conclude that your requirements are unsatisfiable.

    hint: apache-tvm-ffi was requested with a pre-release marker (e.g., apache-tvm-ffi==0.1.0b15), but pre-releases weren’t enabled (try: --prerelease=allow)

uv pip install xgrammar triton flashinfer-python --prerelease=allow
1 Like

Is it necessary to run sudo sysctl -w vm.drop_caches=3 before each new model load?

@changtimwu I’ve been using sleep mode and just kept vllm running and haven’t had to drop_caches, but I’m just running VLLM locally on my Thor with very little traffic and with -–gpu-memory-utilization 0.25

1 Like

It appears the CUDA Toolkit is required: CUDA Toolkit 13.0 Update 2 Downloads | NVIDIA Developer

For a fresh installed THOR – vannila Jetson Linux 38.2, I have to install below to make the vllm build succeed

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
sudo apt-get install python3-dev
1 Like

I’ve been trying this for over a month now to get it working “properly”. I was able to compile it as well as xformers, flash-attention and flashinfer but almost none of the models work stable or with the same attention background. e.g. gpt-oss is not working with any other VLLM_ATTENTION_BACKEND than triton. Is that normal?

Same is the case for any model asking for flash-attention 3 sinks.

It is normal. Depend of the support from each framework, if it has the structure of the model, kernels etc.

After completing the installation. I run Qwen2.5-VL,EngineCore failed to start.

(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843]   File "/home/ls/Env/vllm/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843]     raise PTXASError(error)
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] `ptxas` stderr:
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] ptxas fatal   : Value 'sm_110a' is not defined for option 'gpu-name'
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] 
(EngineCore_DP0 pid=1939580) ERROR 11-03 18:27:57 [core.py:843] Repro command: /home/ls/Env/vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_110a /tmp/tmpvepsp8ne.ptx -o /tmp/tmpvepsp8ne.ptx.o

before run:

export TORCH_CUDA_ARCH_LIST=11.0a # Thor
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

I missed this step; the issue has been resolved. THX