Hi everyone,
I got TensorRT Edge-LLM running on Jetson Thor with the experimental OpenAI-compatible server and Qwen/Qwen3.5-4B. Sharing the steps that worked for me in case it helps others on Thor.
Clone the repository
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive
Build the C++ runtime and Python bindings:
cd ~/TensorRT-Edge-LLM
mkdir -p build
cd build
cmake .. \
-DTRT_PACKAGE_DIR=/usr \
-DCUDA_CTK_VERSION=13.0 \
-DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
-DEMBEDDED_TARGET=jetson-thor \
-DENABLE_CUTE_DSL=ALL \
-DBUILD_PYTHON_BINDINGS=ON
make -j$(nproc)
cd ..
Set PYTHONPATH for the high-level API and server (from repo root):
export PYTHONPATH=$PWD:$PWD/experimental:$PYTHONPATH
I used uv with Python 3.12:
uv venv .tensorrt --python 3.12
source .vllm/bin/activate
Then install dependencies:
uv pip install -r requirements.txt
uv pip install pybind11 fastapi uvicorn openai
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Run server (from repo root)
python -m experimental.server \
--model Qwen/Qwen3.5-4B \
--port 8000
Streaming client (client.py)
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
stream = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Be concise and accurate."
},
{
"role": "user",
"content": "What can you tell me about quantum computing?"
}
],
stream=True,
max_tokens=2000,
extra_body={
"chat_template_kwargs": {
"enable_thinking": False
}
}
)
# Print each token as it arrives
for chunk in stream:
content = chunk.choices[0].delta.content or ""
print(content, end="", flush=True)
For Qwen/Qwen3.5-4B on Jetson Thor, batch size 1 is currently the most stable configuration, especially with CuTe DSL kernels enabled.