TensorRT Edge-LLM on Jetson Thor, OpenAI-compatible server, and streaming client

Hi everyone,

I got TensorRT Edge-LLM running on Jetson Thor with the experimental OpenAI-compatible server and Qwen/Qwen3.5-4B. Sharing the steps that worked for me in case it helps others on Thor.

Clone the repository

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

Build the C++ runtime and Python bindings:

cd ~/TensorRT-Edge-LLM

mkdir -p build
cd build
cmake .. \
  -DTRT_PACKAGE_DIR=/usr \
  -DCUDA_CTK_VERSION=13.0 \
  -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
  -DEMBEDDED_TARGET=jetson-thor \
  -DENABLE_CUTE_DSL=ALL \
  -DBUILD_PYTHON_BINDINGS=ON
make -j$(nproc)
cd ..

Set PYTHONPATH for the high-level API and server (from repo root):

export PYTHONPATH=$PWD:$PWD/experimental:$PYTHONPATH

I used uv with Python 3.12:

uv venv .tensorrt --python 3.12
source .vllm/bin/activate

Then install dependencies:

uv pip install -r requirements.txt
uv pip install pybind11 fastapi uvicorn openai
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Run server (from repo root)

python -m experimental.server \
  --model Qwen/Qwen3.5-4B \
  --port 8000

Streaming client (client.py)

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

stream = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Be concise and accurate."
        },
        {
            "role": "user",
            "content": "What can you tell me about quantum computing?"
        }
    ],
    stream=True,
    max_tokens=2000,
    extra_body={
        "chat_template_kwargs": {
            "enable_thinking": False
        }
    }
)

# Print each token as it arrives
for chunk in stream:
    content = chunk.choices[0].delta.content or ""
    print(content, end="", flush=True)

For Qwen/Qwen3.5-4B on Jetson Thor, batch size 1 is currently the most stable configuration, especially with CuTe DSL kernels enabled.

Thanks for sharing this.