Run SGLang in Thor

Run SGLang Thor & Spark

  1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create environment
uv venv .sglang --python 3.12
source .sglang/bin/activate
sudo apt install python3-dev python3.12-dev
  1. Export variables
export TORCH_CUDA_ARCH_LIST=11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Install SGLang
uv pip install sgl-kernel --prerelease=allow --index-url https://docs.sglang.ai/whl/cu130/
uv pip install sglang --prerelease=allow 
uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130
uv pip install flashinfer-python
  1. Clean memory
sudo sysctl -w vm.drop_caches=3
  1. Run gptoss 120b nvfp4
mkdir -p ~/tiktoken_encodings
wget -O ~/tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O ~/tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --host 0.0.0.0 --port 30000 --reasoning-parser gpt-oss --tool-call-parser gpt-oss
1 Like

Hello, thanks @johnny_nv, it was useful.

Installation Steps for sgl-kernel on Jetson Thor 🚀

Install cmake

ARCH=$(uname -m)
wget ``https://cmake.org/files/v3.31/cmake-3.31.1-linux-${ARCH}.tar.gz
tar -xzf cmake-3.31.1-linux-${ARCH}.tar.gz
mv cmake-3.31.1-linux-${ARCH} /opt/cmake
export PATH=/opt/cmake/bin:$PATH

Install the required system dependencies,

sudo apt-get install -y libnuma-dev

Set essential environment variables for the CUDA, Triton, and build processes.

export TORCH_CUDA_ARCH_LIST=11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CMAKE_BUILD_PARALLEL_LEVEL=1
export MAX_JOBS=4
export CPLUS_INCLUDE_PATH=/usr/local/cuda-13.0/targets/sbsa-linux/include/cccl

Navigate to the sgl-kernel source directory and use uv build to compile the library

cd sgl-kernel

uv build --wheel --no-build-isolation . --out-dir “./wheels” 
–config-settings=cmake.args=“-G;Ninja” 
–config-settings=cmake.define.TORCH_CUDA_ARCH_LIST=“${TORCH_CUDA_ARCH_LIST}” 
–config-settings=cmake.define.CUDA_VERSION=“13.0” 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_BF16=1 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FP8=1 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FP4=1 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FA3=0 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_SM90A=0 
–config-settings=cmake.define.SGL_KERNEL_ENABLE_SM100A=1 
–config-settings=cmake.define.ENABLE_BELOW_SM90=OFF 
–config-settings=cmake.define.CMAKE_POLICY_VERSION_MINIMUM=3.5

Run sglang server

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000  --mem-fraction 0.6  --attention-backend triton

Output

[2025-10-25 21:23:47] Using default HuggingFace chat template with detected content format: string
[2025-10-25 21:23:51] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
[2025-10-25 21:23:51] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-25 21:23:54] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-25 21:23:54] Init torch distributed ends. mem usage=0.00 GB
[2025-10-25 21:23:54] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-10-25 21:23:55] Load weight begin. avail mem=116.85 GB
[2025-10-25 21:23:56] Using model weights format ['*.safetensors']
Ignored error while writing commit hash to /home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main: [Errno 13] Permission denied: '/home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main'.
[2025-10-25 21:23:57] Ignored error while writing commit hash to /home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main: [Errno 13] Permission denied: '/home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main'.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.33it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:04,  2.20s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  3.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.72s/it]

[2025-10-25 21:24:08] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=101.17 GB, mem usage=15.68 GB.
[2025-10-25 21:24:08] Using KV cache dtype: torch.bfloat16
[2025-10-25 21:24:10] KV Cache is allocated. #tokens: 445695, K size: 27.20 GB, V size: 27.20 GB
[2025-10-25 21:24:10] Memory pool end. avail mem=45.01 GB
[2025-10-25 21:24:10] Capture cuda graph begin. This can take up to several minutes. avail mem=44.83 GB
[2025-10-25 21:24:10] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=44.37 GB):   0%|                                              | 0/36 [00:00<?, ?it/s][2025-10-25 21:24:11] MOE_A2A_BACKEND is not initialized, using default backend
Capturing batches (bs=1 avail_mem=43.70 GB): 100%|███████████████████████████████████████| 36/36 [00:09<00:00,  3.75it/s]
[2025-10-25 21:24:20] Capture cuda graph end. Time elapsed: 10.15 s. mem usage=1.12 GB. avail mem=43.70 GB.
[2025-10-25 21:24:21] max_total_num_tokens=445695, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=43.61 GB
[2025-10-25 21:24:22] INFO:     Started server process [150349]
[2025-10-25 21:24:22] INFO:     Waiting for application startup.
[2025-10-25 21:24:22] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-25 21:24:22] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-25 21:24:22] INFO:     Application startup complete.
[2025-10-25 21:24:22] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-10-25 21:24:23] INFO:     127.0.0.1:54984 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-25 21:24:23] Prefill batch [1], #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-25 21:24:27] INFO:     127.0.0.1:54986 - "POST /generate HTTP/1.1" 200 OK
[2025-10-25 21:24:27] The server is fired up and ready to roll!
[2025-10-25 21:25:01] INFO:     127.0.0.1:42242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-25 21:25:01] Prefill batch [10], #new-seq: 1, #new-token: 54, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-10-25 21:25:04] Decode batch [43], #running-req: 1, #token: 88, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.93, #queue-req: 0, 
[2025-10-25 21:25:08] Decode batch [83], #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.03, #queue-req: 0, 
[2025-10-25 21:25:11] Decode batch [123], #running-req: 1, #token: 168, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.00, #queue-req: 0, 
[2025-10-25 21:25:15] Decode batch [163], #running-req: 1, #token: 208, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.97, #queue-req: 0, 
[2025-10-25 21:25:19] Decode batch [203], #running-req: 1, #token: 248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.98, #queue-req: 0, 
[2025-10-25 21:25:22] Decode batch [243], #running-req: 1, #token: 288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.98, #queue-req: 0, 
[2025-10-25 21:25:26] Decode batch [283], #running-req: 1, #token: 328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.97, #queue-req: 0, 
[2025-10-25 21:25:29] Decode batch [323], #running-req: 1, #token: 368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.95, #queue-req: 0, 
[2025-10-25 21:25:33] Decode batch [363], #running-req: 1, #token: 408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.96, #queue-req: 0, 
[2025-10-25 21:25:37] Decode batch [403], #running-req: 1, #token: 448, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.00, #queue-req: 0, 
[2025-10-25 21:25:40] Decode batch [443], #running-req: 1, #token: 488, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.96, #queue-req: 0, 

SGLANG Released: cu130 kernels can be downloaded here
https://github.com/sgl-project/whl/blob/gh-pages/cu130/sgl-kernel/index.html

2 Likes

I install sgl_kernel cu130 wheel, run command:

```python -m sglang.launch_server --model /root/zl/Qwen3-VL-8B-Instruct```

bug got error:

    from sgl_kernel import gelu_and_mul, silu_and_mul
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/__init__.py", line 196, in <module>
    common_ops = _load_architecture_specific_ops()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sgl_kernel/__init__.py", line 191, in _load_architecture_specific_ops
    raise ImportError(error_msg)
ImportError:
[sgl_kernel] CRITICAL: Could not load any common_ops library!

Attempted locations:
1. Architecture-specific pattern: /usr/local/lib/python3.12/dist-packages/sgl_kernel/sm100/common_ops.* - found files: ['/usr/local/lib/python3.12/dist-packages/sgl_kernel/sm100/common_ops.abi3.so']
2. Fallback pattern: /usr/local/lib/python3.12/dist-packages/sgl_kernel/common_ops.* - found files: []
3. Standard Python import: common_ops - failed

GPU Info:
- Compute capability: 110
- Expected variant: SM110 (precise math for compatibility)

Please ensure sgl_kernel is properly installed with:
pip install --upgrade sgl_kernel

Error details from previous import attempts:
- ImportError: /usr/local/lib/python3.12/dist-packages/sgl_kernel/sm100/common_ops.abi3.so: undefined symbol: _ZNK3c106SymInt22maybe_as_int_slow_pathEv
- ModuleNotFoundError: No module named 'common_ops'

did you installed from this? Release v0.3.16.post4 · sgl-project/whl · GitHub

did you install pytorch 2.9.0 cu130?

did you export correctly your export variables?

I use the NGC pytorch image 25.09 as base image, and install sglang from source.

git clone --recursive https://github.com/sgl-project/sglang.git
cd sglang 
pip install -e "python[cu130]" 
pip install sgl_kerne sgl_kernel-0.3.16.post4+cu130-cp310-abi3-manylinux2014_aarch64.whl

export TORCH_CUDA_ARCH_LIST=11.0a # Thorexport TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxasexport PATH=/usr/local/cuda/bin:$PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH


python -m sglang.launch_server --model Qwen3-Vl-8B

python env:

setuptools                 79.0.1
sgl-kernel                 0.3.16.post4
sglang                     0.5.4                     /home/nvidia/zl/sglang/python
six                        1.16.0
sniffio                    1.3.1
sortedcontainers           2.4.0
soundfile                  0.13.1
soupsieve                  2.8
soxr                       0.5.0.post1
stack-data                 0.6.3
starlette                  0.49.1
sympy                      1.14.0
tabulate                   0.9.0
tensorboard                2.20.0
tensorboard-data-server    0.7.2
tensorrt                   10.13.3.9
terminado                  0.18.1
threadpoolctl              3.6.0
tiktoken                   0.12.0
timm                       1.0.16
tinycss2                   1.4.0
tokenizers                 0.22.1
torch                      2.9.0a0+50eac811a6.nv25.9
torch_memory_saver         0.0.9
torch_tensorrt             2.9.0a0
torchao                    0.9.0
torchprofile               0.0.4
torchvision                0.24.0a0+98f8b375
tornado                    6.5.2
tqdm                       4.67.1
traitlets                  5.14.3
transformer_engine         2.7.0+fedd9dd
transformers               4.57.1
types-python-dateutil      2.9.0.20250822
typing_extensions          4.15.0
typing-inspection          0.4.1
tzdata                     2025.2
uri-template               1.3.0
urllib3                    2.5.0
uv                         0.8.17
uvicorn                    0.38.0
uvloop                     0.22.1
wcwidth                    0.2.13
webcolors                  24.11.1
webencodings               0.5.1
websocket-client           1.8.0
Werkzeug                   3.1.3
wheel                      0.45.1
wrapt                      1.17.3
xdoctest                   1.0.2
xgrammar                   0.1.25
xxhash                     3.6.0
yarl                       1.22.0
zipp                       3.23.0

1 Like

25.10 was released: SGLang | NVIDIA NGC

1 Like

I followed the installation steps of the initial post but when I run a model with the sglang.launch_server module, it terminates with:
”RuntimeError: No accelerator (CUDA, XPU, HPU) is available.”

1 Like

I don’t know sglang but this might help.

env |grep -i cuda

If following values aren't returned, set them and add to ~/.bashrc

CUDA_HOME=/usr/local/cuda
LIBRARY_PATH=/usr/local/cuda-13.0/targets/sbsa-linux/lib
PATH=$PATH:/usr/local/cuda/bin

I have the same problem.

      raise RuntimeError("No accelerator (CUDA, XPU, HPU) is available.")
RuntimeError: No accelerator (CUDA, XPU, HPU) is available.
  1. Did you install pytorch 2.9.1 cu130?
  2. Did you build it with that version?

I used the installation procedure exactly described in the initial post of this thread and the way it installs torch. The version that was installed this way is: 2.9.0+cu130

uv pip install sgl-kernel --prerelease=allow --index-url https://docs.sglang.ai/whl/cu130/
uv pip install sglang --prerelease=allow 
uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.