Hello, thanks @johnny_nv, it was useful.
Installation Steps for sgl-kernel on Jetson Thor 🚀
Install cmake
ARCH=$(uname -m)
wget ``https://cmake.org/files/v3.31/cmake-3.31.1-linux-${ARCH}.tar.gz
tar -xzf cmake-3.31.1-linux-${ARCH}.tar.gz
mv cmake-3.31.1-linux-${ARCH} /opt/cmake
export PATH=/opt/cmake/bin:$PATH
Install the required system dependencies,
sudo apt-get install -y libnuma-dev
Set essential environment variables for the CUDA, Triton, and build processes.
export TORCH_CUDA_ARCH_LIST=11.0a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CMAKE_BUILD_PARALLEL_LEVEL=1
export MAX_JOBS=4
export CPLUS_INCLUDE_PATH=/usr/local/cuda-13.0/targets/sbsa-linux/include/cccl
Navigate to the sgl-kernel source directory and use uv build to compile the library
cd sgl-kernel
uv build --wheel --no-build-isolation . --out-dir “./wheels”
–config-settings=cmake.args=“-G;Ninja”
–config-settings=cmake.define.TORCH_CUDA_ARCH_LIST=“${TORCH_CUDA_ARCH_LIST}”
–config-settings=cmake.define.CUDA_VERSION=“13.0”
–config-settings=cmake.define.SGL_KERNEL_ENABLE_BF16=1
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FP8=1
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FP4=1
–config-settings=cmake.define.SGL_KERNEL_ENABLE_FA3=0
–config-settings=cmake.define.SGL_KERNEL_ENABLE_SM90A=0
–config-settings=cmake.define.SGL_KERNEL_ENABLE_SM100A=1
–config-settings=cmake.define.ENABLE_BELOW_SM90=OFF
–config-settings=cmake.define.CMAKE_POLICY_VERSION_MINIMUM=3.5
Run sglang server
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 --mem-fraction 0.6 --attention-backend triton
Output
[2025-10-25 21:23:47] Using default HuggingFace chat template with detected content format: string
[2025-10-25 21:23:51] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
[2025-10-25 21:23:51] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-25 21:23:54] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-25 21:23:54] Init torch distributed ends. mem usage=0.00 GB
[2025-10-25 21:23:54] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-10-25 21:23:55] Load weight begin. avail mem=116.85 GB
[2025-10-25 21:23:56] Using model weights format ['*.safetensors']
Ignored error while writing commit hash to /home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main: [Errno 13] Permission denied: '/home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main'.
[2025-10-25 21:23:57] Ignored error while writing commit hash to /home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main: [Errno 13] Permission denied: '/home/jetson/vllm_models/hub/models--meta-llama--Llama-3.1-8B-Instruct/refs/main'.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.33it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:04, 2.20s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:07<00:02, 2.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00, 3.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00, 2.72s/it]
[2025-10-25 21:24:08] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=101.17 GB, mem usage=15.68 GB.
[2025-10-25 21:24:08] Using KV cache dtype: torch.bfloat16
[2025-10-25 21:24:10] KV Cache is allocated. #tokens: 445695, K size: 27.20 GB, V size: 27.20 GB
[2025-10-25 21:24:10] Memory pool end. avail mem=45.01 GB
[2025-10-25 21:24:10] Capture cuda graph begin. This can take up to several minutes. avail mem=44.83 GB
[2025-10-25 21:24:10] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=44.37 GB): 0%| | 0/36 [00:00<?, ?it/s][2025-10-25 21:24:11] MOE_A2A_BACKEND is not initialized, using default backend
Capturing batches (bs=1 avail_mem=43.70 GB): 100%|███████████████████████████████████████| 36/36 [00:09<00:00, 3.75it/s]
[2025-10-25 21:24:20] Capture cuda graph end. Time elapsed: 10.15 s. mem usage=1.12 GB. avail mem=43.70 GB.
[2025-10-25 21:24:21] max_total_num_tokens=445695, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=43.61 GB
[2025-10-25 21:24:22] INFO: Started server process [150349]
[2025-10-25 21:24:22] INFO: Waiting for application startup.
[2025-10-25 21:24:22] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-25 21:24:22] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-25 21:24:22] INFO: Application startup complete.
[2025-10-25 21:24:22] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-10-25 21:24:23] INFO: 127.0.0.1:54984 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-25 21:24:23] Prefill batch [1], #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-25 21:24:27] INFO: 127.0.0.1:54986 - "POST /generate HTTP/1.1" 200 OK
[2025-10-25 21:24:27] The server is fired up and ready to roll!
[2025-10-25 21:25:01] INFO: 127.0.0.1:42242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-10-25 21:25:01] Prefill batch [10], #new-seq: 1, #new-token: 54, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-25 21:25:04] Decode batch [43], #running-req: 1, #token: 88, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.93, #queue-req: 0,
[2025-10-25 21:25:08] Decode batch [83], #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.03, #queue-req: 0,
[2025-10-25 21:25:11] Decode batch [123], #running-req: 1, #token: 168, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.00, #queue-req: 0,
[2025-10-25 21:25:15] Decode batch [163], #running-req: 1, #token: 208, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.97, #queue-req: 0,
[2025-10-25 21:25:19] Decode batch [203], #running-req: 1, #token: 248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.98, #queue-req: 0,
[2025-10-25 21:25:22] Decode batch [243], #running-req: 1, #token: 288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.98, #queue-req: 0,
[2025-10-25 21:25:26] Decode batch [283], #running-req: 1, #token: 328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.97, #queue-req: 0,
[2025-10-25 21:25:29] Decode batch [323], #running-req: 1, #token: 368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.95, #queue-req: 0,
[2025-10-25 21:25:33] Decode batch [363], #running-req: 1, #token: 408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.96, #queue-req: 0,
[2025-10-25 21:25:37] Decode batch [403], #running-req: 1, #token: 448, token usage: 0.00, cuda graph: True, gen throughput (token/s): 11.00, #queue-req: 0,
[2025-10-25 21:25:40] Decode batch [443], #running-req: 1, #token: 488, token usage: 0.00, cuda graph: True, gen throughput (token/s): 10.96, #queue-req: 0,