Run SGLang in Spark

johnny_nv · October 24, 2025, 4:54pm

Run SGLang Spark

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Create environment

uv venv .sglang --python 3.12
source .sglang/bin/activate
sudo apt install python3-dev python3.12-dev

Export variables

export TORCH_CUDA_ARCH_LIST=12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Install SGLang

uv pip install sgl-kernel --prerelease=allow --index-url https://docs.sglang.ai/whl/cu130/
uv pip install sglang --prerelease=allow 
uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130
uv pip install flashinfer-python

Clean memory

sudo sysctl -w vm.drop_caches=3

Run gptoss 120b nvfp4

mkdir -p ~/tiktoken_encodings
wget -O ~/tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O ~/tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"

python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --host 0.0.0.0 --port 30000 --reasoning-parser gpt-oss --tool-call-parser gpt-oss

If triton backend fails for you, delete triton_kernel path and compile/install from main.

johnny_nv · October 26, 2025, 1:41am

SGLANG Released: cu130 kernels can be downloaded here https://github.com/sgl-project/whl/blob/gh-pages/cu130/sgl-kernel/index.html

ashreaperone · October 27, 2025, 4:30am

@johnny_nv I was able to get it working with models like qwen and Nanonets, but I kept encountering this error with gpt-oss. Were you able to run gpt-oss-20/120b?

================================ SGLang  gpt-oss ===============================================
Scheduler hit an exception: Traceback (most recent call last):
  File "sglang_spark/sglang/python/sglang/srt/managers/scheduler.py", line 2753, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 312, in __init__
    self.initialize(min_per_gpu_memory)
  File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 384, in initialize
    self.load_model()
  File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 739, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
    self.load_weights_and_postprocess(
  File "sglang_spark/sglang/python/sglang/srt/model_loader/loader.py", line 614, in load_weights_and_postprocess
    quant_method.process_weights_after_loading(module)
  File "sglang_spark/sglang/python/sglang/srt/layers/quantization/mxfp4.py", line 541, in process_weights_after_loading
    from triton_kernels.matmul_ogs import FlexCtx, PrecisionConfig
  File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/matmul_ogs.py", line 15, in <module>
    from .matmul_ogs_details._matmul_ogs import _compute_writeback_idx
  File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/_matmul_ogs.py", line 8, in <module>
    from triton_kernels.numerics_details.flexpoint import float_to_flex, load_scale
  File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/numerics_details/flexpoint.py", line 55, in <module>
    @tl.constexpr_function
     ^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'triton.language' has no attribute 'constexpr_function'

Received sigquit from a child process. It usually means the child failed.

johnny_nv · October 27, 2025, 11:13pm

you have to uninstall triton_kernels.
in my case:

rm -rf /home/spark/ray_test/.venv/lib/python3.12/site-packages/triton_kernels

I did similar modifications from this: GitHub - yvbbrjdr/triton at spark
it should be fixed in main from triton.

ashreaperone · October 28, 2025, 1:07am

@johnny_nv That was very helpful to solve triton_kernels issue, but I have new issue. I did a bit research, but I could not find a decent solution to fix the issue. If I downgrade transformers, I get triton language error. if I keep the transformer 4.57.1, I get this error

uv run python -m sglang.launch_server --model-path openai/gpt-oss-20b --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8
sglang_spark/.sglang/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "sglang_spark/sglang/python/sglang/launch_server.py", line 7, in <module>
    from sglang.srt.server_args import prepare_server_args
  File "sglang_spark/sglang/python/sglang/srt/server_args.py", line 61, in <module>
    from sglang.srt.utils.hf_transformers_utils import check_gguf_file, get_config
  File "sglang_spark/sglang/python/sglang/srt/utils/hf_transformers_utils.py", line 26, in <module>
    from transformers import (
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2317, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2347, in _get_module
    raise e
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2345, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.local/share/uv/python/cpython-3.12.6-linux-aarch64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 40, in <module>
    from .auto_factory import _LazyAutoMapping
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 43, in <module>
    from ...generation import GenerationMixin
ImportError: cannot import name 'GenerationMixin' from 'transformers.generation' (sglang_spark/.sglang/lib/python3.12/site-packages/transformers/generation/__init__.py). Did you mean: 'GenerationMode'?

johnny_nv · October 28, 2025, 4:27am

uv cache clean

hw.phil.fan · November 6, 2025, 2:24pm

hello, I wonder how I can fix this problem

do I need to compile triton from this repository?

skimdt · November 9, 2025, 6:05am

eugr · November 9, 2025, 7:10am

The results are impressive: around 70 tokens/s on GPT-OSS 20B and 50 tokens/s on GPT-OSS 120B, which is state-of-the-art so far

Well, that’s a bold claim, considering that llama.cpp gives you 60 tokens/s on gpt-oss-120b and takes only a fraction of time to start (even less with 6.17 kernel).

skimdt · November 9, 2025, 8:16am

It’s literally a one liner on the terminal.. I have not played llama.cpp yet. I’m a little iffy on making any major system changes on dgx spark. Would rather run llama.cpp in dockers but haven’t able to find one yet. I may make my own image.

eugr · November 9, 2025, 8:55pm

You don’t need to make any major system changes to run llama.cpp. Dead simple to set up:

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Then run, e.g.:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -b 2048 --jinja

raphael.amorim · November 9, 2025, 9:40pm

My main problem with llama.cpp is the fact it it’s focused on single user, non-parallel requests.

eugr · November 9, 2025, 10:34pm

Yes, it’s not designed for high concurrency, but it can handle parallel requests, you just need to specify -np parameter setting the max number of requests it can handle in parallel. You can also enable unified kv cache (–kv-unified), so KV cache is shared across the requests vs. creating np*kv_cache buffer. It has continuous batching by default.

See docs for details: llama.cpp/tools/server at master · ggml-org/llama.cpp · GitHub

I don’t think anyone would be using Spark in high concurrency scenarios anyway.

raphael.amorim · November 10, 2025, 1:37am

Yes, you’re correct, not to use the spark in high concurrency, but use the Spark to design/test their solution for high concurrency scenarios.

Boliedfish · November 12, 2025, 7:50am

hi bro

when i follow u step to install sglang,something hanppend.

then i use command like this to fix(Could NOT find NUMA (missing: NUMA_INCLUDE_DIRS NUMA_LIBRARIES)
):sudo apt-get install libnuma-dev

then another error,but i can’t understand where the problem is.so i upload the pic what i did,wait for help,thx!

johnny_nv · November 12, 2025, 11:44am

it is not necessary to build anymore due that sglang distribute for cu130.
use triton 3.5.1 and flashinfer 0.5.2 that have some fixes for spark

eugr · November 12, 2025, 4:48pm

Did they fix triton-kernels too? Do we need to install them?

johnny_nv · November 14, 2025, 12:09am

Yes. You can test it. Same happened on gb300 in the past

eugr · November 14, 2025, 12:27am

should I install triton-kernels from the repo or they are uploaded to pypi?

johnny_nv · November 14, 2025, 12:36am

Here is the fix: [chore] update torch version to 2.9 by FlamingoPg · Pull Request #12969 · sgl-project/sglang · GitHub

Topic		Replies	Views
Run SGLang in Thor Jetson Thor	14	1467	December 2, 2025
Build SGLang from source on Blackwell Pro 6000/ DGX Spark DGX Spark / GB10 jetson , nemotron	14	717	March 4, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1614	December 7, 2025
New pre-built sglang Docker Images for NVIDIA DGX Spark DGX Spark / GB10 Projects	22	1467	April 9, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	38	1448	April 10, 2026
Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) DGX Spark / GB10 cuda	15	356	January 1, 2026
Run VLLM in Spark DGX Spark / GB10	145	11531	April 1, 2026
Running SGLang Diffusion Inference DGX Spark / GB10	3	251	January 27, 2026
Inference best results on Spark - not llama.cpp not VLLM -> SGLand DGX Spark / GB10 llama	3	846	January 11, 2026
Error when use SGLang:26.03-py3 to deploy Qwen3.5-35B-A3B-FP8 DGX Spark / GB10	1	103	April 1, 2026

Run SGLang in Spark

Run SGLang Spark

Related topics