Run SGLang in Spark

Run SGLang Spark

  1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create environment
uv venv .sglang --python 3.12
source .sglang/bin/activate
sudo apt install python3-dev python3.12-dev
  1. Export variables
export TORCH_CUDA_ARCH_LIST=12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Install SGLang
uv pip install sgl-kernel --prerelease=allow --index-url https://docs.sglang.ai/whl/cu130/
uv pip install sglang --prerelease=allow 
uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130
uv pip install flashinfer-python
  1. Clean memory
sudo sysctl -w vm.drop_caches=3
  1. Run gptoss 120b nvfp4
mkdir -p ~/tiktoken_encodings
wget -O ~/tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O ~/tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --host 0.0.0.0 --port 30000 --reasoning-parser gpt-oss --tool-call-parser gpt-oss

If triton backend fails for you, delete triton_kernel path and compile/install from main.

5 Likes

SGLANG Released: cu130 kernels can be downloaded here https://github.com/sgl-project/whl/blob/gh-pages/cu130/sgl-kernel/index.html

3 Likes

@johnny_nv I was able to get it working with models like qwen and Nanonets, but I kept encountering this error with gpt-oss. Were you able to run gpt-oss-20/120b?

================================ SGLang  gpt-oss ===============================================
Scheduler hit an exception: Traceback (most recent call last):
  File "sglang_spark/sglang/python/sglang/srt/managers/scheduler.py", line 2753, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 312, in __init__
    self.initialize(min_per_gpu_memory)
  File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 384, in initialize
    self.load_model()
  File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 739, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "sglang_spark/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
    self.load_weights_and_postprocess(
  File "sglang_spark/sglang/python/sglang/srt/model_loader/loader.py", line 614, in load_weights_and_postprocess
    quant_method.process_weights_after_loading(module)
  File "sglang_spark/sglang/python/sglang/srt/layers/quantization/mxfp4.py", line 541, in process_weights_after_loading
    from triton_kernels.matmul_ogs import FlexCtx, PrecisionConfig
  File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/matmul_ogs.py", line 15, in <module>
    from .matmul_ogs_details._matmul_ogs import _compute_writeback_idx
  File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/_matmul_ogs.py", line 8, in <module>
    from triton_kernels.numerics_details.flexpoint import float_to_flex, load_scale
  File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/numerics_details/flexpoint.py", line 55, in <module>
    @tl.constexpr_function
     ^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'triton.language' has no attribute 'constexpr_function'

Received sigquit from a child process. It usually means the child failed.

you have to uninstall triton_kernels.
in my case:

rm -rf /home/spark/ray_test/.venv/lib/python3.12/site-packages/triton_kernels

I did similar modifications from this: GitHub - yvbbrjdr/triton at spark
it should be fixed in main from triton.

1 Like

@johnny_nv That was very helpful to solve triton_kernels issue, but I have new issue. I did a bit research, but I could not find a decent solution to fix the issue. If I downgrade transformers, I get triton language error. if I keep the transformer 4.57.1, I get this error

uv run python -m sglang.launch_server --model-path openai/gpt-oss-20b --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8
sglang_spark/.sglang/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "sglang_spark/sglang/python/sglang/launch_server.py", line 7, in <module>
    from sglang.srt.server_args import prepare_server_args
  File "sglang_spark/sglang/python/sglang/srt/server_args.py", line 61, in <module>
    from sglang.srt.utils.hf_transformers_utils import check_gguf_file, get_config
  File "sglang_spark/sglang/python/sglang/srt/utils/hf_transformers_utils.py", line 26, in <module>
    from transformers import (
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2317, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2347, in _get_module
    raise e
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2345, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.local/share/uv/python/cpython-3.12.6-linux-aarch64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 40, in <module>
    from .auto_factory import _LazyAutoMapping
  File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 43, in <module>
    from ...generation import GenerationMixin
ImportError: cannot import name 'GenerationMixin' from 'transformers.generation' (sglang_spark/.sglang/lib/python3.12/site-packages/transformers/generation/__init__.py). Did you mean: 'GenerationMode'?
uv cache clean

hello, I wonder how I can fix this problem

do I need to compile triton from this repository?

3 Likes

The results are impressive: around 70 tokens/s on GPT-OSS 20B and 50 tokens/s on GPT-OSS 120B, which is state-of-the-art so far

Well, that’s a bold claim, considering that llama.cpp gives you 60 tokens/s on gpt-oss-120b and takes only a fraction of time to start (even less with 6.17 kernel).

It’s literally a one liner on the terminal.. I have not played llama.cpp yet. I’m a little iffy on making any major system changes on dgx spark. Would rather run llama.cpp in dockers but haven’t able to find one yet. I may make my own image.

You don’t need to make any major system changes to run llama.cpp. Dead simple to set up:

Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Then run, e.g.:

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF -fa 1 -ngl 999 -ub 2048 -b 2048 --jinja
2 Likes

My main problem with llama.cpp is the fact it it’s focused on single user, non-parallel requests.

1 Like

Yes, it’s not designed for high concurrency, but it can handle parallel requests, you just need to specify -np parameter setting the max number of requests it can handle in parallel. You can also enable unified kv cache (–kv-unified), so KV cache is shared across the requests vs. creating np*kv_cache buffer. It has continuous batching by default.

See docs for details: llama.cpp/tools/server at master · ggml-org/llama.cpp · GitHub

I don’t think anyone would be using Spark in high concurrency scenarios anyway.

2 Likes

Yes, you’re correct, not to use the spark in high concurrency, but use the Spark to design/test their solution for high concurrency scenarios.

hi bro

when i follow u step to install sglang,something hanppend.


then i use command like this to fix(Could NOT find NUMA (missing: NUMA_INCLUDE_DIRS NUMA_LIBRARIES)
):sudo apt-get install libnuma-dev

then another error,but i can’t understand where the problem is.so i upload the pic what i did,wait for help,thx!

it is not necessary to build anymore due that sglang distribute for cu130.
use triton 3.5.1 and flashinfer 0.5.2 that have some fixes for spark

2 Likes

Did they fix triton-kernels too? Do we need to install them?

1 Like

Yes. You can test it. Same happened on gb300 in the past

should I install triton-kernels from the repo or they are uploaded to pypi?

Here is the fix: [chore] update torch version to 2.9 by FlamingoPg · Pull Request #12969 · sgl-project/sglang · GitHub