@johnny_nv I was able to get it working with models like qwen and Nanonets, but I kept encountering this error with gpt-oss. Were you able to run gpt-oss-20/120b?
================================ SGLang gpt-oss ===============================================
Scheduler hit an exception: Traceback (most recent call last):
File "sglang_spark/sglang/python/sglang/srt/managers/scheduler.py", line 2753, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "sglang_spark/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^
File "sglang_spark/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 312, in __init__
self.initialize(min_per_gpu_memory)
File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 384, in initialize
self.load_model()
File "sglang_spark/sglang/python/sglang/srt/model_executor/model_runner.py", line 739, in load_model
self.model = get_model(
^^^^^^^^^^
File "sglang_spark/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "sglang_spark/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
self.load_weights_and_postprocess(
File "sglang_spark/sglang/python/sglang/srt/model_loader/loader.py", line 614, in load_weights_and_postprocess
quant_method.process_weights_after_loading(module)
File "sglang_spark/sglang/python/sglang/srt/layers/quantization/mxfp4.py", line 541, in process_weights_after_loading
from triton_kernels.matmul_ogs import FlexCtx, PrecisionConfig
File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/matmul_ogs.py", line 15, in <module>
from .matmul_ogs_details._matmul_ogs import _compute_writeback_idx
File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/_matmul_ogs.py", line 8, in <module>
from triton_kernels.numerics_details.flexpoint import float_to_flex, load_scale
File "sglang_spark/.sglang/lib/python3.12/site-packages/triton_kernels/numerics_details/flexpoint.py", line 55, in <module>
@tl.constexpr_function
^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'triton.language' has no attribute 'constexpr_function'
Received sigquit from a child process. It usually means the child failed.
@johnny_nv That was very helpful to solve triton_kernels issue, but I have new issue. I did a bit research, but I could not find a decent solution to fix the issue. If I downgrade transformers, I get triton language error. if I keep the transformer 4.57.1, I get this error
uv run python -m sglang.launch_server --model-path openai/gpt-oss-20b --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8
sglang_spark/.sglang/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning:
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)
warnings.warn(
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "sglang_spark/sglang/python/sglang/launch_server.py", line 7, in <module>
from sglang.srt.server_args import prepare_server_args
File "sglang_spark/sglang/python/sglang/srt/server_args.py", line 61, in <module>
from sglang.srt.utils.hf_transformers_utils import check_gguf_file, get_config
File "sglang_spark/sglang/python/sglang/srt/utils/hf_transformers_utils.py", line 26, in <module>
from transformers import (
File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2317, in __getattr__
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2347, in _get_module
raise e
File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2345, in _get_module
return importlib.import_module("." + module_name, self.__name__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/.local/share/uv/python/cpython-3.12.6-linux-aarch64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 40, in <module>
from .auto_factory import _LazyAutoMapping
File "sglang_spark/.sglang/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 43, in <module>
from ...generation import GenerationMixin
ImportError: cannot import name 'GenerationMixin' from 'transformers.generation' (sglang_spark/.sglang/lib/python3.12/site-packages/transformers/generation/__init__.py). Did you mean: 'GenerationMode'?
The results are impressive: around 70 tokens/s on GPT-OSS 20B and 50 tokens/s on GPT-OSS 120B, which is state-of-the-art so far
Well, that’s a bold claim, considering that llama.cpp gives you 60 tokens/s on gpt-oss-120b and takes only a fraction of time to start (even less with 6.17 kernel).
It’s literally a one liner on the terminal.. I have not played llama.cpp yet. I’m a little iffy on making any major system changes on dgx spark. Would rather run llama.cpp in dockers but haven’t able to find one yet. I may make my own image.
Yes, it’s not designed for high concurrency, but it can handle parallel requests, you just need to specify -np parameter setting the max number of requests it can handle in parallel. You can also enable unified kv cache (–kv-unified), so KV cache is shared across the requests vs. creating np*kv_cache buffer. It has continuous batching by default.