Hi raphael.amorim:
Thank you for pointing out the token key issue - Iâve alread fixed it. ^^
Additionally, when following your steps (not use GitHub) to test the Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 model, I encounter the following error. Is there any part that is still missing?
Command1: (.sglang) asus@gx10-f0df:~/Desktop/sglang$ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --reasoning-parser qwen3 --tool-call-parser qwen --attention-backend flashinfer --mem-fraction-static 0.8
Traceback (most recent call last):
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/scheduler.pyâ, line 2680, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/scheduler.pyâ, line 320, in init
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.pyâ, line 248, in init
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.pyâ, line 359, in init
self.initialize(min_per_gpu_memory)
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.pyâ, line 511, in initialize
self.init_device_graphs()
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.pyâ, line 2448, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.pyâ, line 354, in init
raise Exception(
Exception: Capture cuda graph failed: Assertion error (/sgl-kernel/build/_deps/repo-deepgemm-src/csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:56): Unknown recipe
Possible solutions:
- set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
- set --cuda-graph-max-bs to a smaller value (e.g., 16)
- disable torch compile by not using --enable-torch-compile
- disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub Sign in to GitHub · GitHub
Command 2: (.sglang) asus@gx10-f0df:~/Desktop/sglang$ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --reasoning-parser qwen3 --tool-call-parser qwen
/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/torch/cuda/init.py:283: UserWarning:
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)
warnings.warn(
[2025-12-17 07:57:16] WARNING common.py:1604: Failed to get GPU memory capacity from nvidia-smi, falling back to torch.cuda.mem_get_info().
[2025-12-17 07:57:18] WARNING server_args.py:1406: Attention backend not explicitly specified. Use trtllm_mha backend by default.
Traceback (most recent call last):
File ââ, line 198, in _run_module_as_main
File ââ, line 88, in _run_code
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/launch_server.pyâ, line 25, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.pyâ, line 4495, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.pyâ, line 4033, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ââ, line 294, in init
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.pyâ, line 654, in post_init
self._handle_attention_backend_compatibility()
File â/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.pyâ, line 1471, in _handle_attention_backend_compatibility
raise ValueError(
ValueError: TRTLLM MHA backend is only supported on Blackwell GPUs (SM100). Please use a different backend.