Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError)

Turtle7777 · December 16, 2025, 6:05am

Hi nv member:

When validating the performance of Qwen/Qwen3-32B-FP8 by following the NVIDIA sglang SOP, the following error occurs.
Do you have any recommended approaches or solutions?

We noticed that lmsys.org has published related benchmark results, so we would also like to verify whether sglang can properly run the Qwen/Qwen3-32B-FP8 model. Please reference the attachment log file sglang_qwen3-32b-fp8_test.txt

Thank you.

Error message:
sglang_qwen3-32b-fp8_test.txt (20.1 KB)

AttributeError: ‘Fp8LinearMethod’ object has no attribute ‘embedding’
References (lmsys.org blog):
Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark | LMSYS Org

NVIDIA DGX Spark Benchmarks - Google Sheets

NVIDIA DGX Spark | sglang | qwen-3 | 32b | fp8 | 1 | 1,145.66 | 6.08 | 2048 | 2048
Steps
1. Launch SGLang container for server mode
  docker run --gpus all -it --rm
  -p 30000:30000
  -v /tmp:/tmp
  -v ~/.cache/huggingface:/root/.cache/huggingface
  –env “HF_TOKEN=”
  lmsysorg/sglang:spark
  bash
2. Start the SGLang inference server for Qwen/Qwen3-32B-FP8
  python3 -m sglang.launch_server
  –model-path Qwen/Qwen3-32B-FP8
  –host 0.0.0.0
  –port 30000
  –trust-remote-code
  –tp 1
  –attention-backend flashinfer
  –mem-fraction-static 0.75

NVES · December 16, 2025, 3:05pm

Thanks for the details.

DGX Spark was not tested against LMSYS Org.

The error you’re seeing appears to be specific to the upstream sglang + FP8 implementation and model integration. Feature requests should be raised directly with the LMSYS/sglang upstream project, ideally referencing their published results and container images.

Turtle7777 · December 17, 2025, 2:25am

Turtle7777:

Hi nv member:

When validating the performance of Qwen/Qwen3-32B-FP8 by following the NVIDIA sglang SOP, the following error occurs.
Do you have any recommended approaches or solutions?

We noticed that lmsys.org has published related benchmark results, so we would also like to verify whether sglang can properly run the Qwen/Qwen3-32B-FP8 model. Please reference the attachment log file sglang_qwen3-32b-fp8_test.txt

Thank you.

Error message:
sglang_qwen3-32b-fp8_test.txt (20.1 KB)AttributeError: ‘Fp8LinearMethod’ object has no attribute ‘embedding’

References (lmsys.org blog):
Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark | LMSYS Org NVIDIA DGX Spark Benchmarks - Google Sheets NVIDIA DGX Spark | sglang | qwen-3 | 32b | fp8 | 1 | 1,145.66 | 6.08 | 2048 | 2048

Steps

Launch SGLang container for server mode
docker run --gpus all -it --rm
-p 30000:30000
-v /tmp:/tmp
-v ~/.cache/huggingface:/root/.cache/huggingface
–env “HF_TOKEN=”
lmsysorg/sglang:spark
bash

Start the SGLang inference server for Qwen/Qwen3-32B-FP8
python3 -m sglang.launch_server
–model-path Qwen/Qwen3-32B-FP8
–host 0.0.0.0
–port 30000
–trust-remote-code
–tp 1
–attention-backend flashinfer
–mem-fraction-static 0.75

Hi :
Thank you for your suggestion. I have asked about this in the LMSYS Org GitHub issue as follows. If this issue gets resolved, I will share an update as well. Thank you.

[Bug] Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) on Nvidia DGX Spark · Issue #15301 · sgl-project/sglang

raphael.amorim · December 17, 2025, 3:27am

Turtle7777 · December 17, 2025, 5:51am

Hi raphael.amorim:
Have you tried using this method to run the Qwen/Qwen3-32B-FP8 (float-8) model? I’m mainly running into the following error indicating that an item is not supported. Thank you.

AttributeError: ‘Fp8LinearMethod’ object has no attribute ‘embedding’

raphael.amorim · December 17, 2025, 6:36am

First, please remove your private HF_TOKEN from the original post :D

Everything worked fine with Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

I saw you’ve opened [Bug] Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) on Nvidia DGX Spark · Issue #15301 · sgl-project/sglang · GitHub

eugr · December 17, 2025, 7:08am

Try to remove --attention-backend flashinfer and see if it works.

Turtle7777 · December 17, 2025, 7:58am

Hi raphael.amorim:
Thank you for pointing out the token key issue - I’ve alread fixed it. ^^

Additionally, when following your steps (not use GitHub) to test the Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 model, I encounter the following error. Is there any part that is still missing?

Command1: (.sglang) asus@gx10-f0df:~/Desktop/sglang$ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --reasoning-parser qwen3 --tool-call-parser qwen --attention-backend flashinfer --mem-fraction-static 0.8

Traceback (most recent call last):
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py”, line 2680, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py”, line 320, in init
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py”, line 248, in init
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py”, line 359, in init
self.initialize(min_per_gpu_memory)
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py”, line 511, in initialize
self.init_device_graphs()
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py”, line 2448, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py”, line 354, in init
raise Exception(
Exception: Capture cuda graph failed: Assertion error (/sgl-kernel/build/_deps/repo-deepgemm-src/csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:56): Unknown recipe
Possible solutions:

set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)

set --cuda-graph-max-bs to a smaller value (e.g., 16)

disable torch compile by not using --enable-torch-compile

disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub Sign in to GitHub · GitHub

Command 2: (.sglang) asus@gx10-f0df:~/Desktop/sglang$ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --reasoning-parser qwen3 --tool-call-parser qwen

/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/torch/cuda/init.py:283: UserWarning:
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)

warnings.warn(
[2025-12-17 07:57:16] WARNING common.py:1604: Failed to get GPU memory capacity from nvidia-smi, falling back to torch.cuda.mem_get_info().
[2025-12-17 07:57:18] WARNING server_args.py:1406: Attention backend not explicitly specified. Use trtllm_mha backend by default.
Traceback (most recent call last):
File “”, line 198, in _run_module_as_main
File “”, line 88, in _run_code
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/launch_server.py”, line 25, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 4495, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 4033, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “”, line 294, in init
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 654, in post_init
self._handle_attention_backend_compatibility()
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 1471, in _handle_attention_backend_compatibility
raise ValueError(
ValueError: TRTLLM MHA backend is only supported on Blackwell GPUs (SM100). Please use a different backend.

raphael.amorim · December 17, 2025, 8:04am

for command 1

exit the container
run sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
pass --mem-fraction-static 0.65

For the other Qwen3 -32B parameter model, try to remove --attention-backend flashinfer to see if helps like @eugr suggested.

Turtle7777 · December 17, 2025, 8:48am

Hi @eugr and @raphael.amorim :
Following the NVIDIA GB10 SGLANG approach, after removing --attention-backend flashinfer, we tested Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 and Qwen/Qwen3-32B-FP8 separately. We found that Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 works normally, but Qwen/Qwen3-32B-FP8 still encounters the error:

‘Fp8LinearMethod’ object has no attribute ‘embedding’.

I noticed that both models use tensor types BF16 · F8_E4M3, so I am not sure whether the Qwen/Qwen3-32B-FP8 model is somehow special. Are there any recommended approaches to resolve this issue? Thank you.

For Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 Command:
ython3 -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --trust-remote-code --tp 1 --mem-fraction-static 0.75

For Qwen/Qwen3-32B-FP8 Command:
python3 -m sglang.launch_server --model-path Qwen/Qwen3-32B-FP8 --host 0.0.0.0 --port 30000 --trust-remote-code --tp 1 --mem-fraction-static 0.75

eugr · December 17, 2025, 5:20pm

Do you have to use SGLang? It works just fine in VLLM, at least the VL version.

I’m getting 7 t/s on a single Spark and 12 t/s on dual Sparks with this model Qwen/Qwen3-VL-32B-Instruct-FP8.

Turtle7777 · December 18, 2025, 1:55am

Hi @eugr :

Because there are some results online showing Qwen3-32B FP8 being tested with sglang, and LMSYS Org has also listed this in the following reference, I was curious why they were able to run it on NVIDIA DGX Spark while I’m unable to reproduce it.

I was wondering whether it’s possible that LMSYS Org was actually using the Qwen/Qwen3-VL-32B-Instruct-FP8 model instead of Qwen/Qwen3-32B-FP8. My goal is simply to verify whether the numbers are consistent, so I can understand which model should be used for future benchmark validation.
Thank you.

References (lmsys.org blog):
Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark | LMSYS Org

NVIDIA DGX Spark Benchmarks - Google Sheets

eugr · December 18, 2025, 4:02am

I get slightly better results with vllm than they posted. It looks like the only meaningful optimization there is for gpt-oss which they managed to get to llama.cpp levels in terms of performance, but they haven’t merged anything into sglang main branch and their spark image misses important fixes to run other models.

Turtle7777 · December 18, 2025, 5:53am

Hi @eugr :
Thank you for helping to verify these questions. I will first use vLLM to test the relevant models. We have also raised an issue on the sglang GitHub regarding what you mentioned—that the related changes have not yet been integrated into the sglang Spark image, which prevents testing other models. If there is any response from the sglang GitHub, we will provide an update on the status.
Thank you.

eugr · December 18, 2025, 7:02am

Let’s see if they respond. I made a comment in a related ticket about two weeks ago, nothing from them… I even tagged the dev who made that spark image.

system · January 1, 2026, 7:02am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Build SGLang from source on Blackwell Pro 6000/ DGX Spark DGX Spark / GB10 jetson , nemotron	13	332	February 18, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1163	December 7, 2025
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	83	4836	February 24, 2026
Run SGLang in Spark DGX Spark / GB10	20	1792	November 28, 2025
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	2547	February 7, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	464	February 23, 2026
New pre-built sglang Docker Images for NVIDIA DGX Spark DGX Spark / GB10 Projects	4	295	February 19, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	184	2364	February 25, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	448	December 19, 2025
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	30	945	February 25, 2026

Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError)

Related topics