Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError)

Hi nv member:

When validating the performance of Qwen/Qwen3-32B-FP8 by following the NVIDIA sglang SOP, the following error occurs.
Do you have any recommended approaches or solutions?

We noticed that lmsys.org has published related benchmark results, so we would also like to verify whether sglang can properly run the Qwen/Qwen3-32B-FP8 model. Please reference the attachment log file sglang_qwen3-32b-fp8_test.txt

Thank you.

  • Error message:
    sglang_qwen3-32b-fp8_test.txt (20.1 KB)

    AttributeError: ‘Fp8LinearMethod’ object has no attribute ‘embedding’

  • References (lmsys.org blog):
    Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark | LMSYS Org

    NVIDIA DGX Spark Benchmarks - Google Sheets

    NVIDIA DGX Spark | sglang | qwen-3 | 32b | fp8 | 1 | 1,145.66 | 6.08 | 2048 | 2048

  • Steps

    1. Launch SGLang container for server mode
      docker run --gpus all -it --rm
      -p 30000:30000
      -v /tmp:/tmp
      -v ~/.cache/huggingface:/root/.cache/huggingface
      –env “HF_TOKEN=”
      lmsysorg/sglang:spark
      bash

    2. Start the SGLang inference server for Qwen/Qwen3-32B-FP8
      python3 -m sglang.launch_server
      –model-path Qwen/Qwen3-32B-FP8
      –host 0.0.0.0
      –port 30000
      –trust-remote-code
      –tp 1
      –attention-backend flashinfer
      –mem-fraction-static 0.75

Thanks for the details.

DGX Spark was not tested against LMSYS Org.

The error you’re seeing appears to be specific to the upstream sglang + FP8 implementation and model integration. Feature requests should be raised directly with the LMSYS/sglang upstream project, ideally referencing their published results and container images.

Hi :
Thank you for your suggestion. I have asked about this in the LMSYS Org GitHub issue as follows. If this issue gets resolved, I will share an update as well. Thank you.

[Bug] Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) on Nvidia DGX Spark · Issue #15301 · sgl-project/sglang

1 Like

Hi raphael.amorim:
Have you tried using this method to run the Qwen/Qwen3-32B-FP8 (float-8) model? I’m mainly running into the following error indicating that an item is not supported. Thank you.

AttributeError: ‘Fp8LinearMethod’ object has no attribute ‘embedding’

First, please remove your private HF_TOKEN from the original post :D

Everything worked fine with Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

I saw you’ve opened [Bug] Error When Validating Qwen3-32B-FP8 Performance Using sglang (Fp8LinearMethod AttributeError) on Nvidia DGX Spark · Issue #15301 · sgl-project/sglang · GitHub

1 Like

Try to remove --attention-backend flashinfer and see if it works.

1 Like

Hi raphael.amorim:
Thank you for pointing out the token key issue - I’ve alread fixed it. ^^

Additionally, when following your steps (not use GitHub) to test the Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 model, I encounter the following error. Is there any part that is still missing?

Command1: (.sglang) asus@gx10-f0df:~/Desktop/sglang$ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --reasoning-parser qwen3 --tool-call-parser qwen --attention-backend flashinfer --mem-fraction-static 0.8

Traceback (most recent call last):
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py”, line 2680, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py”, line 320, in init
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py”, line 248, in init
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py”, line 359, in init
self.initialize(min_per_gpu_memory)
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py”, line 511, in initialize
self.init_device_graphs()
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py”, line 2448, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py”, line 354, in init
raise Exception(
Exception: Capture cuda graph failed: Assertion error (/sgl-kernel/build/_deps/repo-deepgemm-src/csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:56): Unknown recipe
Possible solutions:

  1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
  2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
  3. disable torch compile by not using --enable-torch-compile
  4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
    Open an issue on GitHub Sign in to GitHub · GitHub

Command 2: (.sglang) asus@gx10-f0df:~/Desktop/sglang$ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --reasoning-parser qwen3 --tool-call-parser qwen

/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/torch/cuda/init.py:283: UserWarning:
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)

warnings.warn(
[2025-12-17 07:57:16] WARNING common.py:1604: Failed to get GPU memory capacity from nvidia-smi, falling back to torch.cuda.mem_get_info().
[2025-12-17 07:57:18] WARNING server_args.py:1406: Attention backend not explicitly specified. Use trtllm_mha backend by default.
Traceback (most recent call last):
File “”, line 198, in _run_module_as_main
File “”, line 88, in _run_code
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/launch_server.py”, line 25, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 4495, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 4033, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “”, line 294, in init
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 654, in post_init
self._handle_attention_backend_compatibility()
File “/home/asus/Desktop/sglang/.sglang/lib/python3.12/site-packages/sglang/srt/server_args.py”, line 1471, in _handle_attention_backend_compatibility
raise ValueError(
ValueError: TRTLLM MHA backend is only supported on Blackwell GPUs (SM100). Please use a different backend.

for command 1

  1. exit the container
  2. run sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
  3. pass --mem-fraction-static 0.65

For the other Qwen3 -32B parameter model, try to remove --attention-backend flashinfer to see if helps like @eugr suggested.

1 Like

Hi @eugr and @raphael.amorim :
Following the NVIDIA GB10 SGLANG approach, after removing --attention-backend flashinfer, we tested Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 and Qwen/Qwen3-32B-FP8 separately. We found that Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 works normally, but Qwen/Qwen3-32B-FP8 still encounters the error:

‘Fp8LinearMethod’ object has no attribute ‘embedding’.

I noticed that both models use tensor types BF16 · F8_E4M3, so I am not sure whether the Qwen/Qwen3-32B-FP8 model is somehow special. Are there any recommended approaches to resolve this issue? Thank you.

For Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 Command:
ython3 -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --host 0.0.0.0 --port 30000 --trust-remote-code --tp 1 --mem-fraction-static 0.75

For Qwen/Qwen3-32B-FP8 Command:
python3 -m sglang.launch_server --model-path Qwen/Qwen3-32B-FP8 --host 0.0.0.0 --port 30000 --trust-remote-code --tp 1 --mem-fraction-static 0.75

1 Like

Do you have to use SGLang? It works just fine in VLLM, at least the VL version.

I’m getting 7 t/s on a single Spark and 12 t/s on dual Sparks with this model Qwen/Qwen3-VL-32B-Instruct-FP8.

Hi @eugr :

Because there are some results online showing Qwen3-32B FP8 being tested with sglang, and LMSYS Org has also listed this in the following reference, I was curious why they were able to run it on NVIDIA DGX Spark while I’m unable to reproduce it.

I was wondering whether it’s possible that LMSYS Org was actually using the Qwen/Qwen3-VL-32B-Instruct-FP8 model instead of Qwen/Qwen3-32B-FP8. My goal is simply to verify whether the numbers are consistent, so I can understand which model should be used for future benchmark validation.
Thank you.

References (lmsys.org blog):
Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark | LMSYS Org

NVIDIA DGX Spark Benchmarks - Google Sheets

I get slightly better results with vllm than they posted. It looks like the only meaningful optimization there is for gpt-oss which they managed to get to llama.cpp levels in terms of performance, but they haven’t merged anything into sglang main branch and their spark image misses important fixes to run other models.

2 Likes

Hi @eugr :
Thank you for helping to verify these questions. I will first use vLLM to test the relevant models. We have also raised an issue on the sglang GitHub regarding what you mentioned—that the related changes have not yet been integrated into the sglang Spark image, which prevents testing other models. If there is any response from the sglang GitHub, we will provide an update on the status.
Thank you.

Let’s see if they respond. I made a comment in a related ticket about two weeks ago, nothing from them
 I even tagged the dev who made that spark image.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.