Running SGLang Diffusion Inference

Has anyone successfully run SGlang[diffusion] on spark? I first try to build docker image from scratch and install SGlang[diffusion] from source but server will not even start. In my second try, I pull sglang:dev-arm64 from Docker hub but it also failed after received the http call.

The existing SGLang for Inference playbook needs some modification in order to serve Stable Diffusion models like Z-Image Turbo

docker pull lmsysorg/sglang:dev-arm64

docker run --gpus all \
–shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
–env “HF_TOKEN=$HF_TOKEN” \
–ipc=host \
lmsysorg/sglang:dev-arm64 \
sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --port 30000

After server started I sent this prompt :

curl http://localhost:30000/v1/images/generations \
-o >(jq -r ‘.data[0].b64_json’ | base64 --decode > example.png) \
-H “Content-Type: application/json” \
-d ‘{\
“model”: “Tongyi-MAI/Z-Image-Turbo”,\
“prompt”: “A cute baby sea otter”,\
“n”: 1,\
“size”: “1024x1024”,
“response_format”: “b64_json”\
}’\

SGLang server was not able to complete the request and display the following message:

=====================================================================

[2026-01-26 14:43:19] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[01-26 14:43:26] Sampling params:
width: 1024
height: 1024
num_frames: 1
prompt: A cute baby sea otter
neg_prompt: None
seed: 1024
infer_steps: 9
num_outputs_per_prompt: 1
guidance_scale: 0.0
embedded_guidance_scale: 6.0
n_tokens: None
flow_shift: None
image_path: None
save_output: True
output_file_path: outputs/81a5cd28-4792-4554-abb9-fbd5fb835ec7.jpg

[01-26 14:43:26] Running pipeline stages: [‘input_validation_stage’, ‘prompt_encoding_stage_primary’, ‘conditioning_stage’, ‘timestep_preparation_stage’, ‘latent_preparation_stage’, ‘denoising_stage’, ‘decoding_stage’]
[01-26 14:43:26] [InputValidationStage] started…
[01-26 14:43:26] [InputValidationStage] finished in 0.0001 seconds
[01-26 14:43:26] [TextEncodingStage] started…
[01-26 14:43:27] [TextEncodingStage] finished in 0.9637 seconds
[01-26 14:43:27] [ConditioningStage] started…
[01-26 14:43:27] [ConditioningStage] finished in 0.0000 seconds
[01-26 14:43:27] [TimestepPreparationStage] started…
[01-26 14:43:27] [TimestepPreparationStage] finished in 0.0009 seconds
[01-26 14:43:27] [LatentPreparationStage] started…
[01-26 14:43:27] [LatentPreparationStage] finished in 0.0051 seconds
[01-26 14:43:27] [DenoisingStage] started…
0%| | 0/9 [00:00<?, ?it/s]
[01-26 14:43:27] [DenoisingStage] Error during execution after 315.6838 ms: RMSNorm failed with error code no kernel image is available for execution on the device
Traceback (most recent call last):
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py”, line 200, in call
result = self.forward(batch, server_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py”, line 1020, in forward
noise_pred = self._predict_noise_with_cfg(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py”, line 1265, in _predict_noise_with_cfg
noise_pred_cond = self._predict_noise(
^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py”, line 1211, in _predict_noise
return current_model(
^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/models/dits/zimage.py”, line 620, in forward
x = layer(x, x_freqs_cis, adaln_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/models/dits/zimage.py”, line 262, in forward
self.attention_norm1(x) * scale_msa,
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/layers/custom_op.py”, line 29, in forward
return self._forward_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/layers/layernorm.py”, line 88, in forward_cuda
out = rmsnorm(x, self.weight.data, self.variance_epsilon)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/sgl_kernel/elementwise.py”, line 45, in rmsnorm
torch.ops.sgl_kernel.rmsnorm.default(out, input, weight, eps, enable_pdl)
File “/usr/local/lib/python3.12/dist-packages/torch/_ops.py”, line 841, in call
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^

Hi @paulsc.liu.

The official container release from SGLang with spark support (lmsysorg/sglang:spark) is from Oct 25, 2025 and it doesn’t have support for diffusion, but 0.5.6 has support for Flux2 #14000, Z-image #14067) and if you follow this tutorial they did for nemotron 3 nano, which they claimed to support DGX Spark: SGLang Adds Day-0 Support for the Highly Efficient, Open Nemotron 3 Nano Hybrid MoE Model | LMSYS Org
they’re using sglang==0.5.6.post2.dev7852+g8102e36b5, which should be very straightforward to rebase with Release v0.5.6.

I’ll try it later this week, but you can try it now. It could help you.

Thanks for the tips. I think SGlang 5.8 might be a better candidate for testing. SGlang claim good performance on Diffusion models in 5.8. I will try to build 5.8 on DXG Spark. My other goal is get Vllm-Omni to run so I compare the performance between the two when serving Diffusion model.

SGlang 5.8 release note: https://github.com/sgl-project/sglang/releases/tag/v0.5.8

I also has some problem with Vllm-Omni: fa3.fwd does not work on Spark, I have to exclude it during build process. My previous attempt was a failure. I plan to get back to it after I get SGlang to run.

Some update. I actually get vllm-omni working first. You do needs to install from source. I will write up the installation process after I am sure vllm can service diffusion models correctly. Life on the bleeding edge can really be interesting.

Initial testing has been successful. I was able to issue curl command to send image generation prompt then receive the image back from vllm from a PC. I was able to observe DGX Spark GPU activated by vLLM-omni to generate picture :)

3 Likes