Pip.to(‘cuda‘) is very slowly.DGX vs 3090 => [34s vs 1s]

Hi bro:

I am running a text to image model (sdxl) using diffusers in my DGX spark. but it’s very slowly.

pip.to(‘cuda‘) spent 34s, but only spent 1s in my 3090.

bellow is code:

from diffusers import StableDiffusionXLPipeline
import torch,os,sys

args = sys.argv
model_name = args[1] if len(args) > 1 else 'xl-base.safetensors'
base_dir = "/workspace"
base_model_path = f"{base_dir}/models/{model_name}"

pipe = StableDiffusionXLPipeline.from_single_file(
    base_model_path,
    torch_dtype=torch.float16,
    use_safetensors=True,
    local_files_only=True,
    original_config=f"{base_dir}/models/config/sd_xl_base.yaml",
    safety_checker=None, 
    cache_dir=f"{base_dir}/models/cache",
    use_memory_map=False,
    variant="fp16",
)

# this is very slowly. it spent 34s . but , 3090 only spend 1s to GPU
pipe.to('cuda')

6.14.0-1013-nvidia

NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0

print(torch._version_)

2.9.1+cu130

This is directly from the HF issue that showed 20s to 1s improvement.

Right after creating the pipeline and before pipe.to(“cuda”):

# after creating `pipe` but before pipe.to("cuda")

def _clone_module_params_buffers(module):
    for p in module.parameters():
        p.data = p.data.clone()
    for b in module.buffers():
        # some buffers (e.g. None) may not have .data
        if hasattr(b, "data"):
            b.data = b.data.clone()

# For SDXL, these are the big chunks
_clone_module_params_buffers(pipe.unet)
if hasattr(pipe, "text_encoder"):
    _clone_module_params_buffers(pipe.text_encoder)
if hasattr(pipe, "text_encoder_2"):
    _clone_module_params_buffers(pipe.text_encoder_2)
if hasattr(pipe, "vae"):
    _clone_module_params_buffers(pipe.vae)

pipe.to("cuda")

Make sure you’re reusing the pipeline instead of re-creating it for every image and double-check your safetensors setup

use_safetensors=True
use_memory_map=False

pip install --upgrade diffusers transformers accelerate safetensors

fastsafetensors>=0.1.10 dramatically improves load times for big LLMs on Spark (vLLM going from ~9 minutes to ~24 seconds).

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.