This is directly from the HF issue that showed 20s to 1s improvement.
Right after creating the pipeline and before pipe.to(“cuda”):
# after creating `pipe` but before pipe.to("cuda")
def _clone_module_params_buffers(module):
for p in module.parameters():
p.data = p.data.clone()
for b in module.buffers():
# some buffers (e.g. None) may not have .data
if hasattr(b, "data"):
b.data = b.data.clone()
# For SDXL, these are the big chunks
_clone_module_params_buffers(pipe.unet)
if hasattr(pipe, "text_encoder"):
_clone_module_params_buffers(pipe.text_encoder)
if hasattr(pipe, "text_encoder_2"):
_clone_module_params_buffers(pipe.text_encoder_2)
if hasattr(pipe, "vae"):
_clone_module_params_buffers(pipe.vae)
pipe.to("cuda")
Make sure you’re reusing the pipeline instead of re-creating it for every image and double-check your safetensors setup