First batch after idle / between workloads is much slower (even with preloaded data)

Hi everyone, I’m seeing a consistent issue during GPU inference where the first batch becomes significantly slower after a short idle period, even when data is fully preloaded and H2D/D2H transfers are negligible. When I run workloads continuously (e.g., queueing datasets back-to-back), inference times stay stable (~13–17 ms), but if there’s a small gap between runs, the first batch of the next workload can spike dramatically (e.g., 13 ms → 160 ms), even with identical datasets and fixed batch sizes. This behavior is reproducible in both PyTorch and TensorRT, and profiling shows the delay occurs in the first GPU kernels (like GEMM), not in data loading. Using a queue and preloading reduces the issue, and it disappears entirely if execution remains continuous, which makes me suspect a GPU “cold start” effect after idling (e.g., power state changes, cuBLAS/cuDNN reinitialization, or kernel scheduling overhead). Is this expected behavior in PyTorch CUDA execution, and what’s the recommended way to mitigate it in production inference systems?

GPUs dynamically adjust clock frequencies over a wide range for best performance while minimizing power draw. These adjustments can be made quickly, but they do not occur not instantaneously and they may have hysteresis effect bult in. Periods of inactivity may cause the GPU to operate at lower clocks and transition communication pipes to less-performant lower-power modes. The first batch after such a gap in GPU activity may therefore not run at the maximum performance, e.g. clocks are still being ramped up.

You could try locking the GPU clocks and/or power state with nvidia-smi to see whether this gets rid of these observed “slow batches after activity gap”. Note that the GPU fixing clocks at the maximum performance setting can lead to noticeably higher power draw and energy usage for end-to-end processing, which may or may not be an issue in your production environment.

Note that CPUs likewise use dynamic adjustments of clock frequencies, and in as much as processing of each batch involves host-side activity, you might observe performance artifacts there as well.