High Latency Variance During Inference

seidl1 · April 24, 2024, 8:47am

Problem

I have a resnet that I want to apply in a loop for a real time application. During deployment I noticed that the time it takes for applying my model is very inconsistent. At first it only takes ~2ms but after some time it sometimes even spikes up to ~12ms. I tested it using pytorch (python) and onnx runtime (python and c++) on multiple machines. It only happens on windows PCs when using CUDA as backend. I even had some strange behavior where forward inference would run smoother when I had a training running in the background.

Example
Screenshot 2024-04-18 114147

To Reproduce
I already manged to narrow down the specific circumstances:

This only happens if I simultaneously load data from my hard drive.
It happened on multiple windows systems when using CUDA as execution provider but not on linux.
The latency comes from moving my tensor to(‘cuda’) and from cuda to(‘cpu’)

I wrote a minimal example in python using pytorch:

from time import perf_counter_ns

import torch
import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)

model.eval()
model = model.to("cuda")
tensor = torch.Tensor(np.random.rand(1, 3, 224, 224).astype(np.float32))

torch.cuda.synchronize()

timestamps = []

for x in range(3000):
    start = perf_counter_ns()
    model(tensor.to("cuda")).to("cpu")
    d_t = perf_counter_ns() - start
    timestamps.append(d_t)
    cv.imread(r"path/to/some/image.png")

plt.plot(np.array(timestamps[1:]) * 1e-6)
plt.xlabel("Inference count")
plt.ylabel("Time [ms]")
plt.show()

System Information

os: win11
gpu: rtx 4070 ti
python: 3.11.7
torch: 2.1.2+cu118
torchvison: 0.16.2+cu118

Any help is appreciated : )

Robert_Crovella · April 24, 2024, 2:16pm

I think you are likely to get better help by asking pytorch questions on a pytorch forum such as discuss.pytorch.org. There are NVIDIA experts on those forums.

njuffa · April 24, 2024, 2:37pm

I am not familiar with resnet and pytorch and your best chance of resolving this would be to inquire with people familiar with those. From your description, candidates for possible root causes would appear to relate to bulk data copies, such as resource contention (e.g. PCIe, system memory), possibly as a result of suboptimal machine configuration, or a specific buffering behavior in the operating system’s file system for which the OS may offer configuration knobs.

What kind of “hard drive” is being used here? I assume it is some kind of NVMe SSD with a PCIe gen4 interface (e.g. Samsung 980 PRO)?

seidl1 · April 24, 2024, 2:41pm

Yes, I already posted my problem there last week (High Latency Variance During Inference - deployment - PyTorch Forums). Since I found out that the problem also exits using onnx runtime I figured out it might not be related to pytorch at all and decided to post here. Also I stumbled across this post (Inconsistent kernel execution times, and affected by Nsight Systems) which sounds similar.

seidl1 · April 24, 2024, 2:43pm

I agree. I am using a Samsung SSD 870 EVO 4TB.

njuffa · April 24, 2024, 3:01pm

Looking at the specifications of that SSD, it would appear to use the slow SATA interface, rather than the fast NVMe interface. The difference in transfer speed is something like 550 MB/sec for SATA versus 7 GB/sec for an NVMe on PCIe gen4. Seems like a read flag to me.

I do not have hands-on experience as I am not interested in AI, but my understanding is that at least some people in the deep learning field even use multiple NVMe SSDs in a RAID0 configuration for best performance. Whether this really provides a noticeable performance boost and whether the risk (one drive fails → data loss) is tolerable in practice I cannot assess. It would be a topic for further research.

Topic		Replies	Views
Strange CNN inference latency behavior with CUDA and TensorRT TensorRT cuda	13	1686	January 24, 2024
Long Cuda Synchronization times in TensorRT inference (Python API) TensorRT tensorrt , cuda , python , cudnn	3	155	September 1, 2025
cudaMemcpy latency unusually high on some machines CUDA Programming and Performance	9	439	November 11, 2024
TF and Pytorch are slower on Windows than on linux CUDA Programming and Performance	7	3403	July 2, 2019
Different slowdowns when executing models concurrently CUDA Programming and Performance	5	457	January 4, 2021
Differences in behavior due to NVIDIA Driver cuDNN python	2	575	February 1, 2024
Latency when I launch a program on Tesla S2050 CUDA Programming and Performance	0	2938	January 9, 2012
More inference time in cuda env compared to cpu (occured only for a layer) CUDA Programming and Performance	11	1006	March 7, 2022
TensorRT execution inference time occasionally increases dramatically after the warmup TensorRT	1	1806	January 7, 2022
Why is torch.tensor.to('cuda') so slow? Jetson AGX Orin pytorch	5	219	December 9, 2024

High Latency Variance During Inference

Related topics