High Latency Variance During Inference


I have a resnet that I want to apply in a loop for a real time application. During deployment I noticed that the time it takes for applying my model is very inconsistent. At first it only takes ~2ms but after some time it sometimes even spikes up to ~12ms. I tested it using pytorch (python) and onnx runtime (python and c++) on multiple machines. It only happens on windows PCs when using CUDA as backend. I even had some strange behavior where forward inference would run smoother when I had a training running in the background.

Screenshot 2024-04-18 114147

To Reproduce
I already manged to narrow down the specific circumstances:

  • This only happens if I simultaneously load data from my hard drive.
  • It happened on multiple windows systems when using CUDA as execution provider but not on linux.
  • The latency comes from moving my tensor to(‘cuda’) and from cuda to(‘cpu’)

I wrote a minimal example in python using pytorch:

from time import perf_counter_ns

import torch
import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)

model = model.to("cuda")
tensor = torch.Tensor(np.random.rand(1, 3, 224, 224).astype(np.float32))


timestamps = []

for x in range(3000):
    start = perf_counter_ns()
    d_t = perf_counter_ns() - start

plt.plot(np.array(timestamps[1:]) * 1e-6)
plt.xlabel("Inference count")
plt.ylabel("Time [ms]")

System Information

  • os: win11
  • gpu: rtx 4070 ti
  • python: 3.11.7
  • torch: 2.1.2+cu118
  • torchvison: 0.16.2+cu118

Any help is appreciated : )

I think you are likely to get better help by asking pytorch questions on a pytorch forum such as discuss.pytorch.org. There are NVIDIA experts on those forums.

I am not familiar with resnet and pytorch and your best chance of resolving this would be to inquire with people familiar with those. From your description, candidates for possible root causes would appear to relate to bulk data copies, such as resource contention (e.g. PCIe, system memory), possibly as a result of suboptimal machine configuration, or a specific buffering behavior in the operating system’s file system for which the OS may offer configuration knobs.

What kind of “hard drive” is being used here? I assume it is some kind of NVMe SSD with a PCIe gen4 interface (e.g. Samsung 980 PRO)?

Yes, I already posted my problem there last week (High Latency Variance During Inference - deployment - PyTorch Forums). Since I found out that the problem also exits using onnx runtime I figured out it might not be related to pytorch at all and decided to post here. Also I stumbled across this post (Inconsistent kernel execution times, and affected by Nsight Systems) which sounds similar.

I agree. I am using a Samsung SSD 870 EVO 4TB.

Looking at the specifications of that SSD, it would appear to use the slow SATA interface, rather than the fast NVMe interface. The difference in transfer speed is something like 550 MB/sec for SATA versus 7 GB/sec for an NVMe on PCIe gen4. Seems like a read flag to me.

I do not have hands-on experience as I am not interested in AI, but my understanding is that at least some people in the deep learning field even use multiple NVMe SSDs in a RAID0 configuration for best performance. Whether this really provides a noticeable performance boost and whether the risk (one drive fails → data loss) is tolerable in practice I cannot assess. It would be a topic for further research.