Inference time becomes longer when doing non-continuous fp16 or int8 inference

Sorry we couldn’t find inference script you’re using in 7z file you’ve shared. Please provide us inference script with sample to reproduce the issue.

Thank you.

profiler.7z.001 (10 MB)

profiler.7z.002 (10 MB)

profiler.7z.003 (10 MB)

profiler.7z.004 (4.7 MB)

Hi @spolisetty

Please download profiler.7z.001 - profiler.7z.004 and uncompress it.
cd profiler/build
./profiler

Hi,

Sorry for the delayed response. Are you still facing this issue

Hi, I have the same problem

for i in range(N):
start_time = time.time()
pred_onnx = sess.run(None, feed)
time_diff = (time.time() - start_time) * 1000
print("execution time: ", time_diff)
time.sleep(1)
If I dont have the time.sleep(1). the inference is consistently at 3-5ms, but if I have the time.sleep(1) here, the inference time will become 10-20ms!

This is onnx runtime based on CUDA. I believe it’s not model dependent, you can select any model and try to repeatedly run with some sleep in between to see the effect of ‘non-continuity’

@spolisetty
Yes.
It will be very helpful if you could share more about “warm-up” mechanism, so that we can check whether we could avoid the “cold-down” and “warm-up again” effects.
I guess it will also benefit a lot of other people.

Hi, do you have any insight or solution for this? The inference time increase drastically from 4ms to 30ms when there’s a 100ms sleep in between, I believe this is a very common problem with CUDA.

Hi, are you still around?

Hi, do you have any advice on this? Thanks

Hi @spolisetty we still cannot solve this problem, would you mind sharing your advice on this?
Thank you

Hi, its me again, may I know if you are still checking on this?

Thanks

Hi, are you guys still around? Thanks

Hi @hnamletran,

Could you please create a post with clear issue details and issue repro script/onnx model for better debugging.

Thank you.

Hi, I have posted a new topic here, thanks
https://forums.developer.nvidia.com/t/first-inference-after-a-pause-is-always-long/200950

Hi @hnamletran , Were you able to find any solution for this one? I am facing same issue with hifigan fp16 onnx model. Inference after a pause is taking 10x time than continues inference.

any updates?

I have the same problem with Yolo V5 (PyTorch), CUDA 11.6, CUDNN 8.3.2 and a .pt model.
With my RTX A2000 i get 8ms inference when i deliver images in a loop.
When i insert a pause of 1s between the inferences, the time goes up to ~90ms.
Independent whether FP16 or FP32.
Problem does not occur with .onnx models.
Is there any solution for that?