Inference time becomes longer when doing non-continuous fp16 or int8 inference

Hi,

Sorry for the delayed response. Are you still facing this issue

Hi, I have the same problem

for i in range(N):
start_time = time.time()
pred_onnx = sess.run(None, feed)
time_diff = (time.time() - start_time) * 1000
print("execution time: ", time_diff)
time.sleep(1)
If I dont have the time.sleep(1). the inference is consistently at 3-5ms, but if I have the time.sleep(1) here, the inference time will become 10-20ms!

This is onnx runtime based on CUDA. I believe it’s not model dependent, you can select any model and try to repeatedly run with some sleep in between to see the effect of ‘non-continuity’

@spolisetty
Yes.
It will be very helpful if you could share more about “warm-up” mechanism, so that we can check whether we could avoid the “cold-down” and “warm-up again” effects.
I guess it will also benefit a lot of other people.

Hi, do you have any insight or solution for this? The inference time increase drastically from 4ms to 30ms when there’s a 100ms sleep in between, I believe this is a very common problem with CUDA.

Hi, are you still around?

Hi, do you have any advice on this? Thanks

Hi @spolisetty we still cannot solve this problem, would you mind sharing your advice on this?
Thank you

Hi, its me again, may I know if you are still checking on this?

Thanks

Hi, are you guys still around? Thanks

Hi @hnamletran,

Could you please create a post with clear issue details and issue repro script/onnx model for better debugging.

Thank you.

Hi, I have posted a new topic here, thanks
https://forums.developer.nvidia.com/t/first-inference-after-a-pause-is-always-long/200950

Hi @hnamletran , Were you able to find any solution for this one? I am facing same issue with hifigan fp16 onnx model. Inference after a pause is taking 10x time than continues inference.