Inference time becomes longer when doing “non-continuous” fp16 or int8 inference.
- “non-continuous” inference means that doing pre-process and cudaMemcpyHostToDevice for different inputs each time before inference
- “continuous” inference means that doing pre-process and cudaMemcpyHostToDevice only once for the same input. After that “continuous” doing the inference for N=1000 times.
- Issue can’t be reproduced when doing “non-continuous” fp32 inference.
- Issue can’t be reproduced when doing “continuous” fp16 or int8 inference.
I think the issue is not model correlated.
Please let me know if you cannot reproduce the issue.
TensorRT Version : 7.1.3-1
CUDA Version : cuda10.2
CUDNN Version : 8.0