Description
I’m running an ONNX model converted to FP16 using TensorRT. The model is used for classification tasks, and I am performing 100 inferences. However, in about 4-5 out of 100 inferences, the inference time unexpectedly spikes. Normally, the inference takes around 0.5ms, but during these spikes, it increases to 1-1.5ms. The input image remains the same for all inferences, so this inconsistency is puzzling.
Is this a normal situation or is there something I should be concerned about?
Environment
- TensorRT Version: 7.2.1.6
- GPU Type: 3080Ti Laptop
- Nvidia Driver Version: 560.94
- CUDA Version: cuda_11.1.0_456.43_win10
- CUDNN Version: cudnn-11.1-windows-x64-v8.0.5.39
- Operating System + Version: Windows 11 Pro
Inference Code:
context.setBindingDimensions(0, Dims4(batchSize, Depth, nImageHeight, nImageWidth));
cudaMemcpyAsync(m_buffer[m_nInputIndex], input, batchSize * Depth * nImageHeight * nImageWidth * sizeof(float), cudaMemcpyHostToDevice, m_Stream);
context.enqueue(batchSize, m_buffer, m_Stream, nullptr);
cudaMemcpyAsync(output, m_buffer[m_nOutputIndex], batchSize * m_nClassNum * sizeof(float), cudaMemcpyDeviceToHost, m_Stream);
cudaStreamSynchronize(m_Stream);
Has anyone experienced similar issues or have insights on what might be causing this inconsistency in inference time?