Inference Speed Spikes When Running FP16 Converted ONNX Model with TensorRT

Description

I’m running an ONNX model converted to FP16 using TensorRT. The model is used for classification tasks, and I am performing 100 inferences. However, in about 4-5 out of 100 inferences, the inference time unexpectedly spikes. Normally, the inference takes around 0.5ms, but during these spikes, it increases to 1-1.5ms. The input image remains the same for all inferences, so this inconsistency is puzzling.
Is this a normal situation or is there something I should be concerned about?

Environment

  • TensorRT Version: 7.2.1.6
  • GPU Type: 3080Ti Laptop
  • Nvidia Driver Version: 560.94
  • CUDA Version: cuda_11.1.0_456.43_win10
  • CUDNN Version: cudnn-11.1-windows-x64-v8.0.5.39
  • Operating System + Version: Windows 11 Pro

Inference Code:

context.setBindingDimensions(0, Dims4(batchSize, Depth, nImageHeight, nImageWidth));
cudaMemcpyAsync(m_buffer[m_nInputIndex], input, batchSize * Depth * nImageHeight * nImageWidth * sizeof(float), cudaMemcpyHostToDevice, m_Stream);
context.enqueue(batchSize, m_buffer, m_Stream, nullptr);
cudaMemcpyAsync(output, m_buffer[m_nOutputIndex], batchSize * m_nClassNum * sizeof(float), cudaMemcpyDeviceToHost, m_Stream);
cudaStreamSynchronize(m_Stream);

Has anyone experienced similar issues or have insights on what might be causing this inconsistency in inference time?

The inconsistency in inference time while running your ONNX model using TensorRT could stem from several factors:

  1. Unstable Tactic Selection: TensorRT can choose different execution tactics during optimization, leading to varying performance. This might cause occasional spikes in the inference time.
  2. Performance Regressions: If the model being used has known performance regressions on specific GPU architectures, this could affect inference times.
  3. Engine Building Time: Significant increases in engine building times for certain models can lead to delays that affect inference performance.
  4. Data-Dependent Shape Convolutions: When convolutions involve tensors with data-dependent shapes, they may be processed slower, contributing to inconsistent inference times.
  5. Additional Tactics for Evaluation: Newer TensorRT versions may introduce additional tactics that could impact build times and possibly lead to variability in inference performance.

To address these spikes, consider monitoring the factors mentioned, optimizing model configurations, and exploring possible workarounds for improving inference consistency.