Using TensorRT for model inference, does a stable QPS (Queries Per Second) have a significant impact on the prediction response time (RT)?

Description

Scenario 1:

  • QPS is consistently stable at 300.
  • 99th percentile response time (99RT) is 3ms.
  • Batch size is 32.

Scenario 2:

  • QPS fluctuates between 100 and 300.
  • 99th percentile response time (99RT) is 10ms.
  • Batch size is 32.

Why is the 99th percentile response time (99RT) higher when the QPS is below 300?

Environment

TensorRT Version: 8.6
GPU Type: NVIDIA L20
Nvidia Driver Version: 535.161.08
CUDA Version: 12.1
CUDNN Version: 8.9.7
Operating System + Version: Linux
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):