When using tensorrt's c++ API for inference under 3060 graphics card, the speed of loading the first picture is very slow

Description

When using tensorrt’s c++ API for inference under 3060ti graphics card, the speed of loading the first picture is very slow

Environment

TensorRT Version: TensorRT8.2.5
GPU Type: 3060
Nvidia Driver Version: 512.96
CUDA Version: 11.5
CUDNN Version: 8.2.1
Operating System + Version: WIN10 Professional
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

// DMA the input to the GPU, execute the batch asynchronously, and DMA it back:
CHECK(cudaMemcpyAsync(buffers[InputIndex], inputData, batchSize * input_size, cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(scores, buffers[ScoreIndex], batchSize * scores_size, cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(boxes, buffers[BoxesIndex], batchSize * boxes_size, cudaMemcpyDeviceToHost, stream));|
cudaStreamSynchronize(stream);

context.enqueue(batchSize, buffers, stream, nullptr) This command

The time of the first one is 132593.766ms
The time of the second one is 148.9625ms
The time of the third one is 119ms,then,the time from the fourth one to the last one is 52 ms

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered