The time of cudaMemcpyAsync() in cudaMemcpyDeviceToHost mode is unstable after upgrading the graphics card driver to version 512

I have been using tensorrt to do my work for a while and the inference time is stable.
However, some computers with the same code behave the great instability in inference speed of tensorrt.
For example, an image should be detected in 30ms normally.In fact, the same image needs 120ms in the next detection time with the same model.After I did the statistics of detection time for each module,I found that the time difference originates from the function of cudaMemcpyAsync() in cudaMemcpyDeviceToHost mode.
After my conjecture and experimental verification,I found the key factor——the version of graphics card driver.I found that all the cumpters which behave the great instability have the common feature.Their version of graphics card driver all above 512.And the version of graphics card driver of the other computers are under 500 such as 471.41 and 472.88.
I decided to downgrade one computer’s graphics driver to 472 and it detected stably!Therefore my guess is correct.
So why is the TensorRT inference time related to the graphics card driver version?

Update:
I just did something to improve stability with the 512 version of Nvidia driver card.I set “nvidia-smi -lgc ” in cmd. Then I run my tensorrt exe and found that it became more stable!If the frequency is the max frequency of my gpu,it will be very stable.But if the frequency is less than the max frequency,the time still jumps occasionally.In about a dozen tests, there will be one situation as shown in the figure below.

TensorRT Version : 7.2.2.3/8.2.0.6
GPU Type : 3060/2060
Nvidia Driver Version : 471.41/472.88(stable) ; 516.59/512.XX(unstable)
CUDA Version : 11.1
CUDNN Version : 8.0.4.30
Operating System + Version : Windows 10 21H1
Python Version (if applicable) : None
TensorFlow Version (if applicable) : None
PyTorch Version (if applicable) : None
Baremetal or Container (if container which image + tag) : None

Hi,

We recommend you the following.

  • For cudaMemcpyDeviceToHost memcpys, make sure the host memory is pinned memory (i.e. those allocated with cudaMallocHost() or cudaHostAlloc()) rather than pageable memory.
    Please refer: Developer Guide :: NVIDIA Deep Learning TensorRT Documentation
  • In driver settings (like in GeForce Experience), set the GPU mode to Performance mode or manually lock the clock at the highest frequency.

Thank you.