why the CUDA function latency increased rapidly as the number of threads increased?

I was testing a program on a Tesla T4 card. The software configuration was CUDA 10.1, TensorRT 5.1.5, cuDNN 7.5, on ubuntu 16.04. the driver version was 418.76.
The program had 20 video processing algorithm threads, each thread run three small different models in serial. When I started only 10 threads, the program was running almost in realtime, about 40ms per frame, the GPU-Util was about 40%. With the number of threads increase, the running time increased rapidly to about 400ms per frame when runing 20 threads, and the GPU-Util was about 40% at start and decreased to 20% as the program running by.
I was trying to profile it by NsightSystem. The CUDA stream reporter showed that TensorRT layer functions had a short running time but a very long latency(for example, 10us of running time and 30ms of the lantency). Why the latency increased so rapidly? How can I do to prevent this happening?