There is a difference in inference speed in TensorRT 8


I upgraded from TensorRT to TensorRT
In my code configuration, after the initialize(first inference here), inference is performed twice in a row.
When inferring twice in a row, the speed of the first inference is very slow.
The speed difference seems to be more than doubled.
In TensorRT 7, both had the same speed.
I have the same symptoms on both Jetson nano and PC.
Is there a difference between 7 and 8?


TensorRT Version:
GPU Type: Jetson nano, gtx1060
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version: 8.1(PC), 8.2(Jetson nano)
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:


Hi, Thank you for your reply.
I measured and found that the first cudaMemcpyHostToDevice was taking a long time.

void inference(float* input){
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    cudaMemcpyAsync(cuda_buffers[0], input, input0_size * sizeof(float), cudaMemcpyHostToDevice, cuda_stream);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    context->enqueueV2(, cuda_stream, nullptr);
    high_resolution_clock::time_point t3 = high_resolution_clock::now();
    cudaMemcpyAsync(, cuda_buffers[1], output0_size * sizeof(float), cudaMemcpyDeviceToHost, cuda_stream);
    high_resolution_clock::time_point t4 = high_resolution_clock::now();
    cudaMemcpyAsync(, cuda_buffers[2], output1_size * sizeof(float), cudaMemcpyDeviceToHost, cuda_stream);
    high_resolution_clock::time_point t5 = high_resolution_clock::now();

    auto time1 = duration_cast<microseconds>(t2 - t1).count();
    auto time2 = duration_cast<microseconds>(t3 - t2).count();
    auto time3 = duration_cast<microseconds>(t4 - t3).count();
    auto time4 = duration_cast<microseconds>(t5 - t4).count();
    printf("cudaMemcpyHostToDevice : %d micro sec\n", time1);
    printf("context->enqueueV2 : %d micro sec\n", time2);
    printf("cudaMemcpyDeviceToHost: %d micro sec\n", time3);
    printf("cudaMemcpyDeviceToHost: %d micro sec\n", time4);


cudaMemcpyHostToDevice : 5903 micro sec
context->enqueueV2 : 586 micro sec
cudaMemcpyDeviceToHost : 4269 micro sec
cudaMemcpyDeviceToHost : 68 micro sec
cudaMemcpyHostToDevice : 58 micro sec
context->enqueueV2 : 463 micro sec
cudaMemcpyDeviceToHost : 4511 micro sec
cudaMemcpyDeviceToHost : 70 micro sec

It runs like this

for(int i = 0; i < 2; ++i){
    do preprocess


    do postprocess

After about 15 seconds or more after the for loop, when I run the for loop again, the first cudaMemcpyHostToDevice is very slow.
But if I run the for loop again in about 15 seconds, it’s very fast.


I believe this is an expected behaviour, The very first run usually takes a long time in setting up stuff.
If you find this as an issue, could you please collect the Nsight Systems profile so that we have a better look.
Also please share minimal issue repro to try from our end.

Thank you.