Execution time difference of execute() and enqueue()

So I changed sampleOnnxMNIST sample so it will read videofile and process frames in loop. And verything was OK.
But when I changed code from sync:

auto istart = chrono::high_resolution_clock::now();
bool status = context-><b><u>execute</u></b>(mParams.batchSize, buffers.getDeviceBindings().data());
chrono::duration<double> i_time_span = chrono::duration_cast<chrono::duration<double>>(chrono::high_resolution_clock::now() - istart);
// count is frame index:
cout << "Inference " << count << ": " << i_time_span.count() << " secs\n";

to async:

auto istart = chrono::high_resolution_clock::now();
// cudaStream_t cstream;
bool status = context-><b><u>enqueue</u></b>(mParams.batchSize, buffers.getDeviceBindings().data(), cstream, nullptr);
chrono::duration<double> i_time_span = chrono::duration_cast<chrono::duration<double>>(chrono::high_resolution_clock::now() - istart);
cout << "Inference " << count << ": " << i_time_span.count() << " secs\n";

I got literally same time of execution. CUDA events give same results as chrono does.
I checked on PC with GTX 1063 and Jetson Nano — same time of execute() and enqueue() on both.
I was thinking it should be different.
Am I doing something wrong?
Or is it normal when only one inference is happening?


Async copy will impact the response time if there is significant amount of data movement between host and device.
Could you please share the nvidia profiler output as well so we can help better?

Meanwhile, please refer to below link:


How can I view/create it?

I swapped copyInputToDeviceAsync() and copyOutputToHostAsync() back to sync-versions keeping asynchronous enqueue().
Total time per frame didn’t change.
Also I put CUDA events right before and after execute() and enqueue() so copying data won’t affect measuring:

cudaEventRecord(cstart, cstream);
bool status = context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), cstream, nullptr);
cudaEventRecord(cend, cstream);
float totalTime;
cudaEventElapsedTime(&totalTime, cstart, cend);

And I’m stil getting same ~42 ms for both execute() and enqueue().

So I checked materials you gave and found that there’s examples for 1-task-multiple-streams only for CUDA w/o TensorRT. TensorRT examples with multiple CUDA streams are used only for multiple inferences (with multiple frames) at once.
I assume that inference on 1 image can’t be split into multiple streams, am I right?
Aslo enqueue() as I understand waits till inference completed: “Asynchronously execute inference on a batch.” Just like execute() does.
So is there any way to:

  • just send signal to start inference (only start, not complete)
  • do other things (prepare next frame for example) at the same time
  • after that check if inference is completed
  • ? Just like OpenVINO does: [i]void InferenceEngine::InferRequest::StartAsync() Start inference of specified input(s) in asynchronous mode. Note: It returns immediately. Inference starts also immediately.[/i] and [i]StatusCode InferenceEngine::InferRequest::Wait() Waits for the result to become available. Blocks until specified millis_timeout has elapsed or the result becomes available, whichever comes first.[/i]


    You can use Nvidia visual profiler. Please refer below link for more details:


    Here it is: https://drive.google.com/open?id=1d3rLmhRlN7zOlUI1AI2tgm9EVxyCaxJ6
    If I understood correctly what I had to do.

    This almost tripped me up yesterday but make sure you’re also using pinned memory.

    You can only overlap computation with memory copying by using the page-locked memory via


    Of course!

    You would call enqueue() on the stream (this is non-blocking) and then schedule the data transfer back via ::cudaMemcpyAsync.

    To check if the inference is completed on that stream, simply invoke ::cudaStreamQuery.

    You can read about this function here:

    So I need to allocate memory in RAM using cudaMallocHost() instead of something like this:

    float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer(mParams.inputTensorNames[0]));

    Do I understand this correct?

    You’re saying enqueue() should return immidiately? Is this correct?
    I checked my code again to be sure.
    And understood that I never checked time of sole enqueue() (w/o copying data) with chrono::. I only tried to use CUDA events for that (and I suppose they gave time of full execution, not only starting call enqueue() itself).
    Measured time again (now with chrono) — and got 0.6 ms for enqueue() call on GTX 1063!
    Thanks for pointing at this.
    Now I removed bufferManager method and did copying output manually:

    context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), cstream, nullptr);
    auto start = chrono::high_resolution_clock::now();
    	cudaMemcpyDeviceToHost, cstream);
    auto end1 = chrono::high_resolution_clock::now();
    	cudaMemcpyDeviceToHost, cstream);
    auto end2 = chrono::high_resolution_clock::now();

    But somehow copying data from output takes a lot of time!
    Second cudaMemcpyAsync() (end1-end2) takes only 0.065 msecs.
    But first cudaMemcpyAsync() (start-end1) takes >3 ms for some reason.
    Feels like cudaMemcpyAsync() waits untill execution is completed. Is it so?

    You understand correctly but keep in mind, I’m not sure of the implementation of the buffer class you’re using there. It very well could be using cudaMallocHost or cudaHostAlloc.

    The thing to take away is that overlapping copying and kernel execution requires page-locked host mem which you can only get via cudaMallocHost.

    It should, in theory. Most things are non-blocking. I just know that execute synchronizes with the default stream before it returns.

    Typically, you only see speedups when you’re doing things in parallel. If you’re only doing a single inference batch, you likely won’t see any speedup.

    Instead, copy-and-execute is useful when you’re performing multiple inference batches at once. For example, you can copy over the second batch to run your inferences over while you’re still calculating the inferences for your first batch.

    This kind of stuff can help you fully saturate your GPU as most of these inference routines seem to have a theoretical utilization of like 50%.

    Also I checked implementation and didn’t find any of these two. Only malloc for host and cudaMalloc for device.

    I was curious and did almost same today. But only prepared new batch while inference is running, not copied to input. And it almost doubled FPS on Jetson Nano: from 82-83 ms/frame to 45 ms/frame.
    So as I undestand now I can also copy to input buffer while inference is still running? That’s nice.
    Didn’t try cudaMallocHost/cudaHostAlloc yet, I’ll try this tomorrow.