TensorRT engine inference faster when execute is called more often

Hi there!

I am currently running inference on YOLOv5 engines for my thesis project. However I noticed an interesting quirk I do not quite understand.

My rough pipeline looks like this:

cudaMemcpy(inputBuffer, inputArray, input_size, cudaMemcpyHostToDevice);
context->executeV2(buffers);
cudaMemcpy(cpu_output.data(), outputBuffer, output_buffer_size, cudaMemcpyDeviceToHost);

and it works quite well. However, when I time how long it takes to perform all three steps (execute taking the longest), I noticed that the time that execute takes is directly related to the amount of calls I make in a second!

For example for YOLOv5s on a Jetson AGX, calling executeV2 100 times per second leads to an average inference time of 7ms. But when I lower the amount of calls to 30 per second, execute suddenly takes >14ms per call. The pre- and postprocessing (including cudaMemcpy) takes roughly the same. How can I get execute to always work at the highest possible speed?

Any help would be appreciated. I am working on JetPack 4.6 with CUDA 10.2 and TensorRT 8.0.0.1.

By what mechanism do you adjust the rate of calls per second? How do you measure “execute”?

How do you measure “execute”?

I am working with ROS, so I can measure time like this:

ros::Time t1 = ros::Time::now();
context->executeV2(buffers);
ros::Time diff = ros::Time::now().toNSec() - t1.toNSec();

By what mechanism do you adjust the rate of calls per second?

That is a bit more complicated to explain, but at its core, the ROS main loop always spins at a user-defined rate. For every loop, the entire pipeline consisting of preprocessing (capturing an input image with OpenCV), executing the model and postprocessing is executed. By adjusting the looprate I can adjust how often execute is called. The pre- and postprocessing steps take <2ms each, so I can vary my loop quite a lot before that becomes a problem. Also, if a single pipeline takes too long, single loops are simply skipped, though that rarely occurs.

A reasonable hypothesis would be that the executions measured are an artifact of the rate-limiting mechanism. I don’t use Yolo and know nothign about it.

Have you inquired about this in the relevant Yolo forum / mailing list / GitHub discussion? That’s where I would expect Yolo experts to congregate.

I am not sure what you mean by rate-limiting mechanism. To my understanding, a TensorRT execution time when using executeV2 should only be related to the hardware capabilities available, especially when repeatedly using the same input.

Imagine code somewhat like this:

prepareInput(buffers);
double avg_time = 0.0;
for(int i = 0; i < 1000; i++)
{
   Time t1 = Time.now();
   context->executeV2(buffers);
   double diffMS = t1.toMSec() - Time.now().toMSec(); // calculate computation time in milliseconds
   avg_time += diffMS;
   sleep(20 - diffMS); // sleep remaining time to keep loop rate of 20ms
   
}
cout << "Average Time over 1000 iterations (ms): " << avg_time/1000 << endl;

To my understanding, avg_time should always be the same, irrespective of wether I use sleep(20 - diffMS) or sleep(50 - diffMS). But my measurements show that avg_time is nearly twice as high when using the latter. diffMS is always smaller than 20ms.

But thank you for the tip, I will ask on YOLOs GitHub as well.

What I meant by rate-limiting mechanism:

An ordinary program executes at no particular number of loop iterations per unit of time. It runs however fast the code can execute on a given hardware platform. In your environment, you can apparently chose various specific rates, e.g. 30 times per second or 100 times per second, whereas the code might execute at a “natural” rate of 112.7 times per second. So there is a mechanism that “brakes” the execution from the natural rate down to the desired user-specified rate. Maybe after one invocation finishes it busy-waits or nano-sleeps until the next time slice starts.

So one hypothesis is that whatever mechanism is used interferes with the timing, creating a false impression that the kernel executes in an amount of time that differs depending on the “frame” rate, when in fact it doesn’t.