When running multiple inferences and benchmarking the time of each one, does the first inference longer than the other ones?


I am using the Coral devboard to do inference. When running multiple inferences, the first one is always slowly because the time to send the tensors to the embedded TPU is measured along with the classification/inference time.

I want to know if the same happens with the embedded GPUs?
Where can I find an example?



It depends on what kind of memory are you used.

For Jetson, the physical memory is shared by CPU and GPU.
As a result, you can use a unified or paged-locked memory without sending tensor to the processor.

A basic inference sample can be found in our TensorRT folder:



1 Like

Hello @AastaLLL

When you say physical memory you mean the DRAM (off-chip), right? The one of 8GB.

But why does the first inference take longer than the other ones, then? I am only measuring the inference



It’s known that GPU takes longer at first launch for initialization:

As a result, we usually reserve some warm-up time before benchmarking.


1 Like