Hello,
I have encountered a puzzling issue while benchmarking the inference time of a TensorRT model running on the Jetson Orin Nano. I converted the model using the TensorFlow framework and ran it with the following Python code:
import tensorflow as tf
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import matplotlib.pyplot as plt
import time
saved_model_dir = "tensorRT-model-FP16"
model = tf.saved_model.load(saved_model_dir)
infer = model.signatures['serving_default']
image = np.random.random((1, 1200, 1920, 1)).astype(np.float32)
image = tf.convert_to_tensor(image)
# Warmup phase
for i in range(100):
outputs = infer(inputs=image)
# Benchmark phase
benchmark_runs = 1000
start_event = cuda.Event()
end_event = cuda.Event()
timings = []
for i in range(benchmark_runs):
# time.sleep(0.01) # Uncommenting this changes the behavior
start_event.record()
outputs = infer(inputs=image)
end_event.record()
end_event.synchronize()
start_event.synchronize()
elapsed_time = start_event.time_till(end_event)
timings.append(elapsed_time)
print("Average inference time:", np.mean(timings))
plt.plot(np.arange(len(timings)), timings)
plt.savefig("diagram.png")
The issue arises when I uncomment the time.sleep(0.01)
line in the benchmarking loop:
- Without
time.sleep(0.01)
: The average inference time (np.mean(timings)
) is approximately 9 ms. - With
time.sleep(0.01)
: The average inference time drops drastically to about 0.5 ms
I plotted the timing distributions for both cases, and the diagrams are attached below:
- With
time.sleep(0.01)
:
- Without
time.sleep(0.01)
:
An additional interesting observation is that when I use real data in converter.build(input_fn=my_input_fn)
instead of random images to optimize my model using the tf.experimental.tensorrt.Converter
library, the execution timing plot looks something like this (Without time.sleep(0.01)
):
Why does adding a time.sleep
reduce the reported inference time? Is this an artifact of the CUDA event timing mechanism, or does it relate to TensorRT’s execution pipeline and synchronization? Could it also be due to a thermal issue on my Jetson device or hardware limitations? Interestingly, when I use a 1080 Ti GPU, the execution time remains consistent regardless. Any insights or recommendations for achieving accurate timing measurements would be greatly appreciated
Note: I created the TensorRT model using tf.experimental.tensorrt.Converter
with precision_mode
set to FP16. However, when I use Nvidia’s own TensorRT, it consistently runs in 17.5 milliseconds.
You can download the model from this link:
tensorRT-model-FP16.zip (331.0 KB)