Question about inference using TensorRT API


I utilized the Nsight System to profile an AI model inference program that utilizes the TensorRT API on both the Orin Drive platform and an x86 machine equipped with a GTX-2080. The result on Orin Drive is depicted in Figure 1, while the result on the x86 machine is illustrated in Figure 2.

Fig.1 Result on Orin Drive

Fig.1 Result on X86 machine

From my understanding, TensorRT does use asynchronous kernel launching. However, based on the information provided in Figure 1, it appears to be synchronous. On the GPU side, kernel execution is segregated by the CUDA kernel launch API on the CPU side. In contrast, the outcome on the X86 machine seems to be more reasonable.

Furthermore, upon closer examination of the timeline, I have observed the kernel on Orin Drive as depicted in Figure 3.

Fig.3 Kernel on Orin Drive

Based on Figure 3, it appears that the kernel launch API’s ending time on the CPU side is occurring later than the kernel execution ending time on the GPU side. This sequence of events is rather perplexing as the kernel should ideally be executed after the kernel launch API.

One possible explanation for the aforementioned phenomena could be that the performance of the Orin Drive CPU is significantly poorer compared to the X86 machine. The maximum CPU frequency achievable with the Orin Drive CPU is 2009 MHz, whereas the X86 machine can reach a maximum CPU frequency of 3501 MHz.

Does the performance of the CPU have such a significant impact on the kernel launch API? Based on Figure 1, it appears that the latency of inference is limited by the latency of kernel launch on the CPU side.

Could there be any other reasons that could have caused the phenomena mentioned above? If so, I would appreciate it if you could please inform me.