TensorRT on PX2 is very slow sometimes

I use the Neural Network to process Lidar point clouds for object detection. It implemented by TensorRT on PX2. The normal processing time for one frame of point clouds is 30ms. However, sometimes, the processing time for one frame of point clouds is 3s, after my module starts. It’s slower 100 times.
I’ve to restart my module again, then it works fine, 30ms. The case appears almost half probability.
I estimate that this issue is due to the GPU is not activated when processing time is 3s, instead, it’s running in CPU. So it’s slower 100 times.
How can I check which function of TensorRT returns error?

Thanks a lot

Dear @mpescho,
Does this 30 ms represents just inference of DNN ? Are you using DW or TRT APIs to process your DNN? The inference portion on TensorRT makes use of CUDA kernels for DNN layers and are expected to on GPU only. Can you share some more info about the workflow in your application to get more understanding.

Dear Siva, Thanks for your suggestion. I’ve tested it again. It seems that the inference is 25ms. This issue is that NMS spends about 3s.
Thanks. I’ll continue to debug it.

Dear @mpescho,
So you write CUDA kernel for implementing NMS or is it implemented on CPU? If it is running on CPU, It could be possible that other running CPU processes might be effecting the NMS timings. Could you check in TOP if any other processes running when you see the code path entered NMS operation.