How to run TensorRT based deep learning model as real time?

I optimized my deep learning model with TensorRT. A C++ interface is inferencing images by optimized model on Jetson TX2. This interface is providing average 60 FPS (But it is not stable. Inferences are in range 50 and 160 FPS). I need to run this system as real time on RTOS patched Jetson TX2.

So what is your thoughts on real time inference with TensorRT? Is it possible to develop real time inferencing system with TensorRT and how?

I have tried set high priorities to process and threads to provide preemption. I expect appoximatly same FPS value on every inference. So I need deterministic inference time. But system could not output deterministicaly. Maybe TensorRT is not suitable for real time.

Thanks.

Hello,

TRT inference times are expected to be fairly deterministic, even when running on non-realtime OS. Can you provide more details on your “RTOS patched Jetson TX2”? Also, any details on your inference workflow/infrastructure, maybe even a debug will help us debug.

Hello,

Thank you for your fast response. I’m sorry to use “RTOS patched” expression instead “Real time patched”. To patch Jetson TX2, I used kozyilmaz’s[1] guide. After patching, Jetson passed successfully real-time tests.

You’ve said “even when running on non-realtime OS” but when running on non-real time Jetson fps results also non-deterministic. And non-real time Jetson TX2 is providing higher FPS than real time patched Jetson.

My inference infrastructure is a modified version of sampleUffMnist from TensorRT Samples [2]. And this figure [3] shows my model’s layers. Only difference is that I am using resize layer instead flatten layer because of TensorRT supported ops.

Thanks…

[1] . [url]https://github.com/kozyilmaz/nvidia-jetson-rt/blob/master/docs/README.03-realtime.md[/url]
[2] . [url]https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#mnist_uff_sample[/url]
[3] . [url]https://pasteboard.co/I752Wby.png[/url]

I just found “Persistent Threads”[1] topic. Concurrent RT says :

"The use of the persistent threads style can improve determinism significantly,
making modest sized workloads viable for such applications. 
The persistent threads model avoids these determinism problems by launching a CUDA kernel only once,
at the start of the application, and causing it to run until the application ends."

But I can not find any examples about persistent threading with TensorRT on Jetson TX2. Has anyone try out this method?

[1]. https://www.concurrent-rt.com/wp-content/uploads/2016/09/Improving-Real-Time-Performance-With-CUDA-Persistent-Threads.pdf

I read this comment at Nvidia Devtalk: “If your code uses floating-point atomics, results may differ from run to run because floating-point operations are generally not associative, and the order in which data enters a computation (e.g. a sum) is non-deterministic when atomics are used.”.

I used fp16 precision type when I was optimizing model with TensorRT. Is it possible to get deterministic output, when we use fp16 precision? Any thoughts?