Inference slow using nvInfer and TensorRT directly into PX2

Hi,

I have integrated Jetson-inference in Drive PX2 (jetson-inference Github: GitHub - dusty-nv/jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.). I have modified a sample for detection multiple objects using jetson-inference library.

I call a function detect which is defined in DetectNet.cpp. Inside Detect function ‘execute’ is used(found in Line 513 on, https://github.com/dusty-nv/jetson-inference/blob/master/detectNet.cpp)

I used std::chrono::high_resolution_clock to measure the time taken for ‘execute’ command to run, which in returns gives me the time taken my network to detect the object in a picture.

My initial results on the host PC with FP32 accuracy were in between 50-60 ms per image.

After few weeks, I again ran the time measurement on the same ‘execute’ command, I saw the detection takes in between 150-170 ms, which is 4 times slower than the original measurement on my host PC.

After some digging, I found ‘execute’ command is called from NvInfer.h. I used a caffe model and prototxt file to create a TensorRT Fp32 engine. The detectNet::create makes an engine using tensorNet.cpp, which is provided in the gihub.

I came across ‘nvprof’ but it gives me error on execution.

I cant figure out how the detection became 4 time slower.

Any help would be appreciated.

Thanks

Dear mayank,
very strange. Is the time increased on other machine(your collegue’s) too? Double check the input image size, TRT,cudnn versions.

Hi Siva,

At the moment, Only I have the right libraries installed, so my colleagues cannot check it.

We checked it in PX2, and it gave the same result as my host PC. that was the reason I checked it again.

Input size I use is 1920X1208, TRT 3.0, NvInfer 4.0.

CUDNN should be 7.1 for CUDA 9.0.

Regards
Mayank

Dear mayank.mahajan,
check dpkg -l | grep TensorRT on your machine. It should report something like below

ii  libnvinfer-dev                                           4.0.2-1+cuda9.0                              amd64        TensorRT development libraries and headers
ii  libnvinfer-samples                                       4.0.2-1+cuda9.0                              amd64        TensorRT samples and documentation
ii  libnvinfer4                                              4.0.2-1+cuda9.0                              amd64        TensorRT runtime libraries
ii  python3-libnvinfer                                       4.0.2-1+cuda9.0                              amd64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev                                   4.0.2-1+cuda9.0                              amd64        Python 3 development package for TensorRT
ii  python3-libnvinfer-doc                                   4.0.2-1+cuda9.0                              amd64        Documention and samples of python bindings for TensorRT
ii  tensorrt                                                 3.0.2-1+cuda9.0                              amd64        Meta package of TensorRT
ii  uff-converter-tf                                         4.0.2-1+cuda9.0                              amd64        UFF converter for TensorRT pack

Source: https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt_302/tensorrt-install-guide/index.html

It could be issue with sate of your host machine. You may have reinstall and setup your machine appropriately again. Do you notice change in timings on DrivePX2 as well?

Hi Siva,

I have cross checked the versions. It shows me the exact same output as you have in your comment.

We have checked the time in DrivePX2 and noticed that it takes 150ms. Then I cross checked in the host machine again it showed me 150ms.

I found it very strange.

In my opinion, the time taken per detection should be around 50ms as seen before, and also seen with DW SDK.
Int8 inference takes around 110ms which is also a lot.

Any ideas where it can go wrong?

Regards
Mayank

Dear Mayank,
Does that mean time on Drive PX2 also increased? If your host PC’s GPU and Drive PX2 dGPU has similar computational power, we can expect them to have similar inference timing. Please check running deviceQuery from CUDA samples to know GPU configuration on both Host and Drive PX2.

Hi Siva,

We have first time tested out object detection project in Drive PX2. So I cannot comment if it increased or decreased in DrivePx2.

By your comment, I guess if my host PC shown an increase, it make sense that PX2 also showed 160ms.

I can perform deviceQuery on drivePX2 and my host PC.

I will update you as soon as possible.

Regards
Mayank