I have integrated Jetson-inference in Drive PX2 (jetson-inference Github: GitHub - dusty-nv/jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.). I have modified a sample for detection multiple objects using jetson-inference library.
I call a function detect which is defined in DetectNet.cpp. Inside Detect function ‘execute’ is used(found in Line 513 on, https://github.com/dusty-nv/jetson-inference/blob/master/detectNet.cpp)
I used std::chrono::high_resolution_clock to measure the time taken for ‘execute’ command to run, which in returns gives me the time taken my network to detect the object in a picture.
My initial results on the host PC with FP32 accuracy were in between 50-60 ms per image.
After few weeks, I again ran the time measurement on the same ‘execute’ command, I saw the detection takes in between 150-170 ms, which is 4 times slower than the original measurement on my host PC.
After some digging, I found ‘execute’ command is called from NvInfer.h. I used a caffe model and prototxt file to create a TensorRT Fp32 engine. The detectNet::create makes an engine using tensorNet.cpp, which is provided in the gihub.
I came across ‘nvprof’ but it gives me error on execution.
I cant figure out how the detection became 4 time slower.
Any help would be appreciated.
very strange. Is the time increased on other machine(your collegue’s) too? Double check the input image size, TRT,cudnn versions.
At the moment, Only I have the right libraries installed, so my colleagues cannot check it.
We checked it in PX2, and it gave the same result as my host PC. that was the reason I checked it again.
Input size I use is 1920X1208, TRT 3.0, NvInfer 4.0.
CUDNN should be 7.1 for CUDA 9.0.
check dpkg -l | grep TensorRT on your machine. It should report something like below
ii libnvinfer-dev 4.0.2-1+cuda9.0 amd64 TensorRT development libraries and headers
ii libnvinfer-samples 4.0.2-1+cuda9.0 amd64 TensorRT samples and documentation
ii libnvinfer4 4.0.2-1+cuda9.0 amd64 TensorRT runtime libraries
ii python3-libnvinfer 4.0.2-1+cuda9.0 amd64 Python 3 bindings for TensorRT
ii python3-libnvinfer-dev 4.0.2-1+cuda9.0 amd64 Python 3 development package for TensorRT
ii python3-libnvinfer-doc 4.0.2-1+cuda9.0 amd64 Documention and samples of python bindings for TensorRT
ii tensorrt 3.0.2-1+cuda9.0 amd64 Meta package of TensorRT
ii uff-converter-tf 4.0.2-1+cuda9.0 amd64 UFF converter for TensorRT pack
It could be issue with sate of your host machine. You may have reinstall and setup your machine appropriately again. Do you notice change in timings on DrivePX2 as well?
I have cross checked the versions. It shows me the exact same output as you have in your comment.
We have checked the time in DrivePX2 and noticed that it takes 150ms. Then I cross checked in the host machine again it showed me 150ms.
I found it very strange.
In my opinion, the time taken per detection should be around 50ms as seen before, and also seen with DW SDK.
Int8 inference takes around 110ms which is also a lot.
Any ideas where it can go wrong?
Does that mean time on Drive PX2 also increased? If your host PC’s GPU and Drive PX2 dGPU has similar computational power, we can expect them to have similar inference timing. Please check running deviceQuery from CUDA samples to know GPU configuration on both Host and Drive PX2.
We have first time tested out object detection project in Drive PX2. So I cannot comment if it increased or decreased in DrivePx2.
By your comment, I guess if my host PC shown an increase, it make sense that PX2 also showed 160ms.
I can perform deviceQuery on drivePX2 and my host PC.
I will update you as soon as possible.