Does cuDNN 8.1 solve the performance problem when running with darknet yolov3?

First, I’m assuming that the patching is done to make PJ Reddies’s darknet-yolov3 to work with it. There was about a 15% performance loss with 8.0. I saw mention of some bug fixes, is one of them restoring the performance hit that darknet saw ?

Cheers,
Kim Hendrikse

Hi,

There is a performance regression in cuDNN 8.0 on Nano.
And our internal team is still working on it.

But we don’t observe the similar regression on the XavierNX board.
Do you also find the performance issue on cuDNN 8.0 + XavierNX?

Thanks.

Well… with the original darknet from PJ Reddie I thought that the newer CUDNN didn’t compile without a patch. I tried the patch a long time ago when it first came out but then darknet didn’t start properly. I think it ran out of memory. At this stage I’m not sure which patch is the correct one to fix compilation of the original PJ Reddie darknet code in order to test that.

Do you have any idea what the latest recommended patch is to the original PJ Reddie code is in order to get it to compile correctly on cuDNN 8.0 ? If so, I can re-test.

Cheers,
Kim

Hi,

Please check this comment for the details:

It should also work on the JetPack 4.5.

Thanks.

Thanks! I’ll try it and report back.

Unfortunately. This doesn’t work. I applied the patch and rebuilt using

GPU=1
CUDNN=1
OPENCV=0
OPENMP=0
DEBUG=0

ARCH= -gencode arch=compute_72,code=sm_72
-gencode arch=compute_72,code=[sm_72,compute_72]

But it loads really slowing and then dies before finishing with:
67 conv 1024 3 x 3 / 1 19 x 19 x 512 → 19 x 19 x1024 3.407 BFLOPs
68 res 65 19 x 19 x1024 → 19 x 19 x1024
69 conv 512 1 x 1 / 1 19 x 19 x1024 → 19 x 19 x 512 0.379 BFLOPs
70 conv 1024 3 x 3 / 1 19 x 19 x 512 → 19 x 19 x1024 3.407 BFLOPs
71 res 68 19 x 19 x1024 → 19 x 19 x1024
72 conv 512 1 x 1 / 1 19 x 19 x1024 → 19 x 19 x 512 0.379 BFLOPs
73 conv 1024 3 x 3 / 1 19 x 19 x 512 → 19 x 19 x1024 3.407 BFLOPs
74 res 71 19 x 19 x1024 → 19 x 19 x1024
75 conv 512 1 x 1 / 1 19 x 19 x1024 → 19 x 19 x 512 0.379 BFLOPs
76 Killed

:-(

Kim Hendrikse

I’m trying to run full yolov3 with 608x608 image size

With memory usage as follows:

tegrastats

RAM 6659/7766MB (lfb 174x4MB) SWAP 31/3883MB (cached 0MB) CPU [53%@1420,8%@1420,0%@1420,0%@1420,16%@1420,2%@1420] EMC_FREQ 39%@1600 GR3D_FREQ 99%@1109 VIC_FREQ 0%@115 APE 150 MTS fg 0% bg 0% AO@26C GPU@27.5C PMIC@100C AUX@26C CPU@27.5C thermal@27.05C VDD_IN 7104/7104 VDD_CPU_GPU_CV 2409/2409 VDD_SOC 1304/1304
RAM 4475/7766MB (lfb 626x4MB) SWAP 31/3883MB (cached 0MB) CPU [30%@1420,0%@1420,32%@1420,21%@1420,0%@1420,10%@1420] EMC_FREQ 38%@1600 GR3D_FREQ 0%@1109 VIC_FREQ 0%@115 APE 150 MTS fg 0% bg 1% AO@26.5C GPU@27C PMIC@100C AUX@26C CPU@28C thermal@27.05C VDD_IN 6451/6777 VDD_CPU_GPU_CV 2041/2225 VDD_SOC 1223/1263
RAM 6706/7766MB (lfb 174x4MB) SWAP 31/3883MB (cached 0MB) CPU [8%@1420,4%@1420,50%@1420,0%@1420,15%@1420,2%@1420] EMC_FREQ 27%@1600 GR3D_FREQ 99%@1109 VIC_FREQ 0%@115 APE 150 MTS fg 0% bg 2% AO@26.5C GPU@27.5C PMIC@100C AUX@26C CPU@27.5C thermal@27.1C VDD_IN 5879/6478 VDD_CPU_GPU_CV 1714/2054 VDD_SOC 1141/1222

Hi,

You are right.
We can also reproduce this issue in our environment.

Sorry that we are testing the YOLOv3 Tiny model before.
We will check this internally and share more information with you later.

Thanks.

Excellent! My new project to connect YOLO implementations to useful preventative security on jetson platforms is getting closer to release I hope it will be useful to a lot of people. I’m getting excited, but there’s still a little work left.

In the meantime, it will support using AlexeyAB’s version of yolov3 and yolov3 as well as PJReddie’s implementation of yolov3. Strangely, with identical model and weight’s, in my testing, the classifications of yolov3 between AlexeyAB’s implementation and PJReddie’s are very different. In my testing over the last two years of detecting humans for security purposes’s PJReddies implementation generated less false positives that yolov4.

So there are likely still usecases where this is very important and despite the slow down people may choose for PJReddie’s implementation sometimes, despite the slowdown but then fixing the CUDNN implemation for this code would really be appreciated I think.

Thanks for looking into it.

Kim Hendrikse

Hi,

Thanks for the feedback.

We didn’t notice the accuracy issue between these two repository.
It seems lots of our user switch to AlexeyAB version after cuDNNv8 release for the update.

We are still working on the YOLOv3 issue. Hope to give you a feedback soon.

Thanks.

Thanks thats great. It’s understandable that people switch as the accuracy was better. However, that should be put into context and into perspective if your goal is to detect people for security purposes with low false positives. For example (anecdotally from a low number of tests), when analysing one of my false positives using YOLOV4 I notices that yolov4 correctly detected many more flower pots in an image than yolov3 PJ Reddie detected. That will help improve the overall score of YOLOV4 but not in a manner that is relative to people detection for security. Again anecdotally it appears that YOLOV4 is a little more eager to see things and as such this results in correctly seeing more objects (Perhaps non-people objects) than PJ Reddie YOLOV3, which all goes to increase it’s accuracy. It’s possible that this behavior results in more correct matches of small people images as well (i.e. people far away), which increases the overal score.

However, if I had to choose to get twice as many correct matches of people far away that would normally be missed against getting a few extra false positives close up. For security I would choose to get less false positives close up as far away is less of a threat but false positives really mess up the whole process of monitoring.

I hope my explanation make’s sense. It’s really an attempt to say that overall accuracy score is not the only thing that matters when choosing an algorithm for security and why the PJ Reddie algorithm is still very relevant in the security field.

In my up coming software release I’ll be allowing the user to choose one of three algorithms so they can choose (And test) for themselves which suites them better. In short, if the view of view is non-cluttered and doesn’t contain much noise that can trigger false positives, then it’s quite viable to choose for the faster algorithm. In very cluttered environments, where you do not want to make too many exclusion zones, then choosing the algorithm with a lower incidence of false positives is likely a better choice.

Hope this helps.
Kim Hendrikse

Hi,

Thanks for the comment.
We are still checking the OOM issue on YOLOv3.

But if Deepstream is acceptable, we can deploy the model with TensorRT library.

/opt/nvidia/deepstream/deepstream-5.1/sources/objectDetector_Yolo/

Thanks.

It’s too much variation with what I’m trying to do at the moment. Plus I have not done anything with DeepStream yet so there will be a learning curve. For people using the yolov4 model it won’t affect them. For the other eventual uses there will just be a slight performance cost at the moment but it will still work.

Thanks anyway.

Kim

Hello,

I am also implementing the YOLO V3 by pjreddie and encountering the same OOM issue when cudnn >v8 is enabled. This fix would be greatly appreciated. Are there any updates on this?

Thank you

1 Like

Hi,

We check this issue and this might be an implementation issue of darknet.
Would you mind to inference the YOLOv3 model with our Deepstream library?

We can run YOLOv3 without issue (<4Gb memory) on XavierNX.
Please check below for the detailed instructions:

$ cd /opt/nvidia/deepstream/deepstream-5.1/sources/objectDetector_Yolo/
$ sudo copy [path/to/yolov3.weights] . 
$ sudo copy [path/to/yolov3.cfg] .
$ sudo CUDA_VER=10.2 make -C nvdsinfer_custom_impl_Yolo/
$ deepstream-app -c deepstream_app_config_yoloV3.txt

Thanks.

This is interesting.

In my app, I’m providing a choice of implementation. This can be one of them. I’ll look when I get a chance.

In my app I expose a websocket server implemented with:

import asyncio
import websockets

start_server = websockets.serve(server_me, "0.0.0.0", 8765)

asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

Do you know if your python bindings are compatible with this approach? The websocket server will receive the images and then need to pass it to your stream for processing and then have the control return back to the websocket server handler. It’s conceivable that async python handling in your bindings are not compatible with another approach to asynchronous processing, I don’t know.

Kim Hendrikse

Hi,

Do you have a RTSP input?
If yes, you can check the document below:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_ref_app_deepstream.html#source-group

For output data, please check below sample for some ideas:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_ref_app_test5.html

Thanks.

Thanks! I appreciate your tips. I will however be delaying looking into deepstream till after I’ve released my project.

Kim Hendrikse