• Hardware Platform (Jetson / GPU) - Jetson Nano • DeepStream Version - 5.1 • JetPack Version (valid for Jetson only) - 4.5.1 • TensorRT Version - 7.1.3 • Issue Type( questions, new requirements, bugs) - questions
I am attempting to augment my home security system using deepstream with a Jetson Nano. My pipeline is six cameras coming in RTSP and using the default example which uses the resnet10 caffemodel. After playing with it for a while, I’m trying to figure out the best way to fit it into my use case, and have a few somewhat haphazard questions. My background may help some, I’ve been developing C++ software for 20+ years, and have barely dipped my toe into the perception and machine learning realm. I can sling code w/the best of em but when it comes to tensors and models and stuff I’m a dunce.
The default resnet10 model is built for detecting Car, Bicycle, Person, and Roadsign. In my setup I definitely don’t need Roadsign, and I’d love to add in Animal, or even Cat, Dog, Squirrel, Possum, etc if possible (not super important though). Would I need to modify the resnet model somehow to remove the Roadsign and add Animal? Or can I simply remove the things I don’t need from the labels file? My ideal list would be: Person, Car, Dog, Cat, Squirrel, Possum, Racoon, but I’d settle for Person, Car, Animal. I assume that having a small-ish, narrow list will keep bad detections at a minimum. I recall running yolo at some point and it thought my front sidewalk was a surfboard, and I have no need to detect surfboards in my yard.
My cameras are all capable of roughly 2K resolution as well as have the ability to simultaneously output a second stream at a lower resolution. I like having the full size resolution at 15 fps on the main stream for my recording system (Blue Iris), however I noticed some models like a very specific resolution to be set? Should I be setting that output resolution to something the model prefers, or highest possible? How about framerate (plus constant vs variable) or colors? The Jetson Nano seems to be able to run this model on all of them at the same time at full res at 15 fps without issue, but I’m uncertain as to whether that’s best for the model or not.
While I can jump in and modify the deepstream-app I’d love to use it as is with my customized input files. Is there a sink or something I can use that will simply spit out the bounding boxes, confidences, and classes detected and on which stream they’re detected over something like MQTT or another transport? If that doesn’t exist it seems it’d be super useful in the base app. If it’s something I need to add myself, is the all_bbox_generated() callback the best place to start?
Are the models typically trained only for color images? I.e. when it’s night time and all my cameras switch over to infrared, are the models now junk? Or do I need it to run a different model when the cameras are in night mode?
This is probably the most difficult part but if I have an object that’s in my yard 24-7 that seems troublesome for the detector (it always thinks it’s a person but it’s not), is there something I can do in the model to say “ignore this object” by giving it some pictures? Even screenshots from the exact camera?
I know these are a lot of questions and on different subjects but it seems like I’m not that far off from accomplishing the goal, which is just to turn off Blue Iris’ junk motion detection and just feed it start/stops when people or cars are detected. Any help would be greatly appreciated.
The label file is generated according to how the model is trained. This resnet10 does not meet your requirement. You can refer to Overview — Transfer Learning Toolkit 3.0 documentation (nvidia.com). You may need to train the model by your own dataset to make the model detect the things you want.
The model can only handle the designated size. Deepstream can do the resize internally, so you can use original resolution or other resolution. From deepstream application view, to use relatively smaller resolution can help improving app performance.
To retrain the model with your own dataset may improve the correctness of detection. Another possible thing for you is to check the confidence of the object you detected, if the confidence is relatively low, you can ignore the object in the app. https://docs.nvidia.com/metropolis/deepstream/sdk-api/Meta/_NvDsObjectMeta.html
Wow thanks @Fiona.Chen for that awesome and detailed answer! I figured I’d have to delve into the training world to get what I really want. I’ll check out the nvmsgbroker stuff you linked, looks interesting. It’d be nice to get by without having to modify the apps if I can.