Surveillance System Use Case

I’m looking at possibly getting the Xavier NX to augment a home surveillance system and just wanted to validate the use case with some experts here.

My set up:
Currently 6 IP cameras, varying resolutions but most in the 1920x1080 to 2048x1536 range. They’re currently lowered on frame rate to 6-15 but I’d love to run them faster (e.g. 30ish). I’d hook up a dual monitor setup to the Xavier and have the output from all six cameras displayed, along with performing object detection on at least four of the streams.

My main question is if there’s any obvious holes in this plan. A secondary question is just an answer I can’t seem to find on the product page. What’s the length of the M.2 slot on the dev kit?

Can’t answer the first part but I can the second part… There are 2 M.2 slots on the NX devkit carrier board. There’s a 30mm (2230) E key slot which comes populated with a Reaktek RTL8822CE BT/Wifi card but can certainly be used for a different E key card. It has 1 pcie lane, USB, a UART and I2S. The other M.2 slot is an 80mm (2280) M key slot with 4 pcie lanes perfect for an nvme ssd.

Oh, both slots will only accept single sided cards. There’s almost no clearance between the card and carrier board.

The full spec is here…

https://developer.download.nvidia.com/assets/embedded/secure/jetson/Xavier%20NX/Jetson_Xavier_NX_DevKit_Carrier_Board_Specification_v1.0.pdf

1 Like

Proabably answering different questions here, but I think interesting and relevant to your project. I’ve been running yolov3 on nano’s for about a year now on several sites. I’m gearing up to a new major open source software release of this this year (Just have to finish some gui configuration software to make that side easy). If the point of the software is to off load the person in the equation as it should be, then you don’t need to visualize the detections and the frame rate is less relevant.

Some information that I’ve found:

On the nano, setting the input pixel size on yolov3 to 416x416 the detection time is 0.7s. I also bought an NX which I haven’t tested yet, but I’ve tested yolov4 with the full input sized of 608x608 and it has a detection time of 0.16s. Based on false positive rates, the full sized yolov4 model appears to perform the best, second to that the sub-sized yolov3 model.

In any case, once I’ve finished setting up my beta Xavier AGX I’ll start work on the NX version and let the form know of preliminary results if there’s interest.

For information, the upcoming software release is intended to provide a full production ready system, including onboard reverse proxy, auto letsencrypt certificates, dynamic dns, read-only mounted OS and booting from NVME disk. The beta testing over the last year is revealing a very capable system :-)

Cheers,
Kim Hendrikse

Oh I see I didn’t mention it, the 0.16s was on a Xavier AGX system running the full yolov4 version from a server script after the timings stabilised. I’m guessing that will be between 0.2 and 0.3s on the NX. On the nano it was 0.9s.

The main point that I wanted to make with reference to your post, if you manage to get your frame rates up that’s nice for aesthetics in seeing it but likely in order to do so you would need to compromise the model you are using, which if you are alerting on the basis of this would translate into more false positives. So for home security, you would ideally be focused on preventative security based on alerting in which case quality of detection should lead over frame rates for visualization.

Just my two cents,
Kim Hendrikse

Thanks for all that info! I am looking at YOLOv4 (but I did see YOLOv5 was just released https://blog.roboflow.ai/yolov5-is-here/). So my main reasoning for higher frame rates would be so the system would double both as the monitor and detector. I could definitely still run the substreams of the cameras at any rate, while still recording the full rate on my main camera server. Do you have a link to your open source software yet?

The open source software as linked to ai is not released yet. The original system that the ai triggers is but I’m hesitant to release this yet as it doesn’t do it justice yet as it needs a lot of work on documentation and examples.

if you want high frame rates just use a desktop with an RTX 2080ti. I have one of those as well.

But you don’t need high frame rates to monitor really. 6x frames per second still serves that purpose. You could get your higher frame rates by running with a 160x160 image and batch and subdivisions of “1”, but the two things won’t be detecting on the same thing.

Hi Kim. I’m curious about your system. Are you leveraging deepstream, or doing the inference in some other stack? It sounds like what you’ve made is intended to work as a headless solution, e.g. do something like trigger an API when a person is detected, rather than show the camera streams with boxes etc?

Kindof but not quite and not ultimately. I’m adding AI based trigger to a home security system that I’ve already been developing for ten years. What I’ve done is come up with a json definition of a structure in which camera streams, object categories, notifications, region of interest polygons, regions of explicit non-interest and a bunch of other parameters that when configured are interpreted by a multi-threaded python script that consumes several MJPEG streams and in turn takes the latest image, pushes it up a websocket connection to a “yolo-server” and receives the json back. With this it checks to if any of the regions of interest fall within the parameter restraints and then makes a rest call to my existing system.

My existing system then captures video and triggers the rules in an event engine which can send outgoing alerts such as pushover, E-mail alerts which contain links to the captured video snipperts, control I/O ports, out going HTTP calls, a lot of things actually.

So in essence I’m making the AI do the work for you. It’s very nice showing the video with the bounding boxes and I have python tools for that but ultimately home security should eliminate the need for the person to stare at the screen with the bounding boxes. I want to implement all that as well, but it’s lower priority to me that making it functional and practical and getting the software released for people to use.

I’ve been using yolov3 at 416x16 pixels in nano based systems on about 5 sites for about a year now and it works really well. I’ve spotted one intruder intent on theft just recently and it also managed to help identify a car involved in a car theft. In the past, without the AI as sensor input I’ve captured intruders a number of times of video.

I’m focusing on the practical side of using the AI in real life, to that end I want to package up the intregration of yolo (Currently yolov4), apache reverse proxy, dynamic dns, automatic letsencrypt certificate renewal, video transcoding and a memory overlay file system on top of an OS running read-only from SSD. That’s how my current boxes are. In essense it just waiting for me to complete the reactjs configuration screens before I can write the install scripts. As I’m brand new to reactjs that’s holding me up at the moment. Oh and on the Xavier I also need to create the root pivots scripts in a manner that provides a safe fall back if the writeable partition of the disk ever gets corrupted, which can happen if the power falls out when writing video, however this is a different partition than the read-only partition that the OS is running on, so a fall back solution would be great. Particularly as there’s no SD card in the AGX and the reset all implies a full software reinstall, which is a bit brutal.

The yolov4 based system I’ve been using over the past year processes an image every 0.7s. The Jetson Xavier AGX that I’ve testing with now takes 0.165s using the complete full model. I also have a Xavier NX but haven’t tested that yet.

If I get a match I also do an additional test to be sure it’s ok. Over the past year I’ve gained a lot of insight as to what kind of things can cause false positives and how to deal with that.

The current configuration format makes it also easy to trigger on loitering, or crowd forming or loitering crowds forming as well.

The main system that this triggers also supports a websocket server so other consumers can hang off this.

I server up yolo detections via a websocket server that takes binary images thrown at it. This make’s it easy to use a separate more powerful GPU if you like. With a jetson nano I can support good detection over 4 cameras. But for myself I use an RTX 2080 it I’m monitoring more than 15 cameras comfortably.

For 10 1/2 years I developed my security system for myself, so it’s already well developed and tested. I’d like to add the AI support and then see if other people can benefit from it as well.

In addition to AI based sensing I have also developed Lorawan PIR based sensors that I’ll be open sourcing as well. There can be some uses cases, such on large farms, where you may not be able to easily get cameras to the edge, so I guess there’s still some use for the older style sensors, but the AI has pretty much obsoleted it all in my experience over the last year.

Cheers,
Kim Hendrikse

RE: surveillance system. Some people are using DeepStream for smart dvrs. That’s probably the best performance, but GStreamer can be hard to work with. If you decide to go with that solution, expect to spend a lot of time learning about the framework itself (Pipelines, Bins, Elements, Pads, Caps, and so on). It’s very well designed, but it’s written in GObject C, so not very friendly unless you use something that generates the code for you (another option here).

There are also python bindings for DeepStream, but there you’re trading speed for ease of development. The more you do in python, the more it will hurt your app’s performance, espeically any kind of I/O, so if you choose Python, I would recommend doing as much as you can with Nvidia elements which are written in C and c++. All the necessary components are there to create a smart DVR. 30fps from 6 cameras is not a problem. You can probably do more like 16-32 sources depending on how effecient you make your pipeline (eg. trackers can be used to avoid doing inferences on every frame). Nvidia has a demo for the nano with 8 sources, iirc, and this is the technique used.

You can, of course, roll your own thing but you’ll end up having to re-implement much of what you get with GStreamer and DeepStream for “free”, so I wouldn’t recommend that.

1 Like

That is exactly the direction that sounds right for me. I’ve been a C/C++ developer for about 16 years so it is right in my wheelhouse. Great to hear that I should be able to use the nano. Do you happen to have a link to that example with 8 sources?

You should feel right at home then. The sample in question is installed with DeepStream itself:

You’ll have to install the Debian package manually until it’s added to the online repositories.

The example i’m referring to is a config for the reference deepstream-app

You’ll find the config at …

/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt

… once it’s installed.

You can also just use the reference app to do what you want as the config file setup makes it quite flexible. It’s also open source so you can modify it to your liking or use the plugins to write your own app (there are examples of this as well). Likewise many (but not all) of the gstreamer plugins themselves are open source (various licenses).

1 Like

I just realized I gave you the link to a Nano example. While that will work, the Xavier (and recent GTX/RTX/Tesla) is capable of int8 precision. There is a 30 source example for Xavier in the same folder:

.../source30_1080p_dec_infer-resnet_tiled_display_int8.txt

Not sure if the NX can do that, one, since it doesn’t have the same amount of memory the regular Xavier does, so it may drop some frames. You may have to modify the config in that case (eg. add a tracker and an interval to the inference element, or just remove some sources).

DeepStream also supports the DLA accelerator of the Xavier, so you can try using that instead of the GPU to do inference (or potentially both at the same time). There are example for much of this in the samples and the reference manual for the plugins themselves is here.

If I can get away w/the nano I will. For $100 it’s a no brainer to try and at least get familiar with things. We got a Xavier AGX at work I might be able to do some testing on too if I can wrestle it away from my coworker.

1 Like

I don’t have experience with resnet, I’m a big fan of yolov3 and now playing with yolov4.

You could handle 6x cameras sufficiently well on the nano if you used something like a 640x480 stream at around 6 frames per second. Provided you ensured that you had a source of MJPEG video provided outside of the nano. Currently I use VLC to transcode the RTSP MJPEG to HTTP MJPEG. That works on the nano with 4 cameras, but each vlc process takes around 50MB ram and 4x is all it can handle. In that case I would choose yolov3 at 416x416 resolution in the cfg file. Yesterday I moved the vlc transcoding from a nano onto a raspberry pi, freeing up 200MB of ram, just sufficient to run the full 608x608 yolov4 model. It does push the detection rate up from 0.7s to 0.9s, but that’s not really a big deal. Again however, for 6 cameras I would offload the vlc transcoding to another machine and then yolov4 at 416x416 resolution. I noticed that I got more false positives on yolov4 at 416x416 than yolov3 at 416x416 for people. On the full 608x608 model both yolov3 and yolov4 and a little more resilient to false positives.

I’ll be trying an NX shortly. The Xaver AGX was sweet at 0.165s per detection full yolov4 608x608.

The analytics requires for home security are pretty basic. Usually, it’s if a person is in an area that he shouldn’t be then let me know.

@mdegans thanks again for the suggestion. Got the nano in hand and wow, right outta the box the deepstream SDK works nearly flawlessly on 4 streams to detect cars/people. Even at much higher resolution/framerate than I thought (2688x1520x15fpsx4 cameras). This is all just with the sample app. For those following along, I copy/pasted the

/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt

He linked and just added my four sources as rtsp feeds. Very cool!

1 Like

YW, Yeah, you can do a lot with a single Nano and the Deepstream examples are pretty well optimized, so they’re good places to start from. If you run into any issues or have any questions, there is a DeepStream board on this forum where you can communicate directly with the developers.

1 Like

@cclaunch Have you been able to deploy yolov5 in deepstream? I am facing this issue. It would be really helpful if you can help in any way

I have not gotten there yet. That is on my radar though to try.

1 Like

More is not always better. For example, yolov4 has a higher maps score than yolov3. But this is over 80 categories whereas for surveillance it’s primarily the score for just people or maybe also cars (which always match well) that would matter. I’ve found that yolov4 has a tendency to want to see more objects than yolov3. This tendency is easy to see if you play it on a few images and compare with yolov3. It will correctly see more smaller flower pots for example. It’s conceivable that by correctly seeing more smaller objects and objects that are not people you gain a higher overall accuracy whilst the accuracy for just people goes down due to increased false positives. I’ve found that in my long term experiments yolov3 at 608 appears to be slightly less likely to fire false positives on people than the newer yolov4 at 608x608. It’s one thing just looking at a running video and saying wow that looks great, but if you are going to be woken up in the middle of the night then resilience to false positives is really important. I was testing yolov4 on a few beta sites but switched back to using yolov3 after getting too many false positives on people.

But ok, surveillance use case can have many meanings. Maybe the intended meaning for this thread revolves around people watching screens. In which case it’s a lot less critical.

1 Like

@kimv9rqv I agree and actually am just running the stock resnet10.caffemodel_b8_gpu0_fp16.engine included with deepstream for the nano and turning off the Roadsign class. I currently have it operating on 4 full resolution streams at 15fps and very few false positives. I’m working on tweaking the deepstream sample app to my liking so I can feed triggers from it to Blue Iris. I’ll also use the output stream to a display as my primary means of monitoring. Works great so far!