How to make a pedestrian counter

Hello, I want to build a street pedestrian counter just to test it from my window. It is possible to detect pedestrians with the examples of my Jetson-TX1 and USB Logitech 930 camera. I guess if it detects pedestrians I will be able to count them., I’ve been a few days googling and exploring this forums and Jetson gitlabs without founding clear solutions to some problems I found and before taking some decisions I will love some advice:

  • I run detecnet-camera with multiped network and the displayed video (detecting pedestrians) is a zoomed and little part of the camera display around 400px. In fact, all the examples that I run ahve this same display video size. My camera is configured in detecnet-camera.cpp at 1280px and can display video at 1280 (./v4l2-display /dev/video0 ) so why is this happening? Where do I have to look? I think it may be a problem with the multiped network input size but I cannot see that size in the prototxt file (i see 640px) so for not having such image video size during inference do I have to retrain the detecnet network with 1280px pedestrian inputs?(
  • As I want to count pedestrians I think that I will be able to do it because I see detectnet-camera in the console is showing bounding boxes numbers. It is not very precise it has a lot of false positives but I think it can work. So, for counting the number of pedestrians passing in front of my window is it better to modify the Nvidia c++ code or it will be faster and easier to move to install OpenCV + python + do inference with caffe (in this case can I use detectnet models?). I am regular at coding but python has more examples available...
  • Has anyone any idea how to deal with the problem I have to solve? I mean ... I can detect pedestrians but I am not sure how to count them. Pednet draws a BB in each frame but it does not cares about detecting the same person or another one, what I need is some sort of human recognition in a video scene and not a human detection and follows this person until it disappears.

Thank you very much, Juan Luis


1. General framework is (Input) -> pre-process -> inference -> draw -> (Display)
In the detectnet-camera sample, we downsize image to the model input size in pre-process step:

2. C++ is better since we have accelerated it with TensorRT, which only support C++.

3. This is an interesting use-case, here is some advice for you.