Inference from very wide source

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Jetson AGX Xavier
• DeepStream Version
4.0
• JetPack Version (valid for Jetson only)
4.4
• TensorRT Version
7.1.3.0
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)

I’m using deepstream-app, deepstream-test4-app and different derivatives from those.

When I process a “normal” h264 file, 16:9 ratio, 1920x1080, it works great.

This is the mediainfo output for such file:

General
Complete name                            : una.h264
Format                                   : AVC
Format/Info                              : Advanced Video Codec
File size                                : 38.6 MiB

Video
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : Baseline@L4
Format settings                          : 1 Ref Frames
Format settings, CABAC                   : No
Format settings, ReFrames                : 1 frame
Format settings, GOP                     : M=1, N=30
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 16:9
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive

But what I have to process is a wider image, which is a stitch from 3 cameras, resulting in a 3840x720 image.

This is the mediainfo output of such file:

General
Complete name                            : ../filename.h264
Format                                   : AVC
Format/Info                              : Advanced Video Codec
File size                                : 103 MiB

Video
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : Baseline@L5
Format settings                          : 1 Ref Frames
Format settings, CABAC                   : No
Format settings, ReFrames                : 1 frame
Format settings, GOP                     : M=1, N=30
Width                                    : 3 840 pixels
Height                                   : 720 pixels
Display aspect ratio                     : 5.333
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive

Both files are similar in quality, resolution, etc. And composed of the same features.

But in the second case the inference is really poor. It only detects a couple of the bigger persons and nothing more. There are about 10-15 persons on the image, which are perfectly detected on the single image test.

So I guess my question here is: What do I have to tune in the inference config files in order to detect ok on the wider image? That am I doing wrong?

Thanks a lot.

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hey, why you still need to use DS4.0, could you upgrade to latest DS5.0.1.
For your question, is it possible to input 3 single streams from camera to DS pipeline instead of stitch them to one stream, I guess the resize in nvstreammux or nvinfer will affact the accuracy.

I have to use Jetpack 4.4.0 for an issue with my drivers, and for historical reasons I kept using DS4.0. I tried to run deepstream-test4 from DS5.0 but it gave a lot of errors so I abandoned it.

Is it much better to run DS5.0? Does it improve the accuracy in my case?

I know, but my use case is to first stitch and then to detect objects in this wide video.
Is not possible to accurately detect objects in this 3840x720 canvas? Is not that big of an image. Is it about the aspect ratio?

Can you tell me more about this resize? Which would be the ideal input size? Does it has to be 16:9?

Thanks a lot

I would like to add some information here.

I activated the file output (and screen output also), and both of them shows me a very distorted 16:9 output. I suppose then that the network is having a hard time to detect these distorted “thin” persons.

Is it possible to avoid this resize? Can I tell the pipeline that I’m trying to process a 3840x720 image?

thanks