DeepStream Inference Fails for ONNX Model with Batch Size different than 1

• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question
Hello,

I am trying to run an ONNX model with an explicitly set batch size of 8 through a simple DeepStream pipeline that only performs inference. The model is available here, with config, labels and simple pipeline that I run the model on:
model.zip (1.0 MB)

When inspecting the model in Netron,


the batch size is correctly set to 8. To match this, I configured nvstreammux with: batch-size=8.
According to the DeepStream FAQ, nvstreammux’s batch size should either match the number of input sources or the model batch size. So I think I set it correctly.

However, when running inference, I encounter the following error:

ERROR: [TRT]: IExecutionContext::enqueueV3: Error Code 7: Internal Error (IShuffleLayer model/output/BiasAdd__82: reshaping failed for tensor: model/output/Sigmoid:0 reshape would change volume 50176 to 401408 Instruction: RESHAPEinput dims{1 1 224 224} reshape dims{8 224 224 1}.)
ERROR: Failed to enqueue trt inference batch
nvinfer gstnvinfer.cpp:1504:gst_nvinfer_input_queue_loop:<cp-nvinfer> error: Failed to queue input batch for inferencing

Interestingly, when I use the same model with batch size explicitly set to 1, it works without issues.

Question:

How can I perform inference on 8 frames simultaneously?

• Do I need to introduce a specific buffer element before nvinfer, or does nvinfer handle batching internally?

• How can I verify that inference is actually happening on 8 frames when the converted model reports:

INPUT  kFLOAT input  3x224x224  
min: 1x3x224x224  
opt: 8x3x224x224  
max: 8x3x224x224  

I would like to always perform inference on batch-size=8 instead of 1
Any insights or suggestions would be greatly appreciated!

Your model is actually implicit batch dimension and full dimension both. Please set “force-implicit-batch-dim=1” in your model_config.txt file

@Fiona.Chen Thank you for your help; the solution worked!

However, after analyzing inference performance in Nsight Systems, I noticed that frames are not being stacked into batches of 8 as expected. Instead, they are processed sequentially, one by one, despite setting batch-size=8 and force-implicit-batch-dim=1 in config file. Below is a screenshot for reference:

Question:

Is this the expected behavior? I would prefer frames to be processed in batches of 8 since my model is optimized for batched inference and performs more efficiently when processing multiple frames together rather than individually.

  1. The nvstreammux batch-size and nvinfer batch-size have different meanings.Frequently Asked Questions — DeepStream documentation
  2. From your code, there is only one camera input with 60 fps, it is a live source, and you set “batched-push-timeout= 80000” with nvstreammux, that means nvstreammux only wait for at most 80ms to get frames for batch from the live source, how can the single camera provide 8 frames in 80ms with 60 FPS?

How do you know this from the nsys log?

@Fiona.Chen Thank you for your replay.

  1. If I set nvstreammux’s batch-size to 1, can nvinfer still process 8 frames in a batch? If so, are the frames buffered internally and I can have batch-size in nvstreamux set to 1 and nvinfer batch-size set to 8?
  2. You’re right—I initially set batched-push-timeout too low. I’ve now adjusted it to 280000. Since each frame will be take 16.6 ms and there will be 8 of them.
  3. In Nsight Systems, I assumed that if the GstNvinfer row contains only one buffer_process_batch_num (blue marker), it indicates that frames are processed one at a time. If that’s incorrect, how can I verify how many frames my model processes per batch? Would iterating over frame data in a probe function be the best approach, or is there another way? Also the inference takes the same amount of time as in model where batch size was explicitly set to 1. I am attaching the result of nsys log.
    nsys_log.nsys-rep.zip (754.6 KB)

The engine is built as batch size 8. The engine always works in batch size 8. But there may not be 8 frames data in the batch.

The num_frames_in_batch in [NvDsBatchMeta] (NVIDIA DeepStream SDK API Reference: _NvDsBatchMeta Struct Reference | NVIDIA Docs) shows you how many frames in the batch.

No other way.

If there is only one live source, the batch size 8 may not help the inferencing efficiency.
For it always needs 8/60 second to form the batch, if there is any unstable issue happens to the camera, the time may be longer.

@Fiona.Chen Thank you for your reply.
So i added this to my probe function:

    gst_buffer = info.get_buffer()
    if not gst_buffer:
        print("Unable to get GstBuffer for arcing inference pad buffer probe")
        return Gst.PadProbeReturn.OK
    
    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))
    print(batch_meta.num_frames_in_batch)
    print(batch_meta.max_frames_in_batch)

What i get as a result is

1
8

which mean that max_frames_in_batch is 8 indeed however num_frames_in_batch is 1. Buffer of 8 is not being created even tho i have specified implicitly that buffer should be 8.

Hello @Fiona.Chen
I wanted to check if there are any updates regarding my question. Would appreciate any insights you can share. Thanks again!

Please refer to the attached customized usb camera pipeline.

The features of my camera:

v4l2-ctl --device=/dev/video0 --list-formats-ext
ioctl: VIDIOC_ENUM_FMT
        Type: Video Capture

        [0]: 'MJPG' (Motion-JPEG, compressed)
                Size: Discrete 640x480
                        Interval: Discrete 0.040s (25.000 fps)
                Size: Discrete 1280x720
                        Interval: Discrete 0.040s (25.000 fps)
                Size: Discrete 1920x1080
                        Interval: Discrete 0.040s (25.000 fps)
        [1]: 'YUYV' (YUYV 4:2:2)
                Size: Discrete 640x480
                        Interval: Discrete 0.040s (25.000 fps)
                Size: Discrete 1280x720
                        Interval: Discrete 0.100s (10.000 fps)
                Size: Discrete 1920x1080
                        Interval: Discrete 0.200s (5.000 fps)

I configured the “YUYV, 25 fps, 640x480” caps after the v4l2src and link v4l2src to videoconvert to make the camera output 25fps 640x480 YUYV raw data.

Please pay attention to the properties settings for “nvvideoconvert”, “nvstreammux” and dstest1_pgie_config.txt configuration.

deepstream_test_1_usb.py (11.3 KB)
dstest1_pgie_config.txt (2.9 KB)

1 Like

Parts of my log:

8
8
Frame Number=616 Number of Objects=1 Vehicle_count=1 Person_count=0
Frame Number=617 Number of Objects=1 Vehicle_count=2 Person_count=0
Frame Number=618 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=619 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=620 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=621 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=622 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=623 Number of Objects=0 Vehicle_count=2 Person_count=0
8
8
Frame Number=624 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=625 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=626 Number of Objects=1 Vehicle_count=1 Person_count=0
Frame Number=627 Number of Objects=1 Vehicle_count=2 Person_count=0
Frame Number=628 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=629 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=630 Number of Objects=0 Vehicle_count=2 Person_count=0
Frame Number=631 Number of Objects=1 Vehicle_count=3 Person_count=0
8
8
Frame Number=632 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=633 Number of Objects=1 Vehicle_count=1 Person_count=0
Frame Number=634 Number of Objects=1 Vehicle_count=2 Person_count=0
Frame Number=635 Number of Objects=1 Vehicle_count=3 Person_count=0
Frame Number=636 Number of Objects=0 Vehicle_count=3 Person_count=0
Frame Number=637 Number of Objects=0 Vehicle_count=3 Person_count=0
Frame Number=638 Number of Objects=0 Vehicle_count=3 Person_count=0
Frame Number=639 Number of Objects=0 Vehicle_count=3 Person_count=0
8
8
Frame Number=640 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=641 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=642 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=643 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=644 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=645 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=646 Number of Objects=0 Vehicle_count=0 Person_count=0
Frame Number=647 Number of Objects=0 Vehicle_count=0 Person_count=0

@Fiona.Chen Thank you for your response and for providing an example! I ran your example with minor adjustments for my camera setup and indeed it runs in 8 num_frames_in_batch.

However, I noticed that you did not set streammux.set_property('live-source', True). Could you clarify the reason for this? When I enabled this property, I immediately observed num_frames_in_batch = 1 instead of 8. Does this property influence the number of frames buffered and passed through nvstreammux?

Additionally, if I want to run my pipeline with a camera and perform inference with a batch size of 8, do I need to disable live-source or i still can process 8 frames in buffer with property live-source=True?

Lastly, I noticed that nvvidconvsrc has output-buffers set to 9, while the batch size for both nvstreammux and nvinfer is 8. Could you explain the reasoning behind this? When I commented this property out the pipeline worked in the same way having 8 frames in a batch.

Also, when I commented out nvmultistreamtiler and added only fakesink after nvinfer element, pipeline seems to be stuck. Is nvmultistreamtiler necessary after nvinfer to run next inferences in batch-size of 8?

Thanks in advance!

When nvstreammux “live-source” is set TRUE, the nvstreammux will not wait for the batch to be filled because the latency is very important to most of the customers who work with live sources.

You need to disable “live-source”.

The “output-buffers” of nvvideoconvert is the output buffer pool size, if you want the downstream elements always able to get 8 frames from nvvideoconvert, you need to set the pool size larger than 8.

It will not stuck in my board. But you must use either “nvmultistreamtiler” or “nvstreamdemux” to make the batch data to be converted to non-batch data correctly before you send the video data to any element which can’t handle batch. fakesink can’t handle batch data.

@Fiona.Chen Thank you for your response! I have a few additional questions:

  1. In the case of nvvidconvsrc, when I commented out nvvidconvsrc.set_property('output-buffers', 9), the pipeline still processed 8 buffers at a time correctly. Why is that?

  2. My model performs a single inference, including postprocessing in 3-4 ms. Would it be better/advised to set live-source=True and perform inference frame-by-frame since it is faster than generation of single frame (16.6 ms), or should I set live-source=False to enable batch inference?

  3. Is there a sink element that can handle batched input directly, or must batch inference results always be converted back to non-batch format before passing them to a sink?

Seems you also changed something else.

It depends on you.

No.

Yes.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.