Pipeline freeze on infer with output-tensor-meta enabled - no error message or traceback

• Hardware Platform (dGPU)
• DeepStream Version 6.0
• TensorRT Version 8.2.3-1+cuda11.4
• NVIDIA GPU Driver Version 495.29.05 and 470.103.01
• Issue Type (bugs)
• How to reproduce the issue? Enable output-tensor-meta on a keypoint model

This is a strange one.

We have built a pipeline in DS6 using the python bindings, its a simple filesrc → pgie (detector) → sgie (classifier, etc) → filesinke pipeline. We have tested this with both TAO and in-house built models as the pgie, this all works fine. We also tried a mix-match of triton infer as well as just nvinfer, doesn’t seem to change much.

However, we would like to implement a keypoint model - specifically mmpose hrnet_lite (we have tried other models, it’s not just this one). As you are aware, Deepstream does not natively support keypoint post-processing, so we’d need to create our own post-processor. No big deal. However, as you are also aware, setting the post processor to

postprocess { other { } }

warns you to

warning: Network(uid: 4) is defined for other postprocessing but output_tensor_meta is disabled to attach. If needed, please update output_control.output_tensor_meta: true in config file: config/keypoints_inferserver.txt.

without it enabled, the pipeline runs fine. It lowers the fps from 250 to 7 and it runs to completion, of course with nothing outputted related to the keypoint model.

When output_tensor_meta is set to true, the pipeline never starts, freezing just before it starts to run infer on the video. There is no error, no segfault, I can’t even kill the pipeline with an interrupt. It just sits there forever. We’re not even trying to do anything with the tensor_meta yet.

Ive attached here a screenshot of both docker stats and nvidia-smi, as you can see its not releasing the RAM or VRAM, but also not using the gpu at all.

Another thing to note is that it also works fine on videos with very few instances of what we’re trying to run inference on (2 bounding boxes work) but our usual test video contains anywhere from 8 to 12 at a time.

Any help would be appreciated, we’ve been stuck on this for months.

Sorry for the late response, will have team to do the investigation and provide suggestions soon. Thanks

Actually we have narrowed down the issue quite a bit over the week, it seems the culprit is a lack of dynamic batching combined with output_tensor_meta set to true on frames with too many objects (it depends on the model but anything more than 3-4 seems to cause instability.)

It only applies to sgies, as pgies are only ran once per frame.

It’s not just keypoints as well, detectors and classifiers both have the same issue, if I can’t increase the batch size, it will freeze.

If this a known bug, we’ve missed it. It potentially comes with some limitations as some models, specifically transformers, don’t really convert to dynamic batching.

Of course, if we build the parsers with c++ and have output_tensor_meta disabled, there’s no crashing.

Pipeline freezes when SGIE batch-size is less than PGIE batch-size
when nvinfer and nvinferserver “output-tensor-meta” is true, this is a known issue, we will fix it in future release.

Sorry about the above comment, this bug fixed in DS-6.0
for nvinferserver:
Update secondary gie config files available at inferserver.
to set output buffer pool size
extra {
output_buffer_pool_size: value //value will be larger e.g 20 to infer
more objects
}

We have done that, it seems to have little to no effect on the stability of the pipeline, the value for the output_buffer_pool_size is 10 in all our SGIEs.

Crashes just the same as with the default value of 2.

How about your PGIE batch size?

PGIE has a batch size of 1.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Can you try a more larger size like tensor-meta-pool-size set to 20?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.