Output-tensor-meta Access RAW model output with batch dimension

• Geforce RTX 4070 Ti Laptop dGPU
• DeepStream Version 7.1 triton-multiarch Docker Container
• TensorRT Version 10.3
• NVIDIA GPU Driver 575
• Issue Type: Question
My instance seg model has two outputs of shape:

output[batch_dim, 39 (bbox+scores+mask_coeffs), 8400]
protos[batch_dim, 32, 160, 160]

Now when my model is set to output-tensor-meta=1 i can access the NvDsInferTensorMeta but its Layer information and buffers do NOT contain a batch dimension. Now for my custom processor i have noticed, that running it in a for loop is slowing down the pipeline a lot when batch size is large.
My postprocessor contains parsing, NMS, RoiAlign, and mask creation kernels, of which RoiAlign kernel is taking ~2ms due to large size of data. When this slow kernel is called 32+ times per batch it causes my pipeline to drastically slow down. Now i want to refactor my kernels to run on batches adding a dimension, but the way MetaData is accessed makes this wayy too complicated. I just need a single Tensor par output with the shape i specified at the top, not split up on batch dim.

Is there any way to access the real raw output of the model?

  1. if output-tensor-meta is set to 1, you can access the inference results directly. please refer to the native sample deepstream-infer-tensor-meta-test for how to get inference output tensors.
  2. from the shape, there is batch dimension batch_dim in output tensors. could you share a DeepStream running log, which will include the printing of actual dimension?

I only added a print to each probe in the sample you suggested: deepstream-infer-tensor-meta-test

Everything else is the exact same, configs etc.The prints never show a batch dimension, which should be the dim[0].This dim, per configuration, should be 1 for pgie and 16 for sgie but following is output for first 4 dims:

PGIE Frame Number: 112
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]
Frame Number = 107 Number of objects = 15 Vehicle Count = 11 Person Count = 4

SGIE Frame Number: 110
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]


PGIE Frame Number: 113
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]

SGIE Frame Number: 111
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
Frame Number = 108 Number of objects = 12 Vehicle Count = 8 Person Count = 4


PGIE Frame Number: 114
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]

SGIE Frame Number: 112
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
Frame Number = 109 Number of objects = 11 Vehicle Count = 7 Person Count = 4


PGIE Frame Number: 115
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]

SGIE Frame Number: 113
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
Frame Number = 110 Number of objects = 10 Vehicle Count = 6 Person Count = 4

For SGIE the 6 and 20 are just the number of classes not the dynamic batch, so i have to iterate through all units inside batch_meta, be it rois, objects or frames, ech iteration contains result for a SINGLE batch unit but i want the WHOLE tensor so there should be dims: [16, 20, 0, 0] in output.
I looked at the Sigmoid output layer of the vehicle make models onnx file and it also show the dim being: name:

predictions/Softmax:0
tensor: float32[unk__94,20]

with unk__94 being th batch_dim i am desperately looking for, looks like meta utils or whatever is removing the dim at position 0 and “splitting” each batch unit into its own NvDsInferTensorMetas data buffer

this is output on startup for the vehicle make etc. models used by sample:

root@flowx16:/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test# /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test/deepstream-infer-tensor-meta-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
With tracker
max_fps_dur 8.33333e+06 min_fps_dur 2e+08
Now playing...
Failed to query video capabilities: Invalid argument
0:00:00.144840677 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary2-nvinference-engine> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2092> [UID = 3]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleTypes/resnet18_vehicletypenet_pruned.onnx_b16_gpu0_int8.engine
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:327 [FullDims Engine Info]: layers num: 2
0   INPUT  kFLOAT input_1:0       3x224x224       min: 1x3x224x224     opt: 16x3x224x224    Max: 16x3x224x224    
1   OUTPUT kFLOAT predictions/Softmax:0 6               min: 0               opt: 0               Max: 0               

0:00:00.144876361 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary2-nvinference-engine> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2195> [UID = 3]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleTypes/resnet18_vehicletypenet_pruned.onnx_b16_gpu0_int8.engine
0:00:00.149713450 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer_impl.cpp:343:notifyLoadModelStatus:<secondary2-nvinference-engine> [UID 3]: Load new model:dstensor_sgie2_config.txt sucessfully
0:00:00.154977079 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary1-nvinference-engine> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2092> [UID = 2]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleMake/resnet18_vehiclemakenet_pruned.onnx_b16_gpu0_int8.engine
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:327 [FullDims Engine Info]: layers num: 2
0   INPUT  kFLOAT input_1:0       3x224x224       min: 1x3x224x224     opt: 16x3x224x224    Max: 16x3x224x224    
1   OUTPUT kFLOAT predictions/Softmax:0 20              min: 0               opt: 0               Max: 0               

0:00:00.155002228 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary1-nvinference-engine> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2195> [UID = 2]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleMake/resnet18_vehiclemakenet_pruned.onnx_b16_gpu0_int8.engine
0:00:00.155510811 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer_impl.cpp:343:notifyLoadModelStatus:<secondary1-nvinference-engine> [UID 2]: Load new model:dstensor_sgie1_config.txt sucessfully
0:00:00.160862850 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2092> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-7.1/samples/models/Primary_Detector/resnet18_trafficcamnet_pruned.onnx_b1_gpu0_int8.engine
Implicit layer support has been deprecated
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:327 [Implicit Engine Info]: layers num: 0

0:00:00.160883614 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2195> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-7.1/samples/models/Primary_Detector/resnet18_trafficcamnet_pruned.onnx_b1_gpu0_int8.engine
0:00:00.164382456 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer_impl.cpp:343:notifyLoadModelStatus:<primary-nvinference-engine> [UID 1]: Load new model:dstensor_pgie_config.txt sucessfully
Running...
max_fps_dur 8.33333e+06 min_fps_dur 2e+08

The " num_frames_in_batch" in NvDsBaseMeta tells you the size of the batch.

The batched output tensors are not in one continuous buffer. To copy the tensors from different places into one continuous buffer also needs extra effort and time. Why do you need to read the whole batch in one time? What will you do to accelerate the postprocessing?

Can you tell us why do you use “output-tensor-meta” method to customized the postprocessing instead of customize with " NvDsInferParseCustomFunc" interface?

I dont use the CustomFunction because in this function configurable in the config, i could not find a way to get the device Buffers of NvDsInferTensorMeta or any other buffer ON Gpu/device only buffers on host are given.
I want to do PostProcessing directly on GPU using batch processing kernels for NonMaximumSuppression and ROI alignment for mask creation. I tried modifying the models output head, but this showed bad performance due to INMSLayer of TensorRT synchronizing with host on every call, i couldnt find any solution for this either and on Forum the topic gets no answers:
https://forums.developer.nvidia.com/t/inmslayer-cuda-graph-invalidation-devicetoshapehostcopy/338025/6

And since my kernels need relatively large buffers for processing i dont want to allocate them on every single call but once on initialization with the maximum expected size.

I dont get how some design decisions are made here, why is the originally complete output buffer split up or is this due to internals of how TensorRT handles batched inference?

  1. The gst-nvinfer is open source, the output tensor buffers are managed by a buffer pool, the buffers inside the pool will be reused batch to batch when the pipeline runs. The order of the frame input of the batch is not fixed, and we want to manage the buffers with the batch but not the memory. Please refer to the source code of gst-nvinfer for how the buffers are allocated and used.

  2. With " NvDsInferParseCustomFunc" interface the CUDA device buffer is available for customized postprocessing. E.G. In this sample deepstream_tools/yolo_deepstream/deepstream_yolo/config_infer_primary_yoloV8.txt at main · NVIDIA-AI-IOT/deepstream_tools, we are using CUDA kernel to do postprocessing.

  3. If you don’t want the gst-nvinfer internal clustering algorithm, you can set “cluster-mode=4” in the nvinfer configuration file.

I would clearly prefer editing the model as in this repository by adding NMS and RoiAlign to the models output GitHub - marcoslucianops/DeepStream-Yolo-Seg: NVIDIA DeepStream SDK 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO-Segmentation models .
As stated in https://forums.developer.nvidia.com/t/inmslayer-cuda-graph-invalidation-devicetoshapehostcopy/338025/6 the INMSLayer is somehow synchronizing with host because of the dynamic output size I guess. Is there a way to fix the dimension of INMSLayer to total maxoutputBboxes and using the NumOutputBoxes output to only access the valid NumOutputBoxes entries in SelectedIndices. TensorRT: nvinfer1::INMSLayer Class Reference

Please raise topic in the TensorRT forum Latest Deep Learning (Training & Inference)/TensorRT topics - NVIDIA Developer Forums

Done here:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.