Output-tensor-meta Access RAW model output with batch dimension

johannesrhvw · September 23, 2025, 12:59pm

• Geforce RTX 4070 Ti Laptop dGPU
• DeepStream Version 7.1 triton-multiarch Docker Container
• TensorRT Version 10.3
• NVIDIA GPU Driver 575
• Issue Type: Question
My instance seg model has two outputs of shape:

output[batch_dim, 39 (bbox+scores+mask_coeffs), 8400]
protos[batch_dim, 32, 160, 160]

Now when my model is set to output-tensor-meta=1 i can access the NvDsInferTensorMeta but its Layer information and buffers do NOT contain a batch dimension. Now for my custom processor i have noticed, that running it in a for loop is slowing down the pipeline a lot when batch size is large.
My postprocessor contains parsing, NMS, RoiAlign, and mask creation kernels, of which RoiAlign kernel is taking ~2ms due to large size of data. When this slow kernel is called 32+ times per batch it causes my pipeline to drastically slow down. Now i want to refactor my kernels to run on batches adding a dimension, but the way MetaData is accessed makes this wayy too complicated. I just need a single Tensor par output with the shape i specified at the top, not split up on batch dim.

Is there any way to access the real raw output of the model?

fanzh · September 24, 2025, 12:00pm

if output-tensor-meta is set to 1, you can access the inference results directly. please refer to the native sample deepstream-infer-tensor-meta-test for how to get inference output tensors.
from the shape, there is batch dimension batch_dim in output tensors. could you share a DeepStream running log, which will include the printing of actual dimension?

johannesrhvw · September 24, 2025, 1:58pm

I only added a print to each probe in the sample you suggested: deepstream-infer-tensor-meta-test

Everything else is the exact same, configs etc.The prints never show a batch dimension, which should be the dim[0].This dim, per configuration, should be 1 for pgie and 16 for sgie but following is output for first 4 dims:

PGIE Frame Number: 112
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]
Frame Number = 107 Number of objects = 15 Vehicle Count = 11 Person Count = 4

SGIE Frame Number: 110
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]


PGIE Frame Number: 113
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]

SGIE Frame Number: 111
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
Frame Number = 108 Number of objects = 12 Vehicle Count = 8 Person Count = 4


PGIE Frame Number: 114
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]

SGIE Frame Number: 112
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
Frame Number = 109 Number of objects = 11 Vehicle Count = 7 Person Count = 4


PGIE Frame Number: 115
-------------
PGIE InferLayer 0 Info: output_cov/Sigmoid:0
    dims: [4, 34, 60, 0]
-------------
PGIE InferLayer 1 Info: output_bbox/BiasAdd:0
    dims: [16, 34, 60, 0]

SGIE Frame Number: 113
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [6, 0, 0, 0]
-------------
SGIE InferLayer 0 Info: predictions/Softmax:0
    dims: [20, 0, 0, 0]
Frame Number = 110 Number of objects = 10 Vehicle Count = 6 Person Count = 4

For SGIE the 6 and 20 are just the number of classes not the dynamic batch, so i have to iterate through all units inside batch_meta, be it rois, objects or frames, ech iteration contains result for a SINGLE batch unit but i want the WHOLE tensor so there should be dims: [16, 20, 0, 0] in output.
I looked at the Sigmoid output layer of the vehicle make models onnx file and it also show the dim being: name:

predictions/Softmax:0
tensor: float32[unk__94,20]

with unk__94 being th batch_dim i am desperately looking for, looks like meta utils or whatever is removing the dim at position 0 and “splitting” each batch unit into its own NvDsInferTensorMetas data buffer

this is output on startup for the vehicle make etc. models used by sample:

root@flowx16:/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test# /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test/deepstream-infer-tensor-meta-app /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
With tracker
max_fps_dur 8.33333e+06 min_fps_dur 2e+08
Now playing...
Failed to query video capabilities: Invalid argument
0:00:00.144840677 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary2-nvinference-engine> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2092> [UID = 3]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleTypes/resnet18_vehicletypenet_pruned.onnx_b16_gpu0_int8.engine
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:327 [FullDims Engine Info]: layers num: 2
0   INPUT  kFLOAT input_1:0       3x224x224       min: 1x3x224x224     opt: 16x3x224x224    Max: 16x3x224x224    
1   OUTPUT kFLOAT predictions/Softmax:0 6               min: 0               opt: 0               Max: 0               

0:00:00.144876361 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary2-nvinference-engine> NvDsInferContext[UID 3]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2195> [UID = 3]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleTypes/resnet18_vehicletypenet_pruned.onnx_b16_gpu0_int8.engine
0:00:00.149713450 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer_impl.cpp:343:notifyLoadModelStatus:<secondary2-nvinference-engine> [UID 3]: Load new model:dstensor_sgie2_config.txt sucessfully
0:00:00.154977079 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary1-nvinference-engine> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2092> [UID = 2]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleMake/resnet18_vehiclemakenet_pruned.onnx_b16_gpu0_int8.engine
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:327 [FullDims Engine Info]: layers num: 2
0   INPUT  kFLOAT input_1:0       3x224x224       min: 1x3x224x224     opt: 16x3x224x224    Max: 16x3x224x224    
1   OUTPUT kFLOAT predictions/Softmax:0 20              min: 0               opt: 0               Max: 0               

0:00:00.155002228 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<secondary1-nvinference-engine> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2195> [UID = 2]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-7.1/samples/models/Secondary_VehicleMake/resnet18_vehiclemakenet_pruned.onnx_b16_gpu0_int8.engine
0:00:00.155510811 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer_impl.cpp:343:notifyLoadModelStatus:<secondary1-nvinference-engine> [UID 2]: Load new model:dstensor_sgie1_config.txt sucessfully
0:00:00.160862850 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2092> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-7.1/samples/models/Primary_Detector/resnet18_trafficcamnet_pruned.onnx_b1_gpu0_int8.engine
Implicit layer support has been deprecated
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:327 [Implicit Engine Info]: layers num: 0

0:00:00.160883614 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer.cpp:684:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2195> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-7.1/samples/models/Primary_Detector/resnet18_trafficcamnet_pruned.onnx_b1_gpu0_int8.engine
0:00:00.164382456 470186 0x59c5dc3f37a0 INFO                 nvinfer gstnvinfer_impl.cpp:343:notifyLoadModelStatus:<primary-nvinference-engine> [UID 1]: Load new model:dstensor_pgie_config.txt sucessfully
Running...
max_fps_dur 8.33333e+06 min_fps_dur 2e+08

Fiona.Chen · September 25, 2025, 5:38am

The " num_frames_in_batch" in NvDsBaseMeta tells you the size of the batch.

The batched output tensors are not in one continuous buffer. To copy the tensors from different places into one continuous buffer also needs extra effort and time. Why do you need to read the whole batch in one time? What will you do to accelerate the postprocessing?

Can you tell us why do you use “output-tensor-meta” method to customized the postprocessing instead of customize with " NvDsInferParseCustomFunc" interface?

johannesrhvw · September 25, 2025, 7:22am

I dont use the CustomFunction because in this function configurable in the config, i could not find a way to get the device Buffers of NvDsInferTensorMeta or any other buffer ON Gpu/device only buffers on host are given.
I want to do PostProcessing directly on GPU using batch processing kernels for NonMaximumSuppression and ROI alignment for mask creation. I tried modifying the models output head, but this showed bad performance due to INMSLayer of TensorRT synchronizing with host on every call, i couldnt find any solution for this either and on Forum the topic gets no answers:
https://forums.developer.nvidia.com/t/inmslayer-cuda-graph-invalidation-devicetoshapehostcopy/338025/6

And since my kernels need relatively large buffers for processing i dont want to allocate them on every single call but once on initialization with the maximum expected size.

I dont get how some design decisions are made here, why is the originally complete output buffer split up or is this due to internals of how TensorRT handles batched inference?

Fiona.Chen · September 25, 2025, 8:49am

The gst-nvinfer is open source, the output tensor buffers are managed by a buffer pool, the buffers inside the pool will be reused batch to batch when the pipeline runs. The order of the frame input of the batch is not fixed, and we want to manage the buffers with the batch but not the memory. Please refer to the source code of gst-nvinfer for how the buffers are allocated and used.
With " NvDsInferParseCustomFunc" interface the CUDA device buffer is available for customized postprocessing. E.G. In this sample deepstream_tools/yolo_deepstream/deepstream_yolo/config_infer_primary_yoloV8.txt at main · NVIDIA-AI-IOT/deepstream_tools, we are using CUDA kernel to do postprocessing.
If you don’t want the gst-nvinfer internal clustering algorithm, you can set “cluster-mode=4” in the nvinfer configuration file.

johannesrhvw · September 25, 2025, 8:56am

I would clearly prefer editing the model as in this repository by adding NMS and RoiAlign to the models output GitHub - marcoslucianops/DeepStream-Yolo-Seg: NVIDIA DeepStream SDK 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO-Segmentation models .
As stated in https://forums.developer.nvidia.com/t/inmslayer-cuda-graph-invalidation-devicetoshapehostcopy/338025/6 the INMSLayer is somehow synchronizing with host because of the dynamic output size I guess. Is there a way to fix the dimension of INMSLayer to total maxoutputBboxes and using the NumOutputBoxes output to only access the valid NumOutputBoxes entries in SelectedIndices. TensorRT: nvinfer1::INMSLayer Class Reference

Fiona.Chen · September 25, 2025, 9:58am

Please raise topic in the TensorRT forum Latest Deep Learning (Training & Inference)/TensorRT topics - NVIDIA Developer Forums

johannesrhvw · September 25, 2025, 10:12am

Done here:

Topic		Replies	Views
Raw output tensor for nvinfer sgie cannot be acessed DeepStream SDK deepstream	6	176	September 23, 2025
Not able to access raw tensor output as metadata DeepStream SDK nvbugs , deepstream	25	560	May 26, 2025
Cannot get tensor meta data from deepstream_infer_tensor_meta.cpp example DeepStream SDK	13	1236	October 12, 2021
NVDSINFER_TENSOR_OUTPUT_META missing when nvinfer in pgie mode with both input-tensor-meta and output-tensor-meta enabled DeepStream SDK deepstream	13	168	December 29, 2025
Parsing custom tensorflow model DeepStream SDK	31	990	September 4, 2023
Access TensorMeta in Deepstream-6.0 using nvinfer DeepStream SDK	5	1280	March 22, 2022
How batch inference on secondary model works with output-tensor-meta enabled DeepStream SDK	19	1017	December 12, 2022
Output of engine in gstnvinfer_meta_utlis.cpp DeepStream SDK	5	497	June 30, 2022
Facenet with DeepStream Python Not Able to Parse Output Tensor Meta DeepStream SDK	23	4829	October 12, 2021
Raw tensor output DeepStream SDK jetson-inference , gstreamer	8	2554	October 12, 2021

Output-tensor-meta Access RAW model output with batch dimension

Related topics