Incorrect Bounding Box Decoding with YOLOv8 TensorRT Engine in DeepStream (Output Shape [5, 8400])


Description:

I have exported a YOLOv8 model (likely face detection) to TensorRT using ONNX, and integrated it into DeepStream 7.0 using a custom parser. The output layer is named output0, with shape [5, 8400] — interpreted as:

[x_center, y_center, width, height, confidence]

However, DeepStream shows incorrect bounding box locations — the boxes are not aligning with actual objects in the video (faces).


Details:

DeepStream Version: 7.0

TensorRT Version: 8.x

YOLOv8 Exported via: yolo export model=yolov8n_face.pt format=onnx opset=17

Engine Built With: trtexec

Input resolution: 640×640

Output Layer Shape: [5, 8400]

Classes: 1 (face only)


Observed Output Logging:

Raw Output Sample:

Layer Name: output0
Dims: (5, 8400)
Box 0: 9.61096 5.68841 17.7715 12.5132 0.400749
Box 1: 10.81 5.52857 16.5815 11.1546 0.372961

Interpretation Attempt:

float x_center = output[0 * num_boxes + i];
float y_center = output[1 * num_boxes + i];
float width = output[2 * num_boxes + i];
float height = output[3 * num_boxes + i];
float conf = output[4 * num_boxes + i];

Then converted to [left, top, width, height], clipping to frame.


Parser Code:

extern “C”
bool NvDsInferParseCustomYoloV8(
std::vector const &outputLayersInfo,
NvDsInferNetworkInfo const &networkInfo,
NvDsInferParseDetectionParams const &detectionParams,
std::vector &objectList)
{
const NvDsInferLayerInfo &layer = outputLayersInfo[0];
const float *output = reinterpret_cast<const float *>(layer.buffer);
int num_attrs = layer.dims.d[0]; // 5
int num_boxes = layer.dims.d[1]; // 8400

for (int i = 0; i < num_boxes; ++i) {
    float x_center = output[0 * num_boxes + i];
    float y_center = output[1 * num_boxes + i];
    float width    = output[2 * num_boxes + i];
    float height   = output[3 * num_boxes + i];
    float conf     = output[4 * num_boxes + i];

    if (conf < detectionParams.perClassThreshold[0]) continue;

    float left = std::max(x_center - width / 2.0f, 0.0f);
    float top  = std::max(y_center - height / 2.0f, 0.0f);

    NvDsInferObjectDetectionInfo obj;
    obj.classId = 0;
    obj.left = left;
    obj.top = top;
    obj.width = std::min(width, networkInfo.width - left);
    obj.height = std::min(height, networkInfo.height - top);
    obj.detectionConfidence = conf;
    objectList.push_back(obj);
}

return true;

}


Problem:

Despite proper shape parsing and decoding, detections appear completely misplaced on screen. We verified that:

Data buffer is correct (floats match across frames)

Output is not normalized (values like x_center = 10, width = 20)

Detection boxes drawn do not align with faces


Request:

Could NVIDIA clarify the expected output format for YOLOv8 exported to ONNX and then to TensorRT for DeepStream?

Is there a transform (e.g., normalization or anchor/grid decode) missing from this setup?

Is there any official sample for YOLOv8 with DeepStream?

Why the subtraction here? It’s already the width and height.

please refer to the offical sample for yolov8.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.