Description:
I have exported a YOLOv8 model (likely face detection) to TensorRT using ONNX, and integrated it into DeepStream 7.0 using a custom parser. The output layer is named output0, with shape [5, 8400] — interpreted as:
[x_center, y_center, width, height, confidence]
However, DeepStream shows incorrect bounding box locations — the boxes are not aligning with actual objects in the video (faces).
Details:
DeepStream Version: 7.0
TensorRT Version: 8.x
YOLOv8 Exported via: yolo export model=yolov8n_face.pt format=onnx opset=17
Engine Built With: trtexec
Input resolution: 640×640
Output Layer Shape: [5, 8400]
Classes: 1 (face only)
Observed Output Logging:
Raw Output Sample:
Layer Name: output0
Dims: (5, 8400)
Box 0: 9.61096 5.68841 17.7715 12.5132 0.400749
Box 1: 10.81 5.52857 16.5815 11.1546 0.372961
…
Interpretation Attempt:
float x_center = output[0 * num_boxes + i];
float y_center = output[1 * num_boxes + i];
float width = output[2 * num_boxes + i];
float height = output[3 * num_boxes + i];
float conf = output[4 * num_boxes + i];
Then converted to [left, top, width, height], clipping to frame.
Parser Code:
extern “C”
bool NvDsInferParseCustomYoloV8(
std::vector const &outputLayersInfo,
NvDsInferNetworkInfo const &networkInfo,
NvDsInferParseDetectionParams const &detectionParams,
std::vector &objectList)
{
const NvDsInferLayerInfo &layer = outputLayersInfo[0];
const float *output = reinterpret_cast<const float *>(layer.buffer);
int num_attrs = layer.dims.d[0]; // 5
int num_boxes = layer.dims.d[1]; // 8400
for (int i = 0; i < num_boxes; ++i) {
float x_center = output[0 * num_boxes + i];
float y_center = output[1 * num_boxes + i];
float width = output[2 * num_boxes + i];
float height = output[3 * num_boxes + i];
float conf = output[4 * num_boxes + i];
if (conf < detectionParams.perClassThreshold[0]) continue;
float left = std::max(x_center - width / 2.0f, 0.0f);
float top = std::max(y_center - height / 2.0f, 0.0f);
NvDsInferObjectDetectionInfo obj;
obj.classId = 0;
obj.left = left;
obj.top = top;
obj.width = std::min(width, networkInfo.width - left);
obj.height = std::min(height, networkInfo.height - top);
obj.detectionConfidence = conf;
objectList.push_back(obj);
}
return true;
}
Problem:
Despite proper shape parsing and decoding, detections appear completely misplaced on screen. We verified that:
Data buffer is correct (floats match across frames)
Output is not normalized (values like x_center = 10, width = 20)
Detection boxes drawn do not align with faces
Request:
Could NVIDIA clarify the expected output format for YOLOv8 exported to ONNX and then to TensorRT for DeepStream?
Is there a transform (e.g., normalization or anchor/grid decode) missing from this setup?
Is there any official sample for YOLOv8 with DeepStream?