Issue with Bounding Boxes and Object Detection in DeepStream Using YOLOv8 Model

kakarot · March 21, 2025, 5:46am

Dear NVIDIA Developer Team,

We are currently developing a DeepStream app for Personal Protective Equipment (PPE) detection using a YOLOv8 object detection model. However, we have encountered an issue with the bounding boxes and the number of detected objects during the app execution. Below is a detailed description of our process, the issue we are facing, and the steps we have already taken to troubleshoot the problem.

Env:
**• Hardware Platform (Jetson)
TensorRT Version: 10.3.0
GPU Type: Jetson Orin NX
Nvidia Driver Version: jetpack version 6.2
CUDA Version: 12.6
Operating System + Version: Ubuntu -22.04
Python Version (if applicable): python -3.10.12 **
**• DeepStream Version: 7.1 **

Model and Dataset

We have trained a YOLOv8 object detection model on our PPE dataset, which includes the following classes:
Safety goggles
Non-safety goggles
Toe guards
Non-safety shoes
Safety shoes

Model Conversion

After training the model, we converted it into a .engine file using the trtexec command.
The model is now being used in our DeepStream application.

Problem Description

When we run the app, we encounter two main issues:
Bounding Boxes: The bounding boxes are being drawn at the corners of the image rather than surrounding the detected objects.
Object Count: The number of detected objects is also incorrect.

Debugging Steps Taken
To identify the root cause, we performed the following debugging steps:

Model Check: We ran the .engine model outside of DeepStream to see if the issue persisted. We observed the same bounding box issue in this case.

Terminal output:

Class ID: 4, Label: Non Safety Goggles: 4.66
Warning: class_id 62 is out of bounds, skipping detection.
Warning: class_id 9 is out of bounds, skipping detection.
Warning: class_id 35 is out of bounds, skipping detection.
Class ID: 4, Label: Non Safety Goggles: 4.18
Warning: class_id 60 is out of bounds, skipping detection.
Warning: class_id 8 is out of bounds, skipping detection.
Warning: class_id 35 is out of bounds, skipping detection.
Class ID: 4, Label: Non Safety Goggles: 4.39
Warning: class_id 60 is out of bounds, skipping detection.
Warning: class_id 9 is out of bounds, skipping detection.
Warning: class_id 35 is out of bounds, skipping detection.
Class ID: 4, Label: Non Safety Goggles: 4.04
Warning: class_id 61 is out of bounds, skipping detection.
Warning: class_id 8 is out of bounds, skipping detection.

Parser Check: We suspected that the issue might lie in the parser, so we used the pretrained YOLOv8 object detection model with the same parser for the person, we take person as class in the label.txt file and change the num_of_detected classes in the parser However, we encountered the same problem with the bounding boxes.

The Following image states our issue:

Our parser script is:
include
include
include
include
include
include
include <unordered_map>
include “nvdsinfer_custom_impl.h”

static const int NUM_CLASSES_YOLO = 1; // Only detecting “person” class

float clamp(const float val, const float minVal, const float maxVal)
{
assert(minVal <= maxVal);
return std::min(maxVal, std::max(minVal, val));
}

static NvDsInferParseObjectInfo convertBBoxYoloV8(const float& bx, const float& by, const float& bw,
const float& bh, const int& stride, const uint& netW,
const uint& netH)
{
NvDsInferParseObjectInfo b;
float xCenter = bx * stride;
float yCenter = by * stride;
float x0 = xCenter - bw / 2;
float y0 = yCenter - bh / 2;
float x1 = x0 + bw;
float y1 = y0 + bh;

x0 = clamp(x0, 0, netW);
y0 = clamp(y0, 0, netH);
x1 = clamp(x1, 0, netW);
y1 = clamp(y1, 0, netH);

b.left = x0;
b.width = clamp(x1 - x0, 0, netW);
b.top = y0;
b.height = clamp(y1 - y0, 0, netH);

return b;

}

static void addBBoxProposalYoloV8(const float bx, const float by, const float bw, const float bh,
const uint stride, const uint& netW, const uint& netH, const int maxIndex,
const float maxProb, std::vector& binfo)
{
NvDsInferParseObjectInfo bbi = convertBBoxYoloV8(bx, by, bw, bh, stride, netW, netH);
if (bbi.width < 1 || bbi.height < 1) return;

bbi.detectionConfidence = maxProb;
bbi.classId = maxIndex;
binfo.push_back(bbi);

}

static bool NvDsInferParseYoloV8(
std::vector const& outputLayersInfo,
NvDsInferNetworkInfo const& networkInfo,
NvDsInferParseDetectionParams const& detectionParams,
std::vector& objectList)
{
if (outputLayersInfo.empty()) {
std::cerr << “Could not find output layer in bbox parsing” << std::endl;;
return false;
}
const NvDsInferLayerInfo &layer = outputLayersInfo[0];

if (NUM_CLASSES_YOLO != detectionParams.numClassesConfigured)
{
    std::cerr << "WARNING: Num classes mismatch. Configured:"
              << detectionParams.numClassesConfigured
              << ", detected by network: " << NUM_CLASSES_YOLO << std::endl;
}

std::vector<NvDsInferParseObjectInfo> objects;

float* data = (float*)layer.buffer;
const int dimensions = layer.inferDims.d[1];
int rows = layer.inferDims.numElements / layer.inferDims.d[1];

for (int i = 0; i < rows; ++i) {
    //85 = x, y, w, h, score0......score79
    float bx = data[0];
    float by = data[1];
    float bw = data[2];
    float bh = data[3];
    float * classes_scores = data + 4;

    float maxScore = 0;
    int index = 0;
    // Loop through the only class (index 0 for person):
    if (*classes_scores > maxScore) {
        index = 0;  // Only detecting person (class index 0)
        maxScore = *classes_scores;
    }

    // Check confidence threshold for "person" class (index 0)
    if (maxScore > detectionParams.perClassPreclusterThreshold[index]) {
        int maxIndex = index;
        data += dimensions;
        addBBoxProposalYoloV8(bx, by, bw, bh, 1, networkInfo.width, networkInfo.height, maxIndex, maxScore, objects);
    } else {
        data += dimensions;
    }
}
objectList = objects;
return true;

}

extern “C” bool NvDsInferParseCustomYoloV8(
std::vector const& outputLayersInfo,
NvDsInferNetworkInfo const& networkInfo,
NvDsInferParseDetectionParams const& detectionParams,
std::vector& objectList)
{
return NvDsInferParseYoloV8(
outputLayersInfo, networkInfo, detectionParams, objectList);
}

/* Check that the custom function has been defined correctly */
CHECK_CUSTOM_PARSE_FUNC_PROTOTYPE(NvDsInferParseCustomYoloV8);

Parser or Display Issue: Based on our observations, we suspect the issue is either in the parser or in the display handling or in the model.

Next Steps & Request for Assistance

We have written the parser code based on a reference link(deepstream_tools/yolo_deepstream/deepstream_yolo/nvdsinfer_custom_impl_Yolo/nvdsparsebbox_Yolo.cpp at main · NVIDIA-AI-IOT/deepstream_tools · GitHub), and below are the parser code and OSD probe function for your review.

.cpp script:
include
include
include
include
include
include
include <unordered_map>
include “nvdsinfer_custom_impl.h”

static const int NUM_CLASSES_YOLO = 5;

float clamp(const float val, const float minVal, const float maxVal)
{
assert(minVal <= maxVal);
return std::min(maxVal, std::max(minVal, val));
}

static NvDsInferParseObjectInfo convertBBoxYoloV8(const float& bx, const float& by, const float& bw,
const float& bh, const int& stride, const uint& netW,
const uint& netH)
{
NvDsInferParseObjectInfo b;
// Restore coordinates to network input resolution
float xCenter = bx * stride;
float yCenter = by * stride;
float x0 = xCenter - bw / 2;
float y0 = yCenter - bh / 2;
float x1 = x0 + bw;
float y1 = y0 + bh;

x0 = clamp(x0, 0, netW);
y0 = clamp(y0, 0, netH);
x1 = clamp(x1, 0, netW);
y1 = clamp(y1, 0, netH);

b.left = x0;
b.width = clamp(x1 - x0, 0, netW);
b.top = y0;
b.height = clamp(y1 - y0, 0, netH);

return b;

}

static void addBBoxProposalYoloV8(const float bx, const float by, const float bw, const float bh,
const uint stride, const uint& netW, const uint& netH, const int maxIndex,
const float maxProb, std::vector& binfo)
{
NvDsInferParseObjectInfo bbi = convertBBoxYoloV8(bx, by, bw, bh, stride, netW, netH);
if (bbi.width < 1 || bbi.height < 1) return;

bbi.detectionConfidence = maxProb;
bbi.classId = maxIndex;
binfo.push_back(bbi);

}

static bool NvDsInferParseYoloV8(
std::vector const& outputLayersInfo,
NvDsInferNetworkInfo const& networkInfo,
NvDsInferParseDetectionParams const& detectionParams,
std::vector& objectList)
{
if (outputLayersInfo.empty()) {
std::cerr << “Could not find output layer in bbox parsing” << std::endl;;
return false;
}
const NvDsInferLayerInfo &layer = outputLayersInfo[0];

if (NUM_CLASSES_YOLO != detectionParams.numClassesConfigured)
{
    std::cerr << "WARNING: Num classes mismatch. Configured:"
              << detectionParams.numClassesConfigured
              << ", detected by network: " << NUM_CLASSES_YOLO << std::endl;
}

std::vector<NvDsInferParseObjectInfo> objects;

float* data = (float*)layer.buffer;
const int dimensions = layer.inferDims.d[1];
int rows = layer.inferDims.numElements / layer.inferDims.d[1];

for (int i = 0; i < rows; ++i) {
    //85 = x, y, w, h, score0......score79
    float bx = data[0];
    float by = data[1];
    float bw = data[2];
    float bh = data[3];
    float * classes_scores = data + 4;

    float maxScore = 0;
    int index = 0;
    for (int j = 0; j < NUM_CLASSES_YOLO; j++){
       if(*classes_scores > maxScore){
          index = j;
          maxScore = *classes_scores;
       }
       classes_scores++;
    }

    // Important: Check confidence threshold here
    if (maxScore > detectionParams.perClassPreclusterThreshold[index]) {
        int maxIndex = index;
        data += dimensions;
        // Use maxScore as confidence instead of always using 1.0
        //float maxProb = 0 
        addBBoxProposalYoloV8(bx, by, bw, bh, 1, networkInfo.width, networkInfo.height, maxIndex, maxScore, objects);
    } else {
        data += dimensions;
    }
}
objectList = objects;
return true;

}

extern “C” bool NvDsInferParseCustomYoloV8(
std::vector const& outputLayersInfo,
NvDsInferNetworkInfo const& networkInfo,
NvDsInferParseDetectionParams const& detectionParams,
std::vector& objectList)
{
return NvDsInferParseYoloV8(
outputLayersInfo, networkInfo, detectionParams, objectList);
}

/* Check that the custom function has been defined correctly */
CHECK_CUSTOM_PARSE_FUNC_PROTOTYPE(NvDsInferParseCustomYoloV8);

We would appreciate your assistance in identifying the cause of the bounding box issue and any guidance on how to resolve it.
We are particularly looking for help in debugging the parser or display code, as we believe one of these components may be the source of the problem.

Please let us know if you require additional information or if we can provide any further details to assist in troubleshooting.
Thank you for your time and support!

kakarot · March 21, 2025, 6:41am

Heyy @fanzh and @yingliu Could you please help with this issue? Your guidance would be greatly appreciated!

yuweiw · March 21, 2025, 8:49am

How do you run that engine outside of DeepStream?

Could you attach your config file and whole pipeline?

kakarot · March 21, 2025, 9:08am

Hey @yuweiw

No we didn’t attach any config file , we just load the engine model to see the performance of the model after converting into .engine

I don’t know whether I followed the correct approach or not but I used this code

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt
import cv2

CONFIDENCE_THRESHOLD = 0.5
CLASS_LABELS = {
0: “Safety Goggle”,
1: “ToeGuard”,
2: “Non Safety Shoes”,
3: “Safety Shoes”,
4: “Non Safety Goggles”
}

Load the TensorRT engine

def load_engine(engine_path):
“”“Load the TensorRT engine from a file.”“”
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open(engine_path, “rb”) as f:
engine_data = f.read()
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(engine_data)
print(f"Loaded engine successfully: {engine}")
return engine

Allocate memory buffers for input and output

def allocate_buffers(engine):
“”“Allocate memory for input/output buffers.”“”
inputs, outputs, bindings, host_outputs = , , ,

engine_context = engine.create_execution_context()
num_tensors = engine.num_io_tensors  
print(f"Number of tensors: {num_tensors}")

for tensor_idx in range(num_tensors):
    tensor_name = engine.get_tensor_name(tensor_idx)
    shape = engine_context.get_tensor_shape(tensor_name)
    dtype = trt.nptype(engine.get_tensor_dtype(tensor_name))

    size = trt.volume(shape) * np.dtype(dtype).itemsize
    device_mem = cuda.mem_alloc(size)
    host_mem = np.empty(shape, dtype=dtype)  

    bindings.append(int(device_mem))
    host_outputs.append(host_mem)

    if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
        inputs.append((device_mem, host_mem))
    else:
        outputs.append((device_mem, host_mem))

return inputs, outputs, bindings, host_outputs, engine_context

Perform inference using the TensorRT engine

def infer(engine, inputs, outputs, bindings, stream, engine_context):
“”“Perform inference with the TensorRT engine and return output.”“”
for input_mem, host_mem in inputs:
cuda.memcpy_htod(input_mem, host_mem)

engine_context.execute_v2(bindings)

output_data = []  # Store results
for output_mem, host_mem in outputs:
    cuda.memcpy_dtoh(host_mem, output_mem)
    output_data.append(host_mem.copy())  # Copy output to avoid overwriting

return output_data

Preprocess the frame before inference

def preprocess(frame, input_shape):
“”“Resize and normalize frame for TensorRT model input.”“”
frame_resized = cv2.resize(frame, (input_shape[2], input_shape[3]))
frame_transposed = frame_resized.transpose((2, 0, 1))
frame_normalized = frame_transposed.astype(np.float32) / 255.0
return np.expand_dims(frame_normalized, axis=0)

Postprocess the output and draw bounding boxes

def postprocess(output_data, original_frame):
“”“Extract bounding boxes, confidence scores, and class IDs from model output.”“”
h, w, _ = original_frame.shape
detections =

output_data = np.squeeze(output_data)  # Remove singleton dimensions

for detection in output_data:
    # Extract values from the detection (this could vary depending on the model)
    x_center, y_center, bbox_width, bbox_height, confidence, class_id = detection[:6]
    confidence = float(confidence)  

    # Cast class_id to integer
    class_id = int(round(class_id))  # Convert class_id to integer after rounding

    if confidence > CONFIDENCE_THRESHOLD:  # If confidence is above the threshold
        # Ensure class_id is within the valid range (0-4)
        if class_id < 0 or class_id > 4:
            print(f"Warning: class_id {class_id} is out of bounds, skipping detection.")
            continue  # Skip detections with invalid class IDs

        # Convert normalized coordinates to pixel coordinates
        x1 = int((x_center - bbox_width / 2) * w)
        y1 = int((y_center - bbox_height / 2) * h)
        x2 = int((x_center + bbox_width / 2) * w)
        y2 = int((y_center + bbox_height / 2) * h)

        # Append the detection info
        detections.append((x1, y1, x2, y2, confidence, class_id))

        # Optionally, you can map class_id to the correct label
        label = CLASS_LABELS.get(class_id, "Unknown")
        print(f"Class ID: {class_id}, Label: {label}: {confidence:.2f}")

return detections

Draw bounding boxes on the frame

def draw_detections(frame, detections):
“”“Draw bounding boxes and labels on the frame.”“”
for (x1, y1, x2, y2, confidence, class_id) in detections:
label = f"{CLASS_LABELS[class_id]}: {confidence:.2f}"
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(frame, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

Main function for running inference on a live camera stream

def main():
engine_path = “Goggels_Shoes.engine”
engine = load_engine(engine_path)

inputs, outputs, bindings, host_outputs, engine_context = allocate_buffers(engine)

stream = cuda.Stream()  

# Replace with your video path
cap = cv2.VideoCapture(0)  

if not cap.isOpened():
    print("Error: Could not open video stream.")
    return

while True:
    ret, frame = cap.read()
    if not ret:
        print("Error: Failed to capture frame.")
        break

    input_data = preprocess(frame, inputs[0][1].shape)  
    inputs[0][1][:] = input_data  

    output_data = infer(engine, inputs, outputs, bindings, stream, engine_context)

    detections = postprocess(output_data[0], frame)  

    draw_detections(frame, detections)  

    cv2.imshow("Object Detection", frame)  

    if cv2.waitKey(1) & 0xFF == ord("q"):  
        break

cap.release()
cv2.destroyAllWindows()

if name == “main”:
main()

yuweiw · March 25, 2025, 11:06am

So do you still have the same problems using this method above?

henri · March 25, 2025, 2:43pm

Hello,

Have you try to follow the one and only : Marcos

You can convert your yolov8 .pt into .onnx using this and generate the custom lib path .so to parse it.

Enjoy

kakarot · April 3, 2025, 10:13am

Hi @henri Thanks !

kakarot · April 3, 2025, 10:14am

Hello @henri
I am facing an issue while converting an ONNX model to a TensorRT model on JetPack 5.1.4. During the conversion process, I see the following logs in the terminal:

[04/02/2025-17:23:18] [W] [TRT] Tactic Device request: 7151MB Available: 5868MB. Device memory is insufficient to use tactic.
[04/02/2025-17:23:18] [W] [TRT] Skipping tactic 3 due to insufficient memory on requested size of 7151 detected for tactic 0x0000000000000004.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[04/02/2025-17:23:18] [W] [TRT] Tactic Device request: 7151MB Available: 5871MB. Device memory is insufficient to use tactic.
[04/02/2025-17:23:18] [W] [TRT] Skipping tactic 8 due to insufficient memory on requested size of 7151 detected for tactic 0x000000000000003c.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[04/02/2025-17:23:19] [W] [TRT] Tactic Device request: 7151MB Available: 5870MB. Device memory is insufficient to use tactic.
[04/02/2025-17:23:19] [W] [TRT] Skipping tactic 13 due to insufficient memory on requested size of 7151 detected for tactic 0x0000000000000074.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().

The model conversion is successful, but I notice a significant performance degradation.
When I decrease the workspace size, the same logs continue to appear, and the performance degradation persists.
Interestingly, when I perform the same conversion on JetPack version 6.2 (Jetson Orin) with TensorRT version 10.3.0, the conversion completes successfully without any performance degradation, and I do not see the above logs in the terminal.
Could anyone help me understand why this performance degradation occurs on JetPack 5.1.5 Tensorrt 8.5.2 while JetPack version 6.2 with TensorRT version 10.3.0 does not exhibit the same issue?
Additionally, are there any recommendations for overcoming this memory issue or performance degradation on JetPack 5.1.4?

henri · April 4, 2025, 7:31am