Yolov8 output wrong result, use Tensorrt

Description

I will transfer my ONNX model to Tensorrt, After completing a series of procedures, I obtained a very strange and unexpected result that was far from what I had expected.

I suspect there might be something wrong with my program, which is causing the engine not to function properly.
I suspect it might be a problem with the Tensorrt inference part, but I haven’t been able to identify the cause.

This is the output result of my program, It is increasing:

My module is 2 classes, It should be like this:
[x, y, w, h, c1, c2] * 78200

Environment

TensorRT Version: 8.5.2.2
GPU Type: Jetson Xavier NX
Jetpack Version: 5.1.2
CUDA Version: 11.4.315
CUDNN Version: 8.6.0.166

The following is my main program:

#include <opencv2/opencv.hpp>
#include <string>
#include <chrono>
#include <iostream>
#include "Yolo.h"
#include "Camera.h"
#include "config.h"
#include "Tools.h"

std::string engineFile = "/home/nvidia/Desktop/6.3.engine";


int main()
{
    Yolo yolo(engineFile);
    Camera camera(0);

    std::vector<float> input(inputSize, 0.0f);
    std::vector<float> output(outputSize, 0.0f);

    cv::Mat frame = camera.get_frame();
    input = preprocessImage(frame);
    if (!yolo.infer(input, output))
        std::cout << "yolo.infer error" << std::endl;

    for (int i = 0; i < 20; i++)
        std::cout << output[i] << std::endl;


    return 0;
}

Yolo.cpp

//
// Created by JIN on 25-5-21.
//

#include "Yolo.h"
#include <NvInferRuntime.h>
#include <fstream>
#include <iostream>

#include "Tools.h"
#include "config.h"



Yolo::Yolo(const std::string& path)
{
    try
    {
        // 1. 加载 engine 数据
        engineData = loadEngineFile(path);
        if (engineData.empty())
            throw std::runtime_error("Failed to load engine data.");

        // 2. 创建 runtime
        runtime = nvinfer1::createInferRuntime(logger);
        if (!runtime)
            throw std::runtime_error("Failed to create TensorRT runtime.");

        // 3. 反序列化 engine
        engine = runtime->deserializeCudaEngine(engineData.data(), engineData.size());
        if (!engine)
            throw std::runtime_error("Failed to deserialize engine.");

        // 4. 创建执行上下文
        context = engine->createExecutionContext();
        if (!context)
            throw std::runtime_error("Failed to create execution context.");

        // 5. 创建 CUDA stream
        cudaError_t err = cudaStreamCreate(&stream);
        if (err != cudaSuccess)
            throw std::runtime_error("Failed to create CUDA stream: " + std::string(cudaGetErrorString(err)));

        // 6. 分配 GPU 内存
        err = cudaMallocAsync(&inputDevice, inputSize * sizeof(float), stream);
        if (err != cudaSuccess)
            throw std::runtime_error("Failed to allocate input device memory.");
        err = cudaMallocAsync(&outputDevice, outputSize * sizeof(float), stream);
        if (err != cudaSuccess)
            throw std::runtime_error("Failed to allocate output device memory.");

        // 7. 设置 Tensor 地址
        if (!context->setTensorAddress(inputName, inputDevice) ||
            !context->setTensorAddress(outputName, outputDevice))
            throw std::runtime_error("Failed to set tensor addresses.");

    }
    catch (const std::exception& ex)
    {
        std::cerr << "Yolo constructor failed: " << ex.what() << std::endl;
        cleanup(); // 清理已分配的资源
        throw;     // 重新抛出异常
    }
}



Yolo::~Yolo()
{
    cleanup();
}

bool Yolo::infer(const std::vector<float>& inputHost, std::vector<float>& outputHost) const
{
    // 输入大小检查(防止越界)
    if (inputHost.size() != inputSize) {
        std::cerr << "[infer] Input size mismatch: expected " << inputSize << ", got " << inputHost.size() << std::endl;
        return false;
    }

    // 1. 拷贝输入数据到 GPU
    cudaError_t err = cudaMemcpyAsync(inputDevice, inputHost.data(), inputSize * sizeof(float),
                                      cudaMemcpyHostToDevice, stream);
    if (err != cudaSuccess) {
        std::cerr << "[infer] cudaMemcpyAsync (input) failed: " << cudaGetErrorString(err) << std::endl;
        return false;
    }

    // 2. 推理执行
    if (!context || !context->enqueueV3(stream)) {
        std::cerr << "[infer] enqueueV3 failed!" << std::endl;
        return false;
    }

    // 3. 同步 stream
    err = cudaStreamSynchronize(stream);
    if (err != cudaSuccess) {
        std::cerr << "[infer] cudaStreamSynchronize failed: " << cudaGetErrorString(err) << std::endl;
        return false;
    }

    // 4. 拷贝输出数据到 Host
    outputHost.resize(outputSize);  // resize 是安全的,会释放旧内存
    err = cudaMemcpy(outputHost.data(), outputDevice, outputSize * sizeof(float),
                     cudaMemcpyDeviceToHost);
    if (err != cudaSuccess) {
        std::cerr << "[infer] cudaMemcpy (output) failed: " << cudaGetErrorString(err) << std::endl;
        return false;
    }

    return true;
}


void Yolo::cleanup() const
{
    if (inputDevice) cudaFree(inputDevice);
    if (outputDevice) cudaFree(outputDevice);
    if (stream) cudaStreamDestroy(stream);

    delete context;
    delete engine;
    delete runtime;
}

config.h

#ifndef CONFIG_H
#define CONFIG_H

inline constexpr int imgWidth = 736;
inline constexpr int imgHeight = 1280;
inline constexpr int inputSize = 1 * 3 * imgHeight * imgWidth;
inline constexpr int outputSize = 1 * 6 * 78200;
inline constexpr const char* inputName = "images";
inline constexpr const char* outputName = "output0";

#endif //CONFIG_H

Tool.cpp

//
// Created by JIN on 25-6-3.
//

#include "Tools.h"
#include <fstream>
#include "config.h"

// 读取engine文件到内存buffer
std::vector<char> loadEngineFile(const std::string& filepath)
{
    std::ifstream file(filepath, std::ios::binary | std::ios::ate);
    if (!file)
        throw std::runtime_error("Failed to open engine file");

    std::streamsize size = file.tellg();
    file.seekg(0, std::ios::beg);

    std::vector<char> buffer(size);
    if (!file.read(buffer.data(), size))
        throw std::runtime_error("Failed to read engine file");

    return buffer;
}

std::vector<float> preprocessImage(const cv::Mat& frame)
{
    if (frame.empty()) {
        throw std::runtime_error("Input frame is empty.");
    }

    // 1. Resize to model input size (1280x736)
    cv::Mat resized;
    cv::resize(frame, resized, cv::Size(imgWidth, imgHeight));

    // 2. Convert BGR to RGB
    cv::Mat rgb;
    cv::cvtColor(resized, rgb, cv::COLOR_BGR2RGB);

    // 3. Convert to float and normalize to [0,1]
    rgb.convertTo(rgb, CV_32FC3, 1.0 / 255.0);

    // 4. Split channels (to get HWC -> CHW layout)
    std::vector<cv::Mat> channels(3);
    cv::split(rgb, channels);

    int imageSize = imgWidth * imgHeight;
    std::vector<float> tensor(3 * imageSize);  // CHW format

    // 5. Copy each channel data to tensor buffer
    for (int i = 0; i < 3; ++i) {
        memcpy(tensor.data() + i * imageSize, channels[i].ptr<float>(), imageSize * sizeof(float));
    }

    return tensor;
}

I sincerely hope that anyone can help me. Thank you very much.

Additional note:
I am using the Python version of ONNXRuntime, and this model can run normally.

This is the relevant code:

import os
import queue
import sys
import time
from datetime import datetime
import onnxruntime as ort
import cv2
import numpy as np
import threading
import serial

onnx_file = r"/home/nvidia/Desktop/HB_3.13.onnx"
result_file = r"/home/nvidia/Desktop/result/"
class_names = ['bird', 'airplane']

cap_source = -1
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']


class Yolov8_ORT_MUL_Thread:
    def __init__(self):
        ## 私有变量 #######################################################
        self._p_th = 0.45
        self.p_th = 0.45
        self.iou_th = 0.5
        self.area_th = 150
        self.imgsz = [1280, 736]
        self.lock = threading.Lock()
        self.session_lock = threading.Lock()  # 新增会话锁

        ## 外设 #######################################################
        self.EXIT_FLAG = False
        self.predict_boxes = queue.Queue(maxsize=5)

        self.cap = None
        self.ort_session = None

        self.target_flag = 0

    def run(self):
        self.__check_dir()
        self.__CapInit()
        self.__OrtSessionInit()
        try:
            self.ser_1 = serial.Serial('/dev/ttyUSB0', 9600, timeout=1)
            self.ser_0 = serial.Serial('/dev/ttyUSB1', 9600, timeout=1)
        except serial.SerialException as e:
            print(f"串口打开失败: {e}")
            return

        self.__warm_up()

        cv2.createTrackbar('Area', 'result', self.area_th, 1000, self.__On_Area_Trackbar)

        serial_listen_thread = threading.Thread(target=self.__serial_listen, daemon=True)
        pre_process_thread_1 = threading.Thread(target=self.__Prediction, daemon=True)
        pre_process_thread_2 = threading.Thread(target=self.__Prediction, daemon=True)

        serial_listen_thread.start()
        pre_process_thread_1.start()
        pre_process_thread_2.start()

        _times = 0
        start_time = time.time()
        fps = 0
        while not self.EXIT_FLAG:
            try:
                if not self.predict_boxes.empty():
                    frame, prediction = self.predict_boxes.get()
                    boxes = self.__NMS(prediction)
                    _frame = self.__draw_boxes(frame, boxes)

                    # if boxes is not None and self.target_flag:
                    #     self.ser.write(b'\xAA\x55\x15\x55\xAA')
                    #     print("串口数据已发送")

                    _times += 1
                    if _times == 5:
                        _times = 0
                        end_time = time.time()
                        execution_time = end_time - start_time
                        fps = round(5.0/execution_time, 2)
                        start_time = time.time()

                    cv2.putText(_frame, f"FPS: {fps}", (30, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)
                    cv2.imshow('result', _frame)
                    if cv2.waitKey(1) & 0xFF == 27:
                        self.EXIT_FLAG = True
            except Exception as e:
                print(f"Error in __PostProcess: {e}")

        serial_listen_thread.join()
        pre_process_thread_1.join()
        pre_process_thread_2.join()

        self.cap.release()
        if self.ser_0 and self.ser_0.is_open:
            self.ser_0.close()
        if self.ser_1 and self.ser_1.is_open:
            self.ser_1.close()
        cv2.destroyAllWindows()

    def __CapInit(self):
        self.cap = cv2.VideoCapture(cap_source)
        if not self.cap.isOpened():
            print("Error: Could not open capture.")
            exit()
        self.cap.set(cv2.CAP_PROP_FOURCC, cv2.VideoWriter.fourcc('M', 'J', 'P', 'G'))
        self.cap.set(cv2.CAP_PROP_FPS, 120)
        self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
        self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 736)
        print("Capture init successful!")
        cv2.namedWindow("result", 0)
        cv2.resizeWindow("result", 400, 300)


    def __serial_listen(self):
        while not self.EXIT_FLAG:
            try:
                if self.ser_1.in_waiting > 0:
                    data = self.ser_1.read(1)
                    if data == b'\xFF':
                        self.target_flag = 1
                        print("收到 FF,标志位置为 1")
                    elif data == b'\x00':
                        self.target_flag = 0
                        print("收到 00,标志位置为 0")
            except Exception as e:
                print(f"串口监听出错: {e}")

    def __OrtSessionInit(self):
        self.ort_session = ort.InferenceSession(onnx_file, providers=providers)

    def __Prediction(self):
        while not self.EXIT_FLAG:
            try:
                with self.lock:
                    ret, frame = self.cap.read()
                if ret:
                    if not self.predict_boxes.full():
                        frame, _frame = self.__frame_resize(frame)
                        with self.session_lock:  # 使用会话锁保护推理操作
                            prediction = self.ort_session.run(None, {'images': _frame})[0]
                        self.predict_boxes.put([frame, prediction])
                else:
                    self.EXIT_FLAG = True
            except Exception as e:
                print(f"Error in __PreProcess: {e}")


    ## Function ############################################################################################################
    def __warm_up(self):
        dummy_input = np.random.random([1, 3, 736, 1280]).astype(np.float32)
        for i in range(5):
            try:
                # 进行一次推理,输入为 dummy_input
                self.ort_session.run(None, {'images': dummy_input})
            except Exception as e:
                print(f"Error during warm-up: {e}")
                sys.exit()
        print("ONNX session warm-up complete.")

    def __frame_resize(self, frame):
        frame_resize = cv2.resize(frame, self.imgsz)
        _frame_resize = frame_resize[:, :, ::-1].transpose(2, 0, 1).astype(np.float32)  # BGR2RGB和HWC2CHW
        _frame_resize /= 255.0
        _frame_resize = np.expand_dims(_frame_resize, axis=0)
        _frame_resize = np.ascontiguousarray(_frame_resize)
        return frame_resize, _frame_resize

    def __NMS(self, prediction):
        ### 一、 预测结果预处理 ##########################################################
        # 3.1 [1, 84, 8400] -> [8400, 85]
        _prediction = prediction[0]
        _prediction = np.transpose(_prediction, (1, 0))
        _max = np.max(_prediction[:, 4:], axis=-1)
        _prediction = np.insert(_prediction, 4, _max, axis=-1)

        # 3.2 去除低概率目标
        mask = _prediction[:, 4] > self._p_th
        _prediction = _prediction[mask]
        if len(_prediction) == 0:
            return None

        # 3.3 x,y,w,h转为x1,y1,x2,y2
        xyxy = np.zeros_like(_prediction[:, :4])
        xyxy[:, 0] = _prediction[:, 0] - _prediction[:, 2] / 2
        xyxy[:, 1] = _prediction[:, 1] - _prediction[:, 3] / 2
        xyxy[:, 2] = _prediction[:, 0] + _prediction[:, 2] / 2
        xyxy[:, 3] = _prediction[:, 1] + _prediction[:, 3] / 2
        _prediction[:, :4] = xyxy

        # 3.4 [8400, 4 + 最大概率 + 80] -> [8400, 4 + 最大概率 + 最大概率类别]
        max_index = np.expand_dims(np.argmax(_prediction[:, 5:], -1), -1)
        _prediction = np.concatenate([_prediction[:, :5], max_index], -1)


        ### 二、 NMS ################################################################
        # 1. 获取唯一类别
        unique_class = np.unique(_prediction[:, -1])
        if len(unique_class) == 0:
            return None

        # 2. 对每一类类别进行NMS
        class_boxes = []
        all_class_boxes = []
        for c in unique_class:
            # 1. 获取某一类别
            cls_mask = _prediction[:, -1] == c
            prediction_cls = _prediction[cls_mask]

            # 2. 按概率排序
            cls_scores = prediction_cls[:, 4]
            arg_sort = np.argsort(cls_scores)[::-1]
            prediction_cls_sort = prediction_cls[arg_sort]

            # 3. NMS
            while len(prediction_cls_sort) != 0:
                class_boxes.append(prediction_cls_sort[0])
                if len(prediction_cls_sort) == 1:
                    break
                class_iou = self.__iou_calc(class_boxes[-1], prediction_cls_sort[1:])
                prediction_cls_sort = prediction_cls_sort[1:][class_iou < self.iou_th]

        all_class_boxes.append(class_boxes)

        all_class_boxes = np.array(all_class_boxes)
        all_class_boxes = np.squeeze(all_class_boxes, axis=0)
        return all_class_boxes

    def __draw_boxes(self, frame, boxes):
        if boxes is None:
            return frame

        img_dataset = frame.copy()
        img_display = frame.copy()

        orig_img_path, dataset_img_path, display_img_path = self.__get_save_path()

        # 画图
        save_display_flag = False
        for box in boxes:
            x1, y1, x2, y2 = box[:4].astype(np.int32)
            p = np.round(float(box[4]), 2)
            class_id = box[5].astype(np.int32)
            if class_id == 0:
                area = (x2-x1)*(y2-y1)
                label = f"{class_names[class_id]}:{p} {area}"  # 获取类别名称

                cv2.rectangle(img_dataset, (x1, y1), (x2, y2), (255, 0, 0), 2)
                cv2.putText(img_dataset, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 2)

                if p > self.p_th and area > self.area_th:
                    save_display_flag = True
                    cv2.rectangle(img_display, (x1, y1), (x2, y2), (255, 0, 0), 2)
                    cv2.putText(img_display, label, (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 2)

        # 保存
        if self.target_flag:
            cv2.imwrite(orig_img_path, frame)
            cv2.imwrite(dataset_img_path, img_dataset)

            if save_display_flag:
                print(f"{self.__get_time()}: get target")
                self.ser_0.write(b'\xAA\x55\x15\x55\xAA')
                print("串口数据已发送")
                cv2.imwrite(display_img_path, img_display)
        return img_display

    ### 工具函数 ################################################################
    def __check_dir(self):
        if not os.path.exists(result_file + 'orig'):
            os.makedirs(result_file + 'orig')
        if not os.path.exists(result_file + 'display'):
            os.makedirs(result_file + 'display')
        if not os.path.exists(result_file + 'dataset'):
            os.makedirs(result_file + 'dataset')

    def __get_time(self):
        current_time = datetime.now()
        # time_string = current_time.strftime("%Y-%m-%d_%H:%M:%S")
        time_string = current_time.strftime("%Y-%m-%d_%H-%M-%S-%f")[:-3]
        return time_string


    def __get_save_path(self):
        time_string = self.__get_time()
        orig_img_path = f"{result_file}orig/{time_string}.jpg"
        display_img_path = f"{result_file}display/{time_string}.jpg"
        dataset_img_path = f"{result_file}dataset/{time_string}.jpg"
        return orig_img_path, dataset_img_path, display_img_path


    def __iou_calc(self, b1, b2):
        # 提取坐标
        b1_x1, b1_y1, b1_x2, b1_y2 = b1[0], b1[1], b1[2], b1[3]
        b2_x1, b2_y1, b2_x2, b2_y2 = b2[:, 0], b2[:, 1], b2[:, 2], b2[:, 3]
        # 计算交集区域的左上角和右下角坐标
        inter_x1 = np.maximum(b1_x1, b2_x1)
        inter_y1 = np.maximum(b1_y1, b2_y1)
        inter_x2 = np.minimum(b1_x2, b2_x2)
        inter_y2 = np.minimum(b1_y2, b2_y2)
        # 计算交集区域的面积
        inter_area = np.maximum(0, inter_x2 - inter_x1) * np.maximum(0, inter_y2 - inter_y1)
        # 计算每个边界框的面积
        b1_area = (b1_x2 - b1_x1) * (b1_y2 - b1_y1)
        b2_area = (b2_x2 - b2_x1) * (b2_y2 - b2_y1)
        # 计算 IoU
        iou = inter_area / (b1_area + b2_area - inter_area + 1e-6)
        return iou

    def __On_Area_Trackbar(self, area):
        self.area_th = area



if __name__ == "__main__":
    yolo = Yolov8_ORT_MUL_Thread()
    yolo.run()

I’d be happy to help you troubleshoot the issue with your TensorRT inference.

Based on the information provided, here are some potential steps to help identify the cause of the problem:

  1. Verify the ONNX model: Before converting the ONNX model to TensorRT, ensure that the model is correct and functioning as expected. You can use tools like ONNX-GraphSurgeon to modify and verify the ONNX model.
  2. Check the TensorRT conversion process: Review the process of converting the ONNX model to TensorRT. Make sure that the conversion is done correctly, and the resulting TensorRT engine is properly configured.
  3. Inspect the TensorRT engine: Use tools like Nsight Deep Learning Designer to inspect the TensorRT engine and verify that it matches the expected architecture.
  4. Profile the inference: Use NVIDIA Nsight Systems to profile the inference process and identify any performance bottlenecks or issues.
  5. Compare with a baseline: If possible, compare the output of your TensorRT inference with a baseline output from the original ONNX model or another inference engine. This can help identify if the issue is specific to TensorRT or the conversion process.
  6. Check the environment: Ensure that the environment is correctly set up, including the Jetson Xavier NX, CUDA, and cuDNN versions.
  7. Review the code: Carefully review the main program code to ensure that there are no errors or issues that could be causing the unexpected output.

Some specific questions to help further troubleshoot the issue:

  • Can you provide more details about the unexpected output? Is it a specific value or a pattern that is incorrect?
  • Have you tried running the inference on a different platform or environment to see if the issue is specific to the Jetson Xavier NX?
  • Are there any error messages or warnings during the conversion or inference process?
  • Have you tried using a different version of TensorRT or the ONNX model to see if the issue persists?

By following these steps and providing more information, we can work together to identify the cause of the issue and find a solution.