TensorRT engine giving wrong/different output in DeepStream

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) -> dGPU T4
• DeepStream Version -> 5.0
• TensorRT Version -> 7.1 (probably)
• NVIDIA GPU Driver Version (valid for GPU only) -> 440.82

archs => 7.5 [deviceQuery]

Built the yolov5s engine in DS docker container using tensorrtx/yolov5 which gives the below output for no of boxes detected: [inference on this image

Detected before NMS: 80
Detected after NMS : 4

while the same engine file in DeepStream gives rather conflicting and completely wrong results which is fetched from the NvDsOutputLayerInfo buffer:

Detected before NMS: 8632

Below is the config file:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
workspace-size=1024
model-engine-file=./models/yolov5s.engine
labelfile-path=./labels.txt
force-implicit-batch-dim=1
batch-size=1
process-mode=1
model-color-format=1
network-mode=2
num-detected-classes=80
interval=0
gie-unique-id=1
is-classifier=0
output-blob-names=prob
parse-bbox-func-name=NvDsInferParseCustomFD
custom-lib-path=./nvdsinfer_custom_parser/libyoloplugin.so

[class-attrs-all]
pre-cluster-threshold=0.3
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

and this is how I generate the plugins for yololayer and custom bbox parsing(NMS):

rm -f libyoloplugin.so nvdsinfer_custombboxparser_fd.o yololayer.o

/usr/local/cuda/bin/nvcc -ccbin g++ --compiler-options '-fPIC' -I./ -I/opt/nvidia/deepstream/deepstream-5.0/sources/includes/ -m64    --std=c++11 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o nvdsinfer_custombboxparser_fd.o -c nvdsinfer_custombboxparser_fd.cpp

/usr/local/cuda/bin/nvcc -ccbin g++ --compiler-options '-fPIC' -I./  -m64    --std=c++11 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o yololayer.o -dc yololayer.cu

/usr/local/cuda/bin/nvcc -ccbin g++ -m64 --shared -arch=sm_75 \
-o libyoloplugin.so yololayer.o nvdsinfer_custombboxparser_fd.o  \
-L/usr/local/cuda/lib64/stubs -L/usr/lib/x86_64-linux-gnu/ -lnvinfer -lnvinfer_plugin  -lcuda 

I dont really understand where am I going wrong. Any help would be great. Or is it that DeepStream doesnt support yolov5 pipeline? Thanks!!

Hi,

You should be able to use pregenerated TRT engine file in Deepstream. If the model works with TRT then you should be able to use in DeepStream and get the same outputs. A couple of questions -

  1. Have you updated the bouding box parser code in the custom lib to match your n/w’s output buffer format ?
  2. Are you doing the nms clustering in the custom parsing lib ? NMS clustering method has been added in nvinfer plugin so you dont need to implement it in the custom lib. You can enable NMS using “cluster-mode=2” config param and set the thresh using - “nms-iou-threshold”
1 Like

@CJR 1. Yes, I have updated the bounding box parser code. But that code would be used after fetching the outputs from the NvDsLayerInfor buffer which is giving wrong no of outputs! So, I havent been able to test out if the BBoxParser code is working fine or not!
2. I’ll try cluster-mode=2 but would it matter since the output buffer has wrong outputs!

  1. You will have wrong number of outputs if the output parser is not correct. I would suggest double checking the logic over there. Keep in mind you would be receiving buffer for each frame in a batch for every call to your bbox parsing function. You will not be receiving the buffer for entire batch if you were not already aware. Also double check the preprocessing config params. Does it match your other setup as well ?

  2. You need to decouple a few things here. Either you do the clustering in your custom bbox parser and set “clustering-mode=4” for “NONE” (no clustering) to be performed, else another round of clustering will be done by the plugin using default opecv’s group rectangles algorithm. OR you can not perform any clustering in the bbox parsing library and let the nvinfer plugin handle everything.

1 Like

@CJR Also I checked the logic in output parser, it is exactly same as the way I am doing inference in TensorRT. And the output in TRT with that parsing is perfect.

I have double checked my Parsing box function which is mainly doing NMS. But the outputs fetched from the buffer seems to be incorrect. I did not use any parser to create the TRT engine. But used this tensorrtx/yolov5 repo. Or you can give me a direction to move in.
@CJR Below is the only config file in the DeepStream app for yolov5 detector ( added cluster-mode=4 ) which did not make any difference. I am using exactly the same method in Bbox parsing to parse the output of TRT engine as I used for TensorRT inference :

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
workspace-size=1024
model-engine-file=./models/yolov5s.engine
labelfile-path=./labels.txt
batch-size=1
process-mode=1
model-color-format=1
network-mode=2
num-detected-classes=80
interval=0
gie-unique-id=1
is-classifier=0
cluster-mode=4
output-blob-names=prob
parse-bbox-func-name=NvDsInferParseCustomFD
custom-lib-path=./nvdsinfer_custom_parser/libyoloplugin.so

[class-attrs-all]
pre-cluster-threshold=0.3
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

This is the parsing function for NMS to work buffer should have the correct values. I dont think currently the buffer provides correct output values. as the number of Bboxes detected are way to high:

#include <algorithm>
#include <cstring>
#include <iostream>
#include <map>
#include "nvdsinfer_custom_impl.h"
#include <cassert>

#define NMS_THRESH 0.5
#define CONF_THRESH 0.4


static constexpr int LOCATIONS = 4;
struct alignas(float) Detection{
    //center_x center_y w h
    float bbox[LOCATIONS];
    float conf;  // bbox_conf * cls_conf
    float class_id;
};

// stuff we know about the network and the input/output blobs
static const int INPUT_H = 608;
static const int INPUT_W = 608;
static const int MAX_OUTPUT_BBOX_COUNT = 1000;
static const int OUTPUT_SIZE = MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float) + 1;  // we assume the yololayer outputs no more than 1000 boxes that conf >= 0.1

/* C-linkage to prevent name-mangling */

extern "C"
bool NvDsInferParseCustomFD (
         std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
         NvDsInferNetworkInfo  const &networkInfo,
         NvDsInferParseDetectionParams const &detectionParams,
         std::vector<NvDsInferObjectDetectionInfo> &objectList);


extern "C"


float iou(float lbox[4], float rbox[4]) {
    float interBox[] = {
        std::max(lbox[0] - lbox[2]/2.f , rbox[0] - rbox[2]/2.f), //left
        std::min(lbox[0] + lbox[2]/2.f , rbox[0] + rbox[2]/2.f), //right
        std::max(lbox[1] - lbox[3]/2.f , rbox[1] - rbox[3]/2.f), //top
        std::min(lbox[1] + lbox[3]/2.f , rbox[1] + rbox[3]/2.f), //bottom
    };

    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])
        return 0.0f;

    float interBoxS =(interBox[1]-interBox[0])*(interBox[3]-interBox[2]);
    return interBoxS/(lbox[2]*lbox[3] + rbox[2]*rbox[3] -interBoxS);
}

bool cmp(Detection& a, Detection& b) {
    return a.conf > b.conf;
}

void nms(std::vector<Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5) {
    int det_size = sizeof(Detection) / sizeof(float);
    std::cout << "detected before nms -> " << output[0] << std::endl;
    std::map<float, std::vector<Detection>> m;
    for (int i = 0; i < output[0] && i < 1000; i++) {
        if (output[1 + det_size * i + 4] <= conf_thresh) continue;
        Detection det;
        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));
        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Detection>());
        m[det.class_id].push_back(det);
    }
    for (auto it = m.begin(); it != m.end(); it++) {
        //std::cout << it->second[0].class_id << " --- " << std::endl;
        auto& dets = it->second;
        std::sort(dets.begin(), dets.end(), cmp);
        for (size_t m = 0; m < dets.size(); ++m) {
            auto& item = dets[m];
            res.push_back(item);
            for (size_t n = m + 1; n < dets.size(); ++n) {
                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {
                    dets.erase(dets.begin()+n);
                    --n;
                }
            }
        }
    }
}


bool NvDsInferParseCustomFD (std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
                                   NvDsInferNetworkInfo  const &networkInfo,
                                   NvDsInferParseDetectionParams const &detectionParams,
                                   std::vector<NvDsInferObjectDetectionInfo> &objectList) {
    static int decodeIndex = -1;

    /* Find the decode layer */
    if (decodeIndex == -1) {
        for (unsigned int i = 0; i < outputLayersInfo.size(); i++) {
            if (strcmp(outputLayersInfo[i].layerName, "prob") == 0) {
                decodeIndex = i;
                std::cout << "Found decode layer buffer while parsing" << decodeIndex << std::endl;
                break;
            }
            std::cout << outputLayersInfo[i].layerName << " " << std::endl;
        }
        if (decodeIndex == -1) {
            std::cerr << "Could not find decode layer buffer while parsing" << std::endl;
            return false;
        }
    }

    // Host memory for "decode"
    float* out_decode = (float *) outputLayersInfo[decodeIndex].buffer;
    
    const int batch_id = 0;
    const int out_class_size = detectionParams.numClassesConfigured;
    const float threshold = detectionParams.perClassThreshold[0];
    std::cout<<"out_class_size: "<< out_class_size << std::endl;
    std::cout<<"threshold: "<< threshold << std::endl;

    std::vector<Detection> res;
    nms(res, &out_decode[0], CONF_THRESH, NMS_THRESH);
    
    std::cout << "after nms -> " << res.size() << std::endl;

    for (size_t j = 0; j < res.size(); j++){
        if (res[j].conf < 0.1) continue;
        // std::cout << "class -> " << res[j].class_id;
        // std::cout << " conf -> " << res[j].conf << std::endl;
        NvDsInferObjectDetectionInfo object;
        object.classId = res[j].class_id;
        object.detectionConfidence = res[j].conf;

        /* Clip object box co-ordinates to network resolution */
        float left = res[j].bbox[0] - res[j].bbox[2]/2.f;
        float top = res[j].bbox[1] - res[j].bbox[3]/2.f;
            
        object.left = left;
        object.top = top;
        object.width = res[j].bbox[2];
        object.height = res[j].bbox[3];
        objectList.push_back(object);
    }
    return true;
}

/* Check that the custom function has been defined correctly */
CHECK_CUSTOM_PARSE_FUNC_PROTOTYPE(NvDsInferParseCustomFD);

Below is the Terminal Out:

root@91fbcc5a3a74:/opt/nvidia/deepstream/deepstream-5.0/sources/apps/deepstream-yolov5-img# ./deepstream-custom -c yolo_pgie_config.txt -i samples/bus.jpg 
Now playing: yolo_pgie_config.txt
WARNING: ../nvdsinfer/nvdsinfer_func_utils.cpp:34 [TRT]: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
0:00:02.430127300 14944 0x563e38873360 INFO                 nvinfer gstnvinfer.cpp:602:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1577> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-5.0/sources/apps/deepstream-yolov5-img/models/yolov5s.engine
INFO: ../nvdsinfer/nvdsinfer_model_builder.cpp:685 [Implicit Engine Info]: layers num: 2
0   INPUT  kFLOAT data            3x608x608       
1   OUTPUT kFLOAT prob            6001x1x1        

0:00:02.430234100 14944 0x563e38873360 INFO                 nvinfer gstnvinfer.cpp:602:gst_nvinfer_logger:<primary-nvinference-engine> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:1681> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-5.0/sources/apps/deepstream-yolov5-img/models/yolov5s.engine
0:00:02.431288464 14944 0x563e38873360 INFO                 nvinfer gstnvinfer_impl.cpp:311:notifyLoadModelStatus:<primary-nvinference-engine> [UID 1]: Load new model:yolo_pgie_config.txt sucessfully
Running...
Found decode layer buffer while parsing0
out_class_size: 80
threshold: 0.3
detected before nms -> 8295
after nms -> 0
End of stream
Returned, stopping playback
Deleting pipeline

Trt inference:

detected before NMS => 84

Thanks. Let me know if you have any suggestions!!

@CJR Cluster mode = 2. Doesn’t work on my end either. After having cluster mode = 2. Do I still need a parsing function?

@CJR

  1. I checked preprocessing step, the setup is same as in TRT inference. Double checked the parsing function which is same as in TRT inference.

  2. I tried different cluster-mode=4, which did not make any difference to the wrong output.

Let me know if you have any pointers for me!!

There seems to be a bug in the open source code you are following. The kernels in plugin layer should use the same incoming cuda stream which it receives in the call to “enqueue”. This will ensure in-order execution of all the kernels in the entire network.

@CJR This is the change that I made in yolov5 tensorrt kernel referring objectDetector_Yolo in DS sample apps:

CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>
            (inputs[i], output, numElem, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount, outputElem); 

But this did not make any difference!! I dont know if I did this the right way. But this change gave me correct outputs in TensorRT.
Thanks

Can you also make the memset op here async and use the same cuda stream for performing that operation ?

@CJR can you tell me more about this, I don’t actually know how to make memset async; some resource or something?

Replace that call to memset using memsetasync. See API documentation here.

@CJR Made this change which gave correct results in TensorRT but strange results in Depstream:

for(int idx = 0 ; idx < batchSize; ++idx) {
        CUDA_CHECK(cudaMemsetAsync(output + idx*outputElem, 0, sizeof(float), stream));
    }

but in deepstream the output buffer gives 0 boxes:

float* out_decode = (float *) outputLayersInfo[yoloLayerIndex].buffer;
std::cout << "detected before nms -> " << output[0] << std::endl;

output:

detected before nms -> 0

Does the TensorRT inference work if you change this line to

CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));

@CJR Yes! Keeping the previous changes as is. It works perfectly fine in Tensorrt.

@CJR Hey, so I am trying to debug this with trial and error. If you do have any suggestions please let me know!! Thanks.