Get wrong infer results while testing yolov4 on deepstream 5.0

Envs

• Hardware Platform (Jetson / GPU) GeForce GTX 1070
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only)
• TensorRT Version 7.0.0.11
• NVIDIA GPU Driver Version (valid for GPU only) 440.33.01

Problem Description

I refer to the tensorrtx to generate yolov4 engine file. It runs well when test tensorrtx yolov4.

Then I add engine file to deepstream 5.0 refer to deepstream-app. I have changed config files and rewrite nvdsparsebbox_Yolo.cpp and nvdsinfer_yolo_engine.cpp etc. I can get infer results via std::vector<NvDsInferLayerInfo> const &outputLayersInfo. but the result is different from tensorrtx and seems wrong .

Errors Print

I print the results (before nms) get from deepstream as follow:

...
x: 72 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 80 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 88 y: 272 w: inf h: inf det Confidence: 1 id: 3 class Confidence: 1
x: 96 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 136 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 144 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 152 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 160 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 200 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 208 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 216 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 224 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 264 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 272 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 280 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 288 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 104 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 112 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 120 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 128 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 168 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
x: 176 y: 272 w: inf h: inf det Confidence: 1 id: 1 class Confidence: 1
...

Implements

I doubt that infer pipeline in deepstream gets wrong results. But tensorrtx infer results are right , and the doInference function :

void doInference(IExecutionContext &context, float *input, float *output, int batchSize)
{
    const ICudaEngine &engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void *buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);

    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));

    // Create stream
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex]));
}

Additional

I also refer Iplugin tensorrt engine error for ds5.0 #5, but I can’t get engine file when run $ sudo /usr/local/TensorRT-7.0.0.11/bin/trtexec --onnx=yolov4_4_3_608_608.onnx --workspace=4096 --saveEngine=yolov4.engine --fp16 --explicitBatch. Errors :

(yolov4) dreamdeck@mjj:~/Documents/code/test/yolov4/pytorch-YOLOv4$ sudo /usr/local/TensorRT-7.0.0.11/bin/trtexec --onnx=yolov4_4_3_608_608.onnx --workspace=4096 --saveEngine=yolov4.engine --fp16 --explicitBatch
&&&& RUNNING TensorRT.trtexec # /usr/local/TensorRT-7.0.0.11/bin/trtexec --onnx=yolov4_4_3_608_608.onnx --workspace=4096 --saveEngine=yolov4.engine --fp16 --explicitBatch
[05/30/2020-18:16:23] [I] === Model Options ===
[05/30/2020-18:16:23] [I] Format: ONNX
[05/30/2020-18:16:23] [I] Model: yolov4_4_3_608_608.onnx
[05/30/2020-18:16:23] [I] Output:
[05/30/2020-18:16:23] [I] === Build Options ===
[05/30/2020-18:16:23] [I] Max batch: explicit
[05/30/2020-18:16:23] [I] Workspace: 4096 MB
[05/30/2020-18:16:23] [I] minTiming: 1
[05/30/2020-18:16:23] [I] avgTiming: 8
[05/30/2020-18:16:23] [I] Precision: FP16
[05/30/2020-18:16:23] [I] Calibration: 
[05/30/2020-18:16:23] [I] Safe mode: Disabled
[05/30/2020-18:16:23] [I] Save engine: yolov4.engine
[05/30/2020-18:16:23] [I] Load engine: 
[05/30/2020-18:16:23] [I] Inputs format: fp32:CHW
[05/30/2020-18:16:23] [I] Outputs format: fp32:CHW
[05/30/2020-18:16:23] [I] Input build shapes: model
[05/30/2020-18:16:23] [I] === System Options ===
[05/30/2020-18:16:23] [I] Device: 0
[05/30/2020-18:16:23] [I] DLACore: 
[05/30/2020-18:16:23] [I] Plugins:
[05/30/2020-18:16:23] [I] === Inference Options ===
[05/30/2020-18:16:23] [I] Batch: Explicit
[05/30/2020-18:16:23] [I] Iterations: 10
[05/30/2020-18:16:23] [I] Duration: 3s (+ 200ms warm up)
[05/30/2020-18:16:23] [I] Sleep time: 0ms
[05/30/2020-18:16:23] [I] Streams: 1
[05/30/2020-18:16:23] [I] ExposeDMA: Disabled
[05/30/2020-18:16:23] [I] Spin-wait: Disabled
[05/30/2020-18:16:23] [I] Multithreading: Disabled
[05/30/2020-18:16:23] [I] CUDA Graph: Disabled
[05/30/2020-18:16:23] [I] Skip inference: Disabled
[05/30/2020-18:16:23] [I] Inputs:
[05/30/2020-18:16:23] [I] === Reporting Options ===
[05/30/2020-18:16:23] [I] Verbose: Disabled
[05/30/2020-18:16:23] [I] Averages: 10 inferences
[05/30/2020-18:16:23] [I] Percentile: 99
[05/30/2020-18:16:23] [I] Dump output: Disabled
[05/30/2020-18:16:23] [I] Profile: Disabled
[05/30/2020-18:16:23] [I] Export timing to JSON file: 
[05/30/2020-18:16:23] [I] Export output to JSON file: 
[05/30/2020-18:16:23] [I] Export profile to JSON file: 
[05/30/2020-18:16:23] [I] 
----------------------------------------------------------------
Input filename:   yolov4_4_3_608_608.onnx
ONNX IR version:  0.0.6
Opset version:    11
Producer name:    pytorch
Producer version: 1.5
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/30/2020-18:16:24] [W] [TRT] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
[05/30/2020-18:16:24] [W] [TRT] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
[05/30/2020-18:16:24] [E] [TRT] Layer: (Unnamed Layer* 426)[Select]'s output can not be used as shape tensor.
[05/30/2020-18:16:24] [E] [TRT] Network validation failed.
[05/30/2020-18:16:24] [E] Engine creation failed
[05/30/2020-18:16:24] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/local/TensorRT-7.0.0.11/bin/trtexec --onnx=yolov4_4_3_608_608.onnx --workspace=4096 --saveEngine=yolov4.engine --fp16 --explicitBatch

I have no ideas for how to solve it.

Thanks.

DeepStream issue

The DS YOLO pipeline was designed for YOLOv3.
There is no suitable DS pipeline for YOLOv4 yet.
The new pipeline for YOLOv4 is still under development.

YOLOv4 onnx file parsing issue

The ONNX module of pytorch 1.5 seems to behave differently from earlier pytorch versions while dealing with constant parameters for expand operations.
Try to generate onnx file with pytorch 1.4 or pytorch 1.3 .

Please see compatible pytorch version in TensorRT 7 release note: https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-7.html

Pytorch && ONNX are evolving quickly and we are trying best to catch up.
Inform me if TensorRT reports error again.

Thanks. I will try to generate YOLOv4 onnx file with pytorch1.4.

But I don’t understand that

There is no suitable DS pipeline for YOLOv4 yet.

I have already implemented yolo layer define and generate engine file, and it runs well. What deepstream need to do is just run engine and infer. Is it right? Or means that DS also do something else, likes preprocessing ect.?

@jiejing_ma

Yes, preprocessing of images is included in DS.

We recommend focusing on the ONNX standard to convert models from other DL frameworks into ONNX first, and then convert into TensorRT engine.

Please pull the latest source from https://github.com/Tianxiaomo/pytorch-YOLOv4 and try to follow section 2, 3, 4, 5 of README on it.

I am now looking into the DS pipeline to check the compatibility of post-processing.

Hey there. Any news on this? We would really like to try YOLO-4 with our DS application.
Benchmarks for YOLO-4 look impressing…

cheers,
Gaylord

@gaylord

Integration solution of YoloV4 and DS is now under development.
Manuals and new code release will be available in the near future.

Hi ersheng,
any news about the integration of YoloV4 and DS? When will it be release?
Thanks

@ersheng so does this mean that yolov5 is also not working because of DeepStream compatibility @CJR says the reason for incorrect results is due to wrong execution of cuda kernels . Do you mind throwing some light on what is the main issue? Thanks

@gaylord @hymanzhu1983 @y14uc339 @jiejing_ma

Current Yolo implementation via CUDA kernel in DeepStream is based on old Yolo models (v2, v3) so it may not suit new Yolo models like YoloV4. Location: /opt/nvidia/deepstream/deepstream-5.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/kernels.cu

We are trying to embed Yolo layer into tensorRT engine before deploying to DeepStream, which would cause Yolo cuda kernel in DeepStream no longer to be used.

We have not officially released YoloV4 solutions for DeepStream yet but you can try following steps:

  1. go to https://github.com/Tianxiaomo/pytorch-YOLOv4 to generate a TensorRT engine according to this workflow: DarkNet or Pytorch --> ONNX --> TensorRT.
  2. Add following C++ functions into objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/nvdsparsebbox_Yolo.cpp and rebuild libnvdsinfer_custom_impl_Yolo.so
  3. Here are configuration files for you as references (You have to update a little to suit your environment):
    config_infer_primary_yoloV4.txt (3.4 KB)
    deepstream_app_config_yoloV4.txt (3.8 KB)
static NvDsInferParseObjectInfo convertBBoxYoloV4(const float& bx, const float& by, const float& bw,
                                     const float& bh, const uint& netW, const uint& netH)
{
    NvDsInferParseObjectInfo b;
    // Restore coordinates to network input resolution
    float xCenter = bx * netW;
    float yCenter = by * netH;

    float w = bw * netW;
    float h = bh * netH;

    float x0 = xCenter - w * 0.5;
    float y0 = yCenter - h * 0.5;
    float x1 = x0 + w;
    float y1 = y0 + h;

    x0 = clamp(x0, 0, netW);
    y0 = clamp(y0, 0, netH);
    x1 = clamp(x1, 0, netW);
    y1 = clamp(y1, 0, netH);

    b.left = x0;
    b.width = clamp(x1 - x0, 0, netW);
    b.top = y0;
    b.height = clamp(y1 - y0, 0, netH);

    return b;
}

static void addBBoxProposalYoloV4(const float bx, const float by, const float bw, const float bh,
                     const uint& netW, const uint& netH, const int maxIndex,
                     const float maxProb, std::vector<NvDsInferParseObjectInfo>& binfo)
{
    NvDsInferParseObjectInfo bbi = convertBBoxYoloV4(bx, by, bw, bh, netW, netH);
    if (bbi.width < 1 || bbi.height < 1) return;

    bbi.detectionConfidence = maxProb;
    bbi.classId = maxIndex;
    binfo.push_back(bbi);
}

static std::vector<NvDsInferParseObjectInfo>
decodeYoloV4Tensor(
    const float* detections, const uint num_bboxes,
    NvDsInferParseDetectionParams const& detectionParams,
    const uint& netW, const uint& netH)
{
    std::vector<NvDsInferParseObjectInfo> binfo;

    uint bbox_location = 0;
    for (uint b = 0; b < num_bboxes; ++b)
    {
        float bx = detections[bbox_location];
        float by = detections[bbox_location + 1];
        float bw = detections[bbox_location + 2];
        float bh = detections[bbox_location + 3];

        float maxProb = 0.0f;
        int maxIndex = -1;

        uint cls_location = bbox_location + 4;
        for (uint c = 0; c < detectionParams.numClassesConfigured; ++c)
        {
            float prob = detections[cls_location + c];
            if (prob > maxProb)
            {
                maxProb = prob;
                maxIndex = c;
            }
        }

        if (maxProb > detectionParams.perClassPreclusterThreshold[maxIndex])
        {
            addBBoxProposalYoloV4(bx, by, bw, bh, netW, netH, maxIndex, maxProb, binfo);
        }

        bbox_location += 4 + detectionParams.numClassesConfigured;
    }

    return binfo;
}

static bool NvDsInferParseYoloV4(
    std::vector<NvDsInferLayerInfo> const& outputLayersInfo,
    NvDsInferNetworkInfo const& networkInfo,
    NvDsInferParseDetectionParams const& detectionParams,
    std::vector<NvDsInferParseObjectInfo>& objectList)
{
    if (NUM_CLASSES_YOLO != detectionParams.numClassesConfigured)
    {
        std::cerr << "WARNING: Num classes mismatch. Configured:"
                  << detectionParams.numClassesConfigured
                  << ", detected by network: " << NUM_CLASSES_YOLO << std::endl;
    }

    std::vector<NvDsInferParseObjectInfo> objects;

        const NvDsInferLayerInfo &layer = outputLayersInfo[0]; // num_boxes x (4 + num_classes)

        // 2 dimensional: [num_boxes, 4 + num_classes]
        assert(layer.inferDims.numDims == 2);
        // The second dimension should be 4 + num_classes
        assert(detectionParams.numClassesConfigured == layer.inferDims.d[1] - 4);

        uint num_bboxes = layer.inferDims.d[0];

        // std::cout << "Network Info: " << networkInfo.height << "  " << networkInfo.width << std::endl;

        std::vector<NvDsInferParseObjectInfo> outObjs =
            decodeYoloV4Tensor(
                (const float*)(layer.buffer), num_bboxes, detectionParams,
                networkInfo.width, networkInfo.height);

        objects.insert(objects.end(), outObjs.begin(), outObjs.end());

    objectList = objects;

    return true;
}

3 Likes

@ersheng thanks! But my question was mainly regarding yolov5 compatibility which was released recently!

@y14uc339

YoloV5 may have similar problems too.
However, we have not thoroughly studied compatibilities of YoloV5 yet.
We may add YoloV5 into our agenda soon.

1 Like

Hi @ersheng. Since, DeepStream supports TensorRT and we implemented the cuda kernel for yolov5 which works fine in TensorRT. Why is that cuda kernel not working in DeepStream when DS is using the same TRT. I mean what exactly is causing the problem because here @CJR says that it should work in DS. Any thoughts on this?
Thanks!!

@y14uc339

Highest Yolo version the cuda kernel in /opt/nvidia/deepstream/deepstream-5.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/ can support is YoloV3.

We are trying to embed Yolo layer into tensorRT engine before deploying to DeepStream, which would cause Yolo cuda kernel in DeepStream no longer to be used. You can have a look at my previous post here: YoloV4 Solution.

YoloV5 may have a similar problem and we will work on it applying the same solution. But you can also imitate this YoloV4 solution to solve your YoloV5 problem by yourself.

1 Like

@ersheng this might be a dumb question!! I understand that Highest Yolo version the cuda kernel in /opt/nvidia/deepstream/deepstream-5.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/ can support is YoloV3. BUt I am not using that kernel to implement yolov5 but a different kernel. So, even a different implementation of cuda kernel that works for yolov5 in TRT would not work in DeepStream is that what you are trying to say?

@y14uc339 @CJR

Sorry for the misunderstanding.
CJR is providing you a solution to suit the YoloV5 from https://github.com/wang-xinyu/tensorrtx in this stream, and you can continue to follow this stream.

However, I can give you my suggestions which follows a different workflow:
Pytorch --> ONNX --> TRT
And conversion to ONNX first is a more standardized way to handle YoloV5 from the official page: https://github.com/ultralytics/yolov5.

You can choose either way to solve your problem and I hope they do not clash with each other.

@ersheng Thanks!

@ersheng I’ll try it both ways since @CJR is busy/unavailable currently. I’ll go with Pytorch -> Onnx -> TRT approach. It would be great if you can help out with the custom parsing functions and config files for smooth implementation of yolov5 in TRT!
Thanks

@ersheng Thanks a lot. I try this way and it works!
But there seems have some wrong about results.

And it returns warning info.

WARNING: …/nvdsinfer/nvdsinfer_func_utils.cpp:34 [TRT]: Explicit batch network detected and batch size specified, use enqueue without batch size instead.


Implements

I change the input size to width=320 height=512

And get onnx from Darknet but not pytorch. And set batchsize=1 using this command:

python demo_darknet2onnx.py yolov4.cfg yolov4.weights ./data/dog.jpg 1

onnx2tensorrt

trtexec --onnx=yolov4_1_3_512_320.onnx --explicitBatch --saveEngine=yolov4_1_3_320_512_fp16.engine --workspace=4096 --fp16

Questions

When I set batchsize=4, it gives errors and quit. Does the batchsize have be 1 and input size 320*512? Must I use the Pytorch model? Can the workflow be darknet -> ONNX -> TensoRT?

@jiejing_ma

For the warning

I agree that this warning is annoying but you can now simply ignore it.
It is a historical remaining issue caused by backward compatibility to Caffe and Uff models.
It will be removed in later TensorRT verisons.

For the error

In which step the program quit with error? As I know batch size should be consistent in the workflow: ONNX -> TRT -> DS pipeline:

           batchsize=4  batchsize=4     batchsize=4
Darknet  ->   ONNX   ->   TensorRT  ->  DS pipeline

Have you configured batch size of both [streammux] and [primary-gie]?

For ratio of input

I think the model input ratio should agree with the original image ratio, or at least close to each other.
For example, if your image input is 1080 * 1920, 320 * 512 or 320 * 608 may be a good ratio;
if your image input is 1280 * 1280, then 416 * 416 or 512 * 512 or 608 * 608 may be recommended for the model.

There is an argument named maintain-aspect-ratio in config_infer_primary_yoloV4.txt.
If maintain-aspect-ratio=1, the image will get padded to make its ratio consistent with model input, otherwise, the image will get stretched vertically or horizontally if image ratio does not meet model input.

DarkNet or Pytorch

Convert from darknet to onnx if you just want to use the YoloV4 official pretrained model.
Convert from pytorch to onnx if you want to use the model trained by Pytorch.

1 Like

Hi @jiejing_ma @ersheng
I have implemented Yolov3 with deepstream,but I had a failed attempt with Yolov4.
Can you please share your workflow, and some links which you have referred.
I wish to reproduce your results, [the results you have obtained in the screenshot shared]

I wish to reproduce these . Please help me with a summary or workflow or reference links,
Thanks