Deepstream 6 on Xavier NX detection performance issues

We are using NVIDIA Xavier NX units in a automated machine vision application… Under DeepStream 5 on JetPack 4.4 we got decent performance, but we need some of the new features of 4.6 for deployment.

We are currently using Azure Custom Vision for training, so our model options are very limited… We are operating on a Yolo Tiny ONNX currently, but it’s performance is 20 to 30% of what we see on Azure and on DS5.

We have also tried exporting as SSD but can’t get Deepstream to import the ONNX.

Is there any advice? This is becoming a big issue for us… We are using the reference application and have tried almost every combination of parameters…

Let me know what I can send to help troubleshoot?

Hi,

It will be good if you can fill out our template first:

• Hardware Platform (Jetson / GPU)
• DeepStream Version
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Back to your question, have you tried our TAO toolkit?
It provides some popular detection models and has some pruning mechanisms that can improve performance.

https://developer.nvidia.com/tao-toolkit

Thanks.

No we haven’t but we will add it to our list to evaluate TAO. Unfortunately, we have a code investment in Azure Custom Vision training management, so this could take some time if we go this direction…

I just applied for Tao Early Access… That doesn’t fix the current issue, but that does give some options.

I look forward to next steps from your side.

Hi,

1. Do you compare the performance with Deepstream 6.0.1 (JP4.6.1) or Deepstream 6.0 (JP4.6)?
If version 6.0 is used, would you mind testing this on Deepstream 6.0.1?

2. Please also try the patch shared below to see if it helps in your use case:

If the performance regression still occurs with the above two suggestions.
Would you mind sharing the ONNX model and the corresponding Deepstream configuration for us to reproduce?

Thanks.

@AastaLLL
Yes, we are using 6.0.1… I saw that posting and verified that we had that patch several days ago.

Also, our issue isn’t speed, it’s accuracy/detection we lost 1 to 2 orders of magnitude of detections when we went from DS5 to DS6. In order to get any detections, we had to turn our pre-cluster-threshold down to 0.001 which still didn’t yield usable results and didn’t offer any control. Obviously, something was very wrong. We found that many bounding boxes were returning largely negative probabilities (-300, -500, even -800%).

But I think that we have found the issue and I want to share it because it’s a problem with the deepstream code base. (/opt/nvidia/deepstream/deepstream-6.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/)

In DS 5, we were using a Yolo parser provided by an outside resource (and I don’t currently have the source for it). That parser wasn’t available for DS 6, so we used the libnvdsinfer_custom_impl_Yolo library provided with deepstream 6.0.1.

Digging deep into the detection issues, what we found was that the provided code did not properly process the YOLOV2 (Tiny) output. YOLO V2 output requires special parsing to normalize the bounding boxes and probability into something usable by the rest of the pipeline.

Digging into the parsing code below (from 6.0.1), the first thing that I noticed was that there were no post-processing/scaling on any of the output fields except for an exp on the W/H. YOLO needs a sigmoid on the X/Y, Objectiveness, and a softmax on the Class probabilities.

static std::vector<NvDsInferParseObjectInfo>
decodeYoloV2Tensor(
    const float* detections, const std::vector<float> &anchors,
    const uint gridSizeW, const uint gridSizeH, const uint stride, const uint numBBoxes,
    const uint numOutputClasses, const uint& netW,
    const uint& netH)
{
    std::vector<NvDsInferParseObjectInfo> binfo;
    for (uint y = 0; y < gridSizeH; ++y) {
        for (uint x = 0; x < gridSizeW; ++x) {
            for (uint b = 0; b < numBBoxes; ++b)
            {
                const float pw = anchors[b * 2];
                const float ph = anchors[b * 2 + 1];

                const int numGridCells = gridSizeH * gridSizeW;
                const int bbindex = y * gridSizeW + x;
                const float bx
                    = x + detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 0)];
                const float by
                    = y + detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 1)];
                const float bw
                    = pw * exp (detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 2)]);
                const float bh
                    = ph * exp (detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 3)]);

                const float objectness
                    = detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 4)];

                float maxProb = 0.0f;
                int maxIndex = -1;

                for (uint i = 0; i < numOutputClasses; ++i)
                {
                    float prob
                        = (detections[bbindex
                                      + numGridCells * (b * (5 + numOutputClasses) + (5 + i))]);

                    if (prob > maxProb)
                    {
                        maxProb = prob;
                        maxIndex = i;
                    }
                }
                maxProb = objectness * maxProb;

                addBBoxProposal(bx, by, bw, bh, stride, netW, netH, maxIndex, maxProb, binfo);
            }
        }
    }
    return binfo;
}

I modified the code as you can see below. (I will update if I find any further issues, but this seems to give very “real” results compared to the supplied function)

static float safesigmoid(float x)
{
    if (x > 0.0f)
    {
        return (float)(1.0f / (1.0f + exp(-x)));
    }
    else
    {
        auto e = exp(x);
        return (float)(e / (1.0f + e));
    }
}


static std::vector<NvDsInferParseObjectInfo>
decodeYoloV2Tensor(
    const float* detections, const std::vector<float> &anchors,
    const uint gridSizeW, const uint gridSizeH, const uint stride, const uint numBBoxes,
    const uint numOutputClasses, const uint& netW,
    const uint& netH)
{

    std::vector<NvDsInferParseObjectInfo> binfo;
    for (uint y = 0; y < gridSizeH; ++y) {
        for (uint x = 0; x < gridSizeW; ++x) {
            for (uint b = 0; b < numBBoxes; ++b)
            {
                const float pw = anchors[b * 2];
                const float ph = anchors[b * 2 + 1];

                const int numGridCells = gridSizeH * gridSizeW;
                const int bbindex = y * gridSizeW + x;
                const float bx = x + safesigmoid (detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 0)]);
                const float by = y + safesigmoid (detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 1)]);
                const float bw = pw * exp (detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 2)]);
                const float bh = ph * exp (detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 3)]);

                const float objectness = safesigmoid(detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 4)]);


                float maxProb = 0.0f;
                int maxIndex = -1;

                float sum = 0.0f;
                for (uint i = 0; i < numOutputClasses; ++i)
                {
                  float prob = exp(detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + (5 + i))]);
	                sum += prob;

                  if (prob > maxProb)
                  {
                      maxProb = prob;
                      maxIndex = i;
                  }
                }

                if (sum>0)
                {
                	maxProb = objectness * maxProb / sum;
                }

                addBBoxProposal(bx, by, bw, bh, stride, netW, netH, maxIndex, maxProb, binfo);
            }
        }
    }
    return binfo;
}

I wanted to update you on this as soon as possible and get this searchable for others dealing with poor detections on YOLO V2 Tiny, especially with those trained on Microsoft Azure Custom Vision ONNX Models (Compact Domain). I assume that there may be equivalent issues with the other YOLO versions as well, but I don’t have time or test setup to verify and/or fix.

Please let me know what you think, if you have any questions, and if there is anything additional that I can supply?

Hi,

Our parser is verified with the original darknet .cfg/.weight from the author.
Do you find a similar issue from the darknet format model?

Since the parser is model-dependent, it’s possible that some update is required when using ONNX based model from the Azure.
Thanks.

Hi,

Have you fixed this issue with the change you shared on Apr 8?
Please let us know if you have further questions or need more help.

Thanks.

I fixed it by modifying the DS code as noted earlier.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.