Yolo2 App for DS

I was following the sample code https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/tree/master/yolo/samples/objectDetector_YoloV3

I modified the respective variable and was able to buid but the detection boxes are all not correct. Is there anywhere I can find yolov2 sample app for DS?

Hi,

can you provide more information regarding the changes you have made and whats incorrect in the output detections ?

So one of the obvious things that needs to be done is to create a yolov2 engine. I created that. And rest of the computations are pretty much the same as objectDetector_YoloV3. I configured the paths accordingly and I can see that YoloV2 engine is getting used. The Detection boxes are very small. Are we to scale the Bounding Box values for yolo v2? Any help in this regard is much appreciated.

Well i would double check if all the parameters here match the engine you have generated -

https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/master/yolo/samples/objectDetector_YoloV3/nvdsinfer_custom_impl_YoloV3/nvdsparsebbox_YoloV3.cpp#L280

Along with that, keep in mind that the decodeTensor operations vary between yolov2 and yolov3.

See differences here -
https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/master/yolo/lib/yolov2.cpp#L59
https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/master/yolo/lib/yolov3.cpp#L59

Yes that is exactly what I’ve done

in https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/3a8957b2d985d7fc2498a0f070832eb145e809ca/yolo/samples/objectDetector_YoloV3/nvdsinfer_custom_impl_YoloV3/nvdsparsebbox_YoloV3.cpp#L160

const float bx
                   = x + detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 0)];
               const float by
                   = y + detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 1)];
               const float bw
                   = pw * exp(detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 2)]);
               const float bh
                   = ph * exp(detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 3)]);

in https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/3a8957b2d985d7fc2498a0f070832eb145e809ca/yolo/samples/objectDetector_YoloV3/nvdsinfer_custom_impl_YoloV3/nvdsparsebbox_YoloV3.cpp#L210

static int outputBlobIndex1 = -1;
   static const int NUM_CLASSES_YOLO_V2 = 80;
   static bool classMismatchWarn = false;

if (outputBlobIndex1 == -1)
   {
       for (uint i = 0; i < outputLayersInfo.size(); i++)
       {
           if (strcmp(outputLayersInfo[i].layerName, "region_32") == 0)
           {
               outputBlobIndex1 = i;
               break;
           }
       }
       if (outputBlobIndex1 == -1)
       {
           std::cerr << "Could not find output layer 'region_32' while parsing" << std::endl;
           return false;
       }
   }

and in https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/3a8957b2d985d7fc2498a0f070832eb145e809ca/yolo/samples/objectDetector_YoloV3/nvdsinfer_custom_impl_YoloV3/nvdsparsebbox_YoloV3.cpp#L275

std::vector<float*> outputBlobs(1, nullptr);
   outputBlobs.at(0) = (float*) outputLayersInfo[outputBlobIndex1].buffer;

   const float kNMS_THRESH = 0.4f;
   const float kPROB_THRESH = 0.5f;
   const uint kNUM_BBOXES = 5;
   const uint kINPUT_H = 608;
   const uint kINPUT_W = 608;
   const uint kSTRIDE_1 = 32;
   const uint kGRID_SIZE_1 = kINPUT_H / kSTRIDE_1;

std::vector<NvDsInferParseObjectInfo> objects;
   std::vector<NvDsInferParseObjectInfo> objects1
       = decodeTensor(outputBlobs.at(0), kMASK_1, kANCHORS, kGRID_SIZE_1, kSTRIDE_1, kNUM_BBOXES,
                      NUM_CLASSES_YOLO_V2, kPROB_THRESH, kINPUT_W, kINPUT_H);

   objectList.clear();

   objectList = nmsAllClasses(kNMS_THRESH, objects1, NUM_CLASSES_YOLO_V2);

But my results are all goofed up.

What about anchors ? Have you updated them ? Anchors vary from yolov2 and yolov3 and they also need to be in network input resolution.

Have a look at how its done here - https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/3a8957b2d985d7fc2498a0f070832eb145e809ca/yolo/lib/yolo.cpp#L294

hey hi thanks for pointing out that to me. I modified the code accordingly and multiplied stride to anchor but the results are the same

FYI my anchors for YoloV2 is defined thus

const std::vector<float> kANCHORS
      = {0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828};
const std::vector<float> kANCHORS
          = {0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828};

These need to be multiplied by stride.

How are you computing the values of ph and pw ? There are no masks in yolov2 and still seem to be using them. All the information you need to implement the decoding of bounding boxes from networks output is present over here. Please check if your implementation is exactly the same.

https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/master/yolo/lib/yolov2.cpp#L33

Yes I did multiply with the stride when you suggested that in your answer Posted 07/02/2019 05:48 PM.

My ph and pw follows in line with your suggestion

const float pw = anchors[2 * b];
const float ph = anchors[2 * b + 1];

.
.
.
const float bw = pw
                    * exp(detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 2)]);
const float bh = ph
                    * exp(detections[bbindex + numGridCells * (b * (5 + numOutputClasses) + 3)]);

In fact from the beginning I’ve reoffered https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps/blob/master/yolo/lib/yolov2.cpp#L33. The thing is bounding boxes are all messy.

I’m not a big expert in yolo but do you suspect the way final layer o/p is stacked in yolov2 is any different from yolov3 as problem?

The output layer implementations of yolov2 and yolov3 are different and the changes in decodeTensor function should take of that.

Can you share a sample image of how the outputs look like ? I would double check if the cfg file used to generate the engine has the same parameters as your NvDsInferParseCustomYoloV3(…) function. Typically, if the network input height and width are different in engine file vs parsing function, you would see such behavior.