NvOFTracker-YOLOv3: Recreating Algorithm Benchmark Values

I’ve been trying to recreate the algorithm benchmark values reported on the NVOFA Tracker documentation for the NvOFTracker-YOLOv3. I’ve built the NvOFTracker library and the NvOFTSample executable on a Docker container with the following specs:

  • Ubuntu 18.04
  • CUDA 11.0
  • cuDNN 8.0
  • TensorRT 7.2.2
  • Video Coded SDK 10.0
  • OpenCV 4.5.1

I’ve built the YOLOv3 TensorRT engine as specified in the installation readme, and I’m using the MOT16 training set as specified in the Tracker documentation. First, I stitched all the training set image frames into videos with the command below:

ffmpeg -framerate <framerate> -i <input/frames> -codec copy <output_name>.mkv

Then, I ran ./NvOFTSample on each of the stitched videos. I modified the DumpTrackedObjects() function in NvOFTSample.cpp to produce MOT-16 Challenge format compliant outputs. Then, I ran the output files on py-motmetrics to obtain the accuracy measurements. These are the values I obtained:

  • MOTA: 25.4%
  • FP: 20122
  • FN: 61489
  • ID: 739

I’ve managed to get FN and ID values that are similar to the ones reported in the documentation, but the FP value is much higher, and the MOTA value is much lower by comparison. I was wondering if I am on the right track in obtaining these measurement values? If so, what went wrong that led to such a discrepancy in FP values? And if not, what would be the right way to benchmark the NvOFTracker-YOLOv3?

Hi.
NvOFTracker does a delayed entry and removal of rects. That is, we allow a 4-5 frame buffer before which a new incoming rect is admitted to list of tracking rects and along the same lines we maintain, for 4-5 frames, the rects for which match is not found.
You mention stitching videos into one long sequence. We suspect this is leading to a bunch of ghost rects being tracked (that is rects with nothing inside them or in other words False positives) at video boundaries(that is end of each sequence). These ghost rects will continue for 4-5 frames due to the above-mentioned scheme. Could you try your experiment without stitching the videos rather just running them independently?

Thanks.