VPI performance for background subtraction is SLOW - need advice

So my first very simple project on the TX2 was less than satisfactory. I work with videos and one of the first pre-processing steps is background subtraction. Currently on a PC, I use CV2 MOG background subtraction (cv2.createBackgroundSubtractorMOG2), after a bit of research I decided to try VPI background subtraction (vpi.BackgroundSubtractor) on the TX2. The results were not what I expected. With CV2 MOG on the TX2, the processing of a 1.6 MB file (about 170 800x600 frames) took about 4.7 seconds. VPI CUDA subtraction took 5.7 seconds on the TX2, VPI CPU subtraction took 15 seconds. Is CV2 MOG background subtraction a much faster process than what VPI background subtraction uses? Am I not using VPI background subtraction correctly? The faster the better! Any pointers would help. Below are the two scripts I used:

VPI
import cv2
import vpi
import time

cap = cv2.VideoCapture(“path/to/video.mp4”)

videosize = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))

with vpi.Backend.CUDA:
cuda_sub = vpi.BackgroundSubtractor(videosize, vpi.Format.BGR8)

start_time=time.time()

while True:
ret, frame = cap.read()

if ret != True:
    break

mask, image = cuda_sub(vpi.asimage(frame, vpi.Format.BGR8), learnrate=0.01)

execution_time = (time.time() - start_time)

print("Execution time: " + str(execution_time))

cap.release()

CV2 MOG
import cv2
import vpi
import time

cap = cv2.VideoCapture("“path/to/video.mp4”)

subtractor = cv2.createBackgroundSubtractorMOG2(history=10, varThreshold=25, detectShadows=False)

start_time=time.time()
while True:
ret, frame = cap.read()

if ret != True:
    break

frame_mask = subtractor.apply(frame)

execution_time = (time.time() - start_time)

print("Execution time: " + str(execution_time))

cap.release()

Hi,

You can find the benchmark data of the background subtractor below:
https://docs.nvidia.com/vpi/algo_background_subtractor.html

For TX2 with a 1920x1080 RGB8 input, it’s expected to have 35.0±0.7ms on the CUDA backend.
Could you check the sample in the document to see if anything different between the implementation?

More, please remember to maximize the performance with the VPI clock script first:
https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

Thanks.

Is there anyway to obtain the benchmark source code to see if there where any optimizations?
Also, regarding the VPI clock script, is there any reason why I can’t just leave the TX2 in the maxed out state all the time, if power consumption is not an issue?

Hi,

Sorry that I just realize the timing you mentioned is the time for the whole video.
Based on the score, it takes 5.7/170 = 33ms which is close to the benchmark score.

It seems that we don’t have a comparison between OpenCV and VPI for the background subtraction algorithm.
We are going to reproduce this internally to see the behavior in our environment.
Will share more information with you later.

Thanks.

Hi,

Confirmed that we can reproduce the performance difference.
We are checking this with our internal team.

Will share more information later.
Thanks.

I made a mistake the 170 frame was an estimate. It appears that the actual video is 331 frames of 800x600 pixel frames. When it comes to the the benchmark score, the benchmark was for 1920x1080 frames. Since mine are much smaller and have 480K pixels vs over 2,073K pixels for the benchmark, should there be a corresponding increase in performance, from 35 ms to much less?

Hi,

Thanks for the update.

We can also reproduce the performance issue in our environment.
To give more suggestions, we need to check more details with our internal team.

Will share more information with you once we got feedback.
Thanks.