Active speaker detection

• Hardware Platform (Jetson / GPU)
Jetson Orin 4012, NVIDIA Jetson Orin NX Bundle, 8x 2GHz, 16GB DDR5
• DeepStream Version
Container: deepstream:7.0-triton-multiarch
• JetPack Version (valid for Jetson only)
see Container: deepstream:7.0-triton-multiarch
• TensorRT Version
see Container: deepstream:7.0-triton-multiarch
• NVIDIA GPU Driver Version (valid for GPU only)
Container: deepstream:7.0-triton-multiarch
• Issue Type( questions, new requirements, bugs)
Question

We plan to answer the following question for an input video + audio feed using a gstreamer pipeline which uses the nvinfer plugin:
Given a video with an arbitrary number of people in it, is one of the persons speaking and if so, which one?

Currently we use a heuristical approach on the output we get from the following pipeline using the NVIDIA FacialLandmarks net:

gst-launch-1.0 v4l2src device=/dev/video0 !
      gst-launch-1.0 v4l2src device=/dev/video0 ! 
      nvvideoconvert src-crop=0:0:1920:1080 !
      m.sink_0 nvstreammux name=m batch-size=1 live-source=1 width=1280 height=920 !
      nvinfer config-file-path=ai_pipeline/configs/facedetect.yml !\n
      nvinfer config-file-path=ai_pipeline/configs/landmarks.yml !\n
      fakesink

But we are not satisfied with our results and wondered, if there already is a solution for this problem somewhere in the NVIDIA Model Zoo?

What makes you dissatisfied? The accuracy? The deployment?Or any other issues?..

What “NVIDIA Model Zoo” provides is pre-trained models or model backbones, you may need to re-train these models if you want to use the models in your product.

We had trouble using the infered lip landmark data for reasoning if a face is currently speaking or not, due to accuracy. We managed to get good enough results by smoothing the data but we were just wondering, if there is a prebuilt solution from NVIDIA for this problem of active speaker detection out of live video+audio already out there, that we had not found.