Streaming Support for maxine-audio2face-2d

Hi,

I’m currently working with the Maxine Audio2Face-2D NIM container, and I’ve encountered a few important issues and questions:

Priority 1: Performance Issue

I’ve tested the container extensively on a Turing GPU using the following configuration:

feature_params = {
    "portrait_image": portrait_image_encoded,
    "model_selection": ModelSelection.MODEL_SELECTION_PERF,
    "animation_crop_mode": AnimationCroppingMode.ANIMATION_CROPPING_MODE_INSET_BLENDING,
    "enable_lookaway": 1,
    "lookaway_max_offset": 25,
    "lookaway_interval_min": 1,
    "lookaway_interval_range": 600,
    "blink_frequency": 1,
    "blink_duration": 2,
    "mouth_expression_multiplier": 1.0,
    "head_pose_mode": head_pose_mode,
    "head_pose_multiplier": 0.0
}

Even in Performance mode, the processing is approximately 50% slower than real-time (e.g., ~15 seconds for 10 seconds of audio).

Additionally, the following log message appears consistently:
“Failed to query video capabilities: Invalid argument”

Could this performance issue be related to using driver version (570.124.04), given that the recommended driver (571.21+) isn’t publicly available yet?

Could you suggest troubleshooting steps or adjustments to improve performance and move closer to real-time?

Priority 2: Better Streaming Options

Currently, Maxine Audio2Face-2D only outputs standard MP4 files, which aren’t optimized for near-real-time streaming. My specific use case requires real-time streaming of generated animations, based on audio responses generated by STT and an LLM, via a proxy server.

Questions:

  1. Are there existing options or roadmap plans to support streaming-friendly video formats, such as fragmented MP4 (fMP4), or similar segmented streaming approaches?
  2. What does NVIDIA recommend as the best practice or reference architecture to implement real-time streaming with Audio2Face-2D?
  3. The Digital Humans for Customer Service demo appears to support effective streaming. Could you clarify how streaming is implemented in this demo, and when or if such functionality will become publicly accessible?

Thanks!