Streaming Support for maxine-audio2face-2d

dl19 · March 11, 2025, 9:18pm

Hi,

I’m currently working with the Maxine Audio2Face-2D NIM container, and I’ve encountered a few important issues and questions:

Priority 1: Performance Issue

I’ve tested the container extensively on a Turing GPU using the following configuration:

feature_params = {
    "portrait_image": portrait_image_encoded,
    "model_selection": ModelSelection.MODEL_SELECTION_PERF,
    "animation_crop_mode": AnimationCroppingMode.ANIMATION_CROPPING_MODE_INSET_BLENDING,
    "enable_lookaway": 1,
    "lookaway_max_offset": 25,
    "lookaway_interval_min": 1,
    "lookaway_interval_range": 600,
    "blink_frequency": 1,
    "blink_duration": 2,
    "mouth_expression_multiplier": 1.0,
    "head_pose_mode": head_pose_mode,
    "head_pose_multiplier": 0.0
}

Even in Performance mode, the processing is approximately 50% slower than real-time (e.g., ~15 seconds for 10 seconds of audio).

Additionally, the following log message appears consistently:
“Failed to query video capabilities: Invalid argument”

Could this performance issue be related to using driver version (570.124.04), given that the recommended driver (571.21+) isn’t publicly available yet?

Could you suggest troubleshooting steps or adjustments to improve performance and move closer to real-time?

Priority 2: Better Streaming Options

Currently, Maxine Audio2Face-2D only outputs standard MP4 files, which aren’t optimized for near-real-time streaming. My specific use case requires real-time streaming of generated animations, based on audio responses generated by STT and an LLM, via a proxy server.

Questions:

Are there existing options or roadmap plans to support streaming-friendly video formats, such as fragmented MP4 (fMP4), or similar segmented streaming approaches?
What does NVIDIA recommend as the best practice or reference architecture to implement real-time streaming with Audio2Face-2D?
The Digital Humans for Customer Service demo appears to support effective streaming. Could you clarify how streaming is implemented in this demo, and when or if such functionality will become publicly accessible?

Thanks!

dl19 · April 7, 2025, 8:25pm

Update:

Priority 1: Performance Issue
This is no longer an issue. I made a mistake by using non-original samples for testing. Initially, I was testing on AWS A10G, and it turns out the results match those in the official documentation table here: Basic Inference — NVIDIA NIM Maxine Audio2Face-2D.

Then I switched to L40S and achieved the performance speed I needed—27% faster than real-time in performance mode as we have in table.

So, the drivers weren’t the problem; it was just my mistake. Everything is working as expected with Driver Version: 570.124.04 and CUDA Version: 12.8.

Priority 2: Better Streaming Options
Still valid. Are there any updates & suggestions on this?

aazad.khan · June 4, 2025, 9:33am

Hello dl19,

I have also implemented the same feature and also getting Warning: Failed to query video capabilities: Invalid argument.
Implemented the STT and LLM to show video Realtime but video generation takes more than 11 seconds for 5 seconds video.

Production Server Configuration

CPU: Intel® Core™ i5-13500
(14 cores: 6 Performance + 8 Efficient cores)
GPU: NVIDIA RTX™ 4000 SFF Ada Generation
(20GB GDDR6 ECC memory — enterprise-grade workstation GPU)
RAM: 64GB DDR4
Storage:
- 2 × 1.92TB NVMe SSD (Gen3)
Network:
- 1 Gbps Port
OS: Ubuntu 22.04 LTS
CUDA Version: 12.2

Can you please help me understand am I going on write track and while docker deploy I have provided 16GB RAM.

dl19 · June 4, 2025, 11:52am

Hi,
It sounds like your GPU isn’t powerful enough. It’s not about VRAM size—it’s about overall performance. Check the “Performance data” table at the bottom of this documentation: Basic Inference — NVIDIA NIM Maxine Audio2Face-2D

I also tried running it on different GPUs and like you observed that generation times were longer than the input audio. However, when I used the L40 (L40S on AWS - g6e.xlarge), I got the same results shown in the table with the provided sample input files.

The only issue—which is also the biggest one—is that you can’t run more than one process at the same time. Even though there’s a variable called MAXINE_MAX_CONCURRENCY_PER_GPU and a single process only uses about 3.5GB of VRAM (with each additional process adding roughly ±100MB), adding another parallel generation significantly increases the generation time for each. It’s already not real-time with two processes, and it gets worse with three or four. The performance degradation isn’t linear—it’s probably exponential.

I have some ideas and hacks that might allow for two generations to run in parallel on a single L40, but it’s still too expensive. We need at least 20 real-time parallel generations for it to be profitable—but for now, with a maximum of two parallel generations on an L40, it is not, depends on what you are doing.

aazad.khan · June 4, 2025, 1:54pm

Hi dl19

Thank you for your response. I will check this with L4 or H100 Configuration, You say your performance get improved after switching it to L4, so How much time it takes to generate 5 seconds of video.

dl19 · June 4, 2025, 2:10pm

Not L4 but L40.

As I mentioned, I get results exactly like in the table—27% faster than real-time with that configuration. They have “sample input files” in the demo repo: nim-clients/audio2face-2d/assets/sample_audio.wav at master · NVIDIA-Maxine/nim-clients · GitHub

For that 27 seconds of audio example, I was able to generate video in 19 seconds, which is even faster than what’s shown in their table. However, if we run it 100 times, the median results will probably match the table.
So, from 5 seconds of audio, it’s possible to generate video in about 3 seconds, but only with ANIMATION_CROPPING_MODE_FACEBOX.

nikita17 · November 12, 2025, 2:15pm

hey! @dl19
I am getting issue when trying to pull the docker container
docker pull nvcr.io/nim/nvidia/maxine-audio2face-2d:latest
Error response from daemon: Head “https://nvcr.io/v2/nim/nvidia/maxine-audio2face-2d/manifests/latest”: unknown: {“errors”: [{“code”: “DENIED”, “message”: “Payment Required”}]}

we are trying to activate subscription on https://org.ngc.nvidia.com/activate but recived an error message:

Verification Issue The serial number(s) you are trying to activate are on hold because you tried them previously or the serial number(s) have already been claimed, please contact NVIDIA support and provide the NGC Reference ID: 37881dd7-43b7-43f6-94a1-ed953b4a6e53.

Any idea how can we resolve this?

nikhilbhatt1011 · December 26, 2025, 7:59am

Hi,
I’m working on a conversational AI system and exploring the use of NVIDIA Maxine Audio2Face-2D. From my understanding, the model currently outputs standard MP4 files rather than fragmented MP4 (fMP4), which makes real-time or near-real-time streaming difficult.

I wanted to check:

Are there any recommended workarounds, architectural patterns, or post-processing techniques to make Audio2Face-2D usable in a real-time conversational setup?
Has NVIDIA added (or planned to add) support for streaming-friendly formats such as fMP4 or chunked output?
Is there any official guidance, roadmap, or reference implementation for integrating Audio2Face-2D into real-time conversational systems?

Any insights or best practices would be really helpful. Thanks!

Topic		Replies	Views
Audio to face 2d container Launcher audio2face , nim	3	183	November 18, 2025
Api for audio2face-2d Models audio2face , llama-31-405b-instruct , llama	3	377	October 25, 2024
Audio2Face long delay on streaming Audio2Face (closed) python	8	1749	April 29, 2023
A2F-2D Availability & Streaming Capabilities Maxine audio2face , maxine	2	112	February 28, 2025
[Audio2Face / Digital Human] SDK v3.0 Diffusion Model — Frame gap during streaming inference (posting here as Digital Human board is closed) TensorRT audio2face , nim	0	66	February 20, 2026
Experience Real-Time Audio and Video Communication with NVIDIA Maxine Technical Blog	1	372	January 10, 2024
What is the api for audio2face-2d NVIDIA NIM audio2face	2	166	February 12, 2025
Can Audio2Face Headless REST API be used for commercial use Audio2Face (closed)	10	1965	November 7, 2023
Advancing Telepresence and Next-Generation Digital Humans with NVIDIA Maxine Maxine audio2face	0	71	July 30, 2024
Audio2Face Api Connectors audio2face , api	2	189	August 19, 2024

Streaming Support for maxine-audio2face-2d

Priority 1: Performance Issue

Priority 2: Better Streaming Options

Production Server Configuration

Related topics