Integrating ElevenLabs API with Audio2Face: Issues with Sample Rate and Buffer Size

Hello everyone,

I am reaching out for some expertise on a technical challenge I am currently facing. I am working on integrating the ElevenLabs API with Audio2Face, but I’m encountering difficulties related to the sample rate and buffer size.

Has anyone here attempted to combine these two tools before? If so, did you experience any issues with sample rate or buffer size? I would greatly appreciate any advice or solutions to overcome these hurdles.

Here are more details on the specific problems I am facing:

  • Inconsistencies in sample rate between the ElevenLabs API and Audio2Face, causing compatibility issues.
  • Difficulties with the buffer size that lead to delays and interruptions in audio processing.

If anyone has suggestions, similar experiences to share, or can point me towards useful resources, it would be extremely helpful.

Thank you in advance for your help and feedback.

I’m not familiar with ElevenLabs. But took a quick look and it seems it only exports audio as .mp3 format, right? But Audio2Face only works with .wav.

As a test I created an mp3 using ElevenLabs, then converted it to wav and tested it in A2F and it seems to work as expected. Do you have an audio file we can test on our end?

we want to use the ElevenLabs TextToSpeech Websocket API, to send a streamed audio stream

If you already have the .wav file and would like to send it to Audio2Face using command line, take a look at Overview of Streaming Audio Player in Omniverse Audio2Face - YouTube

I have a question because I am facing exactly the same issue now, namely the fact that ElevenLabs sends data in mp3 format, which obviously makes sense due to the optimization of data transfer speed over the network. However, A2F only supports the wav format. So now, if we are making a system that operates in real time and we care about the speed of response, instead of performing the transcoding from mp3 to wav each time to feed the wav file to A2F, wouldn’t it be better to build in mp3 file support in A2F?? Many people are now using TTS solutions that generate voice, and by nature, since they are usually network services, for optimization, they will never be transmitted in wav format because we know these have much larger sizes. That’s why I always say ‘Let’s make life easier’ - it’s better to build in mp3 support into the program right away rather than having everyone struggle each time with transcoding mp3 to wav. I wonder what is preventing the addition of this format?

Best regards

Thanks for bringing this our attention @chris508

I guess the reason for not supporting MP3 was due to patent/copyright . But I’ll bring this up with the team again and double check.

I must admit that I was also afraid that this could be a licensing issue, because I don’t really see any other reason… and it would be useful :) It would make life easier for thousands of people


1 Like

Another solution, considering the licensing issues, would be to use the ogg format, which has a comparable or even smaller size than mp3. I will talk to ElevenLabs to see if they can implement such a format. This could be a partial solution to the wav file problem.


1 Like

I’ll bring this up with the team. Thanks Chris

Regarding MP3… I found this note: The patent for the MP3 format expired between 2012 and 2017. The last patents related to the MP3 format, owned by the Fraunhofer Society, expired in April 2017. After the expiration of these patents, the MP3 format was no longer subject to licensing restrictions, which allowed for its wider and free use


The audio system in Audio2Face is under a big change currently. We’re hoping to support mp3 and ogg in the future releases.


Oh, wonderful information! I can’t wait for the new options :)

Best regards

Hello, I’m not sure if I need to open a separate topic about this issue, but I wanted to mention it here because it’s a related topic. I am also trying to do similar things with the OpenAI Text-to-speech service, but the OpenAI TTS service only provides output with a 24kHz sample rate. Is there a technical obstacle to supporting 24kHz as well?

11labsgptatf.txt (6.3 KB)

This is what i use for 11 labs and any tts, it takes mp3s:

place the python script in:

load and setup clarie_solved_arkit; u have to add a streaming player

1 Like