Build an avatar with ASR, ChatGPT, TTS and Omniverse Audio2Face

Demo

Below, I present the results of my work using NVIDIA Audio2Face and ChatGPT to create a basic interactive virtual human. Users can engage with it through voice input and have interactive conversations.

Description

This is an update to my previously published article on a simple interactive conversational virtual human technology. It has been a year since I last wrote about it, and I finally have the time to release new content. Over this past year, there have been significant developments, including improvements to Audio2Face and the launch of ChatGPT. With these convenient AI tools, creating a more convincing and lifelike virtual human experience has become easier than ever before.

The source code

I have published the source code for this micro-project on my GitHub repository. Feel free to download it from: https://github.com/metaiintw/build-an-avatar-with-ASR-TTS-Transformer-Omniverse-Audio2Face/tree/main/2.Avatar_With_ChatGPT

System requirements

System Requirements

Element The configuration used in the demo
OS Supported Ubuntu 22.04
CPU Intel I9, 13900
RAM 96 GB
Storage 2TB SSD
GPU RTX 4090

How to use the source code to create the virtual assistant experience demonstrated in the demo video

1. Build virtual environment

Using this Github repo to build the avatar is straightforward. Just use Anaconda to create a Python virtual environment with avatar_requirements.yml.

2. Open the attached USD file with NVIDIA Audio2face

Open claire_audio_streaming.usd in the USD_files folder using NVIDIA Audio2Face (Version 2023.1.0).

3. Run the IPython notebook

Finally, activate the Python virtual environment and run build-an-avatar-with-ASR-TTS-ChatGPTOmniverse-Audio2Face.ipynb.

Please note that you need to have a ChatGPT account and token to use the ChatGPT API in the “build-an-avatar-with-ASR-TTS-ChatGPTOmniverse-Audio2Face.ipynb” notebook. You should input your ChatGPT token to access the API. Instructions on how to obtain the token can be found within the notebook.

Once you have completed the above steps, you can start experiencing this simple virtual human application.

I will update the documentation on the GitHub repo and this article to provide more details about the development process. I hope this content is helpful to you.

8 Likes

Is RTX 3060 ,16 ram suitable

@dr.l.fadaly the general consensus is ‘the more VRAM you have, the better’, especially if you find yourself needing to work with A2F often. here is the spec for A2F for your reference:

https://docs.omniverse.nvidia.com/audio2face/latest/common/technical-requirements.html

Should the import of the environment have taken 40 minutes + (still going)?

@grumpy_bud what kind of “environment” are you referring to, which OV app are you using, and specification of your hardware?

since you posed under this thread under digital human category, it’s best if you could elaborate further with these sort of detail, others could be more informed of your particular scenario. better yet, i would probably encorage you to make a new thread under a more relevant forum if you are using a specific OV app

Looks like using the anaconda navigator was a bad idea as it was using 11 gigs of ram and actually couldn’t even access the file. Mamba is working perfectly (edit: I have 16 gigs)

@Simplychenable when running the juypter notebook it says sentence_transformers not found