Elevate Your Content with Text to Speech Avatar Videos

Text to speech avatar overview - Speech service - Azure AI services

This browser is no longer supported. Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Note

Access to this page requires authorization. You can try signing in or changing directories.

Text to speech avatar converts text into a digital video of a photorealistic human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.

Advanced neural network models

With text to speech avatar's advanced neural network models, the feature empowers users to deliver life-like and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.

How to create a custom text to speech avatar - Speech service

Tip: To convert text to speech with a no-code approach, try the Text to speech avatar tool in Speech Studio.

Capabilities of text to speech avatar

Text to speech avatar capabilities include:

Delivering lifelike and high-quality synthetic talking avatar videos
Choosing from a range of standard voices for the avatar
Language support aligned with text to speech
Accessing standard text to speech avatars through the Speech Studio portal or via API

Free AI Video Generator | Lip-sync | Talking Avatar | Text to Speech

The voice in the synthetic video could be an Azure AI Speech standard voice or a custom voice of voice talent selected by you.

Both batch synthesis and real-time synthesis resolution are 1920 x 1080, and the frames per second (FPS) are 25. Batch synthesis codec can be h264, hevc, or av1 if the format is mp4 and can set codec as vp9 or av1 if the format is webm; only vp9 can contain an alpha channel. Real-time synthesis codec is h264. Video bitrate can be configured for both batch synthesis and real-time synthesis in the request; the default value is 2000000; more detailed configurations can be found in the sample code.

You can create custom text to speech avatars that are unique to your product or brand. All it takes to get started is taking 10 minutes of video recordings. If you're also fine-tuning a professional voice for the actor, the avatar can be highly realistic.

Voice sync for avatar is trained alongside the custom avatar utilizing audio from the training video. The voice is exclusively associated with the custom avatar and cannot be independently used.

Professional voice fine-tuning and custom text to speech avatar are separate features. You can use them independently or together. If you plan to also use professional voice fine-tuning with a text to speech avatar, you need to deploy or copy your fine-tuned professional voice model to one of the avatar supported regions.

For more information, see What is custom text to speech avatar.

Sample code for text to speech avatar is available on GitHub. These samples cover the most popular scenarios.

Microsoft launches AI text-to-speech avatar at Ignite 2023 | Tech News

The text to speech avatar feature is only available in the following service regions: Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, East US 2, and West US 2.

We care about the people who use AI and the people who will be affected by it as much as we care about technology. For more information, see the Responsible AI transparency notes and disclosure for voice and avatar talent.