Text to speech avatar overview - Speech service - Azure AI services
This browser is no longer supported. Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can try signing in or changing directories.
Text to speech avatar converts text into a digital video of a photorealistic human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.
Advanced neural network models
With text to speech avatar's advanced neural network models, the feature empowers users to deliver life-like and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.
Tip: To convert text to speech with a no-code approach, try the Text to speech avatar tool in Speech Studio.
Capabilities of text to speech avatar
Text to speech avatar capabilities include:
- Delivering lifelike and high-quality synthetic talking avatar videos
- Choosing from a range of standard voices for the avatar
- Language support aligned with text to speech
- Accessing standard text to speech avatars through the Speech Studio portal or via API
The voice in the synthetic video could be an Azure AI Speech standard voice or a custom voice of voice talent selected by you.
Both batch synthesis and real-time synthesis resolution are 1920 x 1080, and the frames per second (FPS) are 25. Batch synthesis codec can be h264, hevc, or av1 if the format is mp4
and can set codec as vp9 or av1 if the format is webm
; only vp9
can contain an alpha channel. Real-time synthesis codec is h264. Video bitrate can be configured for both batch synthesis and real-time synthesis in the request; the default value is 2000000; more detailed configurations can be found in the sample code.
You can create custom text to speech avatars that are unique to your product or brand. All it takes to get started is taking 10 minutes of video recordings. If you're also fine-tuning a professional voice for the actor, the avatar can be highly realistic.
Voice sync for avatar is trained alongside the custom avatar utilizing audio from the training video. The voice is exclusively associated with the custom avatar and cannot be independently used.
Professional voice fine-tuning and custom text to speech avatar are separate features. You can use them independently or together. If you plan to also use professional voice fine-tuning with a text to speech avatar, you need to deploy or copy your fine-tuned professional voice model to one of the avatar supported regions.
For more information, see What is custom text to speech avatar.
Sample code for text to speech avatar is available on GitHub. These samples cover the most popular scenarios.
The text to speech avatar feature is only available in the following service regions: Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, East US 2, and West US 2.
We care about the people who use AI and the people who will be affected by it as much as we care about technology. For more information, see the Responsible AI transparency notes and disclosure for voice and avatar talent.