Unlocking the Power of Whisper Models by OpenAI

Open AI Whisper - Developer Support

Get notified in your email when a new post is published to this blog on Developer Support on July 9th, 2024

Whisper is an advanced automatic speech recognition (ASR) system, developed using 680,000 hours of supervised multilingual and multitask data from the web. This extensive and diverse data set enhances its ability to handle various accents, background noise, and technical jargon. Whisper not only transcribes multiple languages but also translates them into English. We are making the models and inference code open source to provide a robust foundation for developing practical applications and advancing research in speech processing.

Whisper Model by OpenAI

The Whisper model, developed by OpenAI, converts speech to text and is ideal for transcribing audio files. Trained on an extensive dataset of English audio and text, it excels at transcribing English speech but can also handle other languages, producing English text as output.

We have Whisper models accessible through the Azure Open AI service. The Whisper model by Azure OpenAI provides varied solutions for different scenarios. It excels in transcribing and analyzing prerecorded audio and video files. It is also ideal for the quick processing of individual audio files. It can transcribe phone call recordings and provide analytics such as call summary, sentiment, key topics, and custom insights. Similarly, it can transcribe meeting recordings and provide analytics like meeting summary, meeting chapters, and action item extraction. The Whisper model also supports contact center voice agent services like call routing and interactive voice response and is suitable for application-specific voice assistants in various scenarios such as set-top boxes, mobile apps, in-car systems, and more. However, it does not support real-time transcription, pronunciation assessment, or translation of live or prerecorded audio. It is recommended for translating prerecorded audio from other languages into English.

Developers Support with Azure AI Speech

Developers using Whisper in Azure AI Speech benefit from additional capabilities such as processing of large file sizes up to 1GB, speaker diarization, and the ability to fine-tune the Whisper model using audio plus human-labeled transcripts. For accessing Whisper, developers can use the Azure OpenAI Studio. The Whisper REST API supports translation services from a growing list of languages to English. The Whisper model is a significant addition to Azure AI’s broad portfolio of capabilities, offering innovative ways to improve business productivity and user experience.

Implement the API of OpenAI Whisper on Bubble - APIs - Bubble Forum

Best Practices for using Whisper API in Azure

Whisper API does offer a variety of parameters that can be utilized for more specific transcriptions. The prompt parameter in the OpenAI Whisper API allows you to guide the transcription process by providing specific instructions or conditions. For example, you could use the prompt parameter to instruct the API to ignore or exclude certain words or phrases from the transcription. This can be particularly useful when you want to filter out specific content or when handling sensitive information. By using the prompt parameter, you’re able to customize the transcription output to better suit your specific needs or requirements.

Preprocessing

Preprocessing in the context of audio transcription involves preparing the audio data to improve the quality and accuracy of the transcription. It’s a crucial step that can significantly impact the results. Here are the main steps involved in audio preprocessing: You can use PyDub is a simple and easy-to-use Python library for audio processing tasks such as slicing, concatenating, and exporting audio files.

OpenAI Other Input Parameters (Frequency and presence penalty)

Post Processing

In the context of audio transcription, the output from the initial transcription process can be further refined using Language Models like GPT-3.5. This step is known as post-processing. In post-processing, the initial transcript, which could potentially contain errors or inconsistencies, is passed to the language model. The language model, guided by its training and potentially a system prompt, generates a corrected or refined version of the transcript. This process allows for the correction of errors, better context understanding, and even the rephrasing or summarization of the content, depending on the specific system prompt provided. It is an effective way to leverage the capabilities of language models to improve the quality and usefulness of audio transcriptions.