Enhancing Meeting Summaries with Pyannote and Whisper

Creating meeting summaries (without Microsoft Copilot) using open-source models

Microsoft Copilot offers a convenient way to summarize your Teams meetings. It is integrated into the Microsoft 365 ecosystem, making it super convenient for summarizing meetings in real time. It pulls out key points, action items, and decisions, syncing effortlessly with your other Microsoft tools. But here’s the thing: Copilot is primarily designed for meetings on platforms like Teams. What if you have audio/video files from different sources, like panel discussions or casual brainstorming sessions? This article dives into the power of open-source tools, specifically Pyannote and Whisper, to achieve just that.

Pyannote is an open-source toolkit designed for speaker diarization, which is the process of segmenting an audio stream by speaker identity. This tool is crucial for meetings with multiple participants, as it allows for the identification and separation of different speakers within an audio recording. Pyannote uses advanced machine learning algorithms to achieve high accuracy in distinguishing between speakers.

pyannote audio: neural building blocks for speaker diarization

Advanced Speaker Diarization with Pyannote

If you want to review the technical report in detail, it is available here. Pyannote also has a commercial version available which, as per the website, is faster and more accurate. For our purpose though, the open-source version is more than enough.

Developed by OpenAI, Whisper is a state-of-the-art model for speech recognition and transcription. Whisper excels at converting spoken language into text with high accuracy, even in noisy environments. The model employs a deep learning architecture, specifically a transformer-based neural network, to perform transcription tasks.

transcription and speaker identification OpenAI-Whisper

Integrating Pyannote and Whisper for Meeting Summaries

We will now combine the diarization result from Pyannote with the transcription from Whisper to create a transcript with speaker attributions. This integration involves aligning the time-stamped speaker segments with the corresponding transcribed text, resulting in a comprehensive document that identifies who said what. This is crucial for accurate summarization. There is an excellent script present in this repo that can help with this integration.

With a speaker-labeled transcript in hand, we can leverage the power of a large language model like Google Gemini 1.5 Flash or Open AI GPT 4 to extract the key points and generate a concise summary of the entire meeting, capturing the essence of each speaker’s contribution.

Generating Meeting Summary

By feeding this transcript to Gemini or GPT, you can extract the key points and generate a concise summary of the entire meeting, capturing the essence of each speaker’s contribution. I will be using Google Gemini API here.

And finally, we have our summary below!

SPEAKER_01: The discussion begins with the President, who is speaking about AIDS and the need for testing. He expresses hope that a cure will be found, even though it may not be immediately. The President then shares a story about rapid scientific progress in physics, providing an example of how unexpected breakthroughs can happen. He closes by expressing his hope and prayer that a cure for AIDS will be found.

SPEAKER_02: The Secretary of Health and Human Services then thanks the President for his commitment to fighting AIDS and welcomes the commission members. She highlights the administration’s dedication to research and progress in health and medicine, citing the significant amount of money spent by the National Institutes of Health (NIH) on research grants.