Ignite 2024: Azure AI Video Indexer Enhances Multi-Modal Video Summarization
We are thrilled to introduce the Multi-Modal Video Summarization, an enhancement to our previously introduced textual video summarization for recorded video and audio files. This new feature allows customers to obtain concise textual summaries of their videos by identifying key frames and processing them through a GenAI engine using Azure OpenAI or Phi3.5 model. By leveraging the Key frames as an input in addition to the audio and visual insights computed by Azure Video Indexer, prompts are generated to assist the language model in creating a comprehensive video summary. This multi modal approach ensures a more accurate and contextually rich summary, suitable for more use cases and scenarios. This feature is available both in the cloud, powered by Azure OpenAI, and on the Edge, as part of VI enabled by ARC, utilizing the latest Phi3.5 visual model that can be configured to run with GPUs for improved performance.
Summary of a short video that had no audio, by applying key frames extraction as part of the Textual Summary using GPT4V.
The Power of Keyframes
By incorporating Video Indexer’s keyframe extraction technology, which captures key moments in the video, and combining them with other audio insights from the video indexer engine such as transcripts, special sounds like alarms or applause, and visual signals including Optical Character Recognition (OCR), object detection, labels, and more, the Multi-Modal video summarization can leverage these signals more effectively and process them using language models like Phi3.5 or GPT4 Visual that receive a textual prompt as well as visual input. This comprehensive approach, of providing the language model rich prompt based on visual and audio insights along with the actual keyframes, ensures that the summaries that are generated are more accurate, contextually rich and relevant to more use cases and industries.
Consider the scenario of summarizing long security camera footage with no audio. Relying solely on audio signals and visual insights might miss critical events captured in the video. With our new multi modal keyframe-based summarization, the model can identify and highlight significant moments, such as when individuals enter restricted areas or when suspicious behavior occurs.
By obtaining these summaries, security analysts can quickly review hours of footage, identifying critical events without needing to watch the entire video. This saves precious time and enhances the effectiveness of security monitoring.
GPUs at the Edge: Enhance Azure AI Video Indexer enabled by Arc with integration with SLM through Phi3.5
The Multi-Modal Textual Summarization on Edge has been upgraded to use the Phi-3.5-mini-instruct model. This model, with its 128k context size and modest hardware requirements, now supports image processing essential for the newly introduced keyframe processing. This model can run on GPUs, enhancing its performance. On average, the runtime on A100 is 14.5% of the video duration, and this can be even lower for some videos.
Creating an Azure AI Video Indexer Arc extension and configuring GPU to run Textual Video Summarization.
How to make it available in my Azure AI Video Indexer account?
Use Textual Video Summarization in Your Public Cloud Environment:
If you already have an existing Azure Video Indexer account, follow these steps to use the video summarization:
For detailed instructions on how to set up this integration, click here. Please note that this feature is not available in Video Indexer trial accounts or on legacy accounts which use Azure Media services. Leverage this opportunity also to remove your dependency on Azure Media services by following these instructions.
Use Textual Video Summarization in Your Edge Environment, enabled by Arc:
If your edge appliances are integrated with the Azure Platform via Azure Arc, here’s how to activate the feature:
For detailed instructions on how to set use the feature click here or watch the demo.
Our Video-to-text API (aka Prompt Content API) now also supports Llama2, Phi3, Phi3.5, GPT4O, and GPT4OMini
Our Video-to-Text API, also known as the Prompt Content API, now supports additional models: Llama2, Phi3, Phi3.5, GPT-4O, and GPT-4O Mini. This enhancement provides greater flexibility when converting video content to text, opening up more opportunities for Azure Video Indexer customers. Users can gather information from Azure Video Indexer in a prompt format that can be customized by selecting the model name and adjusting the prompt style. The “Summarized” style is ideal for tasks like video summaries, naming videos, and describing main events, while the “Full” style is more suited for Q&A, RAG, and search use cases. To learn more about this API, click Here.