Mastering Keyword Extraction in Word and PDF: An AI Approach

Published On Sun Jun 16 2024
Mastering Keyword Extraction in Word and PDF: An AI Approach

Re: Extract top N key words from a Word or PDF doc... - Power ...

I have a flow set up to trigger when a new file is created in a SharePoint Library, specifically Word or PDF documents (with PDFs being the majority). The goal is to extract the top 50 key words from the document using AI. The documents can have multiple pages, tables, images, and various formats. Instead of specific key words, I aim to let AI determine the top N key words based on relevance.

While I have explored extracting text using AI Builder, I encountered limitations such as a 5000-character restriction and the need to handle raw PDF/Word formats. Any advice, tips, or references on achieving this would be greatly appreciated. It's worth noting that I prefer not to rely on third-party connectors for this task, but premium Microsoft connectors are acceptable.

Possible Approach

One potential solution could involve creating a custom connector to Azure Open AI for streamlined processing. By passing the file through Azure Open AI, it may be possible to execute the necessary actions to extract the desired key words. You can learn more about this approach here.

A keyword extraction method from twitter messages represented as ...

When dealing with PDF files generated from Word documents and multilingual documents, especially those in languages like Chinese, Mongolian, and Zulu, traditional methods like XML/XPath may not be suitable. Custom connectors and AI solutions tailored to handle diverse languages could offer a more robust solution.

If you'd like to explore further, consider leveraging the capabilities of connectors like "Convert Word to PDF" for seamless integration. Additionally, ensuring compatibility with various languages and formats would be crucial in the extraction process.

Applied Sciences | Free Full-Text | Re-Thinking Data Strategy and ...

Feel free to share your thoughts and any insights you might have on refining this workflow for extracting key words effectively from Word or PDF documents.