Revolutionizing Literature Reviews with AI: A Deep Dive

Published On Sun Nov 10 2024

Leveraging Large Language Models for Comprehensive Literature Analysis

This research project aims to address a significant limitation in existing large language model (LLM)-based artificial intelligence systems: their inability to accurately generate literature reviews with precise citations. Despite their ability to mimic complex patterns of human language and knowledge, AI models often struggle to reference specific sources, resulting in inaccurately "hallucinated" citations. Our system tackles this issue by focusing on a comprehensive 20-year, 5844-document corpus published by the RAND Corporation.

Methodology

We have developed a multi-step process to overcome this limitation. This process involves extracting text from the corpus documents, segmenting the text into overlapping chunks, generating LLM embeddings for each chunk, and storing these embeddings in a vector database. User queries trigger the generation of corresponding embeddings, facilitating a cosine similarity-based retrieval of the most semantically relevant corpus excerpts along with their associated metadata. The retrieved text and metadata are then summarized and transformed into a comprehensive literature review using the OpenAI API.

Results

Initial results show that this methodology provides a robust and practical approach to generating meaningful literature reviews with accurate citations, offering contextual precision. While the tool's overall accuracy requires further evaluation, it has demonstrated significant potential as a valuable resource for researchers initiating a project and program directors in need of quick institutional research overviews.

Similarity Metrics for Vector Search

Significance

This research expands on the existing PaperQA framework by applying it to a substantial real-world corpus. The methodology's versatility suggests potential applications to other extensive document collections, including those not publicly accessible, highlighting its utility across various fields such as legal and regulatory industries. Thus, this work presents an innovative solution to the challenges of literature review and citation generation within large corpora.