Breaking Down Long Context Models: NIAH vs. RULER Evaluation

Gradient Blog: Evaluating Models Beyond Vanilla Needle-in-a ...

With the recent release of our long context models, we’ve been taking a deeper look into model quality. Take a look at Gradient’s journey on evaluating long context models using de facto standards like NIAH and more comprehensive evaluation methods like RULER.

Recently, long context models have been gaining popularity, due to continued advancements in deep learning architectures and enhanced hardware capabilities. While 16k context length was impressive just a year ago, today it is table stakes. The release of commercial models with long context (200k tokens, 1M tokens), as well as numerous research papers, highlights the evolution and critical role of long context models in AI applications.

Exploring Long Context Models

Working with partners like Crusoe, our team at Gradient was able to develop a range of long context models. Today we’re seeing a particular need for long context models in text modality, across a variety of our enterprise use cases including:

Generating code suggestions based on the context of an entire repository
Automating analysis of large sets of poorly structured tabular data
Generating legal analysis of a case using the historical precedent of previous court proceedings

For detail critical tasks where individual pieces of related information are important to the final output, typical RAG and summarization are often unsatisfying, and long context models show strong promise.

The Evaluation Challenge

While there’s an undeniable amount of interest in the emergence of long context models, there’s currently no established method to evaluate these models. As of today, the de facto standard amongst the community has been the use of NIAH (Needle-in-a-Haystack) - a method which embeds specific targeted information (the “needle”) within a larger, more complex body of text (the “haystack”).

a+u Envisioning a New Architecture | Andrew Bromberg Architects

Despite the impact of NIAH, the eval comes with its own set of nuances that were recently documented in our blog post.

Introducing RULER Benchmark

Recognizing the limitations of NIAH, Gradient has been exploring more sophisticated evaluation methods such as RULER. RULER is a synthetic benchmark on 13 tasks that expands on the vanilla NIAH test. It provides a more comprehensive evaluation of a model’s ability to perform under diverse types and quantities of data.

RULER Task Categories

The benchmark groups 13 tasks into 4 categories:

Retrieval (NIAH): Single NIAH, Multi-key NIAH, Multi-value NIAH, Multi-query NIAH
Multi-hop tracing: Variable tracking
Aggregation: Common word extraction (CWE), Frequent word extraction (FWE)
Question Answering (QA)

Taken together, the 13 tasks iterate on vanilla NIAH by further evaluating a model’s ability to disambiguate similarly sounding information, retrieve critical information, establish chains of reference, summarize long passages, and answer questions effectively.

RULER Results

As an initial result, we are sharing the scores for our Llama-3 8B 1M context length model. The grand average score for our 8B model is 81.1 across sequence lengths, positioning it at 7th place on RULER’s leaderboard. This showcases the impressive performance of our model in retrieval and question answering tasks.