Dense X Retrieval: Proposition based Retrieval for RAG

Proposition based retrieval performs significantly better than existing techniques like paragraph based retrieval and sentence retrieval in case of RAG apps. This paper by Tencent investigates and quantifies how much better on Wikipedia articles.

Paper Link

Weekly newsletter

No spam. Just the latest researches, and exclusive interviews in your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Key Takeaways:

Propositions are atomic expressions within text that encapsulate distinct factoids in a concise, self-contained format
Indexing and retrieving Wikipedia at the proposition level significantly improves dense retrieval performance compared to passage or sentence level
Proposition-based retrieval enhances downstream open-domain question answering accuracy, especially in cross-task generalization settings
Propositions provide a higher density of question-relevant information, reducing the need for lengthy input tokens to reader models

Introduction

Dense retrieval has become a prominent method for obtaining relevant context in open-domain NLP tasks. However, the choice of retrieval unit - document, passage or sentence - in which the corpus is indexed is often overlooked. This research discovers that using propositions as the retrieval unit significantly impacts the performance of both retrieval and downstream tasks.

Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid in a concise, self-contained natural language format. Unlike typical passage or sentence-based indexing, propositions aim to be minimal semantic units that include necessary context.

Technique Used

The researchers introduce FACTOIDWIKI - an English Wikipedia dump processed such that each document is segmented into propositions. A text generation model called the Propositionizer is finetuned to split passages into constituent propositions.

The efficacy of proposition-based indexing is validated on five open-domain QA datasets. Retrieval and downstream QA performance is compared when Wikipedia is indexed at the passage, sentence and proposition levels, using six different dense retriever models.

Results

Despite none of the retriever models being trained on proposition-level data, proposition-based retrieval consistently outperforms sentence and passage-based methods:

For unsupervised retrievers, proposition-based indexing yields a 35-22.5% relative improvement in Recall@5 over passage-based retrieval

The advantage is most prominent in cross-task generalization settings, with 17-25% Recall@5 improvement on datasets like SQuAD and EntityQuestions that the retrievers were not trained on

‍

For downstream open-domain QA, proposition-based retrieval leads to 19-55% relative improvement in Exact Match scores compared to passage retrieval, when the number of retrieved tokens is capped

Analysis shows that propositions enable a higher density of question-relevant information in the retrieved results. The correct answer appears much earlier in the retrieved propositions compared to sentences or passages.

Business Implications

This research demonstrates that indexing retrieval corpora like Wikipedia at the proposition level can be a simple yet highly effective strategy to improve the performance of dense retrieval systems. Key business implications are:

Proposition-based indexing can enhance the accuracy of open-domain question answering systems, without needing to retrain the underlying retrieval models
It especially benefits scenarios involving generalization to new domains and datasets that the models were not originally trained on
By providing more precise and contextual information, propositions can reduce the computation costs associated with processing lengthy retrieved passages in downstream language models
The open-sourced FACTOIDWIKI dataset can accelerate the development and evaluation of next-generation information retrieval systems

Conclusion

This study introduces propositions as a novel retrieval unit for dense retrieval systems. Comprehensive experiments show that segmenting and indexing textual corpora like Wikipedia at the proposition level yields significant gains in both retrieval accuracy and downstream open-domain QA performance.

Notably, proposition-based retrieval enhances the cross-task generalization abilities of dense retrieval models. It also enables feeding more condensed, question-relevant information to reader models in retrieve-then-read QA pipelines.

The findings suggest that retrieval granularity is an important design choice for dense retrieval systems. Propositions offer a promising new way to represent information that balances compactness and richness of context.

‍

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.