Key Takeaways:
Dense retrieval has become a prominent method for obtaining relevant context in open-domain NLP tasks. However, the choice of retrieval unit - document, passage or sentence - in which the corpus is indexed is often overlooked. This research discovers that using propositions as the retrieval unit significantly impacts the performance of both retrieval and downstream tasks.
Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid in a concise, self-contained natural language format. Unlike typical passage or sentence-based indexing, propositions aim to be minimal semantic units that include necessary context.
The researchers introduce FACTOIDWIKI - an English Wikipedia dump processed such that each document is segmented into propositions. A text generation model called the Propositionizer is finetuned to split passages into constituent propositions.
The efficacy of proposition-based indexing is validated on five open-domain QA datasets. Retrieval and downstream QA performance is compared when Wikipedia is indexed at the passage, sentence and proposition levels, using six different dense retriever models.
Despite none of the retriever models being trained on proposition-level data, proposition-based retrieval consistently outperforms sentence and passage-based methods:
Analysis shows that propositions enable a higher density of question-relevant information in the retrieved results. The correct answer appears much earlier in the retrieved propositions compared to sentences or passages.
This research demonstrates that indexing retrieval corpora like Wikipedia at the proposition level can be a simple yet highly effective strategy to improve the performance of dense retrieval systems. Key business implications are:
This study introduces propositions as a novel retrieval unit for dense retrieval systems. Comprehensive experiments show that segmenting and indexing textual corpora like Wikipedia at the proposition level yields significant gains in both retrieval accuracy and downstream open-domain QA performance.
Notably, proposition-based retrieval enhances the cross-task generalization abilities of dense retrieval models. It also enables feeding more condensed, question-relevant information to reader models in retrieve-then-read QA pipelines.
The findings suggest that retrieval granularity is an important design choice for dense retrieval systems. Propositions offer a promising new way to represent information that balances compactness and richness of context.