4 mins

Utility of RAG in times of long context windows

With advent of long context length with Gemini and Anthropic, what would be the advantages of using RAG over directly giving all the context to a model? Will long context spell the end of RAG based apps?

Written by

Ankit Maloo

Published on

March 14, 2024

Key Takeaways

RAG can answer questions where answer is within the docs/data, not the answers to meta questions.
With long context lengths of 200K (Anthropic), 10M (Gemini), you can pretty much add all docs as part of the prompt. Can answer both search related and meta questions.
Invested efforts mean entrepreneurs hope RAG with advanced techniques would still be useful, though not straightforward to justify that.
Long context lengths would have diverse applications which makes it more attractive to everyone.

Key Definitions:

RAG: RAG or Retrieval Augmented Generation combines information retrieval and text generation to enable users to search and synthesize on a specific dataset that is likely outside the training data of any LLM. That is, it's a sophisticated way to find information relevant to your query/question (Retrieval), pass it as a part of prompt/context to a LLM (Augmentation), for generating an answer that a user can understand.

Long Context Length: Context length is the total count of tokens in the prompt + generated answer that an LLM can access. Can be different for different models, and can be different for different checkpoints in the same model. Eg: GPT-4 has 8k and 32k and now 128k context windows. You can expect to pass the total tokens till this count.

Introduction

Since the launch of GPT-4 last year, there has been one axis where all models are getting better. That is, the context length. Intuitively, the longer the context length, the more tokens you have to provide a model with all the relevant info, and the model can generate better more relevant answers. It is obviously better for a simple user looking for answers based on any text/image they may have.

RAG, or Retrieval Augmented Generation, has been incredibly popular in the last one year, since the introduction of ChatGPT. At an abstract level, RAG solved the problem of information retrieval, enabling users to ask a question from a custom dataset that may not be a part of the given model's training. RAG would semantically search for related passages in the text, and dynamically append them to the prompt to provide a model with context. It works when you are looking for specific information within the text. A lot of so-called AI wrappers or apps built on top of AI models use some version of RAG to get an LLM to generate better answers. Specific usecases are when your datasets are > 100K tokens, while a typical context for GPT 3.5 was 4k tokens. Better to use a RAG system and supply an LLM with specific information it needs to answer the question. (For context, one page of a PDF is about 750 tokens. 1 token is equivalent to a word)

With the introduction of large context windows for every provider's latest model, it seems to me that RAG is less useful. When your data is 100k tokens (~120 pages in a pdf), and the context length allows it, it is just easier to just upload the whole file and then ask the questions without going through RAG.

Today, Anthropic allows for 200k tokens, Gemini 1.5 allows for 10M tokens, and GPT-4 allows for 128k tokens. Others offer similar ranges. Mistral large uses sliding attention mechanism (meaning it only looks at 4k tokens at any given time) and allows for 32k tokens.

In context learning

The main idea of in-context learning is learning by analogy, which allows the model to understand and apply patterns from a few examples or even just one (one shot or few shot learning as it is known). In this method, a task explanation or a set of examples is written in everyday language and passed as part of the prompt, enabling the model to follow the pattern. Unlike conventional machine learning methods such as linear regression, which need labeled data and a separate learning phase, in-context learning works with models that have already been trained and doesn't need any changes to the model's parameters.

For example, if you wanted a model to understand how to greet people in different languages, you could give it a prompt like this:

"When we say 'Hello' in English, we say 'Bonjour' in French. What do we say in Spanish?"

The model would then respond with "Hola", even though it hasn't been specifically trained on translating greetings from English to Spanish. It's using the pattern provided in the prompt to figure out the answer. This is in-context learning.

Gemini's surprising discovery

From Gemini's paper, in-context learning shows surprising newer capabilities model was never tuned for or were intended. For example:

Gemini 1.5 pro learnt to translate a new language from one book when it was not in the training data at all.

Gemini 1.5 Pro performed similar to humans on translation tasks from a book.

I am sure you would have already seen the video Q&A demo by Gemini already.

Advantages of a long context length

Intuitively, one aspect of longer context is how you can pass entire PDFs for a model to consume, and then synthesize the answer on the basis of that. Other than that:

New form factors for search: With Google Gemini, you can pass videos, it will convert them into images frame by frame, and can answer your questions based on those. You can do that with any combination of text, images, and videos.
Answers to meta questions: Getting the basic answer from search is the first step, but not why people ask that question in the first place. Most people who ask a question on their datasets are trying to get to a deeper level answer which could help them finish a task. Eg: an investment banker could just give a report and ask for a summary (with any modifications), a sales executive could provide a call transcript and ask to generate strategy and proposals for next steps, or less intuitively, provide a research paper(or papers) and ask for novel connections, get critiques for stories/scripts, find anomalies and so on.
Latency: Given there is only one step, your only bottleneck is a model providing answers. There is no need to fetch anything from a vector DB to pass as context.
Better understanding of the conversation: When a model gets better understanding of the conversation, the resultant AI-Human interaction is smoother with
Understanding of codebases for code generation: This way, the code generated is more in line with rest of the codebase and easier to maintain.

What about RAG?

In the context (no pun intended) of evaluating pros and cons against large context length, RAG may win out in certain cases, like:

Very large datasets: Datasets span multiple documents, and the dataset combined is greater than the context length
Costs: The larger the context you use, the costly it is because every bit of the context is loaded into memory. Long contexts are expensive.
References or Original sources: When you want to cite original sources or references, RAG helps as the preprocessed data would already keep these things handy for you to fetch.
Search advantage: I think pre and post processing data when done intelligently can help search in a way that captures a user's intent better than other techniques. Depending on context, this could be via BM25, reranking or just storing file metadata to look up later.

However,

From a strictly developer lens, I tend to classify any chatter about a newer development through three lenses - hope, cope, or mope. You see versions of each in this case as well.

Developer energy expended

Over the last one year, a lot of developer energy and investor capital have gone into RAG based apps. Popular open source repos like LlamaIndex and Haystack have focused extensively on RAG, coming up with new and improved engineering strategies to improve upon the efficacy of a fundamentally probabilistic and non-determinant system.

Hence, the first reflex to any suggestion that RAG's effectiveness maybe upended draws sort of a non reasoned reaction here - both from the developers and experts. There is hope that long context would not affect their apps. Some hope that their unique techniques and prompting might trump a scenario where you pass the entire file with a simple prompt.

Addressing the debate

Can a long context replace RAG?

Costs

Because longer context requires more compute, it will be costlier than a typical RAG app. In cases and implementations where precision is not as important (eg: search) RAG can do the job at a cheaper cost and a smaller context window. In other cases, where the precision matters, long context is the only thing that would work.

Latency

This depends on model to model, but if we are comparing the same model (eg: GPT-4 32k), then a RAG app is slower by design due to an additional step of retrieving data from a vector database. For a similar model with shorter context length (eg: GPT-4 8k vs GPT-4-32k), latency could vary based on compute availability as 32k would always be slightly slower to answer than 8k (in all cases where the load/requests are the same)

Reasoning

Long context would always be more effective than retrieval based apps when it comes to reasoning. RAG, by design, retrieves a subset of information in context and may or may not contain all the information required for an LLM to answer the question correctly.

Meta Questions

Questions related more to the quality or abstract aspects of the text are outside of capabilities of retrieval based apps by design. Long context can answer them well. Eg: "By reading this book, tell me what things are out of place based on the setting." is easily answerable by Gemini but no RAG based app.

Scalability

Hypothetically, you may have a file/document with > 10M tokens (that is a PDF with more than 12,000 pages). In this case RAG would be effective than long context, yes. I am not going to pretend that it's a very realistic and common case though. At that level of data, RAG too loses it's effectiveness a bit, and it's time for you to think of more useful strategies like training a model.

If total tokens across all your datasets exceed 10M, you should consider training an embedding model by yourself.

It's not a tech issue, it's a human issue where we expect systems to intuitively understand what we mean, and RAG tends to be more precise when you enter a longer search query, which humans are kind of averse to writing.

Verdict

I don't think costs is a compelling enough reason for companies to settle for mediocre performances in certain cases. Remember, given the quantity of RAG based apps available, it's the customers who have leverage. They will likely go with long context directly, as it also provides them with the kind of control you don't get with third party apps.

Working together with RAG and long context

Is it possible that a combination of RAG and long context can work together in tandem to help startups land customers. With a longer context, instead of passing a smaller chunk, it's likely more useful to pass the entire document. It would be costly but not breaking the bank costly. This is especially useful when the data is scattered across various apps, docs, and reports. For example, in enterprise search. It's not either/or, but a combination which can result in better experience for the end users.

A good analogy here is memory processing and storage in computers (good old RAM/virtual memory/disk debate). It fits quite well as you force as much as RAM can handle and rest goes to hard disk)

Looking forward

The danger for RAG apps is not that newer apps implementing long context would do better. It is that engineering and AI departments would be more inclined to build themselves rather than buy apps. With long context, the complexity in building a good retrieval system is gone too. All you need is to integrate an API to the foundational model.

This was always the risk for AI apps to start with. When the foundational model provider is competing with you for the same customer, they will always have an edge. On a positive note, this will help developers dig deep and come up with newer experiences that are fundamentally different. Application layer would see a lot of changes in the near term, and long context might just kickstart that as the devs have to innovate to stay in the game.

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Spend time thinking not searching. Get a demo today.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.