RAG: RAG or Retrieval Augmented Generation combines information retrieval and text generation to enable users to search and synthesize on a specific dataset that is likely outside the training data of any LLM. That is, it's a sophisticated way to find information relevant to your query/question (Retrieval), pass it as a part of prompt/context to a LLM (Augmentation), for generating an answer that a user can understand.
Long Context Length: Context length is the total count of tokens in the prompt + generated answer that an LLM can access. Can be different for different models, and can be different for different checkpoints in the same model. Eg: GPT-4 has 8k and 32k and now 128k context windows. You can expect to pass the total tokens till this count.
Since the launch of GPT-4 last year, there has been one axis where all models are getting better. That is, the context length. Intuitively, the longer the context length, the more tokens you have to provide a model with all the relevant info, and the model can generate better more relevant answers. It is obviously better for a simple user looking for answers based on any text/image they may have.
RAG, or Retrieval Augmented Generation, has been incredibly popular in the last one year, since the introduction of ChatGPT. At an abstract level, RAG solved the problem of information retrieval, enabling users to ask a question from a custom dataset that may not be a part of the given model's training. RAG would semantically search for related passages in the text, and dynamically append them to the prompt to provide a model with context. It works when you are looking for specific information within the text. A lot of so-called AI wrappers or apps built on top of AI models use some version of RAG to get an LLM to generate better answers. Specific usecases are when your datasets are > 100K tokens, while a typical context for GPT 3.5 was 4k tokens. Better to use a RAG system and supply an LLM with specific information it needs to answer the question. (For context, one page of a PDF is about 750 tokens. 1 token is equivalent to a word)
With the introduction of large context windows for every provider's latest model, it seems to me that RAG is less useful. When your data is 100k tokens (~120 pages in a pdf), and the context length allows it, it is just easier to just upload the whole file and then ask the questions without going through RAG.
Today, Anthropic allows for 200k tokens, Gemini 1.5 allows for 10M tokens, and GPT-4 allows for 128k tokens. Others offer similar ranges. Mistral large uses sliding attention mechanism (meaning it only looks at 4k tokens at any given time) and allows for 32k tokens.
The main idea of in-context learning is learning by analogy, which allows the model to understand and apply patterns from a few examples or even just one (one shot or few shot learning as it is known). In this method, a task explanation or a set of examples is written in everyday language and passed as part of the prompt, enabling the model to follow the pattern. Unlike conventional machine learning methods such as linear regression, which need labeled data and a separate learning phase, in-context learning works with models that have already been trained and doesn't need any changes to the model's parameters.
For example, if you wanted a model to understand how to greet people in different languages, you could give it a prompt like this:
"When we say 'Hello' in English, we say 'Bonjour' in French. What do we say in Spanish?"
The model would then respond with "Hola", even though it hasn't been specifically trained on translating greetings from English to Spanish. It's using the pattern provided in the prompt to figure out the answer. This is in-context learning.
From Gemini's paper, in-context learning shows surprising newer capabilities model was never tuned for or were intended. For example:
Gemini 1.5 pro learnt to translate a new language from one book when it was not in the training data at all.
I am sure you would have already seen the video Q&A demo by Gemini already.
Intuitively, one aspect of longer context is how you can pass entire PDFs for a model to consume, and then synthesize the answer on the basis of that. Other than that:
In the context (no pun intended) of evaluating pros and cons against large context length, RAG may win out in certain cases, like:
From a strictly developer lens, I tend to classify any chatter about a newer development through three lenses - hope, cope, or mope. You see versions of each in this case as well.
Over the last one year, a lot of developer energy and investor capital have gone into RAG based apps. Popular open source repos like LlamaIndex and Haystack have focused extensively on RAG, coming up with new and improved engineering strategies to improve upon the efficacy of a fundamentally probabilistic and non-determinant system.
Hence, the first reflex to any suggestion that RAG's effectiveness maybe upended draws sort of a non reasoned reaction here - both from the developers and experts. There is hope that long context would not affect their apps. Some hope that their unique techniques and prompting might trump a scenario where you pass the entire file with a simple prompt.
Can a long context replace RAG?
Because longer context requires more compute, it will be costlier than a typical RAG app. In cases and implementations where precision is not as important (eg: search) RAG can do the job at a cheaper cost and a smaller context window. In other cases, where the precision matters, long context is the only thing that would work.
This depends on model to model, but if we are comparing the same model (eg: GPT-4 32k), then a RAG app is slower by design due to an additional step of retrieving data from a vector database. For a similar model with shorter context length (eg: GPT-4 8k vs GPT-4-32k), latency could vary based on compute availability as 32k would always be slightly slower to answer than 8k (in all cases where the load/requests are the same)
Long context would always be more effective than retrieval based apps when it comes to reasoning. RAG, by design, retrieves a subset of information in context and may or may not contain all the information required for an LLM to answer the question correctly.
Questions related more to the quality or abstract aspects of the text are outside of capabilities of retrieval based apps by design. Long context can answer them well. Eg: "By reading this book, tell me what things are out of place based on the setting." is easily answerable by Gemini but no RAG based app.
Hypothetically, you may have a file/document with > 10M tokens (that is a PDF with more than 12,000 pages). In this case RAG would be effective than long context, yes. I am not going to pretend that it's a very realistic and common case though. At that level of data, RAG too loses it's effectiveness a bit, and it's time for you to think of more useful strategies like training a model.
If total tokens across all your datasets exceed 10M, you should consider training an embedding model by yourself.
It's not a tech issue, it's a human issue where we expect systems to intuitively understand what we mean, and RAG tends to be more precise when you enter a longer search query, which humans are kind of averse to writing.
I don't think costs is a compelling enough reason for companies to settle for mediocre performances in certain cases. Remember, given the quantity of RAG based apps available, it's the customers who have leverage. They will likely go with long context directly, as it also provides them with the kind of control you don't get with third party apps.
Is it possible that a combination of RAG and long context can work together in tandem to help startups land customers. With a longer context, instead of passing a smaller chunk, it's likely more useful to pass the entire document. It would be costly but not breaking the bank costly. This is especially useful when the data is scattered across various apps, docs, and reports. For example, in enterprise search. It's not either/or, but a combination which can result in better experience for the end users.
A good analogy here is memory processing and storage in computers (good old RAM/virtual memory/disk debate). It fits quite well as you force as much as RAM can handle and rest goes to hard disk)
The danger for RAG apps is not that newer apps implementing long context would do better. It is that engineering and AI departments would be more inclined to build themselves rather than buy apps. With long context, the complexity in building a good retrieval system is gone too. All you need is to integrate an API to the foundational model.
This was always the risk for AI apps to start with. When the foundational model provider is competing with you for the same customer, they will always have an edge. On a positive note, this will help developers dig deep and come up with newer experiences that are fundamentally different. Application layer would see a lot of changes in the near term, and long context might just kickstart that as the devs have to innovate to stay in the game.