Imagine trying to summarize a long book or article. As humans, we can easily refer back to earlier parts of the text to understand the context and key points. However, this has been a major challenge for LLMs, which typically have a limited "context window" and can only process a certain amount of text at a time. This paper tackles this problem head-on by introducing Infini-attention, a new way for LLMs to handle very long sequences of text effectively.
LLMs like GPT-4 have been impressive in their ability to generate human-quality text and perform various language-based tasks. However, they have a limitation: they can only process a certain amount of text at once due to the nature of their attention mechanism.
The attention mechanism in LLMs works by comparing each word in the input text to every other word, which becomes computationally expensive as the text gets longer. This limits the "context window" of the model, meaning it can only "remember" and use information from a recent portion of the text.
Infini-attention addresses this limitation by introducing a "compressive memory" alongside the traditional attention mechanism. This memory stores information from earlier parts of the text in a compressed format, allowing the model to access and utilize it when processing later parts.
Here's how it works:
Within each segment of text, the model uses the standard attention mechanism to focus on relevant parts.
Instead of creating a separate memory system, Infini-attention cleverly reuses the "query", "key", and "value" states that are already calculated as part of the standard attention mechanism. These states represent different aspects of the input text and their relationships. The memory is essentially an "associative matrix" where each "key" is linked with its corresponding "value". Think of it like a dictionary where you look up a word (key) to find its meaning (value).
When a new segment of text is processed, the key-value pairs from that segment are used to update the associative matrix. This is done incrementally, meaning the memory is constantly evolving and adapting to new information.
Querying the Memory: When processing a new segment, the model uses the "query" state, which represents the current context, to retrieve relevant information from the memory.
Linear Attention Mechanism: This retrieval process is based on a "linear attention" mechanism. It essentially calculates the similarity between the query and each key in the memory and uses these similarities to retrieve the most relevant values. This is very similar to the cosine dot product used to calculate the similarity of the vector embeddings.
The model now has two sources of information: the local context from the current segment (obtained through the standard attention mechanism) and the global context from the entire text history (retrieved from the compressive memory). A "gating" mechanism is used to combine these two sources of information. This gating mechanism learns to balance the importance of local and global context depending on the specific task and input text.
The combined context, incorporating both local and global information, forms a rich representation of the text that the model can use for downstream tasks like summarization or question answering.
By efficiently storing and retrieving information, Infini-attention allows a model to have a much broader understanding of the text than traditional attention mechanisms, leading to improved performance on tasks involving long sequences.
The researchers tested Infini-attention on various tasks involving long sequences of text and compared it to existing methods:
Infini-attention outperformed other models, like Transformer-XL and Memorizing Transformer, in predicting the next word in a sequence, while using significantly less memory.
A 1B LLM with Infini-attention was able to accurately retrieve a "passkey" hidden within a million-word long text, even though it was only trained on 5,000-word long examples. This demonstrates the model's ability to generalize to much longer contexts than it was trained on.
An 8B LLM with Infini-attention achieved state-of-the-art results on a book summarization task, demonstrating its ability to understand and summarize long and complex texts.
The ability to process longer sequences of text opens doors for several exciting applications:
Infini-attention presents a significant step forward in enabling LLMs to handle long and complex texts. It allows models to maintain a global understanding of the context while efficiently using computational resources. This has the potential to unlock a wide range of applications and further enhance the capabilities of LLMs in various domains.