November 22, 2024

•

6 mins

Contextual Position Encoding: Learning to Count What’s Important

CoPE enables LLMs get better at counting tasks by contextualizing positional encoding differently that traditional token based approaches

Paper Link

Weekly newsletter

No spam. Just the latest researches, and exclusive interviews in your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Key takeaways

This paper proposes a novel position encoding method called Contextual Position Encoding (CoPE) that moves away from the traditional token-based position paradigm.
CoPE measures position in a context-dependent way, allowing more freedom when addressing by position.
CoPE shows gains on several tasks, including the Flip-Flop, Selective Copy, Counting, Language Modeling, and Code Modeling tasks.
CoPE outperforms existing methods, especially in out-of-domain generalization scenarios.
CoPE has the potential to improve other domains, such as video and speech, where token position is less appropriate.

Introduction

Have you ever wondered how large language models (LLMs) understand the order of words in a sentence? This is a crucial part of language comprehension, and the mechanism that allows LLMs to do this is called positional encoding.

To understand the need for positional encoding, let's consider a simple example: "The cat sat on the mat". If you just give the LLM a list of words "cat, sat, on, the, mat", it has no idea which word comes first, second, and so on. This is where positional encoding comes in.

Positional encoding methods add information about the position of each word in the sequence, helping LLMs decode the meaning. There are different methods to achieve this, such as absolute positional encoding, where each word is assigned a unique vector representing its position, and relative positional encoding, where the position is measured relative to the current word. However, both approaches are based on counting tokens and don’t consider context.

The authors of this paper argue that existing positional encoding methods are not enough for more complex tasks, such as understanding the structure of a sentence or paragraph. They propose a new method, Contextual Position Encoding (CoPE), that is context-dependent. This means that the position of a word is not only determined by its numerical position in the sequence but also by its relationship to other words and the overall structure of the text.

Background

Before we dive into CoPE, let's understand the basics of positional encoding.

Absolute Positional Encoding

Imagine a sequence of words, and each word has a unique vector associated with it. This vector represents its position in the sequence. So, the first word has a vector representing position 1, the second word has a vector representing position 2, and so on. This is called absolute positional encoding.

Relative Positional Encoding

In contrast, relative positional encoding considers the relative positions of words in a sequence. Instead of assigning a fixed vector to each position, it calculates the distance between the current word and the word it is attending to. This is done by assigning vectors to different relative positions like "previous word", "two words back", and so on.

Motivation

The authors point out some limitations of existing positional encoding methods.

Standard positional encoding fails on simple tasks

In a simple example, they explain how a model using traditional positional encoding struggles to find the last occurrence of a specific word in a sequence. This is because the model relies on token positions, which can be imprecise when the distance between the word and the last occurrence is large or the length of the sequence is unpredictable.

LLMs fail on simple counting problems

The paper also highlights the surprising fact that even powerful LLMs like GPT-4 and Llama-2 70B Chat struggle with simple counting tasks. These models need to attend to specific words or sentences within a long text, but they get confused because of the varying number of tokens in each sentence.

Contextual Positional Encoding (CoPE)

CoPE offers a novel approach to address the limitations of existing methods by considering context.

How CoPE works

Gate Values

‍For each word in the sequence, CoPE calculates a gate value for every previous word. The gate value is determined by comparing the current word's query vector with the key vector of each previous word. A gate value of 1 indicates that the previous word is important and should be considered in the position measurement. A value of 0 indicates that the previous word is not relevant.

Position Calculation

‍CoPE uses a cumulative sum of the gate values to calculate the position of each word relative to the current word. This means that the position is not simply a number but a count of relevant words.

Position Embeddings

‍Since the positions are not necessarily integers, CoPE uses interpolation between integer positions to generate position embeddings.These embeddings are then added to the key vectors, allowing the query vectors to use them in the attention operation.

CoPE Explained

Let's break down the process with an example:

Imagine a sequence with the words "Alice was tired. She tried reading. A rabbit came."

CoPE will first calculate gate values for each word in this sequence. For instance, the gate value for the word "tired" might be 1, indicating that it is relevant to the current word, while the gate value for "rabbit" might be 0, indicating irrelevance.

Then, CoPE will use a cumulative sum of these gate values to calculate the positions. For example, the position of "tired" with respect to the last word "came" might be calculated as 2. This means that "tired" is two relevant words back from "came".

Next, CoPE will generate a position embedding for position 2, which will be an interpolated value between the position embeddings for positions 1 and 3.

Finally, this position embedding is added to the key vector of the word "tired", allowing the model to attend to "tired" based on its contextual position, not just its numerical position in the sequence.

Multi-head attention

CoPE can be used in multi-head attention, where each head can have its own independent set of gates, allowing the model to attend to different aspects of the context simultaneously. For example, one head can count words, while another head can count sentences.

Limited positions

CoPE has a parameter called pmax, which limits the maximum position value that the model can attend to. This helps to improve computational efficiency and can be especially useful when dealing with long sequences.

Computation

While CoPE introduces some additional computation, it can be optimized by reusing the key-query multiplication that is already computed in the attention mechanism. The additional computation is limited to the gate calculations and position embedding interpolations.

Computing gates

The authors explore different ways of calculating gate values, including using separate keys or value vectors. They find that using separate keys, called "sep-keys", helps to disentangle position from attention and achieves better performance.

Experiments

The paper evaluates CoPE on several tasks to demonstrate its effectiveness.

Flip-Flop Task

This task requires the model to remember the last occurrence of a specific word (w) and recall its associated value in a sequence that can have "ignore" instructions (i). CoPE outperforms existing methods in this task, especially in out-of-distribution scenarios where the distance between the current word and the last "w" is increased.

Selective Copy Task

This task requires the model to selectively copy tokens from a sequence, skipping designated "blank" tokens. CoPE achieves a perfect score on this task, showing its ability to attend to specific words and exclude unwanted elements.

Counting Task

The task involves counting specific operations in a sequence. CoPE demonstrates significant advantages over traditional positional encoding methods, especially when multiple variables are involved and when the context length changes.

Language Modelling

The paper uses the Wikitext-103 dataset to evaluate CoPE on a language modelling task. CoPE outperforms traditional methods, demonstrating its ability to improve model performance in a real-world scenario.

Code Modelling

CoPE is also tested on a code modelling task, showing that it improves perplexity scores compared to traditional methods. This suggests that CoPE can be beneficial in domains with more structured data.

Results

The experimental results demonstrate the superiority of CoPE. It consistently outperforms existing methods, especially in out-of-domain generalization scenarios where the model needs to adapt to longer or different context lengths.

Business implications

CoPE has the potential to augment various applications of LLMs.

Improved Information Extraction

‍By accurately attending to context and structure, CoPE can enhance information extraction from complex text documents, making it easier for LLMs to identify key facts and entities. This could lead to more accurate and efficient document analysis systems.

Enhanced Text Summarization

‍CoPE can help LLMs better understand the nuances of language, leading to more accurate and coherent text summarization. This could improve the effectiveness of content analysis and information retrieval systems.

More Robust Code Analysis

‍CoPE can enhance the ability of LLMs to analyze code, understand its structure and relationships, and identify potential errors or security vulnerabilities. This could have significant implications for software development, testing, and security.

Next-generation LLMs

‍CoPE could contribute to the development of next-generation LLMs that are more robust and capable of understanding complex language structures, making them more adaptable to different domains and applications.

Conclusion

This paper introduces a novel approach to positional encoding called Contextual Position Encoding (CoPE), which addresses the limitations of existing methods by considering context. CoPE demonstrates significant advantages in various tasks, including the Flip-Flop, Selective Copy, Counting, Language Modeling, and Code Modeling tasks. Its ability to generalize well to out-of-domain scenarios makes it particularly promising for real-world applications.

Why is CoPE more effective than other approaches?

CoPE is more effective than traditional positional encoding methods for several reasons.

Contextual awareness: Unlike traditional methods that rely solely on token positions, CoPE considers the context and relationships between words. This makes it more adaptable to complex language structures and allows it to attend to specific elements within a sequence.
Flexibility: CoPE uses a soft gating function to determine which tokens are relevant for position measurement, providing flexibility in identifying different levels of abstraction (words, sentences, paragraphs, etc.).
Efficient computation: CoPE can be optimized to minimize computational overhead, making it suitable for large-scale language models.

The success of CoPE highlights the importance of incorporating context into positional encoding for LLMs to effectively understand and process language. As LLMs continue to evolve, CoPE has the potential to play a key role in advancing their capabilities and expanding their applications.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.