November 22, 2024

•

5 mins

LoRA Learns Less and Forgets Less

Analyzing LoRA to understand whether it can add new knowledge to an LLM.

Paper Link

Weekly newsletter

No spam. Just the latest researches, and exclusive interviews in your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Key Takeaways

LoRA, a popular PEFT method, underperforms compared to full finetuning for large language models, particularly in demanding domains like code and math.
LoRA exhibits superior source-domain knowledge retention ("forgetting less") but struggles to match the accuracy of full finetuning in the target domain.
The study found that full finetuning leads to high-rank weight changes, challenging the assumption that low-rank adjustments are sufficient for effective adaptation.
The paper offers practical guidance on optimizing LoRA, emphasizing learning rate sensitivity and the impact of target module selection.

Introduction

Fine-tuning large language models (LLMs) on specific tasks like writing code or solving math problems often requires immense computational resources. Parameter-efficient fine-tuning (PEFT) techniques address this challenge by modifying only a small subset of the model's parameters, preserving the vast majority of the pretrained weights.

One such PEFT technique that has gained significant traction is Low-Rank Adaptation, or LoRA. This method makes targeted adjustments to the model by adding low-rank matrices to specific layers, effectively capturing task-specific knowledge without drastically altering the original model. In contrast, full finetuning is akin to retraining the model extensively for each new task, potentially causing them to forget or mix up previous learning. While effective, this approach is resource-intensive and may not be practical for rapidly adapting to a wide array of tasks.

Background

LoRA’s popularity stems from its efficiency and its presumed ability to adapt large language models to new domains without compromising their original capabilities. The technique is rooted in the hypothesis that the changes induced by full finetuning are inherently low-rank, meaning they can be effectively captured by modifying only a small subspace of the model's parameters. This aligns with the observation that LLMs often exhibit a degree of "knowledge redundancy," where similar information is encoded across multiple parameters.

Experimental Setup

To rigorously evaluate LoRA's effectiveness, the authors designed experiments using two challenging domains:

Code: Training LLMs to understand and generate computer programs.

Math: Training LLMs to solve mathematical problems.

Datasets

The study employed the following datasets for continued pretraining (CPT) and instruction finetuning (IFT):

Code CPT: StarCoder-Python, a dataset of Python code from GitHub repositories.
Math CPT: OpenWebMath, a dataset containing mathematical web pages.
Code IFT: Magicoder-Evol-Instruct-110K, a dataset of programming question-answer pairs.
Math IFT: MetaMathQA, a dataset of mathematical word problems.

Measuring Learning

To assess the models' ability to "learn" new tasks, the authors used standard benchmarks in the respective domains:

Coding: HumanEval, a benchmark requiring the generation of Python code from function signatures and docstrings.
Math: GSM8K, a benchmark consisting of grade-school math word problems.

Forgetting Metrics

To quantify the extent to which models "forget" previously acquired knowledge, the study used the following benchmarks:

HellaSwag: Tests common sense reasoning by evaluating the plausibility of sentence completions.
WinoGrande: Focuses on pronoun resolution, requiring an understanding of context and common sense.
ARC-Challenge: Consists of grade-school-level science questions, evaluating reasoning and scientific knowledge.

Results

LoRA Trails Full Finetuning in Performance

The study's central finding is that LoRA consistently lags behind full finetuning in terms of accuracy on both coding and math tasks. As the authors state,

Across LoRA configurations and training durations, it still appears to underperform full finetuning. These effects are more pronounced for programming than math.

LoRA Excels at Retaining Prior Knowledge

Despite its performance gap in the target domain, LoRA demonstrates a remarkable ability to preserve source-domain knowledge. In other words, it forgets less than full finetuning. This suggests that LoRA's targeted adaptation mechanism helps maintain the model's original capabilities even when learning new tasks.

The Learning-Forgetting Tradeoff

It's an obvious case that model that models that change less with finetuning will forget less of their original knowledge. The tradeoff is between Finetuning and LoRA in terms of which would increase learning and induce less forgetting. Against a Pareto curve.

Full Finetuning Induces High-Rank Changes

Contrary to the assumption of low-rank changes during finetuning, the authors observed that "full finetuning finds weight perturbations that are far from being low-rank." This suggests that the success of LoRA might stem from its regularization effect (keeping the model close to its initial state) rather than its ability to accurately capture the full complexity of the finetuning process.

Practical Insights for LoRA Optimization

The study offers valuable recommendations for effectively utilizing LoRA:

Learning Rate Sensitivity

‍LoRA exhibits significant sensitivity to learning rate adjustments, often requiring rates an order of magnitude higher than full finetuning.

Target Modules Matter

‍Selecting the appropriate modules (e.g., attention, MLP) within the LLM architecture for LoRA adaptation significantly impacts performance.

Rank Considerations

‍While LoRA's performance generally improves with higher-rank adaptations, the gains diminish, suggesting a trade-off between accuracy and efficiency.

Business Implications

This research carries important implications for businesses looking to leverage LLMs:

Resource Allocation

‍For tasks demanding high accuracy, particularly in complex domains like code generation, businesses might need to prioritize full finetuning despite its higher computational cost. LoRA, in these cases, might serve as a starting point for rapid prototyping or for tasks where resource efficiency is paramount.

Knowledge Preservation

‍LoRA's strength in retaining prior knowledge makes it suitable for applications where maintaining a broad range of capabilities is crucial. This could include scenarios where an LLM needs to switch between tasks or domains without significant retraining.

Hybrid Approaches

‍Future research could explore hybrid strategies combining the strengths of both LoRA and full finetuning. For instance, an initial phase of LoRA adaptation could be followed by targeted full finetuning of specific layers or modules, potentially striking a balance between accuracy and efficiency.

Conclusion

The paper provides a comprehensive analysis of LoRA, revealing both its limitations and its unique strengths. While LoRA falls short of full finetuning in terms of accuracy, especially for demanding domains, it excels at preserving source-domain knowledge and offers a computationally efficient adaptation method. The study's insights into LoRA's behavior and its sensitivity to hyperparameters provide valuable guidance for practitioners and pave the way for exploring hybrid or refined PEFT techniques that combine the best of both worlds.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.