Large language models are getting bigger and better, but their size comes with a cost. Training them requires immense computing power and resources, and storing them takes up a lot of space. That's where parameter-efficient fine-tuning (PEFT) comes in. PEFT techniques aim to adapt these powerful LLMs to specific tasks without having to train the entire model from scratch. This saves resources and makes the process more efficient.
One of the popular PEFT approaches is called Low-Rank Adaptation (LoRA). LoRA works by updating only a small part of the model's parameters through low-rank matrices, which are like simplified representations of the full information. This can be quite effective, but it has limitations.
However, LoRA's reliance on low-rank updates might hinder its ability to fully learn and store new information. From the paper:
One plausible explanation for this limitation observed with LoRA could be its reliance on low-rank updates (Lialin et al., 2023). The low-rank update matrix, ∆W , struggles to estimate the full-rank updates in FFT, particularly in memory-intensive tasks like continual pretraining that require memorizing domain-specific knowledge.
To address this, they propose MoRA, which uses a square matrix instead of low-rank matrices. This allows for higher-rank updates while using the same number of trainable parameters as LoRA. To manage the increased size of the matrix, MoRA incorporates non-parameter operators that act like "compressors" and "decompressors" to handle the input and output dimensions. Think of it like a special coding system that efficiently shrinks and expands the data before and after it's processed by the square matrix.
This is the most traditional approach to fine-tuning a model. It involves updating all the parameters of the model, giving it the most flexibility to learn. But it's also the most computationally expensive and resource-intensive. It does not enable memorizing new knowledge.
LoRA is a popular PEFT method that makes a clever trade-off between performance and efficiency. It works by using two low-rank matrices (A and B) to update the model's parameters. This means that LoRA only needs to learn a small subset of the information, making it much more efficient. It also makes it easy to merge these low-rank matrices back into the original model, avoiding additional computational cost during inference.
The paper points out that LoRA's effectiveness depends on the task. While it performs well on tasks like text classification and instruction tuning, it struggles with tasks that require the model to learn and store significant new information. From the paper:
Low-rank updating by LoRA shows on-par performance with full-rank updating in some tasks such as text classification or instruction tuning (Liu et al., 2024; Meng et al., 2024). However, for tasks like complex reasoning or continual pretraining, LoRA tends to show worse performance (Liu et al., 2023).
The paper uses a memorization task to illustrate this point. They trained LLMs to associate randomly generated UUIDs (unique identifiers) with each other. This task requires the model to learn a large amount of new information that isn't inherent in its initial training. The results showed that LoRA, even with a high rank, performed significantly worse than full fine-tuning.
Based on Figure 2, we observe low-rank updating are hard to memorizing new knowledge compared to FFT. Although constantly increasing the rank of LoRA can alleviate this problem, the gap still exists.
This difference in performance highlights the limitation of LoRA's low-rank updates. While it can effectively leverage the existing knowledge of the LLM, it struggles to acquire and store new, complex information.
To overcome the limitations of LoRA, the paper introduces MoRA. The main difference is that MoRA uses a square matrix (M) instead of two low-rank matrices. This allows MoRA to achieve higher ranks with the same number of trainable parameters, giving it more capacity to learn and store new information.
However, simply replacing low-rank matrices with a large square matrix would require a lot more computation and memory. To address this, MoRA introduces non-parameter operators, fcomp and fdecomp , which act like compression and decompression functions.
These operators help to reduce the input and output dimensions for the square matrix, reducing the computational burden. The paper explores several ways to implement these operators, including:
The key is that these operators are non-parameterized, meaning they don't require any learning. This ensures that MoRA can be merged back into the LLM after training, just like LoRA, without introducing additional computational cost during inference.
The authors evaluated MoRA on several tasks to understand its impact. They used different datasets and compared its performance against LoRA and other baseline methods.
This task was designed to test how well the methods could learn and store new knowledge. The task involves associating pairs of UUIDs (randomly generated unique identifiers). The model has to learn this new knowledge during fine-tuning, and not rely on its pre-trained knowledge. The paper found that MoRA significantly outperformed LoRA in this task.
Our method shows significant improvements over LoRA with the same number of trainable parameters, benefiting from high-rank updating. We also report character-level accuracy at various training steps in Table 2. MoRA requires fewer training steps to memorize these UUID pairs compared to LoRA.
The paper evaluated MoRA on three different finetuning tasks:
Instruction Tuning:
This task aims to adapt the LLM to better understand and respond to specific instructions. The authors used Tülu v2, a dataset combining several high-quality instruction datasets.
Mathematical Reasoning
This task assesses the LLM's ability to solve mathematical problems. The authors used MetaMath, a dataset of 395k samples designed to enhance mathematical reasoning capabilities, as well as GSM8K and MATH for further evaluation.
Continual Pretraining
This task involves fine-tuning the LLM to perform well on specific domains, like biomedicine or finance. The authors used PubMed abstracts and financial news data to train the models.
The paper also tested MoRA on a pretraining task, where they trained the transformer model from scratch on the C4 dataset. They compared MoRA with LoRA and ReLoRA, a method that merges low-rank matrices into the model during training to increase the rank of the updates. The paper introduced ReMoRA, which combines MoRA with the merge-and-reint strategy from ReLoRA.
Across the different tasks, the results showed that:
MoRA achieved comparable performance to LoRA on instruction tuning and mathematical reasoning tasks. This suggests that MoRA can effectively leverage existing knowledge.
MoRA outperformed LoRA on continual pretraining and the memory tasks. This shows that MoRA is better suited for tasks that require the model to acquire and memorize new knowledge.
MoRA's performance on pretraining was better than both LoRA and ReLoRA. This indicates that high-rank updating is beneficial even when training the model from scratch.
ReMoRA achieved further improvements over MoRA, demonstrating the effectiveness of merging the square matrix during training.
The paper also examined the singular values of the learned weight updates. They found that MoRA and ReMoRA had significantly more significant singular values compared to LoRA and ReLoRA, indicating that they achieved higher ranks. This suggests a strong correlation between the number of significant singular values and the overall performance of the model.
MoRA has several important implications for businesses that rely on LLMs:
MoRA's ability to learn new information effectively means that LLMs can be adapted to new tasks and domains more quickly and efficiently. This is particularly valuable for businesses that work in rapidly evolving industries or need to customize LLMs for specific applications.
MoRA's increased capacity for memorization allows LLMs to retain and recall information more accurately. This is crucial for applications requiring knowledge-intensive tasks, like question answering, summarization, or personalized recommendations.
MoRA's parameter efficiency means that it can achieve comparable or even better performance than LoRA using the same number of trainable parameters. This translates to lower computational costs for training, deployment, and inference, making LLMs more accessible for businesses with limited resources.
The ability to merge the square matrix back into the LLM after training provides flexibility. Businesses can fine-tune the model with MoRA and then easily deploy it without sacrificing inference speed.
The paper demonstrates the importance of high-rank updating for LLMs, particularly for tasks requiring the acquisition of new knowledge. MoRA, with its square matrix and non-parameterized operators, offers a promising alternative to LoRA, achieving comparable or better performance with the same number of trainable parameters.
MoRA's success lies in its ability to achieve higher ranks with the same number of trainable parameters as LoRA. This allows MoRA to store and process more complex information. Think of it like expanding the LLM's memory capacity while keeping its overall size the same. The paper also highlights the importance of choosing appropriate compression and decompression functions within the MoRA framework. These functions act like specialized coding systems, making sure that the information isn't lost when the data is shrunk and expanded before and after it's processed by the square matrix.
The paper is a step forward in addressing the challenges of adapting LLMs to new tasks. As research in this area continues, we can expect to see even more innovative and efficient PEFT methods, further accelerating the progress of LLMs and their impact on various domains.
Editor's note: The paper is very interesting. But the results aren't as promising as the claims. Then, they did not publish it on Open Review so I could not see the reviewer comments. Good to have a read, but take the claims with required degree of scepticism.