Text embedding models play a crucial role in various natural language processing (NLP) and generative AI tasks by representing text as dense vectors. These embeddings capture semantic relationships between words and sentences, enabling applications like document retrieval, sentence similarity, classification, and clustering.
While recent efforts have focused on developing general-purpose text embedding models, these models often require vast amounts of training data. Gecko addresses this challenge by leveraging the knowledge contained within large language models (LLMs) through a knowledge distillation approach. This allows Gecko to achieve strong performance with a compact model size and lower dimensionality compared to other models.
The model uses insights from knowledge distillation to create a two step LLM powered embedding model.
Figure: Overview of Gecko. Gecko is a versatile text embedding model trained on a variety of tasks including document retrieval, semantic similarity, and classification. To train Gecko, we utilize FRet where queries are generated from LLMs, and their positive and negative passages are mined by LLMs.
The resulting dataset, FRet, is then combined with human-annotated data and used to fine-tune Gecko. This combination of LLM-generated and LLM-ranked data with human-annotated data allows Gecko to achieve strong performance on a variety of tasks.
This two-step distillation process is key to Gecko's success, as it allows the model to leverage the vast knowledge and understanding of LLMs to create a high-quality training dataset.
Gecko builds upon several existing concepts in NLP and machine learning:
Text Embedding Models: Existing models like SBERT, Universal Sentence Encoder, and Sentence T5 aim to provide general-purpose embeddings for various NLP tasks. However, they often struggle to generalize across different domains and tasks. Gecko addresses this limitation by leveraging LLM knowledge and a diverse training dataset.
Contrastive Learning: This technique involves training models to distinguish between similar and dissimilar examples. Gecko utilizes contrastive learning by employing LLM-ranked positive and hard negative passages during training.
Synthetic Data Generation: LLMs can be used to generate synthetic data for training NLP models. Gecko leverages this capability to create a diverse and task-agnostic training dataset, overcoming the limitations of manually labeled data.
Retrieval with Instructions: Recent research explores incorporating instructions into retrieval tasks to guide the model's behavior. Gecko adopts this concept by generating task descriptions along with queries, allowing the model to adapt to different retrieval objectives.
Gecko is based on a 1.2B parameter pre-trained transformer language model that undergoes two additional training stages: pre-finetuning and fine-tuning.
Gecko's training process consists of three stages:
Gecko starts with a pre-trained language model, which is further trained on a large corpus of text pairs through self-supervised tasks. This exposes the model to diverse textual data and improves its ability to capture semantic relationships. Training on text pairs has been shown to improve performance for smaller-scale dual encoders for various downstream tasks including document retrieval and semantic similarity.
FRet stands for Few-shot prompted Retrieval dataset.
This is the core of Gecko's knowledge distillation process. It involves two steps:
Gecko is fine-tuned on a mixture of FRet and other academic datasets formatted in a unified way, containing task descriptions, queries, positive passages, and negative passages. This allows the model to learn from both synthetic and human-annotated data, further improving its performance and versatility.
Gecko demonstrates superior performance on the MTEB benchmark compared to other text embedding models with similar size and dimensionality. Notably, Gecko achieves strong results even when trained solely on the synthetic FRet dataset, highlighting its zero-shot generalization capabilities.
Gecko also shows promising results on multilingual retrieval tasks, even though the FRet dataset is only available in English. This suggests that the knowledge distilled from LLMs can be effectively transferred to other languages.
Several factors contribute to Gecko's success:
Using LLMs to identify better positive and negative passages significantly improves the quality of the training data, leading to better model performance. In many cases LLM generated positive and negative passages are of higher quality than original passages itself - despite being generated from original passage. Using those instead of original improves the quality.
Example:
Seed Passage: Tagged: Batman, Robin, DC, DC Comics, Comics, …
Generated Task: Given a query, find a passage that allows you to check whether the query is true or not.
Generated Query: Batman is from DC comics
LLM-mined Positive: The Batman is an American superhero film based on the DC Comics character of the same name. Produced by DC Films and distributed by Warner Bros. Pictures, it is a reboot of the Batman film franchise.
LLM-mined Negative: "One of my employees wants to dress up in Batman attire," Gaskins said. "As long as he’s at work, I told him it was fine." New York Times News Service contributed to this report.
The diversity of tasks and queries within FRet allows Gecko to learn general-purpose representations that can be applied to various Gen AI and NLP tasks. Interestingly, unified formatting affects the quality of embeddings significantly, as it helps the model better separate different tasks.
LLM does generate diverse tasks and queries by conditioning on seed passages. LLMs are able to find a passage that provides a more direct and relevant answer to the generated query than the seed passage. Furthermore, LLM-ranked hard negatives make a challenging task of understanding nuanced differences.
The 2-step LLM distillation process effectively brings the LLM’s diverse domain knowledge and global ranking preferences into the text embedding model.
Gecko's versatility and performance open doors for various business applications:
Gecko presents a novel approach for training versatile text embedding models by distilling knowledge from LLMs. Its strong performance and zero-shot capabilities make it a promising tool for various NLP tasks and business applications. As LLM technology continues to advance, we can expect further improvements in Gecko's capabilities and its potential impact on the field of NLP.
The top reason why Gecko works so well is its two-step LLM distillation process. This process allows Gecko to leverage the vast knowledge and understanding of LLMs to create a high-quality training dataset, which in turn leads to better model performance.
Here's how Gecko's approach differs from previous methods:
In essence, Gecko's LLM distillation process allows it to learn from the knowledge and reasoning capabilities of LLMs, which ultimately leads to better text embedding models. This approach is more efficient and scalable than relying on manually labeled data, and it has the potential to revolutionize the way text embedding models are trained.