Contents
1. 서론
텍스트 임베딩은 자연어의 의미 정보를 인코딩하는 벡터 표현으로, 정보 검색(IR) 및 질문 응답 시스템 등 다양한 자연어 처리(NLP) 작업에 널리 사용됩니다. 이 연구에서는 기존의 텍스트 임베딩 방식의 한계를 극복하고자 대규모 언어모델을 활용하여 언어와 작업의 다양성을 보장하는 합성 데이터를 생성하는 새로운 방법을 제안합니다.
2. 관련 연구
텍스트 임베딩에 관한 초기 연구는 단어 임베딩의 가중 평균과 같은 간단한 방법에서 출발하였습니다. 최근에는 자연어 인퍼런스(NLI) 데이터셋에서 BERT를 파인튜닝하여 텍스트 임베딩을 학습하는 방법이 제안되었습니다. 하지만 이런 다단계 훈련 방법은 대규모 데이터 준비가 필요하고, 언어 다양성에 한계가 있습니다.
3. 방법
3.1 합성 데이터 생성
LLM을 사용하여 다양한 텍스트 임베딩 작업에 필요한 데이터를 생성하며, 특히 다양한 언어와 작업 유형에 걸쳐 임베딩의 견고성을 개선하는 데 필수적입니다. 합성 데이터는 두 단계의 프롬프팅 전략을 사용하여 생성됩니다. 첫 번째 단계에서는 임베딩 작업 후보군을 생성하고, 두 번째 단계에서는 주어진 작업에 따라 데이터를 생성합니다.
3.2 훈련
합성 데이터와 함께 미리 훈련된 LLM을 파인튜닝합니다. 이때 InfoNCE 손실을 사용하여 쿼리와 문서 간의 매칭 점수를 계산합니다. 수식은 다음과 같습니다.
\[L = -\log \frac{\exp\phi(q_+, d_+) / \tau)}{\sum_{(q, d) \in N} \exp\phi(q, d) / \tau)}\]\(\phi(q, d)\)는 쿼리 \(q\)와 문서 \(d\) 사이의 매칭 점수를 계산하는 함수이며, \(\tau\)는 온도 파라미터로, 실험에서는 0.02로 설정합니다.
4. 실험 결과
합성 데이터만을 사용하여 훈련한 미스트랄-7B 모델은 BEIR 벤치마크와 MTEB 벤치마크에서 경쟁력 있는 성능을 보였습니다. 특히, 레이블이 지정된 데이터 없이도 높은 성능을 달성한 점은 주목할 만합니다.
5. 분석
5.1 대조적 사전 훈련의 필요성
LLM의 광범위한 자기회귀 사전 훈련은 이미 효과적인 텍스트 표현을 학습할 수 있음을 시사합니다. 따라서 최소한의 파인튜닝만으로도 효과적인 임베딩 모델로 변환될 수 있습니다.
5.2 긴 텍스트 임베딩으로 확장
긴 컨텍스트를 처리할 수 있는 임베딩의 능력을 평가하기 위해 개인화된 패스키 검색 작업을 소개하고 평가합니다. 이는 임베딩이 긴 컨텍스트 정보를 효과적으로 인코딩할 수 있음을 보여줍니다.
이 논문은 LLM을 활용하여 다양하고 견고한 텍스트 임베딩을 생성하는 새로운 접근 방식을 제시하며, 이는 NLP 분야에서의 잠재적인 발전 방향을 제시합니다.
Text embeddings are vector representations of natural language that encode its semantic information. They are widely used in various natural language processing (NLP) tasks, such as information retrieval (IR), question answering, semantic textual similarity, bitext mining, item recommendation, etc. In the field of IR, the first-stage retrieval often relies on text embeddings to efficiently recall a small set of candidate documents from a large-scale corpus using approximate nearest neighbor search techniques. Embedding-based retrieval is also a crucial component of retrieval-augmented generation (RAG) [21], which is an emerging paradigm that enables large language models (LLMs) to access dynamic external knowledge without modifying the model parameters. Source attribution of generated text is another important application of text embeddings [14] that can improve the interpretability and trustworthiness of LLMs.
Previous studies have demonstrated that weighted average of pre-trained word embeddings [35, 1] is a strong baseline for measuring semantic similarity. However, these methods fail to capture the rich contextual information of natural language. With the advent of pre-trained language models [11], Sentence-BERT [37] and SimCSE [13] have been proposed to learn text embeddings by fine-tuning BERT on natural language inference (NLI) datasets. To further enhance the performance and robustness of text embeddings, state-of-the-art methods like E5 [46] and BGE [48] employ a more complex multi-stage training paradigm that first pre-trains on billions of weakly-supervised text pairs, and then fine-tunes on several labeled datasets.
Existing multi-stage approaches suffer from several drawbacks. Firstly, they entail a complex multi-stage training pipeline that demands substantial engineering efforts to curate large amounts of relevance pairs. Secondly, they rely on manually collected datasets that are often constrained by the diversity of tasks and the coverage of languages. For instance, Instructor [40] is only trained on instructions from 330 English datasets, whereas BGE [48] only focuses on high-resource languages such as English and Chinese. Moreover, most existing methods employ BERT-style encoders as the backbone, neglecting the recent advances of training better LLMs and related techniques such as context length extension [38].
In this paper, we propose a novel method for text embeddings that leverages LLMs to overcome the limitations of existing approaches. We use proprietary LLMs to generate synthetic data for a diverse range of text embedding tasks in 93 languages, covering hundreds of thousands of embedding tasks. Specifically, we use a two-step prompting strategy that first prompts the LLMs to brainstorm a pool of candidate tasks, and then prompts the LLMs to generate data conditioned on a given task from the pool. To cover various application scenarios, we design multiple prompt templates for each task type and combine the generated data from different templates to boost diversity. For the text embedding models, we opt for fine-tuning powerful open-source LLMs rather than small BERT-style models. Since LLMs such as Mistral [19] have been extensively pre-trained on web-scale data, contrastive pre-training offers little additional benefit.
We demonstrate that Mistral-7B, when fine-tuned solely on synthetic data, attains competitive performance on the BEIR [42] and MTEB [28] benchmarks. This is particularly intriguing considering that this setting does not involve any labeled data. When fine-tuned on a mixture of synthetic and labeled data, our model achieves new state-of-the-art results, surpassing previous methods by a significant margin (+2%). The entire training process requires less than 1k steps.
Moreover, we empirically validate that our model can effectively perform personalized passkey retrieval for inputs up to 32k tokens by altering the rotation base of the position embeddings, extending the context length beyond the conventional 512 token limit. Regarding its multilinguality, our model excels on high-resource languages. However, for low-resource languages, there is still room for improvement as current open-source LLMs are not adequately pre-trained on them.
Text Embeddings are continuous low-dimensional representations of text and have been extensively applied to various downstream tasks such as information retrieval, question answering, and retrievalaugmented generation (RAG). Early work on text embeddings includes latent semantic indexing [10] and weighted average of word embeddings [25]. More recent methods exploit supervision from natural language inference [3] and labeled query-document pairs, such as the MS-MARCO passage ranking dataset [5], to train text embeddings [37, 6, 13]. However, labeled data are often limited in terms of task diversity and language coverage. To address this challenge, methods like Contriever [18], OpenAI Embeddings [30], E5 [46], and BGE [48] adopt a multi-stage training paradigm. They first pre-train on large-scale weakly-supervised text pairs using contrastive loss and then fine-tune on small-scale but high-quality datasets. In this paper, we demonstrate that it is possible to obtain state-of-the-art text embeddings with single-stage training.
Synthetic Data Synthetic data generation is a widely studied topic in information retrieval research, with various methods proposed to enhance retrieval systems with artificially created data. For instance, Doc2query [33], InPars [2], and Promptagator [8] generate synthetic queries for unlabeled documents, which are then leveraged for document expansion or model training. GPL [45] employs a crossencoder to produce pseudo-labels for query-document pairs. Similarly, Query2doc [47] generates pseudo-documents for query expansion by few-shot prompting LLMs. Unlike these methods, our approach does not rely on any unlabeled documents or queries and thus can generate more diverse synthetic data.
Another related line of work focuses on knowledge distillation from black-box LLMs by training on synthetic data generated from them. DINO [39] generates synthetic text pairs for semantic textual similarity. Unnatural Instructions [16] is a synthetic instruction following dataset by prompting existing LLMs. Orca [29] and Phi [15] propose to train better small language models by using high-quality synthetic data from GPT-3.5/4 [34].
Large Language Models With the popularization of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities in instruction following and few-shot in-context learning [4].
However, the most advanced LLMs such as GPT-4 [34] are proprietary and have little technical details disclosed. To bridge the gap between proprietary and open-source LLMs, several notable efforts have been made, such as LLaMA-2 [44] and Mistral [19] models. A major limitation of LLMs is that they lack awareness of recent events and private knowledge. This issue can be partly mitigated by augmenting LLMs with information retrieved from external sources, a technique known as retrieval-augmented generation (RAG). On the other hand, LLMs can also serve as foundation models to enhance text embeddings. RepLLaMA [24] proposes to fine-tune LLaMA-2 with bi-encoder architecture for ad-hoc retrieval. SGPT [27], GTR [32], and Udever [51] demonstrate the scaling law of text embeddings empirically, but their performance still falls behind small bidirectional encoders such as E5 [46] and BGE [48]. In this paper, we present a novel approach to train state-of-the-art text embeddings by exploiting the latest advances of LLMs and synthetic data.
Figure 1: An example two-step prompt template for generating synthetic data with GPT-4. We first prompt GPT-4 to brainstorm a list of potential retrieval tasks, and then generate (query, positive, hard negative) triplets for each task. “{…}” denotes a placeholder that will be replaced by sampling from a predefined set of values. Full prompts are available in Appendix C.
You have been assigned a retrieval task: {task}. Your mission is to write one text retrieval example for this task in JSON format. The JSON object must contain the following keys:
"user_query"
: a string, a random user search query specified by the retrieval task."positive_document"
: a string, a relevant document for the user query."hard_negative_document"
: a string, a hard negative document that only appears relevant to the query.Please adhere to the following guidelines:
"user_query"
should be {query_type}, {query_length}, {clarity}, and diverse in topic.Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!
{
"user_query": "How to use Microsoft Power BI for data analysis",
"positive_document": "Microsoft Power BI is a sophisticated tool that requires time and practice to master. In this tutorial, we'll show you how to navigate Power BI … (omitted)",
"hard_negative_document": "Excel is an incredibly powerful tool for managing and analyzing large amounts of data. Our tutorial series focuses on how you…(omitted)"
}
Utilizing synthetic data generated by advanced LLMs such as GPT-4 presents a compelling opportunity, especially in terms of enhancing diversity across a multitude of tasks and languages. Such diversity is essential for developing robust text embeddings that can perform well across different tasks, be it semantic retrieval, textual similarity, or clustering.
To generate diverse synthetic data, we propose a simple taxonomy that categorizes embedding tasks into several groups, and then apply different prompt templates to each group.
Asymmetric Tasks This category comprises tasks where the query and document are semantically related but are not paraphrases of each other. Depending on the length of the query and document, we further divide asymmetric tasks into four subgroups: short-long match, long-short match, short-short match, and long-long match. For instance, short-long match tasks involve a short query and a long document, which is a typical scenario in commercial search engines. For each subgroup, we design a two-step prompt template that first prompts LLMs to brainstorm a list of tasks, and then generates a concrete example conditioned on the task definition. In Figure 1, we show an example prompt for the short-long match subgroup. The outputs from GPT-4 are mostly coherent and of high quality. In our preliminary experiments, we also attempted to generate the task definition and query-document pairs using a single prompt, but the data diversity was not as satisfactory as the proposed two-step approach.
Symmetric Tasks Symmetric tasks involve queries and documents that have similar semantic meanings but different surface forms. We examine two application scenarios: monolingual semantic textual similarity (STS) and bitext retrieval. We design two distinct prompt templates for each scenario, tailored to their specific objectives. Since the task definition is straightforward, we omit the brainstorming step for symmetric tasks.
To further boost the diversity of the prompts and thus the synthetic data, we incorporate several placeholders in each prompt template, whose values are randomly sampled at runtime. For example, in Figure 1, the value of {query_length}
is sampled from the set {less than 5 words, 5-10 words, at least 10 words}
.
To generate multilingual data, we sample the value of {language}
from the language list of XLM-R, giving more weight to high-resource languages. Any generated data that does not conform to the predefined JSON format are discarded during the parsing process. We also remove duplicates based on exact string matching.
Figure 1: Example Prompt Template for Short-Long Match Subgroup
You have been assigned a retrieval task: {task}. Your mission is to write one text retrieval example for this task in JSON format. The JSON object must contain the following keys:
"user_query"
: a string, a random user search query specified by the retrieval task."positive_document"
: a string, a relevant document for the user query."hard_negative_document"
: a string, a hard negative document that only appears relevant to the query.Please adhere to the following guidelines:
"user_query"
should be {query_type}, {query_length}, {clarity}, and diverse in topic.Your output must always be a JSON object only, do not explain yourself or output anything else. Be creative!
{
"user_query": "How to use Microsoft Power BI for data analysis",
"positive_document": "Microsoft Power BI is a sophisticated tool that requires time and practice to master. In this tutorial, we'll show you how to navigate Power BI … (omitted)",
"hard_negative_document": "Excel is an incredibly powerful tool for managing and analyzing large amounts of data. Our tutorial series focuses on how you…(omitted)"
}
[
"Retrieve company's financial reports for a given stock ticker symbol.",
"Given a book name as a query, retrieve reviews, ratings, and summaries of that book.",
"Search for scientific research papers supporting a medical diagnosis for a specified disease.",
"Find news articles discussing the economic impact of a recent natural disaster.",
"Retrieve documents explaining the historical significance of a given landmark.",
"Search for user manuals and troubleshooting guides for a specified electronic device.",
"Find scholarly articles that debate a controversial topic in education.",
"Retrieve recipes and cooking tips for a specified ingredient.",
"Search for travel guides and tips for visiting a specific country or city.",
"Find blog posts discussing the benefits and drawbacks of a particular diet.",
"Retrieve academic papers exploring the effects of climate change on marine life.",
"Search for tutorials and guides on how to use a specific programming language.",
"Find customer reviews and ratings for a particular product.",
"Retrieve documents explaining the steps to apply for a specific type of visa.",
"Search for articles that provide investment advice for beginners.",
"Find studies that analyze the impact of social media on mental health.",
"Retrieve guidelines and standards for building construction in a specified region.",
"Search for historical documents related to a significant event in world history.",
"Find instructional materials for teaching a particular subject to elementary students.",
"Retrieve policy papers discussing the implications of a new government regulation."
]
inst:
Given a relevant query-document pair (q+, d+), we first apply the following instruction template to the original query q+ to generate a new one q+ q+ inst = Instruct: {task_definition} \n Query: {q+} (1) where “{task_definition}” is a placeholder for a one-sentence description of the embedding task. For generated synthetic data, we use the outputs from the brainstorming step. For other datasets, such as MS-MARCO, we manually craft the task definitions and apply them to all the queries in the dataset. We do not modify the document side with any instruction prefix. In this way, the document index can be prebuilt, and we can customize the task to perform by changing only the query side.
Given a pretrained LLM, we append an [EOS] token to the end of the query and document, and then feed them into the LLM to obtain the query and document embeddings (hq+ , hd+) by taking the last layer [EOS] vector. To train the embedding model, we adopt the standard InfoNCE loss L over the in-batch negatives and hard negatives:
where N denotes the set of all negatives, and ϕ(q, d) is a function that computes the matching score between query q and document d. In this paper, we adopt the temperature-scaled cosine similarity
τ is a temperature hyper-parameter, which is fixed to 0.02 in our experiments.
Figure 2: Task type and language statistics of the generated synthetic data (see Section 3.1 for task type definitions). The “Others” category contains the remaining languages from the XLM-R language list.
Figure 2 presents the statistics of our generated synthetic data. We manage to generate 500k examples with 150k unique instructions using Azure OpenAI Service 2, among which 25% are generated by GPT-35-Turbo and others are generated by GPT-4. The total token consumption is about 180M. The predominant language is English, with coverage extending to a total of 93 languages. For the bottom 75 low-resource languages, there are about 1k examples per language on average.
In terms of data quality, we find that a portion of GPT-35-Turbo outputs do not strictly follow the guidelines specified in the prompt templates. Nevertheless, the overall quality remains acceptable, and preliminary experiments have demonstrated the benefits of incorporating this data subset.
The pretrained Mistral-7b [19] checkpoint is fine-tuned for 1 epoch using the loss in Equation 2. We follow the training recipe from RankLLaMA [24] and utilize LoRA [17] with rank 16. To further reduce GPU memory requirement, techniques including gradient checkpointing, mixed precision training, and DeepSpeed ZeRO-3 are applied.
For the training data, we utilize both the generated synthetic data and a collection of 13 public datasets, yielding approximately 1.8M examples after sampling. More details are available in Appendix A. To provide a fair comparison with some previous work, we also report results when the only labeled supervision is the MS-MARCO passage ranking [5] dataset.
We evaluate the trained model on the MTEB benchmark [28]. Note that the retrieval category in MTEB corresponds to the 15 publicly available datasets in the BEIR benchmark [42]. Evaluation of one model takes about 3 days on 8 V100 GPUs due to the need to encode a large number of documents. Although our model can accommodate sequence length beyond 512, we only evaluate on the first 512 tokens for efficiency. Official metrics are reported for each category. For more details about the evaluation protocol, please refer to the original papers [28, 42].
Table 1: Results on the MTEB benchmark [28] (56 datasets in the English subset). The numbers are averaged for each category. Please refer to Table 15 for the scores per dataset.
In Table 1, our model “E5mistral-7b + full data” attains the highest average score on the MTEB benchmark, outperforming the previous state-of-the-art model by 2.4 points. In the “w/ synthetic data only” setting, no labeled data is used for training, and yet the performance remains quite competitive. We posit that generative language modeling and text embeddings are the two sides of the same coin, with both tasks requiring the model to have a deep understanding of the natural language. Given an embedding task definition, a truly robust LLM should be able to generate training data on its own and then be transformed into an embedding model through light-weight fine-tuning. Our experiments shed light on the potential of this direction, and more research is needed to fully explore it.
Table 2: Comparison with commercial models and the model that tops the MTEB leaderboard (as of 2023-12-22). For the commercial models listed here, little details are available on their model architectures and training data.
In Table 2, we also present a comparison with several commercial text embedding models. However, due to the lack of transparency and documentation about these models, a fair comparison is not feasible. We focus especially on the retrieval performance on the BEIR benchmark, since RAG is an emerging technique to enhance LLM with external knowledge and proprietary data. As Table 2 shows, our model outperforms the current commercial models by a significant margin.
To assess the multilingual capabilities of our model, we conduct an evaluation on the MIRACL dataset [53], which comprises human-annotated queries and relevance judgments across 18 languages. As shown in Table 3, our model surpasses mE5large on high-resource languages, notably on English. Nevertheless, for low-resource languages, our model remains suboptimal compared to mE5base. We attribute this to the fact that Mistral-7B is predominantly pre-trained on English data, and we anticipate that future multilingual LLMs will leverage our method to bridge this gap.
Table 3: nDCG@10 on the dev set of the MIRACL dataset for both high-resource and low-resource languages. We select the 4 high-resource languages and the 4 low-resource languages according to the number of candidate documents. The numbers for BM25 and mDPR come from Zhang et al. [53]. For the complete results on all 18 languages, please see Table 5.
Figure 3: Effects of contrastive pre-training. Detailed numbers are in Appendix Table 6.
Weakly-supervised contrastive pre-training is one of the key factors behind the success of existing text embedding models. For instance, Contriever [18] treats random cropped spans as positive pairs for pre-training, while E5 [46] and BGE [48] collect and filter text pairs from various sources.
This section re-evaluates the necessity of contrastive pre-training for LLMs, particularly those that have been pre-trained on trillions of tokens. Figure 3 shows that contrastive pre-training benefits XLM-Rlarge, enhancing its retrieval performance by 8.2 points when fine-tuned on the same data, which aligns with prior findings. However, for Mistral-7B based models, contrastive pre-training has negligible impact on the model quality. This implies that extensive auto-regressive pre-training enables LLMs to acquire good text representations, and only minimal fine-tuning is required to transform them into effective embedding models.
Figure 4: Illustration of the personalized passkey retrieval task adapted from Mohtashami and Jaggi [26]. The “
Existing evaluation datasets for text embedding models are typically short, to evaluate the long-context capability of our model, we introduce a novel synthetic task called personalized passkey retrieval, which is illustrated in Figure 4. This task requires encoding the passkey information in a long context into the embeddings. We compare the performance of different variants by changing the sliding window size and the RoPE rotation base [41] in Figure 5. The results show that the default configuration with 4k sliding window attains 100% accuracy within 4k tokens, but the accuracy deteriorates quickly as the context length grows. Naively extending the sliding window size to 32k results in worse performance. By changing the RoPE rotation base to 105, the model can achieve over 90% accuracy within 32k tokens. However, this entails a minor trade-off in performance for shorter contexts. A potential avenue for future research is to efficiently adapt the model to longer contexts through lightweight post-training [54].
Table 4: Results on the MTEB benchmark with various hyperparameters. The first row corresponds to the default setting, which employs last-token pooling, LoRA rank 16, and natural language instructions. Unless otherwise stated, all models are trained on the synthetic and MS-MARCO passage ranking data.
Table 4 presents the results under different configurations. We notice that the Mistral-7B initialization holds an advantage over LLaMA-2 7B, in line with the findings from Mistral-7B technical report [19]. The choice of pooling types and LoRA ranks does not affect the overall performance substantially, hence we adhere to the default setting despite the marginal superiority of LoRA rank 8. On the other hand, the way of adding instructions has a considerable impact on the performance. We conjecture that natural language instructions better inform the model regarding the embedding task at hand, and thus enable the model to generate more discriminative embeddings. Our framework also provides a way to customize the behavior of text embeddings through instructions without the need to fine-tune the model or re-built document index.
This paper shows that the quality of text embeddings can be substantially enhanced by exploiting LLMs. We prompt proprietary LLMs such as GPT-4 to generate diverse synthetic data with instructions in many languages. Combined with the strong language understanding capability of the Mistral model, we establish new state-of-the-art results for nearly all task categories on the competitive MTEB benchmark. The training process is much more streamlined and efficient than existing multi-stage approaches, thereby obviating the need for intermediate pre-training.
For future work, we aim to further improve the multilingual performance of our model and explore the possibility of using open-source LLMs to generate synthetic data. We also intend to investigate ways to improve the inference efficiency and lower the storage cost for LLM based text embeddings.
Hyperparameters for Fine-tuning When fine-tuning Mistral-7b, the batch size is set to 2048 and the learning rate is 10−4 with 100 step warmup and linear decay. The weight decay is 0.1. We add 1 hard negative for each query-document pair. The fine-tuning process takes roughly 18 hours on 32 V100 GPUs with a maximum sequence length 512. We add LoRA adapters to all linear layers, resulting in a total of 42M trainable parameters. Our implementation is based on the HuggingFace PEFT library at https://github.com/huggingface/peft.
The model and dataset release information is available at https://github.com/microsoft/unilm/tree/master/e5.
To assess the test set contamination on all the datasets in the MTEB benchmark, we perform a string match based analysis between the test set and our training set, disregarding differences in character case and spacing. We categorize the train-test overlaps into three types:
In summary, we did not detect substantial contamination risks that could alter the main findings of this paper.
Another aspect to consider is the possibility of test set contamination in the training data of Mistral-7B and GPT-4. However, since the training data of these models is not publicly accessible, it is challenging to estimate the degree of such contamination. Given their widespread use in the research community, we believe it is still a valid comparison if other works also employ these models.
Table 5: nDCG@10 and Recall@100 on the dev set of the MIRACL dataset for all 18 languages.
Table 6: Detailed results for the effects of contrastive pre-training. For the “E5mistral-7b w/ cont. pre-train” setting, we pre-train Mistral-7B following the mE5 recipe for 10k steps.
Table 7: Prompt template for the short-long matching subgroup. For placeholders, “{query_type}” ∈ {extremely long-tail, long-tail, common}, “{query_length}” ∈ {less than 5 words, 5 to 15 words, at least 10 words}, “{difficulty}” ∈ {high school, college, PhD}, “{clarity}” ∈ {clear, understandable with some effort, ambiguous}, “{num_words}” ∈ {50, 100, 200, 300, 400, 500}.