00:00:00

Share Your Feedback 🏝️

PO | Contrastive Preference Optimization*

PO | Contrastive Preference Optimization*

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: RLHF Paper Next: Decontamination | Pile

PO | Contrastive Preference Optimization*

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-13

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

  • url: https://arxiv.org/abs/2401.08417
  • pdf: https://arxiv.org/pdf/2401.08417
  • abstract: Moderate-sized large language models (LLMs) – those with 7B or 13B parameters – exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT’21, WMT’22 and WMT’23 test datasets.

[번역품질 및 벤치마크 등에 대한 의구심과 관련된 CPO 색인마킹]

[관련 모델 ALMA]


Contents

TL;DR


  1. (기계 번역 모델 최적화) 현재 기계 번역은 주로 트랜스포머 기반 인코더-디코더 구조를 활용하고, 이는 다양한 언어의 번역 능력 향상을 위해 Mono-lingual 데이터와 고품질 병렬 데이터를 사용한 세밀한 튜닝을 포함합니다.
  2. (ALMA 모델) ALMA 모델은 Contrastive Preference Optimization (CPO) 방법을 통해 기존 SFT 방식의 성능 한계를 극복하고, 기존 대규모 모델과 경쟁할 수 있는 성능을 보여준다.
  3. (Gold Standard 번역의 문제점 분석) 기존 Gold Standard 번역 데이터의 품질이 상위 모델 번역보다 낮을 수 있음을 지적하고, 이를 통해 모델이 최적의 번역을 생성하도록 교육하는 새로운 접근 방식을 제안합니다.

1. 서론

기계 번역(MT)은 주로 트랜스포머 기반의 인코더-디코더 구조를 활용합니다. 이 구조는 NLLB-200, M2M100, BIBERT, MT5 등 여러 모델에서 활용되며, 최근에는 디코더-온리 대규모 언어모델(LLMs)이 등장하여 다양한 자연어 처리(NLP) 작업에서 향상된 효과를 보여주고 있지만, 작은 크기의 LLMs는 기존 번역 모델에 비해 성능이 떨어지는 문제가 있습니다.

이를 개선하기 위해 ALMA 모델은 다양한 언어의 대량 Monolingual 데이터로 초기 파인튜닝을 진행한 후, 고품질 병렬 데이터를 활용한 지도 학습(SFT)을 통해 번역 생성에 집중했습니다. 이 방법을 통해 ALMA 모델은 중간 크기의 LLM들을 뛰어넘는 성능을 달성했으나, 여전히 최고의 번역 모델과의 격차는 존재합니다.

1.1. 수학적 기반

번역 모델의 효율성은 고품질의 번역 쌍의 데이터셋 \(D = \{(x^{(i)}, y^{(i)})\}_{i=1}^N\) 의 사용에 기반하며, 모델 \(\pi_{\theta}\)는 다음과 같은 손실 함수를 최소화하는 방향으로 학습됩니다.

\[L_{\text{NLL}} = -\mathbb{E}_{(x,y) \sim D} [\log \pi_{\theta}(y \,|\, x)] \tag{1}\]

이 식에서 \(x\)는 원문 문장을, \(y\)는 대상 번역 문장을 나타낸다. 이 손실 함수는 모델이 정확한 번역을 생성하도록 유도합니다.

1.2. 문제 인식 및 해결 방법 제안

Gold Standard 번역 데이터의 질이 모델이 생성할 수 있는 최적의 번역보다 떨어질 수 있음이 밝혀짐에 따라, 모델이 더 높은 품질의 번역을 생성하고, 부족한 번역을 거부하도록 학습하는 것이 중요합니다. 이를 위해 Contrastive Preference Optimization (CPO) 방법을 도입하여, 참조 기반 학습의 한계를 넘어서고 모델 성능의 새로운 경계를 설정합니다. CPO는 기존의 참조 모방 학습 과정에서 발생하는 성능 병목 현상을 해결하고 번역 품질을 향상시키는 메모리 효율, 속도 및 효과성 측면에서 이점을 제공합니다.


2. Gold or Gilded? Scrutinizing Gold Reference Quality

  1. 기계 번역 모델의 효율성은 고품질의 번역 쌍에 달려 있다.
  2. 현재의 번역 모델은 휴먼이 작성한 기준 번역보다 우수할 수 있다.
  3. 모델의 학습 방향은 우수한 번역을 선호하고 열등한 번역을 거부하는 쪽으로 개선되어야 한다.

기계 번역 작업에서 표적 참조(골드 레퍼런스)는 중요한데, 기계 번역 모델 훈련 패러다임은 골드 레퍼런스의 질에 크게 의존하고, 모델은 예측 출력과 골드 레퍼런스 간의 차이를 최소화하는 손실을 정의하여 최적화되기 때문입니다.

데이터셋 \(D\)는 소스 문장 \(x\)와 그에 해당하는 타깃 문장(골드 레퍼런스) \(y\)의 쌍으로 구성됩니다. 이를 \(D = \{(x^{(i)}, y^{(i)})\}_{i=1}^N\)로 표현할 수 있으며, \(N\)은 총 문장 쌍의 수입니다.

모델 \(\pi_\theta\)의 파라미터 \(\theta\)에 의해 파라미터화된 부정 로그 가능도 손실은 다음과 같이 정의됩니다.

\[L_{NLL} = -E_{(x,y) \sim D}[\log \pi_\theta(y|x)].\]

이 손실은 모델이 효과적으로 번역하는 능력과 직결된다고 알려져있지만 최근 연구에 따르면, evaluation tool들은 골드 레퍼런스의 질에 민감하고, 부족한 레퍼런스는 평가의 Precision를 손상시킬 수 있다고 보고 했습니다.

본 연구에서는 골드 레퍼런스의 질과 최신 고성능 번역 모델의 출력을 비교 평가하기 위해 참조 없는 평가 프레임워크를 사용하여 이들의 출력을 평가할 것을 제안합니다. 예를 들어, FLORES-200 데이터셋의 번역 예시를 보면, 시스템 생성 출력이 골드 레퍼런스보다 우수한 품질을 보이는 경우가 있습니다.

소결

모델의 효과적인 번역 능력은 고품질 번역 쌍의 유무에 크게 의존하기 때문에 이를 극복하기 위한 주요 방법으로 골드 레퍼런스와 모델 출력 사이의 손실을 최소화하는 방법을 사용합니다.

결과론적으로 실험 결과는 골드 레퍼런스가 항상 최상의 품질을 나타내는 것은 아님을 보이고, 때로는 모델 출력이 더 우수할 수 있으며, 번역 모델의 평가 방법을 재고하는 중요한 근거로 제시합니다.


3. 대조적 선호 최적화 (Contrastive Preference Optimization, CPO)

이 섹션에서는 새로운 선호 학습 기법인 대조적 선호 최적화(CPO)를 소개합니다. 이 방법은 모델이 ‘더 나은’ 번역을 생성하면서 동시에 ‘더 나쁜’ 번역을 거부하는 경향을 개발하도록 안내합니다.

3.1 CPO 목표 도출

직접적인 선호 최적화(Direct Preference Optimization, DPO)에서 출발하여 CPO 목표를 도출합니다. DPO는 휴먼 피드백에서 강화 학습(RLHF)을 위해 사용되는 보다 직접적인 최적화 목표를 나타냅니다. 주어진 소스 문장 \(x\)와 선호 번역 대상 \(y_w\) 및 비선호 번역 \(y_l\)가 있는 데이터셋 \(D\)를 고려할 때, DPO 손실은 다음과 같이 정의됩니다.

\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( r_{\text{ref}}(x, y_w) - r_{\text{ref}}(x, y_l) \right) \right) \right]\]

\(\sigma\)는 시그모이드 함수이고, \(\beta\)는 하이퍼파라미터입니다. 그러나 DPO는 메모리 효율성이 낮고 처리 속도가 느리다는 단점이 있습니다. 이런 비효율성을 해결하기 위해 CPO를 도입합니다.

CPO 설정에서는 \(\pi_{\text{ref}} = \pi_w\)로 설정하여, 주어진 데이터 포인트 \((x, y_w, y_l)\)에 대해 \(\pi_w(y_w\|x) = 1\) 및 \(0 \leq \pi_w(y_l\|x) \leq 1\) 조건을 충족합니다. 이런 변경을 통해 선호 데이터에 대한 예측은 참조 모델에 의한 재가중치를 요구하지 않으며, DPO 손실은 다음과 같이 재구성됩니다.

\[\mathcal{L}_{\text{CPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( 1 - \pi_{\theta}(y_l\|x) \right) \right) \right]\]

3.2 삼중 선호 데이터 구성

선호 데이터셋 \(D\)는 FLORES-200 데이터를 사용하여 구성되며, 각 언어 쌍마다 2009개의 병렬 문장을 포함합니다. 주어진 소스 문장 \(x\)에 대해, GPT-4 및 ALMA-13B-LoRA를 사용하여 각각의 번역 \(y_{\text{gpt-4}}\) 및 \(y_{\text{alma}}\)를 생성하고, 원래의 대상 참조 \(y_{\text{ref}}\)와 함께 삼중항을 형성합니다.

참조 없는 평가 모델 KIWI-XXL 및 XCOMET을 사용하여 이 번역들을 점수화하고, 가장 높은 점수를 받은 번역을 선호 번역 \(y_w\)로, 가장 낮은 점수를 받은 번역을 비선호 번역 \(y_l\)로 지정합니다. 이런 고품질이지만 완벽하지 않은 번역을 비선호 데이터로 사용하는 접근 방식은 모델이 세부 사항을 다듬고 생성된 번역에서 완벽을 달성하도록 훈련하는 데 도움이 된다고 언급합니다.


4. 실험

4.1 데이터

이 섹션에서는 ALMA 모델의 통찰력을 바탕으로, 소량의 고품질 데이터가 인상적인 번역 결과를 낳을 수 있음을 보여줍니다. 훈련 데이터셋은 FLORES-200 데이터셋에서 파생되었으며, 총 2K × 10 방향 = 20K 쌍 문장을 포함합니다. 추가로, 내부적으로 레이블이 지정된 휴먼 선호 데이터도 포함되어 있습니다.

4.2 훈련 설정

ALMA-13B-LoRA를 초기 체크포인트로 사용하여, 많은 언어로 다중 언어 기계 번역 방식으로 모델을 훈련합니다. 훈련 단계에서는 LoRA 파라미터의 가중치만 업데이트에 초점을 맞춥니다. 이 과정은 메모리와 속도 효율성을 높이며, CPO를 통한 선호 학습을 일반적인 SFT 방법과 동일한 공간 및 시간 복잡성으로 가능하게 합니다.


5. 분석

모든 분석은 WMT’21 및 WMT’22 테스트 데이터셋을 사용하여 수행되었으며, 평균 성능을 분석합니다.

5.1 번역이 정말로 나은 것인지, 아니면 단지 지표 선호인지?

본 연구에서 선호 데이터는 참조 없는 모델로 선택되며, 동일한 모델이 평가에 사용되며, 이 과정에서 '치팅' 가능성을 조사합니다.

구체적으로,

  • (1) 향상된 번역 점수가 정말로 더 나은 번역을 반영하는지, 아니면
  • (2) 단지 평가 모델의 선호도에 더 밀접하게 일치하는지를 묻습니다. 질문 (2)는 다음과 같이 두 부분으로 나뉩니다.
    • (2-1) 지표 수준에서, 특정 지표(e.g., KIWI-XXL)에 의해 선호되는 데이터로 모델을 훈련할 때 다른 지표에서 일관된 개선이 있는지를 조사합니다. 이를 위해 KIWI-XXL 또는 XCOMET만을 사용하여 선호 데이터를 재구성하고 CPO 방법을 사용하여 ALMA-13BLoRA 모델을 재훈련합니다. 표 5의 결과는 선호 데이터를 선택하는 데 사용된 지표에 대한 유의미한 편향이 나타나지 않았으며, 모든 지표에서 비슷하고 일관된 개선을 관찰했습니다.
    • (2-2) 방법 수준에서, 지표 선호 데이터에 대한 훈련이 해당 지표에서 항상 더 나은 점수로 이어지는지 여부를 질문합니다. 그러나 이 연결은 직관적이지 않습니다. 예를 들어, 선호 데이터에 대한 SFT는 표 2에서 보듯이 모든 세 지표에서 성능이 감소하는 역설적 결과를 초래합니다.

따라서 분석은 참조 없는 모델인 KIWI-XXL 및 XCOMET를 선호 데이터 구성 및 평가 목적으로 사용하는 것의 견고함과 타당성을 뒷받침하며, 이 접근 방식에서 편향이 없음을 강조합니다. 또한, 표 5는 KIWI-XXL, XCOMET 또는 두 모델의 앙상블을 사용하는 선택이 결과에 미치는 영향이 최소함을 보여줍니다.

5.2 소거 연구

CPO 손실 함수는 두 가지 구성 요소를 갖고 있습니다. 선호 학습을 위한 Lprefer와 모델이 선호 데이터 분포에서 크게 벗어나지 않도록 보장하는 LNLL입니다. 각 용어의 중요성을 보여주기 위해, 모델을 하나의 구성 요소만으로 재훈련합니다. LNLL만으로 훈련하는 것은 선호 데이터에 대한 SFT의 기준 시나리오와 동일합니다. 표 6에서 보듯이, 두 용어를 모두 포함시킬 때 최적의 성능을 나타내며, 하나라도 누락되면 성능이 감소합니다.

선호 데이터 구성 요소: 선호 데이터 선택은 GPT-4, ALMA, 그리고 금 표준 참조에서 나온 번역의 삼중항으로 구성된 선호 및 비선호 번역을 선택하는 것을 포함합니다. 표 7에서는 ALMA 또는 GPT-4 생성 데이터를 선호 삼중항에서 제외하고 모델을 재훈련함으로써 각각의 영향을 평가합니다. 이런 발견은 en→xx 번역에서 ALMA 생성 데이터의 중요성과 xx→en 번역에서 GPT-4 생성 데이터의 중요성을 강조합니다.

5.3 비선호 데이터의 품질이 왜 중요한가?

실험 설정에서, 비선호 데이터는 강력한 번역 모델에서 유래하지만 두 개의 다른 번역 출력과 비교하여 가장 낮은 점수를 받습니다. 비선호 데이터의 품질이 모델 성능에 상당한 영향을 미치는지, 그리고 고품질(비록 완벽하지 않은) 비선호 데이터가 번역 개선에 도움이 될 수 있는지를 탐구합니다. 이를 위해 비선호 번역($y_l$)을 자연스럽게 유래된 고품질 번역이 아닌 인위적으로 생성된 번역으로 구성된 새로운 선호 데이터셋를 구성합니다.

이 새로운 데이터셋에서, 선호 번역($y_w$)은 섹션 3.2와 동일한 방식으로 세 번역 후보 중 가장 좋은 것으로 선택됩니다. 그러나 비선호 번역은 $y_w$의 노이즈 버전으로 의도적으로 수정됩니다. Zeng et al. (2023)이 제안한 방법을 따라 단어 삭제 확률 0.15와 단어 교환 범위 1 내에서 확률 0.3을 적용하여 수동으로 노이즈가 있는 비선호 데이터를 생성합니다. 이 접근 방식은 자연스럽지 않고 인공적인 더 나쁜 번역을 생성합니다.

표 8은 이런 수동으로 노이즈가 있는 비선호 데이터를 사용할 때와 원래의 자연 발생 고품질 비선호 데이터를 사용할 때의 성능을 비교합니다.

결과는 모든 세 지표 및 두 번역 방향에서 비선호 데이터가 수동으로 노이즈 처리될 때 성능이 크게 저하됨을 보여주며, 비선호 데이터의 품질이 번역 성능 향상에 중요함을 강조합니다.


1 Introduction

Machine translation (MT) predominantly utilizes transformer encoder-decoder architectures (Vaswani et al., 2017), which is evident in prominent models such as NLLB-200 (NLLB TEAM et al., 2022), M2M100 (Fan et al., 2021), BIBERT (Xu et al., 2021), and MT5 (Xue et al., 2021).

1 We release our code and models at: https://github.com/fe1ixxu/ALMA.

However, the emergence of decoder-only large language models (LLMs) such as the GPT series (Brown et al., 2020; OpenAI, 2023), Mistral (Jiang et al., 2023), LLaMA (Touvron et al., 2023a;b), Falcon (Almazrouei et al., 2023), and others, which have shown remarkable efficacy in various NLP tasks, which attracts the interest of developing machine translation with these decoder-only LLMs. Recent studies (Zhu et al., 2023a; Jiao et al., 2023b; Hendy et al., 2023; Kocmi et al., 2023; Freitag et al., 2023) indicate that larger LLMs such as GPT-3.5 (175B) and GPT-4 exhibit strong translation abilities. However, the performance of smallersized LLMs (7B or 13B) still falls short when compared to conventional translation models (Zhu et al., 2023a).

Therefore, there are studies intend to enhance the translation performance for these smaller LLMs (Yang et al., 2023; Zeng et al., 2023; Chen et al., 2023; Zhu et al., 2023b; Li et al., 2023; Jiao et al., 2023a; Zhang et al., 2023), but their improvements are relatively modest, primarily due to the predominant pre-training of LLMs on English-centric datasets, resulting in limited linguistic diversity (Xu et al., 2023). Addressing this limitation, Xu et al. (2023) introduce ALMA models, which initially fine-tune LLaMA-2 (Touvron et al., 2023b) with extensive monolingual data in various languages to enhance their multilingual abilities, and then perform supervised fine-tune (SFT) with small but high-quality parallel data to induce the model toward the translation generation. This method has allowed ALMA models to outperform all prior moderated-size LLMs, and even larger models such as GPT-3.5, in the translation task. Nonetheless, the performance still lags behind leading translation models such as GPT-4 and WMT competition winners. Our study bridges this gap by further fine-tuning ALMA models with our novel training method Contrastive Preference Optimization (CPO) and minimal costs, i.e., only 12M learnable parameters (equivalent to 0.1% of the original model size) and a 22K dataset for 10 directions. The fine-tuned model is referred to as ALMA-R. A detailed performance comparison is illustrated in Figure 1.

CPO aims to mitigate two fundamental shortcomings of SFT. First, SFT’s methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the

Are reference Gold or Gilded? We conducted an in-depth analysis of the training data (FLORES-200 data) utilized by the ALMA model. We meticulously compared the quality of the reference translations with those generated by strong translation models. Our findings reveal that, in numerous instances, the quality of human-written parallel data is even inferior to that of system-generated translations. This observation underscores a critical insight: training models exclusively towards replicating reference translations may not be the most effective approach, and reliance on reference-based evaluation could be flawed.

Pushing the Performance Boundary of SFT We introduce Contrastive Preference Optimization, which offers advantages in terms of memory efficiency, speed, and, crucially, enhanced effectiveness in improving translation quality. CPO breaks the performance bottleneck inherent in SFT’s reference-mimicking learning process and push the performance boundary of models that have reached saturation through SFT training.

2. Gold or Gilded? Scrutinizing Gold

Reference Quality

The significance of target reReferences are paramount in machine translation tasks. The paradigm of training models on the machine translation task heavily relies on the quality of the references since the model is commonly optimized using a loss that is defined to minimize the difference between the predicted outputs and gold reference. Consider a dataset \(D\), comprising pairs of source sentences \(x\) and their corresponding target sentences (gold references) \(y\), represented as \(D = \{(x^{(i)}, y^{(i)})\}_{i=1}^N\), where \(N\) is the total number of parallel sentences. The negative log-likelihood loss for these parallel sentences, in relation to a model \(\pi_{\theta}\) parameterized by \(\theta\), is defined as follows:

\[L_{\text{NLL}} = -\mathbb{E}_{(x,y) \sim D} [\log \pi_{\theta}(y\\|x)] \tag{1}\]

Hence, the ability of models to effectively translate is contingent upon the availability of high-quality translation pairs (Xu et al., 2023; Maillard et al., 2023). Furthermore, prevalent evaluation tools such as BLEU (Papineni et al., 2002) and COMET-22 (Rei et al., 2022) predominantly rely on reference-based metrics. However, the precision of these evaluations is sensitive to and compromised by substandard references (Kocmi et al., 2023; Freitag et al., 2023). Recent research (Xu et al., 2023; Kocmi et al., 2023; Freitag et al., 2023) has shifted attention towards assessing the quality of parallel datasets, indicating that target references may not consistently represent the highest quality. In Figure 2, we take a translation example from the FLORES-200 dataset, and compare the gold reference translation with outputs from the best ALMA model and GPT-4. This comparison training data. This limitation is significant, as even humanwritten data, traditionally considered high-quality, is not immune to quality issues (more details in Section 2). For instance, one may notice that some strong translation models are capable of producing translations superior to the gold reference, as illustrated in Figure 1. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. While strong translation models can produce high-quality translations, they occasionally exhibit minor errors, such as omitting parts of the translation. Preventing the production of these near-perfect but ultimately flawed translation is essential. To overcome these issues, we introduce Contrastive Preference Optimization (CPO) to train the ALMA model using specially curated preference data. After CPO training, the ALMA-R model shows marked improvements, achieving performance levels that match or even surpass those of GPT-4 and WMT competition winners.

Figure 1: A performance comparison featuring our proposed model ALMA-13B-R against other recently released 13B LLM-based models, as well as top-performing translation systems like GPT-4, WMT winners, Google Translate, and NLLB-200. This evaluation covers the WMT’22 test data across 8 directions, involving translations to and from English for German, Czech, Chinese, and Russian. Scores are averaged based on assessments from three reference-free models: wmt23-cometkiwi-da-xxl, XCOMET-XXL, and wmt22-cometkiwi-da, and are also averaged across all directions. The gold reference is also evaluated due to the reference-free approach. Our model, ALMA-13BR, developed by further training ALMA-13B-LoRA using our proposed CPO method, either matches or surpasses the most advanced translation models, We show the detailed numerical data for all systems presented in the figure in Appendix A.

Table 1: A performance comparison between gold references and outputs from advanced translation models, as assessed by two 10B-size reference-free evaluation models with the highest correlation to human preferences. The results indicate that the average performance of these strong translation models can even exceed that of the gold references, achieving a high success rate in beating the reference.

Figure 2: An example demonstrating that a human-written gold reference may not always be flawless, and could be surpassed by translations from advanced translation models. In this case, the reference retains the abbreviation “CEP” but fails to provide its full name. The highlighted phrases in the model-generated translations indicate the portions omitted by the gold reference.

Models We scrutinize the translation outputs from ALMA13B-LoRA2, as well as zero-shot translations from the most recent GPT-4 (gpt-4-1106-preview). To assess the quality of these outputs, we employ two of the latest and largest reference-free models, each with a 10B parameter size and demonstrating very high correlation with human judgements (Freitag et al., 2023). These models are Unbabel/wmt23-cometkiwi-da-xxl (henceforth referred to as KIWI-XXL) (Rei et al., 2023) and Unbabel/XCOMET-XXL (subsequently referred to as XCOMET) (Guerreiro et al., 2023).

Data we consider the high-quality and human-written FLORES-200 dataset (NLLB TEAM et al., 2022), comprising both development and test data, amounting to a total of 2009 samples for each language direction, to compare the gold references with the outputs generated by the models. We employed ALMA-13B-LoRA and GPT-4 to perform translations across five English-centric language pairs, covering both translations from and to English. These pairs include German (de), Czech (cs), Icelandic (is), Chinese (zh), and Russian (ru), with Icelandic (is) categorized as a low-resource language and the others as high-resource languages.

Prompt The prompt employed for generating translations with ALMA models is consistent with the one used in Xu et al. (2023). For GPT-4 translation generation, we follow the guidelines suggested by Hendy et al. (2023). The specifics of these prompts are detailed in Appendix B.

2 ALMA-13B-LoRA is the best 13B translation model in the ALMA families. It initially undergoes full-weight fine-tuning on monolingual data, followed by fine-tuning on high-quality humanwritten parallel data using low-rank adaptation (LoRA) (Hu et al., 2022).

Model Outputs Can Be Better References In Table 1, we present the evaluation scores of KIWI-XXL and XCOMET for the gold references, ALMA-13B-LoRA outputs, and GPT-4 outputs. Additionally, we report Win Ratio, reflecting the proportion of instances where model outputs surpass the gold standard references. These metrics are calculated as an average across five languages. Remarkably, even comparing with the high-quality Flores-200 dataset, the average performance of translation models in xx→en translations significantly exceeds that of the references, showing approximately 3-4 point increases in KIWI-XXL and 4-6 point gains in XCOMET. Notably, a significant proportion of outputs are rated higher than the references by KIWI-XXL (e.g., 73.24% for ALMA), with a slightly reduced yet still substantial percentage when assessed using XCOMET (60.17% for ALMA). In the en→xx direction, while the overall performance between the translations from reference and two systems is comparable, approximately 40% are still deemed superior to the reference translations.

Motivation: Help The Model Learn Rejection The aforementioned findings illustrate that translations produced by advanced models can sometimes surpass the quality of gold standard references. This raises the question of how to effectively utilize such data. A straightforward approach would involve fine-tuning the model using the source and the superior translations as references. While this could enhance the model’s translation abilities, it does not equip the model with the discernment to identify and avoid generating suboptimal translations, exemplified by the “good but not perfect” translations depicted in Figure 2. Consequently, this situation motivates us to develop a new training objective, which aims to instruct the model in prioritizing the generation of higher-quality translations and rejecting lesser ones, in a style of contrastive learning with hard negative examples (Oord et al., 2018; Chen et al., 2020; He et al., 2020; Robinson et al., 2021; Tan et al., 2023). This objective moves beyond the traditional focus on merely minimizing cross-entropy loss towards the reference.

3. Contrastive Preference Optimization

In this section, we present a novel preference learning technique, termed Contrastive Preference Optimization (CPO). This method is designed to guide the model in developing a propensity for generating ‘better’ translations while simultaneously learning to reject ‘worse’ ones, even in cases where these ‘worse’ translations are of high quality but not perfect.

3.1. Deriving the CPO Objective

We discuss the derivation of the Contrastive Preference Optimization (CPO) objective, beginning with an analysis of Direct Preference Optimization (DPO) (Rafailov et al., 2023). DPO represents a more direct optimization objective utilized in reinforcement learning from human feedback (RLHF) (Ziegler et al., 2019; Ouyang et al., 2022). Given a set of source sentences \(x\), alongside preferred translation targets \(y_w\) and less preferred ones \(y_l\), we can access a static dataset of comparisons, denoted by \(D\).

\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( r_{\text{ref}}(x, y_w) - r_{\text{ref}}(x, y_l) \right) \right) \right]\]

where \(\pi_{\text{ref}}\) is a pre-trained language (translation) model, \(\sigma\) is the sigmoid function, and \(\beta\) is a hyperparameter. The DPO loss is derived via a reparameterization process of the ground truth reward and the corresponding optimal policy in the Proximal Policy Optimization (PPO) framework (Schulman et al., 2017). As a result, DPO training can be conducted in a supervised fine-tuning style, as it relies exclusively on labeled preference data and does not require interaction between agents and their environment.

However, DPO has notable drawbacks compared to common SFT. Firstly, DPO is memory-inefficient: it necessitates twice the memory capacity to simultaneously store both the parameterized policy and the reference policy. Secondly, it is speed-inefficient: executing the model sequentially for two policies doubles the processing time. To address these inefficiencies, we introduce contrastive preference optimization.

Initially, we set \(\pi_{\text{ref}} = \pi_w\), representing an ideal policy that perfectly aligns with the true data distribution of the preferred data. Specifically, for any given data point \((x, y_w, y_l)\) from the dataset \(D\), the conditions \(\pi_w(y_w\\|x) = 1\) and \(0 \leq \pi_w(y_l\\|x) \leq 1\) hold true. This contrasts with the conventional approach of assigning \(\pi_{\text{ref}}\) to the initial pre-trained model checkpoint. This modification is feasible because the primary role of \(\pi_{\text{ref}}\) is to prevent deviation from the original model, and our approach similarly aligns with this goal, maintaining the model’s proximity to the ideal preferred model. Consequently, under this setup, the predictions for preferred data do not require reweighting by the reference model, and the DPO loss can be reformulated as follows:

\[\mathcal{L}_{\text{CPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( 1 - \pi_{\theta}(y_l\\|x) \right) \right) \right]\]

However, \(\beta \log \pi_w(y_l\\|x)\) is unknown as the ideal model \(\pi_w\) is unreachable, but we can approximate the optimization. After expanding the sigmoid function and removing the non-parameterized term, the loss becomes:

\[\mathcal{L}_{\text{CPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( 1 - \pi_{\theta}(y_l\\|x) \right) \right) \right]\]

Upon restructuring the above loss into a sigmoid function, we derive our new preference learning objective without computing \(\beta \log \pi_w(y_l\\|x)\) by minimizing the upper bound:

\[\mathcal{L}_{\text{CPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( 1 - \pi_{\theta}(y_l\\|x) \right) \right) \right]\]

A detailed step-by-step derivation from DPO to CPO is provided in Appendix C. Our approach also involves implementing a straightforward signal to guide the learnable policy \(\pi_{\theta}\) towards the preferred data distribution. Specifically, we incorporate a log-likelihood supervised fine-tuning loss applied to the preferred data:

\[\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x, y_w) \sim D} \left[ \log \pi_{\theta}(y_w\\|x) \right]\]

The transition from DPO to CPO resolves issues of memory and speed inefficiency. This efficiency is achieved as CPO only requires the storage and processing of one policy \(\pi_{\theta}\), so CPO facilitates preference learning with the same space and time complexity as common SFT methods. We also show the substantially superior performance of CPO compared to DPO in Section 4.

3.2. Triplet Preference Data

Construction of Preference Data \(D\)

This dataset is developed using the FLORES-200 data (both development and test sets) and encompasses the same language pairs as discussed in Section 2. For each language pair, the dataset comprises 2009 parallel sentences.

For a given source sentence \(x\), whether translated from or to English, we utilize both GPT-4 and ALMA-13B-LoRA to generate respective translations, denoted as \(y_{\text{gpt-4}}\) and \(y_{\text{alma}}\). Together with the original target reference \(y_{\text{ref}}\), this forms a triplet \(y = (y_{\text{ref}}, y_{\text{gpt-4}}, y_{\text{alma}})\), representing three different translation outputs for the input \(x\).

The reference-free evaluation models KIWI-XXL and XCOMET are then employed to score these translations, with the average scores represented as \(s = (s_{\text{ref}}, s_{\text{gpt-4}}, s_{\text{alma}})\). The highest-scoring translation is labeled as the preferred translation \(y_w\), and the lowest-scoring as the dis-preferred translation \(y_l\):

\[y_w = y_{\arg \max_i (s)}, \quad y_l = y_{\arg \min_i (s)}\]

where \(i\) represents the index in the triplet. Translations with intermediate scores are not considered. An illustrative example of this selection process is depicted in Figure 3.

It is important to note that even the dis-preferred translations may be of high quality. The designation ‘dis-preferred’ indicates that there is still room for improvement, perhaps through the addition of minor details. This approach of using high-quality but not flawless translations as dis-preferred data aids in training the model to refine details and achieve perfection in generated translations.

4. Experiments

4.1. Data

Following Section 2, we consider 10 translation directions in the paper: cs↔en, de↔en, is↔en, zh↔en, ru↔en. Building on the ALMA models’ (Xu et al., 2023) insights that a small quantity of high-quality data can yield impressive translation results, our training dataset is even more compact. As detailed in Section 3.2, our preference training data is derived from the FLORES-200 dataset, a subset of which has been also employed in the training of ALMA models. This results in a total of 2K × 10 directions = 20K paired sentences. In addition to preference data assessed by large evaluation models, our dataset incorporates 1K internal human-labeled preference data, containing preferred and dis-preferred translations along with human preference. However, the human-labeled data is limited to just two translation directions: en→zh and en→de. The details regarding the composition and influence of human-labeled data are explored in Appendix D.4 In alignment with Xu et al. (2023), our primary focus is on the test set drawn from WMT’21 for is and WMT’22 for other languages. Additionally, we conduct auxiliary experiments evaluating models on WMT’23, covering six directions: de↔en, zh↔en, and ru↔en.

Figure 3: A triplet of translations, either model-generated or derived from a reference, accompanied by their respective scores as assessed by reference-free models. For a given source sentence, the translation with the highest score is designated as the preferred translation, while the one with the lowest score is considered dis-preferred, and the translation with a middle score is disregarded.

4.2. Training Setup

We train the model in a many-to-many multilingual machine translation manner, starting with ALMA-13B-LoRA as the initial checkpoint. During the training phase, we focus exclusively on updating the weights of the added LoRA parameters. These weights have a rank of 16 and only add an additional 12M parameters to the original 13B size of the model. We adhere to the default β value of 0.1 as suggested by Rafailov et al. (2023). The fine-tuning process of ALMA13B-LoRA involves a batch size of 128, a warm-up ratio of 0.01, spanning a single epoch, and accommodating sequences with a maximum length of 512 tokens. To optimize training efficiency, we integrate the deepspeed tool (Rasley et al., 2020). We utilize the same prompt as Xu et al. (2023) and do not compute the loss for the prompt. While our primary focus is on the performance of 13B models, CPO markedly benefits 7B models as well. Consequently, we also release ALMA-7B-R and provide a detailed discussion data suggests a minimal effect.

3 The impact of using different evaluation models, such as only using XCOMET or KIWI-XXL, is explored in Section 5.1.

4 TL;DR: A brief overview of the impact of this human-labeled

Table 2: The overall results in en→xx for WNT’21 and WMT’22. The application of the CPO method to fine-tune the ALMA-13B-LoRA model leads to a significant enhancement in performance, equalling or surpassing that of WMT competition winners and GPT-4. Bold numbers denote the highest scores across all systems. Deep green boxes highlight improvements exceeding 1 point over the ALMA model after fine-tuning with preference data, while more modest gains under 1 point are shown in shallow green boxes . Decreases in performance are marked with red boxes.

4.3. Baselines

SoTA Models In this category, our benchmarks are established against, to the best of our knowledge, the strongest publicly available translation models. We first compare with ALMA-13B-LoRA, recognized as one of the top moderatesize language-model based translation systems, surpassing notable conventional models such as NLLB-54B in both WMT’21 and WMT’22. We also compare our results with TowerInstruct5, a recently released LLM-based translation model and a contemporary work in the field.6 Additionally, we evaluate against the zero-shot performance of the latest GPT-4 (gpt-4-1106-preview), currently shown to be the best translation model among all LLM-based translation systems (Xu et al., 2023; Zhang et al., 2023; Zeng et al., 2023; Jiao et al., 2023a). Lastly, we include comparisons with the WMT competition winners, representing the highest standard of translation models within the competition, though it is noted that the winning models vary across different language directions.7

SFT and DPO We also compare different training objectives. Given that CPO is designed to steer learning towards preferred data, a straightforward benchmark is to compare its performance against directly SFT on the same preferred data set. Furthermore, considering that CPO is an evolution of DPO, we also include a comparative analysis with DPO.

5 https://huggingface.co/datasets/Unbabel/

6 Note that TowerInstruct has used WMT’22 test data for training, so we exclude it from comparison on the WMT’22 test dataset. 7The WMT winner systems used for comparison in each direction are provided in Appendix E.

4.4. WMT’21 and WMT’22 Results

We present the primary results for en→xx and xx→en in Table 2 and Table 3, respectively. Our emphasis is primarily on reference-free evaluation models, due to our analysis in Section 2, which questions the reliability of gold references and highlights that evaluations can be compromised by poorquality references (Kocmi et al., 2023; Freitag et al., 2023). These models include KIWI-XXL, XCOMET, and a smaller yet popular model, Unbabel/wmt22-cometkiwi-da (hereinafter referred to as KIWI-22). Scores highlighted in bold represent the highest achieved across all systems. For a comprehensive comparison, we also include reference-based evaluations using sacreBLEU (Post, 2018) and COMET-22 (Unbabel/wmt22-comet-da) (Rei et al., 2022) in Appendix A.

Comparing With SoTA Models While ALMA-13B-LoRA ranks as one of the top moderate-size LLM translation models, it slightly trails behind GPT-4 and the WMT competition winners. However, the incorporation of CPO significantly enhances ALMA’s capabilities, bringing its performance to a level that is comparable to or even surpasses that of GPT-4 and WMT winners. For example, ALMA-13B-R achieves an average score of 85.74 on KIWI-XXL and 94.05 on XCOMET for en→xx translations. These scores outperform GPT-4, which scores 83.83 on KIWI-XXL and 93.23 on XCOMET, as well as the WMT winners, who score 84.81

Table 3: The overall results in xx→en for WMT’21 and WMT’22. The usage of color and boldface are the same in Table 2.

Table 4: The average performance in WMT’23 across all six directions, with the highest score among all systems highlighted in bold.

In Table 2 and 3, we emLoRA model as a base. ploy deep green boxes to indicate improvements exceeding 1 point, and shallow green boxes for improvements less than 1 point. Conversely, red boxes denote a decline in performance. We observe that SFT on preferred data marginally enhances the ALMA model’s translation capability for xx→en, and results in a slight deterioration for en→xx. Similarly, DPO slightly decreases model performance. In contrast, CPO demonstrates significant improvements across all translation directions.

4.5. WMT’23 Results

We show the average results across all six directions in Table 4, and provide the performance in each direction in Appendix F due to the space constraints. Consistent with observations from WMT’21 and WMT’22, ALMA-13B-R surpasses contemporary moderate-size LLM-based translators such as ALMA-13B-LoRA and TowerInstruct, and either matches or exceeds the performance of WMT winners.

Table 5: The influence of employing various reference-free models for creating preference data. The results illustrates that the final performance disparities are minimal whether using solely KIWI-XXL, XCOMET, or their combined ensemble.

5. Analyses

All analyses were conducted using the WMT’21 and WMT’22 test datasets, with their averaged performance being reported.

5.1. Are Translations Really Better or Just

Metric-Preferred?

In our study, since the preferred data is selected by referencefree models and the same models are used for evaluation, we investigate the potential for ‘cheating’ in the scoring process. Specifically, we question whether the improved translation scores reflect genuinely better translations or if they simply align more closely with the evaluation model’s preferences? This inquiry is addressed in two parts:

At the metric level, we examine if training a model on data preferred by a specific metric (such as KIWI-XXL) yields improvements that are consistent across other metrics. To investigate this, we reconstruct the preference data using only KIWI-XXL or XCOMET and re-train the ALMA-13BLoRA model using the CPO method. The results, presented in Table 5, do not indicate a significant bias towards the metric used for selecting preferred data. We observed similar and consistent improvements across all metrics, regardless of the specific metric used to select the preferred data.

Table 6: An ablation study evaluating the significance of individual components in the CPO loss function, specifically analyzing how the preference learning loss Lprefer and the log-likelihood loss LNLL each contribute to enhancing translation performance.

At the method level, we question whether training on metricpreferred data always leads to better scores on that metric, regardless of the method we use. However, the connection is not straightforward; for instance, SFT on preferred data paradoxically results in diminished performance across all three metrics as shown in Table 2.

Consequently, our analysis supports the robustness and validity of using reference-free models like KIWI-XXL and XCOMET both for constructing preference data and for evaluation purposes, underscoring the absence of bias in this approach. Furthermore, Table 5 demonstrates that the choice between using KIWI-XXL, XCOMET, or an ensemble of both has a minimal impact on the results.

5.2. Ablation Study

CPO Loss Components The CPO loss function consists of two components: Lprefer for preference learning, and LNLL, which ensures the model does not deviate significantly from the preferred data distribution. To illustrate the significance of each term, we re-train the model exclusively with one of the components. It is important to note that training solely with LNLL equates to the baseline scenario of SFT on preferred data. As depicted in Table 6, the inclusion of both terms yields the optimal performance, while the absence of either leads to a decrease in performance.

Preference Data Components: Our preference data selection involves choosing preferred and dis-preferred translations from a triplet consisting of outputs from GPT-4, In Table 7, we emphaALMA, and the gold reference.

Table 7: An ablation study assessing the significance of each component in the translation triplet. By excluding either ALMA or GPT-4 generated data from the preference triplet and re-training the model, we evaluate their respective impacts. The findings highlight the importance of ALMA-generated data for en→xx translations and GPT4 generated data for xx→en translations.

Table 8: An examination of the impact of dis-preferred data quality, contrasting noised data with natural, high-quality translations receiving the lowest scores as dis-preferred data. The findings underscore the importance of natural and highquality dis-preferred data.

5.3. Does The Quality of Dis-preferred Data Matter?

In our experimental setup, dis-preferred data, though originating from strong translation models, receives the lowest scores when compared with two other translation outputs. A pertinent question arises: does the quality of dis-preferred data significantly impact model performance, and can highquality (albeit imperfect) dis-preferred data aid in translation improvement? To explore this, we constructed a new set of preference data where the dis-preferred translations (yl) are artificially generated, as opposed to being naturally derived high-quality translations.

In this new dataset, the preferred translation (yw) remains the best of the three translation candidates, selected in the same manner as in Section 3.2. However, the dis-preferred translation is intentionally modified to be a noised version of yw. We applied random deletions of words with a probability of 0.15 and word swaps within a range of 1 with a probability of 0.3, following the method suggested by Zeng et al. (2023) for creating manually noised dis-preferred data. This approach produces worse translations that are unnatural and artificial.

Table 8 compares the performance when using these manually noised dis-preferred data versus the original, naturally occurring high-quality dis-preferred data. The results show a substantial decline in performance across all three metrics and both translation directions when the dis-preferred data is manually noised, underscoring the importance of the quality of dis-preferred data in enhancing translation performance.

6. Conclusion

In this study, we initially proposed the potential quality issues of gold references in machine translation tasks, highlighting instances where advanced translation models outperform these references. This finding challenges the conventional assumption of gold references as the optimal standard, impacting not only model training, which often relies on minimizing the difference between predicted tokens and gold references, but also potentially skewing results in reference-based evaluation metrics. Subsequently, we introduce Contrastive Preference Optimization, a more efficient variant of of DPO. This method leverages both modelgenerated and reference data to guide the model in avoiding near-perfect yet flawed translations and learning superior ones. Our developed model, ALMA-13B-R, stands out as the first moderate-size LLM-based translation model to match, and in some cases surpass, the performance of cutting-edge systems such as GPT-4 and WMT competition winners, marking a significant advancement in the field of neural machine translation.

Previous: RLHF Paper Next: Decontamination | Pile

post contain ""

    No matching posts found containing ""