00:00:00

Model | MAP-Neo*

https://dsdanielpark.github.io https://github.com/dsdanielpark

Model | MAP-Neo*

MinWoo(Daniel) Park | Tech Blog

Created: 2024-05-31 13:08:55 +0000

Last modified: 2024-09-05 20:56:50 +0900

Model | MAP-Neo*

Related Project: Private
Category: Paper Review
Date: 2024-05-31

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

url: https://arxiv.org/abs/2405.19327
pdf: https://arxiv.org/pdf/2405.19327
html https://arxiv.org/html/2405.19327v1
abstract: Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model’s weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

Contents

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

TL;DR

대규모 언어모델의 개방성과 성능 향상: MAP-Neo-7B의 개발

문제 인식: 대규모 언어모델들 개방성과 투명성이 약간 부족하며, 특정 언어에 대한 성능 저하 문제가 있습니다.
해결 방법: 완전 개방형 모델인 MAP-Neo-7B를 개발하여, 모든 연구자가 접근 가능하고 재현할 수 있도록 하며, 다양한 언어 능력을 개선하는 데 기여한다.
방법: 고성능 토크나이저와 다양한 데이터 소스를 이용하여 훈련하고, 투명성과 성능을 모두 갖춘 모델을 제공한다.

[선행 연구 및 문제 정의]

최근 AI 연구에서 대규모 언어모델(LLMs)의 중요성이 커지고 있지만, 대부분의 모델이 특정 언어, 특히 영어에 편향되어 있고, 모델의 개발 과정이 투명하지 않아 재현성과 검증이 어려운 문제가 있습니다. 예를 들어, OLMo 모델은 투명성은 있으나 성능이 다른 언어 처리에서는 떨어지는 한계를 가지고 있었습니다.

[방법]

[토크나이저] MAP-Neo-7B의 토크나이저는 byte-pair encoding(BPE)를 사용하며, 다양한 언어의 문자를 효율적으로 처리할 수 있도록 설계되었습니다. 특히 중국어와 영어에서의 성능을 극대화하기 위해 특정 문자 처리 방법을 개선하였는데, 예를 들어, 중국어 데이터에서는 숫자를 개별 숫자로 분리하고, 알 수 없는 UTF-8 문자를 바이트 단위로 처리하도록 설정했다.

[데이터셋 구축 및 훈련 과정] MAP-Neo-7B는 고품질 데이터를 바탕으로 훈련되었습니다. 4.5T 토큰의 이중 언어 학습 코퍼스를 구축하여, 이를 사용하여 모델을 pre-training하였습니다. 데이터는 웹에서 수집하고, 고품질로 필터링 및 정제하는 과정을 거쳤습니다. 특히, training dataset에는 프로그래밍 코드나 수학적 데이터 등 다양한 유형의 고난도 데이터가 포함되어, 모델의 다양한 문제 해결 능력을 향상시키는 데 중점을 두었습니다.

[성능 평가]

MAP-Neo-7B는 다양한 벤치마크에서 우수한 성능을 보였습니다. 특히, 중국어와 영어 이해, 수학적 능력, 프로그래밍 코드 작성 능력 등에서 높은 점수를 기록했다. 모델의 투명성과 성능을 동시에 검증하기 위해, 모든 데이터셋과 훈련 과정이 완전히 공개되어, 다른 연구자들이 모델을 재현하고 검증할 수 있습니다.

1 서론

인공 일반 지능(AGI)을 향한 진전에 기여하는 대규모 언어모델(Large Language Models, LLM)들이 등장하였다. 이들 모델은 복잡한 인퍼런스, 창작, 다양한 언어의 처리 능력을 포함하여 광범위한 능력을 보여준다. 상업적 이해관계로 많은 선진 모델들이 비공개 상태로 남아 있지만, LLaMA, BLOOM, LLM360과 같은 모델들은 중요한 데이터와 코드를 공개하여 투명한 개발 환경을 제공하였다. OLMo는 데이터 처리 파이프라인을 개선하고, 훈련 로그 및 중간 체크포인트를 공개하였지만, 여전히 다양한 분야에서 최고 수준의 모델에는 미치지 못한다.

이에 본 논문에서는 전체 개발 워크플로를 포함하는 완전 투명한 양방향 LLM인 MAP-Neo를 소개한다. 훈련 데이터의 큐레이션부터 모델 아키텍처, 토크나이저 훈련, 세밀한 조정에 이르기까지 전 과정을 체계적으로 설명한다. 특히, 투명성과 재현성을 높이기 위해 중간 체크포인트와 훈련 로그도 공개하고, 모델의 성능 평가를 위해 자세한 벤치마크를 제공한다.

2 선행 연구

선행 연구들은 대부분 비공개 상태로 남아 투명성이 결여되어 있었다. MAP-Neo는 이런 문제를 극복하고자 모든 훈련 데이터와 중간 단계의 체크포인트를 공개하였다. 이런 투명한 접근 방식은 연구 커뮤니티에 의한 독립적인 검증을 가능하게 하고, 모델의 신뢰성을 높인다. 성능 측정에서 MAP-Neo-7B는 GSM8K, MATH, HumanEval에서 우수한 성과를 보여주며, 투명성에서도 모든 조건을 충족시킨 유일한 모델이다.

\[\text{Performance} \, \propto \, \text{Transparency} \, \times \, \text{Data Quality}\]

이 수식은 성능이 투명성과 데이터 품질의 곱으로 비례함을 나타낸다. MAP-Neo는 이런 관계를 활용하여 공개 데이터와 체계적인 개발 프로세스를 통해 최고의 성능을 구현한다.

3 토크나이저

토크나이저는 byte-pair encoding (BPE) 알고리즘을 사용하여 50B 샘플을 기반으로 훈련되었다. 이 알고리즘은 데이터에서 가장 빈번하게 등장하는 바이트 쌍을 병합하는 방식으로 작동하며, 이는 언어의 구조적 특성을 반영하여 효율적인 토크나이징을 가능하게 한다. MAP-Neo는 특히 중국어와 영어 데이터에 대해 다른 압축률을 보여주는데, 이는 각 언어의 문자 사용 빈도와 구조적 차이에서 기인한다.

\[\text{\% Compression} = \frac{\text{# Char}}{\text{# Token}}\]

중국어 데이터는 보통 더 낮은 압축률을 보이며, 이는 토크나이저의 성능에 영향을 미칠 수 있다. 이런 분석을 통해 토크나이저의 설정을 조정하고, 다양한 언어에 대한 최적의 성능을 달성할 수 있는 방법을 개발할 수 있다.

4 Matrix Data Pile

양방향 투명한 LLM 훈련을 위한 4.5T 토큰 데이터셋, Matrix 도입
데이터 수집 및 처리 과정의 고도화를 통한 모델 성능 향상
다양한 데이터 출처와 필터링 및 중복 제거 기술 적용

언어 모델의 훈련을 위해서는 고품질의 훈련 데이터가 필수적이며, 이는 모델의 발전을 이끄는 주요 요소로 작용한다. 대규모 언어모델 개발에 있어, 데이터의 양과 질은 모델의 성능에 직접적인 영향을 미친다. 본 연구에서는 4.5T 토큰 규모의 양방향 훈련 데이터셋인 Matrix를 소개한다. Matrix는 투명한 데이터 수집 및 처리 과정을 제공하며, 다양한 데이터 출처(웹 콘텐츠, 프로그래밍 코드, 학술 논문 등)에서 수집된 데이터의 세밀한 처리 방법을 포함한다.

4.1 데이터 재처리 파이프라인

선행 연구에서 공개된 데이터셋들은 주로 영어 기반으로, 이를 통해 얻은 데이터는 종종 품질의 한계를 지닌다. Matrix 프로젝트는 이런 데이터를 재처리하여 품질을 향상시키는 것을 목표로 한다. 재처리 과정은 크게 필터링과 중복 제거의 두 단계로 나뉜다.

4.1.1 필터링

텍스트의 질을 결정하는 중요한 요소는 그 내용의 신뢰성과 일관성이다. 다양한 데이터 출처에서 수집된 코퍼스를 대상으로 휴리스틱 규칙을 적용하여 낮은 품질의 데이터를 제거한다. 이 과정은 문서 레벨과 문장 레벨에서 이루어지며, 중복 텍스트 및 블랙리스트에 포함된 용어를 포함한 텍스트의 제거를 포함한다.

4.1.2 중복 제거

중복된 텍스트는 모델의 학습 효율을 저하시킬 수 있기 때문에, 이를 제거하는 과정은 중요하다. Minhash LSH 방법을 사용하여 근접 중복을 제거하고, 정확한 문서 중복에는 문서 전체를 비교하는 방식을 사용한다. 이 과정은 대량의 데이터를 효율적으로 처리하기 위해 분산 처리 방식을 적용한다.

4.2 웹 콘텐츠의 스크래치 파이프라인

중국어 데이터의 품질 향상을 위해 별도의 웹 콘텐츠 스크래치 파이프라인을 개발하였다. 이 파이프라인은 중국어 웹 페이지에서 직접 데이터를 수집하고, 이를 기반으로 필터링 및 중복 제거 작업을 수행한다.

4.2.1 필터링

중국어 데이터셋에는 HTML 변환 데이터가 큰 비중을 차지하므로, HTML 관련 아티팩트 제거에 중점을 두고 있다. 또한 중국어와 영어 간의 언어적 차이를 고려하여 필터링 규칙을 조정하고, 중국어 텍스트의 특성에 맞는 토크나이징 방법을 적용한다.

4.2.2 중복 제거

중국어 데이터의 중복 제거는 정확한 문서 중복 제거와 MinHash 중복 제거, 유사 줄 중복 제거를 포함한다. 특히 유사 줄 중복 제거는 텍스트를 줄 단위로 나누고, 각 줄 간의 유사성을 비교하여 중복을 제거하는 방식으로 진행된다.

4.3 문서 변환 파이프라인

디지털 문서에서 고품질 텍스트를 추출하는 것은 많은 챌린지를 수반한다. 문서의 레이아웃 정보를 분석하고, 다양한 레이아웃 요소를 식별하는 것을 목표로 한다. 이 과정에서는 PP-StructureV2와 같은 오픈 소스 솔루션을 활용하여 레이아웃 감지, 요소 인식, 정렬, 후처리 등의 여러 단계를 포함한다.

이상의 과정을 통해 언어 모델 훈련을 위한 고품질의 투명한 데이터셋을 구축하고, 이를 통해 모델의 성능을 극대화할 수 있다. 이런 데이터 처리 기술은 향후 LLM 개발에 있어 중요한 참고 자료가 될 것이다.

5 모델

MAP-Neo 모델: 트랜스포머 기반 구조 및 최적화된 하이퍼파라미터 사용
이단계 전처리 전략을 통한 성능 및 신뢰성 향상
지속적인 학습과 개선을 통한 코드 생성 및 언어 이해 능력 강화

5.1 모델 아키텍처

MAP-Neo 모델은 트랜스포머 디코더 아키텍처를 기반으로 하며, 여러가지 개선 사항을 포함하고 있다. 이 모델은 8192 토큰의 컨텍스트 길이로 훈련되며, 다음과 같은 특징을 갖는다.

다중 쿼리 어텐션 (Multi-Query Attention): 2B 모델은 다중 쿼리 어텐션을 사용해 하나의 키-값 헤드 구성을 통해 효율성을 높인다. 이는 작은 규모의 모델에서 효과적임이 입증되었다.
로터리 위치 임베딩 (RoPE Embeddings): 기존의 절대 위치 임베딩 대신, 각 레이어에서 로터리 위치 임베딩을 사용하여 모델 크기를 최소화한다.
RMSNorm: 안정적인 훈련을 위해 각 트랜스포머 하위 레이어를 RMSNorm으로 정규화한다.
활성화 함수: SwiGLU를 활성화 함수로 사용한다.

5.2 모델 규모 하이퍼파라미터

이 연구에서는 2B와 7B 파라미터의 두 가지 모델 규모를 비교한다. 이 모델들은 표준 밀집 트랜스포머로 구성되며, 하이퍼파라미터는 다음과 같다.

\[d_{\text{ff}} = 8 \times d_{\text{model}}\]

$d_{\text{model}}$은 모델 차원을 나타내며, 훈련은 동일한 어휘와 배치 크기를 사용하여 수행된다.

6 전처리

6.1 기본 단계: 일반 능력 습득

기본 단계에서는 두 단계 학습률 스케줄러를 사용하여 모델에 강력한 일반 텍스트 생성 능력을 부여한다. 학습률 함수는 다음과 같이 모델링된다.

\[f(t) = \begin{cases} \eta_a + (\eta_{\text{max}} - \eta_a) \frac{t}{t_{\text{warmup}}} & \text{if } t \leq t_{\text{warmup}} \\ \eta_b + (\eta_{\text{max}} - \eta_b) \left(1 + \cos\left(\pi \frac{t-t_{\text{warmup}}}{t_{\text{total}}-t_{\text{warmup}}}\right)\right) & \text{if } t_{\text{warmup}} < t \leq t_{\text{total}} \end{cases},\]

$t$는 현재 타임스텝, $t_{\text{warmup}}$은 웜업 단계의 기간, $t_{\text{total}}$은 총 훈련 타임스텝을 나타낸다.

6.2 감쇠 단계: 성능 향상 및 수정

토크나이저 훈련 문제로 인해 코드 생성 작업에서 실패가 발생했다. 이를 해결하기 위해 특별히 설계된 감쇠 단계를 도입하였다. 학습률은 지수 감쇠를 따르며 다음과 같이 표현된다.

\[f(t) = \eta_c \times 0.5^{\frac{t}{T}} \quad \text{if } t \leq t_{\text{decay}},\]

이 단계에서는 지시 데이터와 높은 비율의 코드 데이터를 사용하여 모델의 성능을 개선한다. 이런 조정은 모델이 복잡한 코딩 작업을 효과적으로 처리할 수 있도록 하며, 다양한 분야에서 전문적인 응답을 생성하는 데 능숙하게 만든다.

7 정렬

휴먼 행동에 대응하는 감독된 파인튜닝(SFT) 및 반복적 직접 우선순위 최적화(DPO) 적용
다단계 학습을 통한 채팅 능력과 기본 기능 향상
선호도 기반 학습으로 모델의 휴먼 행동 정렬 및 개선

7.1 감독된 파인튜닝 (Supervised Fine-tuning)

7.1.1 데이터

첫 단계에서는 200만 개 이상의 지시 데이터를 사용하여 모델의 기본 능력(코딩 및 수학 능력)을 강화한다. 이 단계는 완전한 OpenHermes 2.5 데이터셋를 포함하고 있으며, TheoremQA 벤치마크와 관련된 부분은 제외하여 데이터 누설을 방지한다. 두 번째 단계에서는 모델의 채팅 능력을 향상시키기 위해 10만 개 이상의 멀티턴 대화 데이터를 수집한다. 이 단계는 첫 단계에서 획득한 기본 기능을 유지하면서 채팅 능력을 개선하는 데 중점을 둔다.

7.1.2 훈련

감독된 파인튜닝에서는 다음 토큰 예측 목표를 훈련 과제로 사용한다. 첫 번째 단계에서는 3 epoch 동안 200만 개 이상의 지시 데이터 포인트를 사용하여 기본 능력을 강화하고, 두 번째 단계에서는 1 epoch 동안 10만 개 이상의 멀티턴 대화 데이터를 사용하여 채팅 능력을 향상시킨다.

7.2 반복적 직접 우선순위 최적화 (Iterative DPO)

DPO는 휴먼의 선호도를 기반으로 언어 모델을 최적화하는 간단하고 효과적인 방법이다. 이 방법은 선호도 손실을 언어 모델에 대한 손실 함수로 변환하여, 명시적인 보상 모델링이나 강화 학습의 필요성을 제거한다. DPO는 선호도 데이터셋에서 최대 가능도 추정을 사용하여 언어 모델의 파라미터를 직접 추정한다.

\[L_{\text{DPO}}(\pi_{\theta}; \pi_{\text{sft}}, D) = -\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}}) \sim D} \left[ \beta \log \frac{\pi_{\theta}(y_{\text{w}}|x)}{\pi_{\text{sft}}(y_{\text{w}}|x)} - \beta \log \frac{\pi_{\text{theta}}(y_{\text{l}}|x)}{\pi_{\text{sft}}(y_{\text{l}}|x)} \right].\]

이 수식은 선호된 응답 $y_{\text{w}}$와 비선호된 응답 $y_{\text{l}}$의 로그 확률 비율을 최대화하여 모델을 학습시키는 것을 목표로 한다. 이 과정은 반복적으로 수행되며, 각 반복에서 응답 쌍을 생성하고, 보상 모델을 사용하여 응답을 레이블링하고, DPO 손실을 사용하여 LLM을 훈련한다.

반복적 DPO 과정은 세 단계로 구성됩니다.

쌍으로 된 응답 생성
보상 모델을 사용한 응답 레이블링
DPO 손실로 LLM 훈련

이 방법은 모델이 휴먼의 가치와 더욱 일치하도록 만들어, 자연스럽고 유용하며 문맥적으로 정확한 응답을 생성할 수 있도록 한다.

10 평가

MAP-Neo의 성능 평가 및 베이스라인 설정
코드, 수학, 지시에 따른 기능 강화로 향상된 성능 발휘
반복적 DPO 방법을 통한 대화형 모델 성능 개선

10.1 기본 모델 성능

주요 결과

표준 학술 벤치마크를 통해 LLama3-8B, Mistral-7B와 같은 유명 LLM과 MAP-Neo의 기본 모델을 비교 분석했다. 평가는 영어와 중국어를 포함한 다양한 공개 벤치마크를 사용하여 수행되었습니다. 이 벤치마크들은 언어 이해와 인퍼런스 능력을 평가하기 위한 다양한 데이터셋를 포함한다.

퍼플렉시티 기반 평가를 사용하여 여러 선택 문제를 처리한다. 생성 기반 데이터셋의 경우, 자유로운 텍스트를 생성하고 결과를 분석한다. 다음과 같은 수학적 표현을 사용하여 평가 메트릭을 계산한다.

\[\text{Perplexity}(W) = \exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log p(w_i|w_{i-1},...,w_1)\right)\]

$W$는 단어 시퀀스, $N$은 시퀀스의 길이, $$p(w_i

w_{i-1},…,w_1)$는 주어진 컨텍스트에서 단어$w_i$$의 조건부 확률이다. 이 수식은 모델이 데이터를 얼마나 잘 이해하고 예측하는지를 정량적으로 평가한다.

10.2 정렬 모델 성능

주요 결과

실제 대화 성능을 평가하기 위해 여러 벤치마크를 선택했고, 벤치마크는 모델의 다양한 능력을 종합적으로 평가한다.

Iterative DPO를 사용하여 모델을 훈련시킬 때, 선호도 데이터셋 $D$에 대한 최대 가능도 추정을 수행하여 모델의 파라미터를 최적화한다.

\[\max_{\theta} \mathbb{E}_{(x,y_{\text{w}},y_{\text{l}}) \sim D} \left[ \beta \log \frac{\pi_{\theta}(y_{\text{w}}|x)}{\pi_{\text{sft}}(y_{\text{w}}|x)} - \beta \log \frac{\pi_{\theta}(y_{\text{l}}|x)}{\pi_{\text{sft}}(y_{\text{l}}|x)} \right]\]

$\pi_{\theta}$는 훈련된 모델, $\pi_{\text{sft}}$는 파인튜닝된 기본 모델이다. 이 과정은 모델이 휴먼의 선호도를 더 잘 반영하도록 도와 성능을 개선한다.

이 논문에서는 완전히 공개된 양방향 LLM 스위트인 MAP-Neo를 소개하여 LLM의 투명성과 접근성을 증진시키고자 한다. 데이터 큐레이션부터 사전 훈련 코퍼스, 모델 훈련, 평가에 이르기까지 모든 과정을 상세하게 공유함으로써 학술 및 오픈소스 커뮤니티가 투명한 NLP 연구를 진전시킬 수 있도록 지원한다. 또한, MAP-Neo는 산업 수준 모델(보통 비공개)과의 격차를 좁히며 인퍼런스, instruction following, 코딩 능력을 향상시킨다.

1 Introduction

The advent of generalist large language models (LLMs) such as GPT-4 [1], Claude [4], and Gem- ini [80] has significantly expanded the boundaries of Natural Language Processing (NLP) and is paving the way towards Artificial General Intelligence (AGI). These models exhibit universal capabili- ties, including complex reasoning [116, 89], role-playing [107], creative writing [105], psychological assessment [112], scientific education [18], and music generation [115, 75, 29], among others. How- ever, the most advanced ones remain closed-source due to commercial interests [1, 4, 80]. In this paper, we argue that open-source and transparent LLMs are essential for both the democratization of LLMs and further academic research, especially considering the substantial resources these models consume.

Previous works have released numerous open-source or even transparent LLMs. For example, the LLaMA series [101, 102, 3] released the weights, thereby significantly boosting the development of the open-source LLM community. However, they are not transparent because they do not disclose the details of their training data. BLOOM [86] trained a multilingual language model with 176 billion parameters and open-sourced its model weights, intermediate checkpoints, and training corpus. Models like LLM360 [66] and Pythia [9] further provided their training codes, optimizer state checkpoints, analysis codes, and data pipelines.

These models make significant contributions to building transparent ecosystems, yet generally lag behind industry-level LLMs such as LLaMA [3], Mistral [48] and Yi [113], etc. OLMo [36] has made a great stride in narrowing this gap by improving pre-training data and data processing pipelines, and introducing more open-source components, including training logs and ablations. Nonetheless, it remains less proficient, especially in areas like coding (HumanEval [15]), reasoning (MATH [41], GSM8K [23]), knowledge (MMLU [40]), and multilingualism (CMMLU [60]).

To remedy these issues, we introduce MAP-Neo, a fully open-source and transparent bilingual LLM suite that achieves superior performance to close the gap with closed-source models. Specifically, the entire workflow of building an LLM includes:

Data Curation Pipeline: We provide the code for the curation and cleaning of training data (both English and Chinese), including a stable OCR system, the data recalling mechanism in DeepSeek-Math [89], the integration of previous open-source data processing pipelines, and support for distributed data processing based on Spark2, among others.
Data: We release our pre-training corpus, namely Matrix Data Pile, along with the training data for supervised fine-tuning and alignment training.
Model Architecture: We provide the codes and details of our modeling architecture.
Model Training: We offer the training codes for our tokenizer, base models, instruction- tuned models, and aligned models. Additionally, we address some issues of the Megatron- LM framework3, enhancing its support for more robust and efficient distributed training. Moreover, we introduce the NEO Scaling Law designed to optimize scaling up LLMs using a pre-training dataset sourced from diverse corpora.
Model Checkpoints: We not only release the final models on HuggingFace but also make the intermediate checkpoints available for reproducibility.
Infrastructure: This report details the infrastructure for stable training.
Evaluation: We also provide detailed evaluation codes and thorough evaluation settings for benchmarking the performance of LLMs.8. Analysis and Lessons: This report elaborates on numerous techniques and recipes, such as optimization tricks at different phases of pre-training, and offers insights into building LLMs through rigorous analysis and ablations.

Our work is a milestone towards fully transparent LLMs with advanced abilities, even competitive with the top closed-source LLMs. Notably, our contribution is not just a novel foundational model but also a comprehensive handbook for building LLMs from scratch, covering the entire workflow. We believe that our model provides a critical reference for the community, particularly for non-English regions of the world engaged in LLM research.

2 https://spark.apache.org/

3 https://github.com/NVIDIA/Megatron-LM

Table 1: Compare with other open-source large language models (LLMs). All metrics are obtained using the same evaluation manner, and the details are shown in Table 9. Non-transparent models are listed above the dashed line, while the transparent LLMs are shown below.

The development of open-source large language models (LLMs) is pivotal for advancing artificial intelligence research and applications. Recent efforts in this domain have been focused on not only enhancing model performance [48, 3] but also ensuring transparency and reproducibility [9, 66, 36, 128]. Our model, MAP-Neo-7B, emerges as the new lead in this evolving landscape, as shown in Table 1, which balances performance and transparency.

The MAP-Neo model series represents a step forward in emphasizing full transparency, aligning it alongside other contemporary models such as Mistral [48], LLaMA3 [3], Pythia [9], Amber [66], and OLMo [36]. Unlike these models, which often lack either intermediate checkpoints, comprehensive data cleaning processes, or accessible pre-training corpus and reproduction code, MAP-Neo excels by integrating all these elements. This commitment to the openness of MAP-Neo facilitates in-depth analysis and independent validation by the research community.

Performance-wise, MAP-Neo-7B demonstrates superior capabilities across a broad scope of bench- marks including Chinese and English understanding on C-EVAL [46] and MMLU [20], mathematical ability on GSM8K [23] and MATH [41], and code ability on HumanEval [15]. Notably, MAP-Neo-7B is the only model in our comparative analysis to achieve all checks in transparency, as well as the highest scores across all tests compared with other transparent LLMs, underscoring the effectiveness of the training and the quality of the data.

The most similar work to MAP-Neo is OLMo [36], which is the pioneering work to fully open-source LLMs. However, their performance is compromised in several aspects like knowledge, coding, and mathematical reasoning. Moreover, OLMo cannot handle languages beyond English. MAP-Neo sets a new standard for transparency and performance in the field of open-source LLMs. By fostering a fully transparent development process, MAP-Neo not only enhances its utility and trustworthiness but also provides a valuable framework for future research, promoting further advancements and collaborative efforts in the community.

3 Tokenizer

We train our tokenizer using the byte-pair encoding (BPE) algorithm [88] via the implementation of SentencePiece [56]. The training data consists of 50B samples from the pre-training corpus, and the maximum length is cut to 64K. We assign higher sampling weights to code, math, and high-quality academic data. To balance the computational efficiency and model performance, we propose to set the vocabulary size to 64000 and constrain the max sentence-piece length to 16 to improve the Chinese performance.

Notably, we slice all numbers into individual digits and fall back unknown UTF-8 characters to byte granularity. We do not use any normalization strategy on the training samples and do not add dummy prefixes. The character coverage rate is set to 0.9999. Particularly, the remove extra whitespaces parameter is set to False, which is turned on by default in the SentencePieceTrainer. This setting can severely impact code performance during pre-training, as normal code indentation is treated as a single space. We encountered a specific issue during the initial phase of our model’s pre-training. Initially, we did not disable the ‘remove extra whitespaces’ parameter, which is enabled by default in the SentencePieceTrainer. In the training process, we observe steady improvements in the QA reasoning and mathematics benchmarks, but the code metrics exhibit fluctuations and do not show expected improvements. To address this issue, we fixed this bug in the second phase of our training (§6.2), which stabilizes and significantly improves the code metrics. Furthermore, we observe that this issue is well addressed in the decay phase training stages under the new tokenizer settings, where rapid improvements are achieved.

Moreover, we also investigate the compression rates across various categories of data, categorized by both language (Chinese and English) and data source quality (high-quality and web-sourced) as shown in Table 2. Specifically, first, we observe that the high-quality data (HQ) including complex reasoning, mathematical, and general knowledge texts, showing different compression rates between Chinese (HQ cn) and English (HQ en). The HQ cn category has a compression rate of 1.577, while the HQ en category exhibited a higher rate of 3.311 characters per token. Second, data sourced from the web (Web) also comprise more characters than Chinese ones. This suggests a significant variation in tokenization efficiency or character usage between languages, possibly due to the linguistic structure and the tokenization methods. Third, it should be mentioned that even with similar compression rates, the settings of the tokenizer can cause significant fluctuations in the pre-training process. Therefore, it remains necessary to further investigate tokenization strategies for subsequent usage scenarios.

Table 2: Average Compression Rates by Category. These subsets are not uniformly proportioned in the training set. A detailed distribution is shown in Appendix 18.

4 Matrix Data Pile

Figure 2: Statistics of the Matrix Pile Data Distribution: The inner pie chart represents the language distribution, while the outer loop indicates the proportion of meta-categories in the corpus.

It is widely recognized that a well-constructed training corpus is essential for training LLMs. The training corpus serves as the fuel driving advancements in language modeling, as demonstrated by the emergent capabilities of models like ChatGPT, Claude, Gemini, and Llama. However, due to intellectual property restrictions, the pre-training data and processing toolkits of these (partially) proprietary LLMs are not disclosed upon release. Although the open-source research community has made substantial efforts to increase transparency in the collection and processing pipeline of language model pre-training data [9, 86, 95], the development of fully open-sourced LLMs still lags behind proprietary LLMs to some extent, primarily due to gaps in the quantity and quality of the datasets.

To address the pressing need for more diverse and transparent datasets in language modeling, we introduce Matrix, a bilingual pre-training corpus of 4.5T tokens. Upon its release, Matrix could be the largest transparent LLM pre-training corpus to our best knowledge. Specifically, Matrix provides the details of the data collection and processing along with a high-performance toolkit. Additionally, we design Matrix based on the idea of retrieving, filtering, and cleaning high-quality data under various practical circumstances, which are discussed as follows:

Given a set of existing (English) pre-training datasets, how do we re-process and improve the quality? §4.1
How do we construct a large-scale, topic-comprehensive corpus from scratch, on the less explored Chinese content?§4.2
If we have enormous printed documents, how do we build an efficient and effective system to extract viable textual contents? §4.3
When specifying a domain of interest, how do we find relevant high-quality data from the wild of web content? §4.4

The final composition of the corpus is as follows: 52.55% from Common Crawl, 22.29% from programming code, and the rest from academic papers, books, and other printed materials, as illustrated in Figure 2. The detailed methodologies for processing these sources are described in the subsequent sections, and a comprehensive illustration of the sources is provided in Table 16.

Table 3: The composition sources of re-processed English web subset. The proportion denotes dividing the size of the current dataset by the total size of the whole dataset.

4.1 Re-processing Pipeline for Open Datasets

Although several processed pre-trainig corpus (mostly in English) have been released by previous works [95, 74], we argue that there is still room for a more meticulously designed pipeline to improve the existing data. Besides, it should be mentioned that existing LLMs can be easily improved by continuous pre-training with high-quality data. Therefore, we further re-process the selected web content-based corpora to produce the English subset of Matrix data mixture. The source comes from the Head and Middle parts of RedPajama-Data-V2 [25], CC part of Dolma [95], the EN part of Cultrax [72], the Refined-Web part of Amber [66], SlimPajama [94] and falcon [74]. The precise distribution of our English dataset is listed in Table 3. The procedure involves filtering and multi-step deduplication. The diagram in Figure 3a shows the processing orders and the retention rates.

4.1.1 Filtering

To further filter out the relatively low-quality corpus from open-source datasets, we propose to use heuristic rules for text filtering. These rules are designed to identify and remove poor-quality data, thereby preventing potential model performance degradation caused by a flawed pre-training corpus. Since our composite dataset is made up of corpora from multiple sources, we adapt well-designed cleaning methods [74, 14, 76, 78] and tailor our rules for each one to ensure quality consistency.

[25], which provides quality annotations for each text, we For the RedPajama-Data-v2 dataset integrate our heuristic rules with these annotations to refine data quality evaluation and further perform random sampling on the dataset to confirm the thresholds for every rule. For datasets lacking quality annotations, we apply the established rules and thresholds derived from RedPajama-V2, while customizing them to align with the unique characteristics of each dataset. For example, the Dolma dataset [95] comprises six subsets, namely Wikipedia, PeS2o, Stack Code, Gutenberg, C4, and CC, each with different data characteristics. Given the unique characteristics of each subset, we conduct individual sampling and evaluation to ensure that the modifications in rules and thresholds are aligned with our filtering requirements. Specifically, for the CC subset, we adjust the unique word and text length thresholds. For the Gutenberg subset, which predominantly contains book texts, we apply only a few rules to avoid the time-consuming process of executing extensive heuristic checks on long texts.

The filtering process involves: 1) Document-level and sentence-level filtering to ensure text length adequacy, character meaningfulness, and consistency; 2) Duplicate text removal, including n-grams and sentences; 3) Sensitive word check to eliminate texts containing any terms from a blacklist.

4.1.2 Deduplication

It has been reported that repetitive text can lead to a decline in model performance [58, 51, 42], which makes deduplication a crucial step in corpus processing. By eliminating duplicates, we can significantly reduce the rate of emitted memorization and make model training more efficient [58]. Repetitions can be categorized into exact duplicates and near duplicates. For exact duplicates, we employ exact document deduplication to remove them. For near duplicates, we utilize Minhash LSH deduplication to remove them as much as possible. In addition, there are instances where parts of the text are completely duplicated, and in these cases, the Minhash method struggles to remove them. To address this, we have adopted two methods for partially removing such content: paragraph deduplication and exact substring deduplication.

Exact Document Deduplication
Exact document deduplication is a method used to evaluate an entire text to determine if it is identical to another. If it is found to be exactly the same, the duplicate will be removed. For processing data in English, Spark is employed to handle the dataset. Due to the vast volume of data, there may be issues with insufficient memory. The solution to this problem involves batching the text data into separate buckets for storage. Each bucket’s data is then processed in turn to remove duplicates.

Minhash LSH Deduplication
Minhash [13] is an excellent method for removing near duplicates, especially for web page data, and is widely used for similarity search and duplicate detection in large datasets [104, 33, 37]. It can handle very common scenarios where the text content is essentially the same, but the scattered template blocks of the web pages are different. The principle of MinHash is to represent a set with smaller hash values, which can then be used to estimate the Jaccard similarity between two sets:

\[\text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|}\]

MinHash involves using multiple distinct hash functions that map each element of a set to a larger numerical domain. For each set, these multiple hash functions are applied to all elements within the set, and the smallest hash value produced by each hash function is chosen as its minimum hash value. Thus, each set can be represented by a vector of these minimum hash values, forming the set’s MinHash signature. For text data, an n-gram approach can be used to construct a set.

After obtaining the signature of the text, Locality-Sensitive Hashing (LSH) [35] is employed to rapidly identify candidate set pairs that exceed a certain threshold in Jaccard similarity. This accelerates the search process for similar items. The specific approach divides the signature into several bands, each containing several hash values. Another hash function is then used to map each band to a hash bucket. All sets with the same band hash are mapped to the same hash bucket. All set pairs in the same hash bucket are considered candidate similar pairs without further specificity regarding their similarity. Here, we utilize 128 unique hash functions to form signatures, divided into 9 bands, with each band containing 13 hash values. Consequently, the Jaccard threshold is set at 0.8.

Upon identifying similar pairs, connected components are constructed. Within each component of the connected components, one text is retained while the others are eliminated. For processing vast amounts of data efficiently, a distributed implementation [53] based on map-reduce is adopted.

Paragraph Deduplication
Paragraph deduplication involves removing all duplicate paragraphs within a text. A paragraph is defined as a section of text separated by the newline UTF-8 character \n. Paragraph deduplication is an effective method for removing website navigation headers, advertisements, and similar elements. Since paragraph deduplication involves deleting parts of the text, it may cause some interference with content analysis.

Its concrete implementation first involves splitting the text into multiple paragraphs using newline utf-8 character \n, with each paragraph being tagged with its corresponding document id and offset in the text. Then, each paragraph is hashed using SHA256. Next, the hash values are deduplicated. After deduplication, the deduplicated text is restored according to the document ID and offset.

Exact Substring Deduplication
This method follows [58]. Given the diversity of languages, when the length of repeated text is sufficiently long, it is highly likely that they are either derived from one another or sourced from the same reference. Therefore, when two texts, $t_i$ and $t_j$ share sufficiently a long substring, that is $t[a..a+k]$, one of them is removed. For the selection of the length threshold, we adhere to the setting in [58], choosing $k=50$. Due to our distributed environment, the memory of a single node is insufficient to hold all the data. Therefore, we did not adopt the implementation in [58]. In our work, we segment each text into sliding windows of 50 characters with a step size of 1. We then compute the SHA256 hash value for each window along with its corresponding document ID and offset. Subsequently, for windows with identical hash values, we mark them as duplicates except the first one. Finally, using the text ID and offset, we restore the original strings and decide whether to delete a segment based on the duplicate marker.

(a) Re-processing retention rates for the corpora in §4.1. (b) Processing retention rates for the corpora crawled from scratch in §4.2.

Figure 3: Funnel Diagram for the two main data pipelines. The darker part of each row represents the retention proportion for each processing step and the lighter one for the filtered corpora.

4.2 Corpora Crawl from Scratch Pipeline

We further provide a pipeline to crawl and process the web content from scratch and showcase it with the Chinese language data, which could be a step-by-step guide for follow-up research to build a new up-to-date corpus. We take the corpus produced in such a pipeline as the Chinese subset of Matrix, where 80.6% is derived from the Chinese web pages we crawled and others from several open datasets, as listed in Table 4. The pipeline overview and details are illustrated in Figure 3b.

Table 4: The composition sources of the Chinese web subset.

4.2.1 Filtering

The filtering rules for Chinese datasets are specifically tailored to address their unique challenges, differing from those applied to relatively well-processed English datasets in §4.1. Considering the large proportion of HTML-converted data in Chinese datasets, we focus intensively on eliminating HTML-related artifacts and rectifying textual inconsistencies. Furthermore, given the significant linguistic differences between Chinese and English, we conduct targeted sampling of documents within Chinese datasets, which aims to reassess and adjust the thresholds and details of our filtering rules, ensuring their suitability for the unique language characteristics of Chinese text. For example, we refine the rules to distinguish between ‘characters’ and ‘words’ in Chinese texts, adapting the tokenization method accordingly.

Our Chinese filtering steps are similar to the rules adapted to filter Massive Appropriate Pre-train Chinese Corpus (MAP-CC) [30]: 1) Data format unification to boost processing efficiency. 2) URL removal. This step is conducted in two stages: first, removing texts with URLs listed in Blacklist T1; followed by a comprehensive sweep to eliminate residual URLs. 3) Sentence-level and document filtering to discard text that is excessively brief, substandard, or logically incoherent. 4). Duplicates removal, including n-grams and sentences.

4.2.2 Deduplication

The deduplication of Chinese data includes Exact Document Deduplication, MinHash Deduplication, and Similar Line Deduplication. Due to difficulties in deploying Spark in the environment for pro- cessing Chinese, we have re-implemented the first two methods. For Exact Document Deduplication, there are slight differences from the implementation for English, mainly to save memory, where we have adopted a Bloom Filter approach and set the false positive rate of the Bloom Filter to 0.001. Discussions on Exact Document and MinHash LSH Deduplication can be found in §4.1.2.

We did not use Exact substring deduplication because when crawling web pages, it is common to repeatedly crawl the same content multiple times in a signal document. Additionally, when extracting the main text from HTML, there is often a loss of one or two words. The combination of these two situations violates the assumption in [58] that “it is rare for the same idea to be expressed identically in multiple documents unless one expression is derived from the other, or both are quoting from a shared source.” Therefore, after Exact substring deduplication, there will be cases where extra words are retained, greatly reducing the readability of the text. Hence, we propose a Similar Line deduplication method to address this issue.

4.2.3 Similar Line Deduplication

To address the scenario where identical content appears multiple times within a text, a direct method is to divide the text into lines using specific delimiters and then compare the similarity between each line. If they are similar, the subsequent line is removed. The division of lines includes the use of the following delimiters: “[”, “.”, “!”, “?”, “\”, “. . . . . . ”, “]”. We use edit distance to judge whether two lines $L_1$ and $L_2$ are similar as follows:

\[\text{isSimilar}(L_1, L_2) = \begin{cases} \text{True} & \text{if } \min(|L_1|, |L_2|) \geq 15 \wedge \text{editDist}(L_1, L_2) < 0.1 \times \min(|L_1|, |L_2|) \\ \text{True} & \text{if } \min(|L_1|, |L_2|) < 15 \wedge L_1 = L_2 \\ \text{False} & \text{otherwise} \end{cases},\]

where $$

$is the length of line$ L $$ and “editDist” is short for edit distance.

Due to the computational complexity of calculating edit distance being $O(\text{len}(L_1) \times \text{len}(L_2))$, to accelerate this process, we additionally propose two methods to judge dissimilarity:

Is the length difference between the two lines greater than one-tenth of the length of the shorter line?
Is the ratio of the intersection of the sets of characters and the union of the sets of characters in $L_1$ and $L_2$ less than one-third?

Note that the first method has a computational complexity of $O(1)$, and the second method has a complexity of $O(\text{len}(L_1) + \text{len}(L_2))$. Thus, these methods can significantly improve the speed of calculation. Clearly, if either of the above two questions is positive, the lines cannot be considered similar. Otherwise, we calculate $\text{isSimilar}(L_1, L_2)$ to obtain the similarity between $L_1$ and $L_2$.

4.3 Document Conversion Pipeline

The documents are usually better formatted, in concentrated topics, and with more consistent expressions compared to noisy web content. However, it seems to be a gold mine of high-quality corpus except that the golds lie deeply under the digital dirt. Such digital documents are mostly stored as standard PDFs with diverse layouts or scanned images with inconsistent quality, making it challenging to build datasets upon. We observe two core issues in designing an effective conversion pipeline to extract plain text from documents: i) analyzing layout information and identifying different layout elements including text, titles, captions, images, tables, and formulas, and ii) recognizing the relationships among these layout components.

Figure 4: The document conversion framework is composed of various sub-models for different parts.

We survey the existing open-source solutions for document conversion and find some distinguished projects with good performances: PP-StructureV2 [59], Marker4, Vary [108], and Nougat [11]. However, along with their merits, each of them exhibits limitations that could be addressed to further enhance performance: PP-StructureV2 cannot recognize LaTeX format content and necessary post- processing stages; Marker and Texify5 support few languages and do not process figures effectively; Nougat has limited support for multi-column data and recognized languages; Vary and Vary-toy require considerable computational resources. Therefore, we propose a framework consisting of disentangled processing components, allowing us to leverage the strengths of these models together. For example, we utilize Marker for enhanced language support and PP-StructureV2 for efficient layout parsing. As illustrated in Fig. 4, our document conversion framework is comprised of four parts: Layout Detection, Element Recognition, Ordering, and Post Process. The decoupling between each module enhances interpretability and simplifies the upgrade, addition, and replacement of various components.

Layout Detection segments the document into multiple parts such as formulas, text, headers, and footers. The Pipeline employs a lightweight target detection model provided by PP-StructureV2, which is computationally efficient and performs exceptionally well. This model’s performance is further enhanced by employing the FGD (Feature Gradient Descent) algorithm, which optimizes feature extraction for more accurate layout detection.

Element Recognition incorporates various models to identify different elements. For formula recognition, the TrOCR model trained through Pix2Text outperforms other formula recognition mod- els such as Latex-OCR and Taxify, supporting recognition of formulas embedded within paragraphs and non-conventional formulas, thus effectively addressing most formula recognition scenarios. Text recognition employs PP-OCRv4, Text recognition employs PP-OCRv4, notable for its compatibility with multiple computing devices and boasts strong recognition capabilities; approximately one hundred language recognition models have been publicly released, applicable to a broader range of document recognition tasks. Figures are saved as images and inserted in the subsequent merging phase. Table reconstruction is achieved using SLANet, which represents tables in HTML format. Other regions, such as headers, footers, and page numbers, are discarded and do not proceed to the post-processing and reconstruction stages.

4 https://github.com/VikParuchuri/marker

5 https://github.com/VikParuchuri/texify

Ordering In document conversion tasks, correctly handling the relationships between blocks is of paramount importance. To acquire high-quality conversion data, we need to properly handle complex layout scenarios such as multi-column and cross-page conditions. In the ordering stage, we use LayoutLMv3 [45] for column detection and sorting different areas according to specific rules. This strategy not only enhances the accuracy of the task but also significantly optimizes the readability.

Post-processing. The texts extracted by OCR usually could not be directly used and require additional processing as follows:

Broken-up sentences: In text extracted from images, sentences may be fragmented across different lines or pages, resulting in a single sentence being divided into multiple seg- ments. Effective OCR text extraction necessitates the identification and rejoining of these fragmented sentences to reconstruct coherent, complete sentences.
Hyphenated words: Certain words may be split into two parts within the text due to formatting constraints, connected by hyphens (e.g., network-ing). Text extraction must recognize these hyphenated words and merge them back into a single, complete word (e.g., networking).
Broken math formulas: OCRed mathematical formulas in Markdown may experience issues such as missing elements, incorrect symbols, or fragmented expressions. To address this issue, we fine-tune a 7-billion parameter open-source pre-trained language model [7] on supervised learning data pairs (xi, yi). Here, xi represents the instruction for detecting and correcting errors in the given texts, and yi represents the corrected output texts. We adopt vLLM to enable faster inference through quantization and efficient memory management of attention keys and values using PagedAttention, among other optimizations. The prompt templates used for processing both both languages are provided in Appendix A.10.

By incorporating these strategies, we can significantly improve the quality and coherence of OCR-ed texts, mitigating common errors and enhancing the overall readability and usability of extracted content. We use FastDeploy6, a highly efficient AI inference deployment tool, as the codebase of our implementation, which can fully exploit the advantages of multithreading to optimize inference speed and computational overhead. Overall, while maintaining performance and deployment efficiency, we provide a framework for document conversion that covers comprehensive scenarios, including recognizing layout information, supporting table reconstruction, and formula recognition.

4.4 High-Quality Supplement Data Collection

In this section, we present our method for High-Quality Supplement Data Collection, which applies to a diverse range of topics and enhances the robustness of datasets. Inspired by [89], which adopts an iterative pipeline to facilitate the acquisition of large-scale, high-quality data from Common Crawl, we propose to select high-quality data for mathematics, scientific exam synthetic data, and wiki-based content in our Matrix.

The procedural phases of the iterative pipeline are enumerated as follows:

Seed Dataset Collection: Collect a high-quality seed dataset for the field of interest, like mathematics, code, or wiki-based content.
Domain Definition and Sampling: Define a domain as data entries within the seed dataset sharing the same base URL and extract samples from each domain in the seed dataset as positive samples to enhance format diversity. Correspondingly, acquire an equal amount of data from Common Crawl as negative samples.
Model Training: Employ a FastText model [50] for binary classification to discern data relevance to the specified field. Training parameters are set as follows: three epochs, a learning rate of 0.1, an embedding dimension of 256, and an n-gram of 3. The model is quantized to augment operational efficiency within constrained memory capacities, reducing its size to approximately 10% of its original footprint.
Data Confidence Assessment: Utilize the trained FastText model to estimate the confidence of Common Crawl data qualifying as positive. Retain data sequenced from highest to lowest confidence. To streamline the confidence sorting process, initially sample a subset of data to establish a viable threshold that balances data exclusion with retention needs.
Data Evaluation: Assess the retained data via ChatGPT 3.5 [1], employing the URL to determine field specificity. This stage aims to mitigate the incidence of false positives while maintaining a requisite recall rate.
Data Recall and Annotation: Revisit domains where over 10% of the data was recognized as field-specific. Annotate this data subset using ChatGPT 3.5 [1] via URL.
Model Refinement and Iteration: Integrate unconfirmed positive data from prior iterations into the positive samples to diversify the FastText model’s training base. Subsequently, initiate a new iteration cycle beginning from the training stage.

6 https://github.com/PaddlePaddle/FastDeploy

The data selection for Common Crawl focused on the English content of the RedPajama V2 dataset [25]. The seed dataset for the mathematics segment is sourced from OpenWebMath [6], while the science synthetic dataset is from specific domains such as Chemrxiv, biorxiv, and proprietary crawled exercise data from open-source datasets, e.g. wanjuan-exam [38], WebInstruct [117], Web Of Science [55]. Wiki data is procured directly from wiki websites.

5 Model

5.1 Model Architecture

The MAP-Neo model architecture is grounded on the transformer decoder as outlined by Vaswani et al. [103]. The essential parameters defining this architecture are detailed in Table 5. The models are trained with a context length of 8192 tokens, incorporating several enhancements proposed after the original transformer concept. These enhancements are listed below:

Multi-Query Attention [92]. The 7B model variant employs multi-head attention, whereas the 2B model checkpoints implement multi-query attention, using a single key-value head configuration (num kv heads = 1). This modification is based on ablation studies indicating that multi-query attention is particularly effective at more minor scales [92].

RoPE Embeddings [97]. Instead of traditional absolute positional embeddings, we utilize rotary positional embeddings at each layer and share these embeddings between the inputs and outputs, minimizing the overall model size.

RMSNorm. To ensure stable training, each transformer sub-layer—including both the attention and feedforward layers—is normalized using RMSNorm [120].

Activation Function We use SwiGLU [93] as our activation function.

5.2 Model Scale Hyperparameters

In this work, we compare two different model scales: 2B and 7B parameters. Since these models are standard dense Transformers. These models are constructed using the hyperparameters in Table 5. The two models are trained identically (except for training data) using the same vocabulary and batch size. Training details are shown in §3 and §5.1.

Table 5: Model architecture details. We list the number of layers, $d_{\text{model}}$, the number of attention heads, and attention head size. The feed-forward size dff is always 8 × $d_{\text{model}}$.

6 Pre-training

In the pre-training process, we employ a two-stage pre-training strategy to train the MAP-Neo model. The first stage termed the fundamental phase, involves training the model on a vast corpus of generic texts to develop its general text generation capability. Subsequently, during the decay phase, we focus on enhancing the reliability of the model’s generated content by incorporating high-quality data and mode code data. The distribution of data used across different phases is depicted in Figure 5. Note that we increase the volume of code data in the decay phase. Specifically, during the fundamental phase, since Stack V2 [68] was not yet available, we utilized Stack V1 [54] and repeated the dataset twice to achieve a balanced data ratio. In the decay phase, with the release of Stack V2 [68], we incorporated it as the code component for training. Moreover, we perform further data distribution tuning including duplicated high-quality data sources, such as books, judicial decisions, and government reports for training, to improve the model’s performance. The open-source data used for pre-training is shown in Table 16, the data repetition details are shown in Table 17 and the training hyperparameters are shown in Table 6.

Table 6: Model training details.

Figure 5: The data mixture ratios in MAP-Neo pre-training stage. The left is the fundamental phase and the right shows the decay phase.

6.1 Fundamental Phase: General Ability Acquisition

During the fundamental phase, we employ a two-stage learning rate scheduler (LRS) to equip the model with a robust capability for general text generation. The LRS is modeled as a piecewise function, consisting of an initial warmup phase where the learning rate linearly ascends from a base rate of $\eta_a = 2 \times 10^{-5}$ to peak learning rate $\eta_{\text{max}} = 2 \times 10^{-4}$ over $t_{\text{warmup}} = 2k$ steps. This is followed by a cosine decay phase, during which the rate gradually diminishes back to $\eta_b = 2 \times 10^{-5}$ over about 365k steps. The learning rate $f(t)$ as a function of time $t$ can be delineated as follows:

where $t$ is the current timestep, $t_{\text{warmup}}$ denotes the duration of the warmup phase, and $t_{\text{total}}$ represents the total number of training timesteps. This learning phase processes about 3,726 billion tokens, ensuring the model’s robust training on diverse textual data. This meticulous configuration of learning rates and extensive processing optimize training dynamics and efficiency, fostering a steady maturation of the model’s capabilities.

6.2 Decay Phase: Improvement and Rectification

Owing to the issue in training tokenizer as claimed in §3, the model encounters test failures in code generation tasks, despite its strong language understanding capabilities acquired during the fundamental phase. To address this issue, we have introduced an additional decay phase specifically designed to utilize a tokenizer of the fixed version. The learning rate in this decay phase initiates at $\eta_c = 2 \times 10^{-4}$ and undergoes exponential decay over $t_{\text{decay}} = 148k$ steps, with a half-life $T$ corresponding to half the $t_{\text{decay}}$ steps, similar to the decay phase employed by MiniCPM [44], which can be formulated as follows:

\[f(t) = \eta_c \times 0.5^{\frac{t}{T}} \quad \text{if } t \leq t_{\text{delay}},\]

where $t$ is the current timestep of the decay phase. This strategic adjustment not only rectifies the initial tokenization flaws but also enhances the model’s performance on code generation tasks. During this phase, the model processes a total of about 778 billion tokens, which primarily consist of high-quality instruction data. We also simultaneously increased the proportion of code in the data from 14.77% to 17.04%. This adjustment significantly enhances the overall performance of the model. The deliberate enrichment of the dataset with a higher ratio of code, coupled with instructional inputs, ensures a more robust and versatile model, adept at tackling complex coding tasks as well as understanding and generating professional responses in different fields.

7 Alignment

7.1 Supervised Fine-tuning

To align with the human behavior of LLMs, the initial step is to perform Supervised Fine-Tuning (SFT). Our SFT also consists of two phases. In the first phase, we collect a large amount of instruction data to enhance the foundational abilities of LLMs. In the second phase, we build upon the capabilities established in the first phase and propose to improve the chat abilities of MAP-Neo. This process finetunes a pre-trained LLM on chat-style data, including both queries and responses. We illustrate the details of data construction and training strategies.

7.1.1 Data

Foundational Phase: Enhancing Instruction Following Abilities
In the first phase, our focus is to significantly boost the model’s foundational abilities (e.g., code and math skills), where we utilize over 2 million instructional data points during this phase. Specifically, the first phase includes the entire OpenHermes 2.5 [99], where we exclude segments related to the TheoremQA benchmark [16] to prevent benchmark data leakage. Additionally, we incorporate the complete Code-Feedback [125] dataset and a subset of WebInstructSub [117] data.

Chat Phase: Enhancing Chat Abilities
In the second phase, we focus on improving the model’s chat abilities while maintaining the foundational skills acquired in the first phase. For this purpose, we collect over 100k multi-turn dialogue data sourced from real user conversations. To ensure the model retains its foundational capabilities, we include 5k math and code-related data points extracted from the first phase. Our experiments have demonstrated that this additional phase of SFT significantly boosts the model’s performance on chat benchmarks, such as MT-Bench [124] and AlpacaEval [62], without compromising its foundational abilities.

By following this two-phase approach, we ensure that our model can not only maintain a strong foundation in essential skills but also generate natural, helpful, and contextually accurate responses.

7.1.2 Training

Consistent with pre-training, we also apply the next-token prediction objective as the training task for SFT. Note that we apply the loss masks for the system and user inputs. The model’s training process utilizes the AdamW optimizer with the hyperparameters in Table 6.

The sequence length is limited to 8192, and the batch size is 512. The training process consists of two phases using the same hyperparameters. In the first phase, the model is trained for 3 epochs using over 2 million instructional data points, focusing on enhancing foundational abilities. In the second phase, the model is trained for 1 epoch using over 100k multi-turn dialogue data to enhance its chat abilities while maintaining the foundational skills acquired in the first phase.

7.2 Iterative DPO

DPO Direct Preference Optimization (DPO) [77] is a straightforward and effective method for aligning language models with human feedback. It converts the preference loss [12] into a loss function over the language model, thereby bypassing the need for explicit reward modeling [12] and reinforcement learning [19, 87]. Starting with a supervised fine-tuned language model, denoted as $\pi_{\text{sft}}$, DPO collects a dataset $D = {(x, y_{\text{w}}, y_{\text{l}})i}$, which consists of human preferences between two responses generated by $\pi{\text{sft}}$: $y_{\text{w}}$ (preferred) and $y_{\text{l}}$ (dispreferred) to the same prompt $x$. Using this dataset, DPO parameterizes a language model $\pi_{\theta}$ and directly estimates its parameters through maximum likelihood estimation on the human preference dataset $D$ as follows:

\[L_{\text{DPO}}(\pi_{\theta}; \pi_{\text{sft}}, D) = -\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}}) \sim D} \left[ \beta \log \frac{\pi_{\theta}(y_{\text{w}}|x)}{\pi_{\text{sft}}(y_{\text{w}}|x)} - \beta \log \frac{\pi_{\theta}(y_{\text{l}}|x)}{\pi_{\text{sft}}(y_{\text{l}}|x)} \right].\]

Iterative DPO. We follow Storm-7B [64] to use the Iterative DPO [111] pipeline to develop our chat model. Specifically, we employ three iterations, with each iteration consisting of three stages: 1) generating paired responses, 2) labeling responses using reward models, and 3) training the LLM with DPO loss as described in Eq. 3. We utilize Nectar7 as our prompt dataset and Starling-RM-34B8 [126] as our reward model. This model is finetuned from Yi-34B-Chat [113] and generates a scalar output for any given prompt and response. To preserve the multilingual capabilities of our model, we also adopt a preference dataset9 in Chinese in the 3-rd iteration.

We report the length-controlled win rate of AlpacaEval2.0 [32] to demonstrate the performance progress of our model in Table 7. The results show that performance improves with each iteration, indicating that our model becomes increasingly aligned with human values.

Table 7: The length-controlled win rate of MAP-Neo at different iterations on the AlpacaEval2.0 leaderboard. For “SFT”, we report the performance of our model using two-phase SFT.

Model	SFT	Iteration 1	Iteration 2	Iteration 3
LC Win Rate (%)	9.77	10.02	15.59	16.65

8 Scaling Law of MAP-Neo

8.1 Problem Definition

The scaling laws are capable of predicting training configuration for the training of LLMs. This principle emphasizes the importance of the ratio between the amount of training data $D$ (measured in tokens) and the size of the model $N$ (in terms of parameters). In this section, we applied the Chinchilla Law in Eq. 4 [43], OpenAI Law in Eq. 5 [52], a derivation of Symbolic Music Scaling law in Eq. 6 [75] and our proposed method on our dataset to fit our models, where $A$, $B$, $E$, $\alpha$, $\beta$, $\alpha_c$, $D_c$, $\alpha_N$, $N_c$ and $d$ are hyperparameters to be optimized.

\[L(N, D) = \frac{B D^\beta + E}{A N^\alpha + \alpha_N \alpha_D D_c \frac{D}{N_c N}}\] \[L(N, D) = d N^\alpha \cdot D^\beta + A N^\alpha + B D^\beta + E.\]

(Figures 4, 5, and 6 refer to specific equations and scaling predictions detailed within this section.)

Figure 6: The training loss value is represented by the blue line. The Chinchilla law prediction is shown in yellow, and the NEO scaling law prediction is depicted in green. We fit the Chinchilla law and NEO law on 250M, 460M, and 980M and predict the model behavior on both training samples and samples from the 7B model.

The original SMS scaling law introduces two modifications to the Chinchilla law. The first modification addresses the repetition of training data, which is not considered in our study. The second modification concerns the interaction between the number of model parameters, $N$, and the dataset size, $D$. Specifically, it posits that the loss curve as a function of $D$, represented as $B D^\beta$, is influenced by $N$. This interaction between the number of model parameters and dataset size is also reflected in the OpenAI scaling law. However, our version of SMS law, as detailed in Eq. 6, is simpler and yields superior results compared to the corresponding model in the OpenAI framework.

The motivation for fitting scaling laws is to optimize the loss under the bounds of computational resources. This process is formalized as minimizing the validation cross-entropy loss $L$, subject to constraints imposed by available computational resources (C), specifically floating-point operations per second (FLOPs), as denoted below:

\[\arg \min_{N,D} L(N, D) \quad \text{s.t.} \quad \text{FLOPs}(N, D) = C\]

Given that our model is trained on almost non-repetitive and high-quality data, we utilize the training loss instead of the validation loss for the scaling law application.

8.2 NEO Scaling Law

We train models with sizes of 250M, 460M, and 980M parameters using 1000B tokens of training data. These models are then used to predict the scaling law, which guides the training of a model with 7.8B parameters on 3.07T (3065B) tokens during phase 1. To evaluate the fit of the scaling law, we employ the Huber loss ($\delta = 1e^{-3}$) between the actual logloss and the predicted logloss, along with the R2 value between the true loss and predicted loss. Optimization of the scaling law is performed using the LBFGS algorithm. This approach is applied consistently across the Chinchilla law and the symbolic music scaling law. By leveraging these methods, we aim to ensure the accuracy and reliability of our scaling law predictions, enabling efficient training of large-scale language models.

Figure 6 illustrates the training loss values alongside the Chinchilla law predictions. Although the Chinchilla law fits well, with the predicted loss curve falling within the fluctuations of the actual loss curve, its trend appears flatter compared to the actual loss curve. The actual loss decreases more rapidly than predicted by the Chinchilla formula (i.e. $B D^\beta$), suggesting our dataset with diverse high-quality corpora can further decrease the loss value when $D$ is large. To address this discrepancy between Chinchilla prediction and observation, we introduce the following equation, denoted as NEO scaling law, which includes one additional regularization term $\log(D)$ for datasets containing several trillion tokens across various corpora:

\[L(N, D) = A N^\alpha + B D^\beta + E - d \cdot \log(D)\]

Note that although the regularization term −d · log(D) theoretically results in no lower bound on loss as D approaches negative infinity suggesting potential imperfection of the formula, the value of d typically ranges in our experiments between 1e-2 and 3e-2. Therefore, for a dataset size less than hundreds of trillion tokens, the loss remains within a reasonable range.

From the following Table 8, we observe that the NEO scaling law equation yields significantly better results on the training set and testing set.

Table 8: Comparison of parametric fitting on R2 and Huber Loss of different scaling laws. R2 Value (train) ↑ Huber Loss (train) ↓ R2 Value (test) ↑ Huber Loss (test) ↓

Under the prediction of the NEO scaling law and the computational resource constraint of 1.5 × 1023 FLOPs, the optimal configuration is to train a 10B parameter model with 2.5T tokens, providing a predicted loss value of 0.6597. To ensure comparability with baseline models, we choose to keep our model size at 7.8B parameters, similar to the Llama-base model. This configuration with a 7.8B parameter model with 3.07T tokens requires slightly fewer computational resources and results in a similar prediction loss value (0.6618). Meanwhile, after training, We observe that the real training loss in this configuration is 0.6591, which is close to the prediction loss value and demonstrates the effectiveness of the NEO scaling law.

8.3 Generalization of NEO Scaling Law

The NEO scaling law can be applicable to a broader range of models beyond MAP-Neo. Specifically, in Figure 7, we illustrate the fit results of the Chinchilla scaling law (yellow dashed line) and the NEO scaling law (red solid line) to the DeepSeek LLM [28] with the 7B and 67B parameters, which also pre-trained on a dataset with multiple corpura including Chinese, English and codes.

We observe that for the largest model sizes (i.e. MAP-Neo-7B and DeepSeek-67B), the predictions of Chinchilla Law tend to underestimate the actual loss when the dataset size (D) is small and overestimate the actual loss as model parameters and training data scale up. In contrast, our predictions of our NEO Scaling Law produce better fitting results when compared with the results of Chinchilla Law for MAP-Neo-7B and DeepSeek-67B.

Figure 7: The loss curve of Chinchilla Law prediction and the NEO Scaling law prediction for the DeepSeek LLM. We use loss values from both 7B and 67B for fitting and prediction.

We further suggest NEO Scaling law might be more suitable for the situation with a large diverse pre-training dataset with multiple high-quality dataset sources. For more discussion on NEO scaling law on other models, please refer to Appendix A.8.

9 Infrastructure

Our advanced infrastructure consists of two primary components: a data processing system and a training system. The training system is designed to support both pre-training and fine-tuning stages, enabling comprehensive model development.

Our infrastructure is designed to handle extensive data processing tasks for both English and Chinese datasets. We utilize robust systems to ensure efficient and scalable processing capabilities across different languages. Spark [118] is used for distributed computing, and object storage is used to save the data. Each machine is configured with a 64-core CPU, 256GB of memory, and 1TB of local disk. There are a total of 94 machines. For the Chinese data processing, there are a total of 14 machines. Among them, 6 machines have a 96-core CPU and 180GB of memory, while the other 8 machines have a 48-core CPU and 190GB of memory. Network File System (NFS)[84] is used as the distributed file storage system.

In the pre-training stage, the Megatron-Core toolkit is utilized for its capacity to train large-scale language models, featuring up to hundreds of billions of parameters. Compared to the tokens per second (TPS) metric, the usage of Megatron-core achieves a rate of 7200 TPS when training a 7B model, which surpasses the performance of 6400 TPS observed under the same settings without employing Megatron-core. This is accomplished using both model and data parallelism techniques. We implement several strategies to manage our large datasets and model complexities effectively. Firstly, we introduce programs to identify and temporarily remove tainted computing nodes from the resource pool due to software or hardware errors by automatic inspection, prediction, and labeling. Secondly, we make modifications to Megatron-LM to specifically prevent overflow issues detailed in A.3 when processing large data corpora. Lastly, we implement task recovery mechanisms that utilize strategically selected checkpoint iterations to safeguard against potential failures during training. These enhancements ensure optimal performance and reliability in our large-scale training operations.

To ensure optimal utilization of our computational resources, our infrastructure design incorporates a sophisticated network topology and hardware configuration, facilitating efficient workload distribution and data transfer for complex model training tasks. Our infrastructure utilizes distributed computing techniques to optimize the training of our models. Specifically, our 7B model is trained using an H800 configuration with 512 GPUs across 64 nodes and employs NCCL for backend distribution with ibp as the network interface and mlx5 of InfiniBand hardware to enhance inter-GPU communication. Tensor model parallelism is configured to utilize 2 GPUs, distributing the execution of a single transformer module across these units to enhance efficiency. For our 2B models, we utilize all 256 GPUs with tensor model parallelism set to 1 to ensure effective data replication. We further amplify scalability and efficiency by employing techniques similar to ZeRO-1 for sharding the optimizer state. This approach enables the management of more extensive datasets and more complex model training with significantly reduced memory overhead.

Our cluster consists of machines with dual Intel Xeon CPUs and eight NVIDIA H800 GPUs. The architecture facilitates high-speed data transfer, with each CPU socket interfacing with two PCIe Gen4 x16 lanes connected to dedicated PCIe switches. These switches manage the connections to a local NVMe SSD, an RDMA-capable Network Interface Card (NIC), and two GPUs. Inter-CPU communication is facilitated by Intel’s Ultra Path Interconnect (UPI), with both CPUs linked to a dual-port TCP NIC supporting 100 Gbps. Each machine’s network configuration includes four RDMA NICs, each offering 200 Gbps of full duplex bandwidth and integrated GPU Direct RDMA capabilities. Notably, the GPU array is interconnected through four NVIDIA NVSwitches, enabling robust intra-GPU communication with a bandwidth of 400 Gbps. This advanced configuration underscores the cluster’s capability to handle large-scale model training with exceptional efficiency and speed.

Regarding the inter-machine connections of our data center, we implement a dual-layer Clos network architecture wherein each minipod accommodates at least 512 H800 servers interconnected via a high-speed RDMA network. Within this architecture, each S0 switch is equipped with 64 ports, each supporting a bandwidth of 400 Gbps. This arrangement ensures a network convergence ratio of 1:1, a critical factor in maintaining optimal data flow and reducing bottlenecks. Connectivity within this structure is meticulously organized such that every two S0 switches serve 32 servers, with a total of 32 S0 switches networking within each minipod. This setup exemplifies an advanced implementation designed to maximize throughput and minimize latency in data center environments.

Table 9: Performance comparison of various base models on different benchmarks. The best results are in blue, the second-best results are underline, and the third-best results are in fbox.

10 Evaluations

The thorough evaluation demonstrates that the MAP-Neo model family achieves inspiring perfor- mance both on automatic benchmarks of base models and chat models. Compared to the previous transparent LLM series, we underline MAP-Neo’s distinctive performance on code, math, and in- struction following abilities, which not only endows the MAP-Neo with academic and practical value.

Table 10: Performance comparison of various aligned models on different benchmarks. The best results are in blue , the second-best results are underline, and the third-best results are in fbox .

10.1 Base Model Performance

10.1.1 Main Results

We present the results of our base models compared to several well-known LLMs, e.g. LLama3-8B and Mistral-7B, across standard academic benchmarks. All our evaluation metrics are derived from our assessments, ensuring consistency and transparency. We do not perform any post-processing on the evaluation content, maintaining the integrity of the raw outputs.

Our evaluation spans a comprehensive suite of public benchmarks in both English and Chinese, lever- aging an internal evaluation framework designed for rigorous assessment. These benchmarks include a diverse range of datasets catering to multiple disciplines and aspects of language understanding and reasoning. Our evaluation strategy encompasses various metrics, including language modeling, specialized knowledge, and code generation. For datasets requiring multiple-choice selection, we employ a perplexity-based evaluation. For generation-based datasets, we generate free text and parse the results accordingly. The detailed results of our comparison with other base models are shown in Table 9.

Standard Benchmarks We include Boolean Questions(BoolQ) [21], Physical Interaction QA(PIQA) [10], Social Interaction QA(SIQA) [85], HellaSwag [119], WinoGrande [83], ARC- Challenge(ARC-c) [22], OpenBookQA-Fact [70], CommonsenseQA [98], and MMLU [40] to assess general reasoning capabilities. All these benchmarks are tested with a 0-shot configuration, except for MMLU, which is evaluated with a 5-shot setup.

Code Generation We report the pass@1 scores of the evaluated models on HumanEval [15], HumanEval-Plus, MBPP [5], and MBPP-Plus, all with a 0-shot configuration, following the EvalPlus framework [63].

World Knowledge We include NaturalQuestions(NQ) [57] and TriviaQA [49] to assess world knowledge. Both benchmarks are tested with a 0-shot configuration.

Reading Comprehension We report the 0-shot average on SQuAD2.0 [79].

Exams We report the average scores for MATH [41] and GSM8K [23], both with a 4-shot configura- tion. For GSM8K, we employ a simple Chain-of-Thought prompting strategy: ”Let’s think step by step.” For both datasets, we use the MAmmoTH evaluation framework [116].

Chinese We use CMMLU [60] and CEval [46] to assess performance on Chinese language tasks. Both benchmarks are tested with a 5-shot configuration.

10.1.2 Discussions

Data Quality MAP-Neo demonstrates significantly better performance on math, code, and com- plex reasoning by incorporating high-quality data, compared to previous transparent LLMs, e.g. Amber [66] and Pythia [9], adopting (presumably) lower quality data.

Gap between our MAP-Neo and other transparent LLMs In Table 9, we note that transparent LLMs still significantly lag behind the performance of frontier industrial Open-weight LLMs with similar sizes (e.g. LLama3-8B, Mistral-7B). In contrast, our MAP-Neo can match or even surpass them on part of the automatic benchmarks about math, code, and Chinese knowledge. We call for increased participation in the development of transparent LLMs to further advance the LLM democratization.

10.2 Aligned Model Performance

10.2.1 Main Results

To accurately evaluate the realistic conversational performance of our aligned models, we selected several benchmarks that measure various aspects of model capabilities. These benchmarks were chosen for their ability to comprehensively assess key abilities such as alignment, instruction- following, real-world performance, and alignment with human preferences. Below are the specific benchmarks we used and the unique capabilities they evaluate:

AlignBench [65] AlignBench evaluates the alignment capabilities of Chinese LLMs, ensuring high reliability and interpretability through a comprehensive, multi-dimensional benchmark and human-in- the-loop data curation.

AlpacaEval [62, 32, 31] AlpacaEval measures instruction-following models’ performance efficiently and reliably through an LLM-based automatic evaluation, validated against extensive human annota- tions.

Arena-Hard [61] Arena-Hard evaluates LLMs’ real-world performance and ability to reflect hu- man preferences by constructing benchmarks from live data and ensuring robust model capability separation.

CHC-Bench [30] CHC-Bench evaluates LLMs on their proficiency in Chinese culture, history, and language, with tasks like composing poetry, understanding ancient Chinese, and explaining Chinese terms, emphasizing the challenges for models trained mainly on English datasets.

MT-Bench [124] MT-Bench assesses LLM-based chat assistants’ alignment with human preferences using strong LLMs as judges, achieving high agreement with human evaluations.

MMLU-Pro [106] For the aligned models, we further evaluate MMLU-Pro [106] with a 5-shot configuration to reflect the model’s capabilities more comprehensively.

10.2.2 Discussions

The effectiveness of Iterative DPO In Table 10, when compared to Neo-7B-SFT, Neo-7B-Instruct shows significant improvement on the chat-related benchmark datasets (e.g., AlignBench, AlpacaEval, Arena-Hard, and CHC-Bench), which further demonstrates the effectiveness of our Iterative DPO.

The performance of the chat model Table 10 shows that Amber-7B-Chat and OLMo-7B-Instruct perform poorly on Chat Benchmarks. We assume that the limited capabilities of the base model may severely limit the performance of corresponding instruction-tuned models on chat benchmarks.

11 Societal Impact

Data Colonialism is a deep concern when firms decide to exploit an algorithm product. [27] con- ceptualize the data colonialism framework and argue that Big Tech Giants, particularly in the U.S., use their massive data power to manipulate human behaviors and judgments and track people’s traces continuously, forming a new social order. This suggests that controlling and owning data benefits firms’ market status and generates large returns. So, making LLMs as firms’ proprietary models is a common practice in the industry. [2] discuss the barriers to AI democratization, such as the concentration of AI capabilities in large tech firms and elite universities. They underscore the importance of democratizing access to AI resources to mitigate the risks of data colonialism and promote equitable access to AI technologies across all institutions. [91] discuss the dominance of proprietary LLMs and the need for high-performing open-source alternatives. They propose methods to enhance open-source models to compete with proprietary models while addressing privacy and resource-constrained concerns. They also point out how important the open-source model is in the LLMs community and acknowledge that firms with fewer resources and sensitive information are hesitant to trust the proprietary models. However, most LLMs are the product of a massive English corpus and are trained from English scratch [122]. How the open-source model can benefit the non-English language community and its data democratization remains unclear.

Additionally, most open-source models are not thoroughly transparent. Open-source large language models (LLMs) often claim to be transparent and accessible, but many critical aspects of their development, such as data cleaning processes and pre-training code, remain undisclosed. This lack of transparency hampers reproducibility and the ability to fully understand and trust these models [110]. For firms with financial constraints and privacy concerns, it is not economical to train their LLMs. Even though most open-source models give open access to the final and some intermediate checkpoints, they keep data sources, data pre-training code, and data processing methods opaque, those of which are the most costly parts of setting up an LLM. That is the key issue we want to tackle and then hope to promote full transparency in our community.

In our report, the MAP-Neo model might complement the current scarcity of Chinese corpus in LLMs. Importantly, our bi-lingual language model is a ”thorough” open-source model–disclosing all key processes from sources of searching original data, and data cleaning to pre-training code base. Those disclosures significantly reduce the cost of deploying and customizing a LLM, especially for a Chinese LLM. It might have potential societal impacts. Firms with the need for a Chinese version of LLM but face constraints can be more able to leverage benefits from LLMs by using or referencing our MAP-Neo Model. It might improve social welfare in total and make a more vivid and diversified Chinese LLMs community [24]. Our advocates for thorough open-source action may attract more Chinese LLM researchers or relevant firms to fully disclose their models because thorough transparent open-source models can bring them sizable benefits from more constructive feedback and criticism. Those might make their models better and eventually accelerate the iterations of Chinese LLMs and empower the local community [81]. Overall, open innovation practices like disclosing the MAP-Neo model might alleviate the dominance of English LLMs and improve the inclusivity of the international LLMs community.

Those open innovation practices may also benefit Small and Medium enterprises (SME) to introduce new products effectively [96] and efficiently with easier implementation of their own customized LLMs, which may partially mitigate the threats of data colonialism from Big Tech Giants. Our Map- Neo model’s open and economical attributes give an optimistic outlook for researchers in academia. Those attributes suggest that it is not hard and costly to set up the university’s own AI without depending on specific Big Tech Giants’ help. If universities have independent and decentralized control over their data and AI processes, it will prevent large companies from AI monopolization and promote data and AI democratization.

12 Conclusion

In this paper, we introduce MAP-Neo, which makes strides toward enhancing the transparency and accessibility of large language models (LLMs) by offering a fully open-source bilingual LLM suite. By sharing thoroughly detailed processes, from data curation, pre-training corpus (i.e., Matrix Data Pile), and model training to evaluation, we aim to support the academic and open-source communities in advancing transparent NLP research. Moreover, MAP-Neo narrows the gap with industry-level models (typically closed-source) with enhanced reasoning, instruction-following, and coding abilities. We hope that our work provides a valuable resource for researchers and developers, contributing to a broader effort to democratize access to advanced LLM technologies.

post contain ""

No matching posts found containing ""

Share Your Feedback 🏝️

Model | MAP-Neo*

Model | MAP-Neo*

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

TL;DR

1 Introduction

2 Related Works

3 Tokenizer

4 Matrix Data Pile

4.1 Re-processing Pipeline for Open Datasets

4.1.1 Filtering

4.1.2 Deduplication

4.2 Corpora Crawl from Scratch Pipeline

4.2.1 Filtering

4.2.2 Deduplication

4.2.3 Similar Line Deduplication

4.3 Document Conversion Pipeline

4.4 High-Quality Supplement Data Collection

5 Model

5.1 Model Architecture

5.2 Model Scale Hyperparameters

6 Pre-training

6.1 Fundamental Phase: General Ability Acquisition

6.2 Decay Phase: Improvement and Rectification

7 Alignment

7.1 Supervised Fine-tuning

7.1.1 Data

7.1.2 Training

7.2 Iterative DPO

8 Scaling Law of MAP-Neo

8.1 Problem Definition

8.2 NEO Scaling Law

8.3 Generalization of NEO Scaling Law

9 Infrastructure

10 Evaluations

10.1 Base Model Performance

10.1.1 Main Results

10.1.2 Discussions

10.2 Aligned Model Performance

10.2.1 Main Results

10.2.2 Discussions

11 Societal Impact

12 Conclusion

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views