[sLLM 파라미터 대비 효율성(100M) 연구 색인마킹]
Contents
1. 서론
대규모 언어모델(LLM)은 다양한 작업에서 향상된 성능을 보이며, 특히 언어 처리 및 멀티모달 작업에 향상된 능력을 보였습니다. 그러나 이들 모델의 훈련 비용은 높으며, 더 많은 training dataset 사용이 추세입니다. 본 논문에서는 100B+ 파라미터 규모의 LLM을 처음부터 훈련하는 새로운 성장 전략을 제시하여, 비용을 절감하는 방법을 모색합니다. 이 성장 전략은 모델의 파라미터 수가 고정되지 않고 훈련이 진행되는 동안 작은 규모에서 큰 규모로 확장됩니다.
2. 개요 및 FLM-101B 모델
2.1 구조
FLM-101B 모델은 GPT와 유사한 디코더 전용의 트랜스포머 구조를 기반으로 하되, 언어 모델링 목적과 teacher 신호 목적이 통합된 새로운 학습 목표를 도입하여 구성하였습니다. 이는 훈련 중 크기 증가에 따른 안정성을 확보하는데 기여합니다.
2.2 훈련 설정
이중 언어(영어 및 중국어) 데이터를 이용하여 pre-trained FLM-101B는 초기 작은 모델에서 시작하여 점차 큰 모델로 성장하면서 이전 모델로부터 지식을 상속받습니다. 성장 전략을 통해 모델은 초기 16B에서 시작하여 51B, 마지막으로 101B 규모로 확장되며, 각 단계에서는 이전 단계의 지식을 상속받아 더 높은 효율성과 저렴한 비용으로 훈련이 가능합니다.
3. 훈련 안정성
FLM-101B의 훈련 안정성을 위해 트랜스포머의 기본 블록, 즉 LayerNorm과 Dropout을 사용하여 GPU 리소스 및 메모리 사용을 최적화합니다. 또한, 모델 성장 전략에 따라 각 단계의 안정적인 성장이 가능하도록 함수 보존 성장을 도입합니다.
4. 벤치마크 평가 및 IQ 테스트 기반 평가
4.1 벤치마크 평가
FLM-101B는 다양한 벤치마크를 통해 평가되었습니다. 기존의 지식 평가 벤치마크와 더불어, 새롭게 도입된 IQ 테스트에 기반한 평가 방식은 모델의 심볼릭 매핑, 규칙 이해, 패턴 마이닝, 그리고 방해 요소 차단 능력을 종합적으로 평가합니다.
4.2 전문 지식 강화 버전 평가
전문 지식 데이터를 사용하여 훈련된 eFLM-16B 모델은 향상된 지식 평가에서 우수한 성능을 보였으며, 전문 지식이 training dataset에 포함될 때 LLM의 능력이 어떻게 변화하는지를 확인할 수 있습니다.
[Data Composition 도메인 스페서픽 데이터 실험 색인마킹]
5. 결론
본 논문은 100B+ 파라미터 규모의 LLM을 효과적으로 훈련할 수 있는 새로운 성장 전략을 제안하며, 이를 통해 비용을 크게 절감할 수 있음을 보여줍니다. 또한, 전통적인 벤치마크와 IQ 테스트 기반 평가를 통해 모델의 다양한 능력을 평가하고, 이런 평가가 LLM의 복잡한 능력을 더 잘 반영할 수 있음을 제시합니다.
Large language models (LLMs) have demonstrated great successes in a wide range of tasks, particularly in language processing [65; 64; 11; 30] and multimodal tasks [82; 33]. Throughout their development, many model architectures have been proposed and evaluated, including decoderonly structures (e.g., the GPT series [40; 41; 3] and the LLAMA series [58; 59]), encoder-only structures (e.g., BERT [10]), and encoder-decoder structures (e.g., T5 [44]), along with their variants [29; 21; 55; 45]. Regardless of the differences in model architectures, all LLMs face the same challenge of high training cost. There is also a current trend suggesting using larger amounts of training data. For example, the LLAMA-1 [58] models use 1-1.4 T tokens for training, while LLAMA-2 [59] series use 2T tokens. A primary emphasis in LLM research hence is to find effective solutions to reduce training costs.
In this paper, we present our solutions to train an LLM at the 100B-parameter scale using a growth strategy inspired by our previous research [78]. “Growth” means that the number of parameters is not fixed, but expands from small to large along the training progresses. Figure 1 illustrates three typical scenarios for growth strategies. As the FLOPs of LLMs are approximately proportional to their number of parameters [19], the area under the parameter curve represents the computational cost of training.
Figure 1: An overview of different growth strategies.
Figure 1(a) serves as a reference for the cost with a constant number of parameters (y-axis) w.r.t. the number of tokens (x-axis). Figure 1(b) illustrates a straightforward linear growth strategy, leading to a cost-saving of exactly 50%; Figure 1(c) showcases a modest growth strategy that reduces the cost by less than 50%; in contrast, Figure 1(d) represents an aggressive growth strategy, which reduces the cost by more than 50%. This analysis informs our decision to employ the aggressive growth strategy for maximal computational savings. In our model training, we achieve aggressive growth with an enhanced growth strategy originated in our previous work MSG [78], a strategy that achieves strict function-preserving when growing.
With a fixed $100K budget, we focus on 100B+ parameters. Although the Chinchilla laws [19] suggest that training a smaller model with more data may potentially result in higher scores on some benchmarks due to more sufficient training, we believe that verifying the feasibility of a growth strategy [15; 51; 6; 78] would be a new direction and beneficial to the community of LLM as well. This is because (i) larger models have higher upper bounds for capabilities that may not be reached by scaling only the training data [69], and (ii) data can be linearly scaled up with the budget, while a growth strategy has the potential for saving cost regardless of the amount of available data, if it turns out to be feasible. Existing studies such as [19] have not extensively investigated this area because they only consider the scenarios where model sizes are fixed through training.
Another critical challenge in LLM research is evaluation. Existing mainstream evaluations can be broadly grouped into two categories: knowledge evaluation (i.e., MMLU [17] and C-Eval [20]), and NLP tasks evaluation. Such evaluations may not fully reflect the model capability due to potential data leakage if some of the evaluation datasets were also used in model training. In addition, it is also difficult to distinguish whether the models remember a piece of knowledge or possess the capacity for reasoning and/or inference. Borrowing some ideas from Intelligence Quotient (IQ) tests (i.e., Perceptual Reasoning and Working Memory [67]), we consolidate another range of evaluations on LLMs, including symbolic mapping, rule understanding, pattern mining, and anti-interference evaluations. Symbolic mapping [71] evaluation tests the capability of LLMs in learning to use (less meaningful) symbols instead of (more meaningful) category labels for some forms of classification tasks. Rule understanding evaluation is to test the capability of understanding some given rules, and then to perform corresponding actions. Pattern mining (involving both induction and deduction), is often used in various levels of competition. It tests the pattern-finding capability (e.g., repetition of certain parts of a given input). Last but not least, anti-interference is an ability to recognize core information from noisy input [5; 84]. We believe the evaluations inspired by IQ tests are less likely to be affected by data leakage or memorization, hence providing another dimension for fair, objective, and reliable evaluations of LLMs.
To summarize, the paper has made the following contributions. First, to the best of our knowledge, this is the first attempt to use a growth strategy to train an LLM with 100B+ parameters from scratch. Simultaneously, it is probably the lowest-cost model with 100B+ parameters, costing only 100,000 US dollars. Second, we address several instability issues via promising approaches for hyperparameter search, function-preserving growth, and improvements based on our FreeLM [25]. Our methodology holds potential benefits for the broader research community. Third, we conduct extensive evaluations, including both the commonly used knowledge-oriented benchmarks and the new range of evaluations inspired by IQ tests. Experimental results show that, despite its low training cost, FLM-101B is competitive and robust. Lastly, we release the model checkpoints, code, related tools, et al. to promote research on bilingual Chinese and English LLMs at the scale of 100B+.
In this section, we provide an outline of FLM-101B, detailing its architecture, pre-training methods, and configuration specifics.
The architecture of an LLM significantly impacts its capabilities. Current researches [80; 3] under-score the high costs associated with experimenting on diverse architectures. Hence, it is more suitable to select an architecture with great potential for cost effectiveness and model capability.
Backbone. Among the many existing model architectures, we adopt FreeLM [25] as the backbone for our models, with modifications. FreeLM is based on GPT [41], a transformer-like architecture with a decoder-only configuration known for its exceptional performance. Different from GPT, FreeLM features two pre-training objectives: the language objective and the teacher objective (Section 2.2). We preserve the GPT-style transformer block designs, including the Pre-LayerNorm and the additional LayerNorm after the last transformer layer. We employ the tokenizer derived from GPT-4, characterized by a vocabulary size of 100, 256.
Integration of xPos. To enhance long sequence modeling, we integrate the Extrapolatable Position Embedding (xPos) [56] in FLM-101B. This innovation draws inspiration from the principles of RoPE [54], which aims to improve the length extrapolation ability. By introducing an exponential decay into the rotation matrix, xPos strives to rectify this hurdle. To the best of our knowledge, FLM-101B is the largest model to date that incorporates the xPos technology.
Model Sizes. Benefiting from the proposed growth strategy, the FLM series produces three models with 16B, 51B, and 101B (i.e., FLM-101B) parameters in a single training. The training process is carried out in a sequential manner, starting from a smaller model (i.e., 16B) and progressively growing to larger ones (i.e., 51B and 101B).
FLM-101B. By design, FLM-101B is an English-Chinese bilingual model pre-trained with causal language modeling. It mixes English and Chinese corpora at a ratio of approximately 53.5% : 46.5% for language modeling. Inspired by the finding that instruction data can augment LLMs’ comprehension capabilities [37], we integrate multi-task instructionally prompted data: OIG (Open Instruction Generalist) 1 and COIG (Chinese Open Instruction Generalist) 2, in the pre-training stage.
eFLM-16B. To evaluate the effect of using domain-specific knowledge data (Section 4.2), we apply the FreeLM teacher signals [25] to enhance FLM. Due to computational cost, we incorporate the teacher signals only in the smallest 16B model. This knowledge-enhanced FLM-16B is named eFLM-16B.
Table 1: Partial configurations for different growth stages.
The original FreeLM incorporates two training objectives: language modeling objective guided by language signals and binary classification objective guided by teacher signals. In FLM-101B, we unify the two objectives by using a masking strategy and two specialized tokens. These tokens facilitate the transformation of the binary classification objective into the unified language modeling format. The unified training objective leads to training stability when the model becomes much larger in scale. Hence, for eFLM-16B, we transform this binary classification into the format of causal (U+1F608) 3, from language modeling. Specifically, we employ two emojis: the vocabulary to replace the original binary labels of 1 and 0. We apply zero-masking to the loss for tokens in the propositions and predict one of these two special tokens at the end of each proposition. By this method, we unify the teacher objective and language modeling. Moreover, we discard the original Iterative Training approach [25] and completely mix the samples from both signals in every batch. This strategy can enhance the consistency of data sampling distribution as well as improve training stability.
The essence of the low cost in scaling FLM-101B up is the growth strategy in model training. Specifically, we train three models, with 16B, 51B, and 101B parameters respectively, in a sequential manner. Each model inherits knowledge from its predecessor. This is contrary to the common practice that the models of different sizes are trained independently [58; 59].
Function-preserving Growth. Function preservation means that before and after growth, the models yield consistent outputs given the same arbitrary inputs. This property has proven beneficial for both knowledge inheritance [8; 6; 51] and training stability [78]. The growth operators used in FLM-101B training originate from [78], with improvement. Specifically, to adapt these operators to the multi-node 3D parallel framework, we implement them by extending the model structures offline and reloading the checkpoint when the next stage starts.
Schedules and Cost-Effectiveness. Model growth scheduling is a trade-off between the pros and cons inherent to models of different sizes [78]: a smaller model is faster in computing each training step, enabling more rapid consumption of training data for broader commonsense knowledge; conversely, a larger model is better in the reduction of loss per step, indicating a deeper understanding of the nuanced linguistic patterns. We train the 16B model with 245.37B tokens, the 51B model with 39.64B tokens, and the 101B model with 26.54B tokens. The billion tokens per day of different sizes are listed in Table 1. Under this growth schedule, the total time cost for our 101B model is 21.54 days, which is 72% time-saving (or a 3.56x speedup) compared to training a 101B model from scratch (76.74 days). This is consistent with our motivations depicted in Figure 1.
FLM-101B is trained on a cluster of 24 DGX-A800 GPU (8×80G) servers. Following the growth strategy, we sequentially complete the model training for sizes 16B, 51B, and 101B on this cluster.
The Parallel Strategies. Data parallelism [60] and tensor model parallelism [52] have become the standard approaches for training models at the billion scale. Nevertheless, an excessive amount of tensor parallelism may escalate GPU communication overheads, hampering training efficiency. To tackle this problem, we integrate pipeline model parallelism [35] and employ a 3D parallel strategy for optimal throughput. Moreover, by employing sequence parallelism [24], we slice the inputs to the
Transformer core’s LayerNorm and Dropout layers along the sequence length dimension, leading to additional savings in GPU computational resources and memory utilization. We also utilize the Megetron-LM 4 implementation of the distributed optimizer [46] to further reduce the GPU memory consumption, which is a technique that evenly distributes the optimizer states across data parallel ranks.
Table 2: Parallel strategies and throughput for different growth stages. For NVIDIA A800 GPUs, the peak theoretical FLOPs per second is 312 teraFLOPs/sec. Gradient accumulation is applied for the large global batch size.
Table 2 shows the parallelism configurations and training throughput in each stage of FLM-101B training under our growth strategy. In different stages, we configure different Tensor Parallel × Pipeline Parallel sizes to achieve higher throughput. The single-GPU throughput for all three training stages consistently exceeds 160 teraFLOPs/sec with a utilization rate of at least 51.3%. For comparison, GLM-130B achieves 135 teraFLOPs/sec [80] with a 42.27% utilization rate. We can also find that FLM-101B has a higher FLOP utilization rate than Megatron-LM [24] under a similar model size.
FLM-101B Configurations. The FLM-101B model is structured with a hidden state dimension of 10, 240, a layer number of 80, a context window of 2,048 tokens, 80 attention heads, and a vocabulary size of 100, 256. FLM-101B uses the AdamW optimizer [31] with β1 = 0.9 and β2 = 0.95. A cosine learning rate schedule is employed, leading to a final learning rate of 6e − 6. We use a weight decay of 0.1 and gradient clipping of 1.0.
Table 1 presents part of the hyperparameters used in different growth stages. In each growth stage, we approximately inherit the previous learning rate and adhere to the same schedule. The learning rate at the beginning of each stage is reported in the table. In the 16B stage, 4,608k samples are used for learning rate warmup, while in later growth stages, we use fewer samples of 230.4k. Note that we do not apply batch size warmup because we address the stability issue in a different manner, detailed in Section 3.
The training duration and token consumption for each stage are also outlined in Table 1. In total, FLM-101B training is accomplished within 22 days using 311.54B tokens.
Models beyond 100B parameters [49; 80] usually suffer from a bunch of notorious stability issues including loss divergence, gradient explosion, and numerical overflow/underflow. This not only inflates the cost of searching for feasible hyperparameters like optimal learning rates, but also intensifies ongoing maintenance during training, such as babysitting, issue resolution, data adjustment, and rebooting. Moreover, this makes the budget of the whole project unpredictable. We have undertaken the following efforts to mitigate these issues.
Loss Prediction. The Tensor Programs theories [75; 28] unveil the universal relations across the training dynamics of a series of models with the model width tending to infinite. For certain classes of hyperparameters, this results in a parameterized mapping for their optimal value between a small model and its larger counterparts, which is termed µP [76]. Two important insights are:
Based on these findings, our method to solve training stability is as follows: we first determine the data distribution before the FLM-16B training starts. Next, we perform a grid search on three hyperparameters including the learning rate, initialization standard deviation, and the softmax temperature in the output layer. This grid search is performed by running a proxy model (less than 100M ) with a hidden state dimension (“model width”) of 256 and a head number of 2. All the other structural hyperparameters and training data of the proxy model are identical to those of FLM-16B. A single run of grid search takes 24.6 hours with data parallelism on 6 nodes, which is equivalent to 6 hours per run given our 24-node infrastructure. Finally, We find a group of well-performing hyperparameters: learning rate = 4e − 4, standard deviation = 1.6e − 2, and softmax temperature = 2.0, through this grid search. Transferring these hyperparameters to the 16B model via µP [76] led to a seamless training experience devoid of instabilities. Combining with MSG [78], we also witness no post-growth divergence in FLM-51B and FLM-101B.
Figure 2: Training loss for FLM-101B models.
The full training loss curve is presented in Figure 2. The first stage (16B) stably goes through 246B tokens. Immediately afterwards, FLM grows from 16B to 51B. As expected, the training is stable. More importantly, we observe that the loss curve becomes steeper. It matches the intuition that a larger model is better in loss reduction per step. Subsequently, FLM grows to 101B. Although the training data for the 51B stage are only 40B tokens, the 101B training remains stable, and the loss curve becomes slightly steeper again. This loss curve proves the effectiveness of the growth strategy.
Our implementations of µP are largely consistent with those in µScaling [77], with modifications to handle the rotary embedding. Thus, the intermediate loss ranges for FLM-16B are also predictable with the results from multiple proxy widths at the same steps.
Mixed Precision with Bfloat16. We apply mixed-precision training to save run-time memory and reduce time costs. Specifically, we choose Bfloat16 instead of FP16 due to its superior precision for values approaching zero, making it more suitable for µP. As a result, we do not encounter the FP16 underflow issue reported by [76]. To our knowledge, the FLM models are currently the largest ones successfully trained with mixed precision + µP. Moreover, Bfloat16 negates the need for loss scale adjustments, making our training procedure more promising and reproducible.
Many existing benchmarks (e.g., Open LLM) focus on assessing the knowledgeability of LLMs. In this section, we discuss the results of FLM on these benchmarks. We argue that knowledge alone might not comprehensively reflect LLM’s capability (see Section 4.2 for more details). Thus, in addition to the common benchmark evaluation, we borrow the concept of IQ tests and evaluate LLMs with some specific tasks in Section 5.
Cost Estimation Method. Due to the considerable computational expense of LLMs, we also emphasize their associated costs in our experimental results. However, it is hard to directly compare 16B Stage51B Stage101B StageProcessed Tokens(Billions)TrainingLoss the actual cost of LLMs due to their different infrastructures, and the different costs incurred on different hardware.
To objectively compare training costs, we use the number of floating-point operations for training as the cost estimation index, which can be estimated from the model’s hyperparameters, configuration, and training data [35]. Since many models do not release the complete training configuration (e.g., GPT-3, LLAMA series), we estimate FLOPs within a range5.
For monolingual LLMs, e.g., GPT-3, the cost from monolingual data is equal to the total cost. The computational cost of GPT-3 is calculated as 376.41 (±53.77) zettaFLOPs, and LLAMA-2 (13B) as 210.37 (±28.77) zettaFLOPs. Because the cost is linear to both model parameters and training data [19], we could calculate the cost of the remaining LLAMA models easily. For bilingual or multilingual models, it is necessary to estimate based on the amount of data in the corresponding language. The total cost of GLM-130B is 421.60 zettaFLOPs. We know that the data ratio of English and Chinese is 1:1. Hence, the cost of GLM-130B for English is 210.80 zettaFLOPs, and the same for Chinese. The data ratio of FLM-101B is 53.5% : 46.5% for English and Chinese. The total cost of FLM-101B is 52.76 zettaFLOPs. According to the data ratio, the cost for English and Chinese is 28.22 zettaFLOPs and 24.54 zettaFLOPs, respectively.
Open LLM is an open-source project 6. Its target is to track and evaluate the open-sourced LLMs and chatbots. Open LLM contains four tasks: ARC-Challenge (ARC for short), HellaSwag, MMLU, and TruthfulQA. The Open LLM Leaderboard applies the average score of these tasks as a metric.
ARC: The ARC [9] dataset is proposed for graduate-school level closed book science question-answering tasks. Most problems in ARC are solvable with life experiences and Wikipedia searches. Thus, a model is expected to perform better if exposed to more commonsense and factual data.
HellaSwag: This is a sentence completion task emphasizing on commonsense inference [79]. We observe that the increase in HellaSwag performance is highly correlated with the reduction of training loss. This is intuitive because the training data is usually enriched with common sense.
MMLU: MMLU includes 57 multiple-choice tasks covering subjects spanning STEM to social science [17]. The tasks differ significantly in complexity, with many STEM-oriented questions demanding domain-specific professional knowledge and intricate reasoning to be solved.
TruthfulQA: TruthfulQA contains 817 factual questions to detect model falsehoods caused by naively mimicking human language patterns [27]. The solutions to these questions are closely associated with English Wikipedia sources. The task probes a model’s factual knowledge and resistance to popular misconceptions.
Table 3: Performance of FLM-101B and baselines including LLAMA series and GLM-130B. In order to visually compare the performance and cost, we estimate the floating-point operations (zetta = 1021) of the training process.
Table 3 details the performance of FLM-101B and strong baselines, including LLAMA series and GLM-130B. Because GPT-3 is closed-source, we could not get the probability values for a fair comparison. As a result, we cannot list GPT-3 here. GLM-130B results are achieved by our run on an open-sourced checkpoint.
5 This range originates from the use of checkpoint activation. Please check [35] for more details.
6 https://huggingface.co/spaces/HuggingFaceH4/open_TextGenerationLLM_leaderboard
Results. Among all the baseline models, FLM-101B ranks last with an average of 43.94. However, going deeper into the nature of these tasks, this does not necessarily indicate the inferiority of our model and training procedures.
(i) MMLU typically requires domain knowledge to solve. In our training of FLM-101B, no English textbook or sample exam questions are intentionally used. Nevertheless, in an FLM variant that incorporates this knowledge with FreeLM objectives (eFLM-16B, Section 2.2), even a 16B FLM model can outperform GLM-130B, supporting our claims here.
(ii) As aforementioned, TruthfulQA, ARC, and HellaSwag emphasize more on common sense and Wiki-level knowledge, and their performances improve with the increased amount of data and the reduction of training loss. With less than 0.16T English data (about one-tenth of LLAMA-2), FLM-101B already achieves the best accuracy of 41.47 among all the baselines on TruthfulQA. On ARC and HellaSwag, FLM-101B is comparable to GLM-130B with a similar amount of English data (approximately 0.2T). Also, the training data of GLM-130B includes ARC and HellaSwag, as expressly claimed in [80]. In our understanding, superior performance of FLM-101B can be expected on these three tasks if exposed to more training data.
We have also conducted experiments on a knowledge-enhanced version (eFLM-16B, detailed in Section 2.2) of the FLM to validate the effect of using domain-specific knowledge data. To reduce the training cost, we continue to train the smallest FLM-16B with teacher signals from a combination of (i) part of the auxiliary training data of MMLU [17], (ii) exam questions in similar domains and formats to C-Eval [20] 7, and (iii) other domain knowledge data. Note that, eFLM-16B is not a typical fine-tuning with additional data, which may affect the language capability of LLM. Recall that the FLM series uses FreeLM as its backbone which can learn both language and teacher signals. In this training, we preserve the language signal. Table 4 lists the result of eFLM-16B and baselines on C-Eval.
Table 4: Performance of eFLM-16B and baselines on C-eval. In this table, eFLM-16B refers to the professional-knowledge-enhanced FLM-16B. Note that C-Eval leaderboard only keeps one decimal place for the evaluation results.
Results. Enhanced with professional knowledge, significant improvements are observed. On MMLU task, the incorporation of the teacher signals with professional knowledge data results in a score of 44.50 for eFLM-16B (see Table 3), which surpasses GLM-130B (42.59), a model that also uses multi-task data in the related domain [80]. As a comparison, the MMLU score is 27.02 for the unenhanced FLM-16B. On C-Eval tasks 8, we observe that eFLM-16B performs better than GLM-130B by about 2 points. As a comparison, the average C-Eval score of the vanilla FLM-16B is 27.0, which underperforms GLM-130B. These results suggest that evaluation with professional knowledge may not fully reflect the capability of LLMs, particularly when different LLMs are trained with different data collections, and some may not come with a clear list.
Our core method for reducing computational cost is the growth strategy. We would like to answer the question of whether our growth strategy is effective in knowledge inheritance, and the trajectory of how model capabilities grow with size. Hence, we evaluate the performance of FLM on all the stages: 16B, 51B, and 101B. The training data for each stage is 0.245T, 0.04T, and 0.027T, respectively, in 7C-Eval can be considered as a Chinese version of MMLU. 8The scores are achieved on the test set by submitting to the C-Eval platform.
Table 5: Performance of the three stages of FLM on Open LLM. To reduce the computational cost during evaluation, we sample 20% and 30% items for HellaSwag and MMLU tasks, respectively. Parameters Training Data Average ARC Hellaswag MMLU TruthfulQA
Results. As expected, the performance of FLM improves with the increase in model size. FLM-101B achieves the best performance on almost all tasks. This means that our model inherits knowledge from the previous stage after each growth. We also observe that the 101B model improves the performance scores more significantly than the 51B model, with less data. This indicates that the models are successfully incorporating new weights in training after growth, and taking advantage of larger model sizes when the loss is low. Interestingly, the performance on ARC and HellaSwag increases steadily and significantly. This corresponds exactly to the steady decline of the model loss. Again, as we claimed in Section 4.1, when more training data is processed, FLM’s performance on Open LLM becomes better.
The above experiments evaluate the knowledge-related ability of FLM and how the performances depend on the amount and domain of training data. We also conduct an additional range of evaluations inspired by IQ tests in the following section.
Section 4 details the evaluation of existing benchmarks, focusing on knowledge. As we discussed in Section 1, knowledge could not fully reflect the Intelligence Quotient (IQ) of LLMs. To this end, we use existing IQ-related datasets [71; 72; 53] and make necessary modifications or generate new synthetic datasets where necessary.
Specifically, the IQ test mainly considers four aspects: symbolic mapping, rule understanding, pattern mining, and anti-interference. A common key property of these tasks is that they are dependent on the inference and generalization in a new context, instead of the previously-learned knowledge. We re-organize the modified existing datasets and our newly generated datasets under these four aspects, and introduce the motivation for each aspect, as well as the detailed execution methods.
Compared Methods. Borrowing psychological ideas that the measurement of IQ is dependent on age 9, we mainly consider models trained with similar amounts of data to FLM-101B. As a milestone of LLM development, GPT-3 (175B) [3] proposed in-context learning for the first time. GLM-130B [80] is the first open English-Chinese bilingual LLM. Hence, we select them as baseline models. Both models are trained with 300 ~400 billion tokens, which are in the same range as ours. GPT-3 focuses on English, so it is not included in the Chinese-related evaluation (i.e., CLUE-IQ).
An existing study [71] points out that classification tasks (e.g., document classification, sentiment classification) in textual forms often lack generalization. This is because they often come with very indicative and meaningful category labels. Such labels may laterally appear in the raw training data or popular websites, i.e., SemEval, IMDB [32], and Yelp 10 et al.. This leads a model to over-fit the semantics of the labels instead of inferring them from the new context, while the latter is critical for measuring intelligence as well. Considering this, we use a symbolic mapping method to replace the original category labels with symbols that are unlikely to be seen in the training data. Hence, we can evaluate the LLMs’ language understanding ability as well as the generalization abilities to a new context. Because the labels are from a given scope, we form our evaluation task as in-context learning with few-shot examples for each label.
9 https://ocw.mit.edu/ans7870/9/9.00SC/MIT9_00SCF11_text.pdf, page 367.
Figure 3: An example of symbolic mapping. The main difference is that the symbolic mapping method replaces the original label with random strings. In this example, we use <30mFC%4Z> and <?V9qP@Rx> to replace entailment and not entailment, respectively.
We use the existing benchmark datasets (e.g., SuperGLUE [61], CLUE [74]) as the source and sample up to 300 instances. Then, we replace the original category labels with random strings. Figure 3 shows an example. In this case, the entailment category is replaced by random string <30mFC%4Z> while the not entailment category is replaced by <?V9qP@Rx>. This processing also mitigates the problem that these datasets may contaminate the LLM pre-training data, since both benchmarks are public with lots of reproductions. Table 6 presents the statistics and task types of the rebuilt datasets.
Table 6: Statistics for SuperGLUE-IQ and CLUE-IQ datasets. “WSD” stands for “Word Sense Disambiguation”; “SS” stands for “Sentence Similarity”; “KR” stands for “Keyword Recognition”; coref. stands for “coreference resolution”.
SuperGLUE is a benchmark dataset used in evaluating the classification ability of various models including LLMs. However, the data is publicly available and many websites have reproduced this dataset. As a result, it is inevitable that the models might have already been trained on it. Thus, we build a new dataset named SuperGLUE-IQ based on the original dataset. Since the answers for the test set of SuperGLUE are not publicly available, we use a validation set here. There are two rules for selecting the sub-tasks: (i) the number of instances exceeds 100; (ii) the classification categories are fixed sets. The building process is detailed in Section 5.1.1. Table 7 lists the performance of FLM-101B and the baselines.
Results.
## Task Comparisons
On the BoolQ, WiC, and RTE tasks, FLM-101B and GPT-3 perform at the same level, and both outperform GLM-130B.
Specifically, GPT-3 and FLM-101B achieve more than 9 points better performance than GLM-ExamplesPromptSymbolic Mapping Method.
## Examples
### Task: BoolQ (Boolean Questions)
**Premise 1:**
- Kozlowski and the company's former chief financial officer, Mark Swartz, were sentenced, on Monday, to up to 25 years in prison.
**Hypothesis 1:**
- Kozlowski was sentenced on Monday to serve up to 25 years in prison.
**Answer 1:**
- <30mFC%4Z>...
### Task: WiC (Word-in-Context)
**Premise 2:**
- Note that SBB, CFF, and FFS stand out for the main railway company, in German, French, and Italian.
**Hypothesis 2:**
- The French railway company is called SNCF.
**Answer 2:**
- <?V9qP@Rx>
### Task: RTE (Recognizing Textual Entailment)
**Premise 3:**
- Pibul Songgram was the pro-Japanese military dictator of Thailand during World War 2.
**Hypothesis 3:**
- Pibul was the dictator of Thailand.
**Answer 3:**
## Textual Relationship Determination
Given the premise and hypothesis, determine the relationship between the two sentences.
## Instruction
### Traditional Direct Method
**Premise 1:**
- Kozlowski and the company's former chief financial officer, Mark Swartz, were sentenced, on Monday, to up to 25 years in prison.
**Hypothesis 1:**
- Kozlowski was sentenced on Monday to serve up to 25 years in prison.
**Answer 1:**
- Entailment...
**Premise 2:**
- Note that SBB, CFF, and FFS stand out for the main railway company, in German, French, and Italian.
**Hypothesis 2:**
- The French railway company is called SNCF.
**Answer 2:**
- Not entailment
**Premise 3:**
- Pibul Songgram was the pro-Japanese military dictator of Thailand during World War 2.
**Hypothesis 3:**
- Pibul was the dictator of Thailand.
**Answer 3:**
Given the premise and hypothesis, determine the relationship between the two sentences.
Table 7: Performance on SuperGLUE-IQ of GPT-3, GLM-130B, and FLM-101B. The result of GPT-3 is evaluated by API. GLM-130B is evaluated with its open-sourced checkpoint.
130B on BoolQ. On WSC task, FLM-101B and GPT-3 perform comparably while both perform worse than GLM-130B with about an 18 points gap. The technical report of GLM-130B [80] shows that they use both the WSC and RTE datasets in training. It is interesting to observe that the performance of GLM-130B on the two tasks has such a difference. Since the original label is replaced by a random string, overfitting can be ruled out to a certain extent. We believe that the main reason lies in the structure of language models: GLM-130B contains a bidirectional encoder while FLM-101B and GPT-3 are uni-directional. This feature potentially makes GLM-130B perform better in English coreference resolution tasks, while poor in reasoning-related tasks (e.g., BoolQ). More importantly, the costs of the three models are very different. FLM-101B achieves a comparable performance with GPT-3 under about 1/13 of its computational cost.
CLUE [74] is an open benchmark for Chinese NLP tasks. Similar to SuperGLUE-IQ, we build CLUE-IQ based on the CLUE dataset. Because GPT-3 is unable to handle Chinese well, here we compare FLM-101B with GLM-130B only. There are four tasks to be evaluated, including AFQMC, CSL, OCNLI, and CLUEWSC2020.11 Similar to SuperGLUE-IQ, we follow the same two rules to filter the original CLUE. Table 8 lists the performances of FLM-101B and GLM-130B.
Table 8: Performance on CLUE-IQ for GLM-130B and FLM-101B.
Results. On CLUE-IQ, our proposed FLM-101B achieves the best average performance of 42.07. Among the evaluated tasks, FLM-101B outperforms GLM-130B on AFQMC, CSL, and CLUEWSC2020. The results show that FLM-101B has good Chinese ability at the level of 100B parameters. Interestingly, FLM-101B performs better than GLM-130B on Chinese WSC, while worse than GLM-130B on English WSC. In addition, FLM-101B performs worse than GLM-103B on OCNLI. These results suggest that Chinese and English are different in nature and a model excelling in one language may not be good at both. Finally, from a cost-effective perspective, FLM-101B achieves better performance in Chinese at about 12% of the training cost of the counterpart.
Symbolic mapping is able to lighten the negative effects of data overfitting. From a different perspective, we consider understanding rules and executing them according to the given rules is a strong indication of reasoning capability. To this end, we design rule understanding evaluation. Note that, this test is different from reasoning based on the chain of thought. The former focuses on the understanding ability of simple rules (e.g., counting) and performing the right action in a closed setting, while the latter focuses on reasoning ability in an open setting (e.g., different valid reasons for the same conclusion). For example, “counting an increasing sequence of numbers” is a typical task for rule understanding evaluation, which can be zero-shot.
Details of Selected Tasks and Data. Counting (0-shot) is the simplest test method for rule understanding ability. Here, we build a bilingual dataset with 300 randomly generated items and report
11 For the details of these tasks, please refer to the original work [74].
A typical example is “Let’s count from 10010 to 10035: 10010, 10011, 10012,”. String replacement (4-shots) is another task that examines the model’s capacity to edit the text precisely following human intention. We build two sub-tasks: Replace-Word and Replace-Lowercase, each of which contains 300 instances. Each instance starts with a clear instruction: for the “Replace-Word” task, it is like “In the following sentence, replace the specified word with the target word. word to replace: WQHF target word: DFBB”; for the “Replace-Lowercase” task, it is like “For the following text, please modify all uppercase letters to lowercase”. The counting range and words to replace are sampled with a uniform distribution. Table 9 shows the performance of our proposed FLM-101B against GPT-3 and GLM-130B on both counting and string replacement tasks.
Table 9: Performance of FLM-101B, GPT-3, and GLM-130B on rule understanding tasks.
Results. On counting task, FLM-101B achieves 69.59%, about 9 points better than GLM-130B. GPT-3 wins the first place in counting and Replace-Lowercase, and second place in Replace-Word. This is potentially because GPT-3 has the largest amount of English training data. This experiment shows that the advantages of each model are varied. Hence, in future work, rule understanding evaluation tasks should cover more scenarios. Finally, considering the cost of each model, the performance of FLM-101B is satisfactory.
Pattern Mining test is common in IQ tests. In detail, it is the induction and deduction of the patterns emerging in a new context. In general, it is difficult even for humans and is frequently used in intelligence tests. Again, we face the problem that the same test data might have appeared in large quantities, so we also use replacement methods similar to Section 5.1 to alleviate this problem.
Specifically, we build a benchmark with three tasks (i.e., Head & Tail, Full Repeating, and Head Slicing) for evaluation. Head & Tail is to add a head and a tail to the given input, which should be exactly the same as the ones in the given examples. Regarding Full Repeating, the input sequence should be fully repeated once. For the Head Slicing task, the model needs to return the first fixed number of characters of the input. The number can be inferred from the preceding examples. No instruction or clue is provided except the examples.
Figure 4: Examples of pattern mining evaluation.
Figure 4 shows examples of these tasks. We sample the input strings, heads, and tails from a uniform distribution. These tasks are actually the “alphabetical” versions of the list_functions sub-task of Big-Bench [53]. The original numerical version is so simple that most existing LLMs could achieve 90%+ accuracy. To improve the distinctiveness, we replace the numbers with characters. All these tasks require the model to discover the behavior patterns inside the given examples. Each task is 5-shot and contains 100 instances. Table 10 lists the experimental results of our proposed FLM-101B against GPT-3 and GLM-130B on pattern mining tasks.
Table 10: Performance of FLM-101B, GPT-3, and GLM-130B on pattern mining tasks.
Results. On all three tasks, FLM-101B outperforms GLM-130B by a large margin. For the head & tail and full repeating tasks, FLM-101B is a few points behind GPT-3, but outperforms the latter on the head slicing task. Considering the computational cost, FLM-101B exhibits noticeable abilities in this area.
Anti-interference capability is critical for finding and utilizing information that is truly related to a specific goal, in an unseen and noisy context (Figure 5). We believe that in addition to generalization, anti-interference is also one of the important principles of AGI. For example, many LLMs will babble when given noisy cues. Another famous hard problem, the cocktail party problem in speech recognition [38], also suggests the importance of the anti-interference ability of intelligent agents. To this end, we conduct this anti-interference evaluation. Figure 5 shows two typical examples of this test.
Figure 5: Examples of anti-interference evaluation.
Selected Tasks and Data Collection. We conduct anti-interference evaluation in three task types: multiple key retrievals, single supporting fact tracking, and two supporting facts tracking. Multiple key retrieval is a kind of puzzle that hides some important information (referred to as keys) inside a lot of irrelevant text. If the anti-interference ability of LLMs is not good enough, they will output the wrong or even meaningless words. Even if LLMs pass the first challenge, they may still fail due to multiple relevant noises. We collect a multiple key retrieval dataset in similar formats as those in [7] with at most 3 keys in each instance, exemplified in Figure 5. The single supporting fact tracking and two supporting facts tracking tasks test whether a model can find the chain of supporting facts to answer a question correctly, which is hidden inside a set of irrelevant statements. There are two sub-tasks in the babi-20 [72] benchmark (qa1 and qa2 12) that are aligned with this setting. Thus, we
12 We drop qa3 due to the long context length and extraordinary difficulty for all the models
Anti-interference EvaluationThere is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.Here we go. There and back again….Here we go. There and back again. Pass key 1 is 4. Remember it. I4kh-DMS8y is pass key 2.Here we go. There and back again….Here we go. There and back again.The pass key 1 I told you wasSupporting FactsDaniel went back to the office.
Daniel travelled to the bathroom.
- Q: Where is Daniel?
- A: bathroomSandra journeyed to the kitchen. Daniel journeyed to the bathroom.
- Q: Where is Sandra?
- A: kitchenDaniel travelled to the hallway. John moved to the office. John went to the bathroom. John travelled to the office.
- Q: Where is Daniel?
- A: hallwayDaniel went back to the hallway. Daniel travelled to the garden. Sandra went to the office. Sandra journeyed to the kitchen.
- Q: Where is Daniel?
- A: Prompt Examples Multiple Key Retrival directly modify them in a generative format with 3 shots.
We randomly sampled 300 questions for each of these three tasks. Table 11 shows the evaluation results on anti-interference.
Table 11: Performance of FLM-101B, GPT-3, and GLM-130B on anti-interference evaluation.
Results. Among all the baselines for this evaluation, FLM-101B achieves the second-best passing rates of 89.00%, 59.00%, and 32.33%, respectively, which is an advantage of about 11%, 3%, and 6% compared to GLM-130B. Considering the computational cost, FLM-101B delivers exciting performance.
In conclusion, on our four additional evaluations inspired by the IQ tests, FLM-101B outperforms GLM-130B and obtains competitive results compared to GPT-3 in some tasks with much lower costs. Except for the impacts of training data, the superiority may be owed to a story that in the growth strategy, the smaller models in early stages refine a more efficient searching space, which keeps taking effect when the model grows larger with increased generalization ability.
Scaling Up Language Models to 100B. The burgeoning advancements in hardware and computational techniques in recent years [47; 52] have laid a robust groundwork for the expansion of language models. The benefits of scaling up LLMs include discernible advantages in language perplexity supported by studies on scaling laws [23; 18; 19; 77], as well as the emergent cognitive competencies in models [69; 4].
In the realm of 100+ billion parameters, examples of closed-source pre-trained LLMs include GPT-3 [3], Gopher [42], and Palm [1]. For closed-source models trained on Chinese data, notable mentions are Ernie 3.0 [63], Pangu-Σ [48], and InternLM [57]. Turning our attention to open-source variants, OPT [81] and BLOOM [49] are among the counterparts to GPT-3; the Llama [58; 59] series strategically operates on a slightly reduced scale (approximately 70B parameters) but amplifies the data to 2T. GLM-130B [80] is an open-source bilingual model with decent performance in both Chinese and English tasks. Nevertheless, the development trajectory and cost of GLM-130B remain largely inaccessible to many academic and industrial entities. FLM-101B is an exemplary paradigm for achieving comparable performance with a relatively small $100K budget. It is our aspiration that this model serves as a catalyst, expediting research advancements and making them more economically feasible in this domain.
Aligning with Humans. Despite the evidence that foundation LLMs present reasoning abilities in zero/few-shot learning and chain-of-thought prompting [3; 70], further refinement is needed to enhance their abilities to follow instructions [68] and align with human preferences [37; 36; 13; 2]. Supervised fine-tuning releases the potential of LLMs to imitate the instruction-following formats and provide human-like responses in dialogical and problem-solving contexts [66; 73; 34; 26]. Meanwhile, policy optimization methods [50; 43] lead LLMs to generate responses that maximize rewards congruent with human preferences, e.g., being helpful and harmless [12].
On the other hand, although these post-training techniques have proven effective and successful in industrial applications, the scaling laws regarding model sizes persist even after alignment with humans: larger models provide more factual and reasonable responses [16], as well as being better calibrated with their confidence probabilities [22]. We hereby release FLM-101B as a large foundation model, making it an accessible starting point for subsequent alignment studies.
LLM Evaluation. Widely-used approaches to evaluate LLMs include natural language processing benchmarks [74; 61], commonsense knowledge benchmarks [9; 79; 27], and professional knowledge benchmarks [17; 20]. For chatbots after fine-tuning, automatic and semi-automatic playgrounds are developed to evaluate their human alignment abilities [83]. Although knowledge-oriented ability is important, the results can be substantially impacted by training data and domains. To measure other classes of abilities, existing research like Big-Bench [53] and babi-20 [72] include some sub-tasks relevant to IQ tests, while others still depend more on NLP and knowledge. In this work, we add additional ranges of evaluation in the IQ-test paradigms by re-organizing existing datasets as well as creating new ones where proper.
Model Growth A line of existing work studies the progressive expansion of structures in training Transformer-like models [14; 51; 15; 6; 39; 62; 78]. To our knowledge, FLM-101B presents the first attempt to use a growth strategy to train LLMs in the 100B+ scale. For a more comprehensive summary, please refer to [78].
In this paper, we introduce FLM-101B, an open-source LLM that is successfully trained from scratch within a $100,000 budget. The key idea of reducing the training cost of FLM-101B is to utilize the growth strategy to break through the fixed number of model parameters. To fairly evaluate LLMs, we conduct a set of evaluations inspired by IQ tests. We believe that along this pathway, better IQ evaluation methods will continue to emerge in future studies. Experimental results show that FLM-101B outperforms strong baseline models under the same computational cost.
The power of LLMs is very exciting. We believe that LLMs are one of the important possible technical paths to AGI. For the sustainable development of LLMs, we believe that it may be an effective path to construct a basic LLM with strong reasoning capabilities but not a large amount of knowledge (for cost saving), and then expand the knowledge of the LLM in different domains to better support applications. Besides, our exploration on the growth strategy as well as training stability would potentially be beneficial for future attempts of further scaling up LLMs, e.g., beyond 1T parameters.