00:00:00

Share Your Feedback 🏝️

Survey | LLM Survey

Survey | LLM Survey

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: CoT | Chain-of-Thought Without Prompting* Next: Survey | Continual Learning LLM Survey

Survey | LLM Survey

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-02-17

Large Language Models: A Survey

  • url: https://arxiv.org/abs/2402.06196
  • pdf: https://arxiv.org/pdf/2402.06196
  • abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs’ ability of general-purpose language understanding and generation is acquired by training billions of model’s parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

Contents


1 INTRODUCTION

Language modeling is a long-standing research topic, dat- ing back to the 1950s with Shannon’s application of informa- tion theory to human language, where he measured how well simple n-gram language models predict or compress natural language text [3]. Since then, statistical language modeling became fundamental to many natural language understanding and generation tasks, ranging from speech recognition, ma- chine translation, to information retrieval [4], [5], [6].

The recent advances on transformer-based large language models (LLMs), pretrained on Web-scale text corpora, signif- icantly extended the capabilities of language models (LLMs). For example, OpenAI’s ChatGPT and GPT-4 can be used not only for natural language processing, but also as general task solvers to power Microsoft’s Co-Pilot systems, for instance, can follow human instructions of complex new tasks per- forming multi-step reasoning when needed. LLMs are thus becoming the basic building block for the development of general-purpose AI agents or artificial general intelligence (AGI).

As the field of LLMs is moving fast, with new findings, models and techniques being published in a matter of months or weeks [7], [8], [9], [10], [11], AI researchers and practi- tioners often find it challenging to figure out the best recipes to build LLM-powered AI systems for their tasks. This paper gives a timely survey of the recent advances on LLMs. We hope this survey will prove a valuable and accessible resource for students, researchers and developers.

LLMs are large-scale, pre-trained, statistical language mod- els based on neural networks. The recent success of LLMs is an accumulation of decodes of research and development of language models, which can be categorized into four waves

that have different starting points and velocity: statistical lan- guage models, neural language models, pre-trained language models and LLMs.

Statistical language models (SLMs) view text as a sequence of words, and estimate the probability of text as the product of their word probabilities. The dominating form of SLMs are Markov chain models known as the n-gram models, which compute the probability of a word conditioned on its immediate proceeding n − 1 words. Since word probabilities are estimated using word and n-gram counts collected from text corpora, the model needs to deal with data sparsity (i.e., assigning zero probabilities to unseen words or n-grams) by using smoothing, where some probability mass of the model is reserved for unseen n-grams [12]. N-gram models are widely used in many NLP systems. However, these models are incomplete in that they cannot fully capture the diversity and variability of natural language due to data sparsity.

Early neural language models (NLMs) [13], [14], [15], [16] deal with data sparsity by mapping words to low-dimensional continuous vectors (embedding vectors) and predict the next word based on the aggregation of the embedding vectors of its proceeding words using neural networks. The embedding vectors learned by NLMs define a hidden space where the semantic similarity between vectors can be readily computed as their distance. This opens the door to computing semantic similarity of any two inputs regardless their forms (e.g., queries vs. documents in Web search [17], [18], sentences in different languages in machine translation [19], [20]) or modalities (e.g., image and text in image captioning [21], [22]). Early NLMs are task-specific models, in that they are trained on task-specific data and their learned hidden space is task-specific.

Pre-trained language models (PLMs), unlike early NLMs, are task-agnostic. This generality also extends to the learned hidden embedding space. The training and inference of PLMs follows the pre-training and fine-tuning paradigm, where lan- guage models with recurrent neural networks [23] or trans- formers [24], [25], [26] are pre-trained on Web-scale unlabeled text corpora for general tasks such as word prediction, and then finetuned to specific tasks using small amounts of (labeled) task-specific data. Recent surveys on PLMs include [8], [27], [28].

refer to transformer-based neural that contain tens to hundreds of billions of parameters, which are pre- trained on massive text data, such as PaLM [31], LLaMA [32], and GPT-4 [33], as summarized in Table III. Compared to PLMs, LLMs are not only much larger in model size, but also exhibit stronger language understanding and generation abilities, and more importantly, emergent abilities that are not present in smaller-scale language models.

Fig. 1: LLM Capabilities.

As illustrated these emergent abilities include (1) in-context in Fig. 1, learning, where LLMs learn a new task from a small set of examples presented in the prompt at inference time, (2) instruction following, where LLMs, after instruction tuning, can follow the instructions for new types of tasks without using explicit examples, and (3) multi-step reasoning, where LLMs can solve a complex task by breaking down that task into intermediate reasoning steps as demonstrated in the chain-of-thought prompt [34]. LLMs can also be augmented by using external knowledge and tools [35], [36] so that they can effectively interact with users and environment [37], and continually improve itself using feedback data collected through interactions (e.g. via reinforcement learning with human feedback (RLHF)).

Through advanced usage and augmentation techniques, LLMs can be deployed as so-called AI agents: artificial entities that sense their environment, make decisions, and take actions. Previous research has focused on developing agents for specific tasks and domains. The emergent abilities demonstrated by LLMs make it possible to build general-purpose AI agents based on LLMs. While LLMs are trained to produce responses in static settings, AI agents need to take actions to interact with dynamic environment. Therefore, LLM-based agents often need to augment LLMs to e.g., obtain updated information from external knowledge bases, verify whether a system action produces the expected result, and cope with when things do not go as expected, etc. We will discuss in detail LLM-based agents in Section IV.

In the rest of this paper, Section II presents an overview of state of the art of LLMs, focusing on three LLM families (GPT, LLaMA and PaLM) and other representative models. Section III discusses how LLMs are built. Section IV discusses how

LLMs are used, and augmented for real-world applications Sections V and VI review popular datasets and benchmarks for evaluating LLMs, and summarize the reported LLM evaluation results. Finally, Section VII concludes the paper by summa- rizing the challenges and future research directions.

2 LARGE LANGUAGE MODELS

In this section we start with a review of early pre-trained neural language models as they are the base of LLMs, and then focus our discussion on three families of LLMs: GPT, LlaMA, and PaLM. Table I provides an overview of some of these models and their characteristics.

A. Early Pre-trained Neural Language Models

Language modeling using neural networks was pioneered by [38], [39], [40]. Bengio et al. [13] developed one of the first neural language models (NLMs) that are comparable to n-gram models. Then, [14] successfully applied NLMs to machine translation. The release of RNNLM (an open source NLM toolkit) by Mikolov [41], [42] helped significantly popularize NLMs. Afterwards, NLMs based on recurrent neural networks (RNNs) and their variants, such as long short-term memory (LSTM) [19] and gated recurrent unit (GRU) [20], were widely used for many natural language applications including machine translation, text generation and text classification [43].

Then, the invention of the Transformer architecture [44] marks another milestone in the development of NLMs. By applying self-attention to compute in parallel for every word in a sentence or document an “attention score” to model the influence each word has on another, Transformers allow for much more parallelization than RNNs, which makes it possible to efficiently pre-train very big language models on large amounts of data on GPUs. These pre-trained language models (PLMs) can be fine-tuned for many downstream tasks.

Fig. 2: The paper structure.

We group early popular Transformer-based PLMs, based on their neural architectures, into three main categories: encoder- only, decoder-only, and encoder-decoder models. Comprehen- sive surveys of early PLMs are provided in [43], [28].

1) Encoder-only PLMs: As the name suggests, the encoder- only models only consist of an encoder network. These models are originally developed for language understanding tasks, such as text classification, where the models need to predict a class label for an input text. Representative encoder-only mod- els include BERT and its variants, e.g., RoBERTa, ALBERT, DeBERTa, XLM, XLNet, UNILM, as to be described below.

BERT (Birectional Encoder Representations from Trans- formers) [24] is one of the most widely used encoder-only language models. BERT consists of three modules: (1) an embedding module that converts input text into a sequence of embedding vectors, (2) a stack of Transformer encoders that converts embedding vectors into contextual representation vectors, and (3) a fully connected layer that converts the representation vectors (at the final layer) to one-hot vectors. BERT is pre-trained uses two objectives: masked language modeling (MLM) and next sentence prediction. The pre-trained BERT model can be fine-tuned by adding a classifier layer for many language understanding tasks, ranging from text BooksCorpus, English Wikipedia BooksCorpus, English Wikipedia, CC-NEWS, STORIES (a subset of Common Crawl), Reddit BooksCorpus, English Wikipedia

TABLE I: High-level Overview of Popular Language Models

BooksCorpus, English Wikipedia, STORIES, Red- dit content BooksCorpus, English Wikipedia, Giga5, Com- mon Crawl, ClueWeb 2012-B BooksCorpus Reddit outbound Common Crawl New Common Crawl-based dataset in 101 lan- guages (m Common Crawl) Corrupting text Common Crawl Books2, Wikipedia (filtered), WebText2, Books1, Public GitHub software repositories ELI5 - Online sources Online sources GPT-3.5 GPT-3.5 Dialogue data - Publicly available code - Code and math corpora SlimPajama, Starcoderdata Web documents, books, Wikipedia, conversations, GitHub code Web documents, books, Wikipedia, conversations, GitHub code Web documents, books, code, mathematics, con- versational data HealthSearchQA, MedicationQA, LiveQA MedQA, MedMCQA, HealthSearchQA, LiveQA, MedicationQA Web documents, code, dialog data, Wikipedia MassiveText Chinese text MassiveText public dialog data and web documents MassiveText

THE PILE, BIGQUERY, BIGPYTHON ROOTS Synthetic data Online source - GitHub RedPajama, m Common Crawl, S2ORC, Common Crawl Instruction dataset RefinedWeb Web documents, books, and code, audio data, video data GitHub’s Markdown and StackExchange IIT-CDIP Test Collection 1.0, DocBank image data, classification, question answering to language inference. A high-level overview of BERT framework is shown in Fig 3. As BERT significantly improved state of the art on a wide range of language understanding tasks when it was published, the AI community was inspired to develop many similar encoder-only language models based on BERT.

RoBERTa [25] significantly improves the robustness of BERT using a set of model design choices and training strate- gies, such as modifying a few key hyperparameters, removing the next-sentence pre-training objective and training with much larger mini-batches and learning rates. ALBERT [45] uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT: (1) splitting the embedding matrix into two smaller matrices, and (2) using repeating layers split among groups. DeBERTa (Decoding- enhanced BERT with disentangled attention) [26] improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask de- coder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a novel virtual adversarial training method is used for fine-tuning to improve models’ generalization. ELECTRA [46] uses a new pre-training task, known as replaced token detection (RTD), which is empirically proven to be more sample-efficient than MLM. Instead of masking the input, RTD corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, a discriminative model is trained to predict whether a token in the corrupted input was replaced by a generated sample or not. RTD is more sample-efficient than MLM because the former is defined over all input tokens rather than just the small subset being masked out, as illustrated in Fig 4.

Fig. 3: Overall pre-training and fine-tuning procedures for BERT. Courtesy of [24]

Fig. 5: Cross-lingual language model pretraining. The MLM objective is similar to BERT, but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences. To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations. Courtesy of [47].

Fig. 4: A comparison between replaced token detection and masked language modeling. Courtesy of [46].

XLMs [47] extended BERT to cross-lingual language models using two methods: (1) a unsupervised method that only relies on monolingual data, and (2) a supervised method that leverages parallel data with a new cross-lingual language model objective, as illustrated in Fig 5. XLMs had obtained state-of-the-art results on cross-lingual classification, unsuper- vised and supervised machine translation, at the time they were proposed.

There are also encoder-only language models that leverage the advantages of auto-regressive (decoder) models for model training and inference. Two examples are XLNet and UNILM. XLNet [48] is based on Transformer-XL, pre-trained using a generalized autoregressive method that enables learning bidi- rectional contexts by maximizing the expected likelihood over

Fig. 6: Overview of unified LM pre-training. The model parameters are shared across the LM objectives (i.e., bidirec- tional LM, unidirectional LM, and sequence-to-sequence LM). Courtesy of [49].

1) Decoder-only PLMs: Two of the most widely used decoder-only PLMs are GPT-1 and GPT-2, developed by OpenAI. These models lay the foundation to more powerful LLMs subsequently, i.e., GPT-3 and GPT-4.

GPT-1 [50] demonstrates for the first time that good performance over a wide range of natural language tasks can be obtained by Generative Pre-Training (GPT) of a decoder-only Transformer model on a diverse corpus of unlabeled text in a self-supervised learning fashion (i.e., next word/token prediction), followed by discriminative fine-tuning on each specific downstream task (with much fewer samples), as illustrated in Fig 7. GPT-1 paves the way for subsequent GPT models, with each version improving upon the architecture and achieving better performance on various language tasks.

B. Large Language Model Families

Large (LLMs) mainly language models that contain tens to refer transformer-based PLMs to hundreds of billions of parameters. Compared to PLMs reviewed above, LLMs are not only much larger in model size, but also exhibit stronger language understanding and generation and emergent abilities that are not present in smaller-scale models. In what follows, we review three LLM families: GPT, LLaMA, and PaLM, as illustrated in Fig 8.

Fig. 7: High-level overview of GPT pretraining, and fine-tuning steps. Courtesy of OpenAI.

GPT-2 [51] shows that language models are able to learn to perform specific natural language tasks without any explicit supervision when trained on a large WebText dataset consisting of millions of webpages. The GPT-2 model follows the model designs of GPT-1 with a few modifications: Layer normal- ization is moved to the input of each sub-block, additional layer normalization is added after the final self-attention block, initialization is modified to account for the accumulation on the residual path and scaling the weights of residual layers, vocabulary size is expanded to 50,25, and context size is increased from 512 to 1024 tokens.

3) Encoder-Decoder PLMs: In [52], Raffle et al. shows that almost all NLP tasks can be cast as a sequence-to-sequence generation task. Thus, an encoder-decoder language model, by design, is a unified model in that it can perform all natural language understanding and generation tasks. Representative encoder-decoder PLMs we will review below are T5, mT5, MASS, and BART.

T5 [52] is a Text-to-Text Transfer Transformer (T5) model, where transfer learning is effectively exploited for NLP via an introduction of a unified framework in which all NLP tasks are cast as a text-to-text generation task. mT5 [53] is a multilingual variant of T5, which is pre-trained on a new Common Crawl- based dataset consisting of texts in 101 languages.

MASS (MAsked Sequence to Sequence pre-training) [54] adopts the encoder-decoder framework to reconstruct a sen- tence fragment given the remaining part of the sentence. The encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and the decoder predicts the masked fragment. In this way, MASS jointly trains the encoder and decoder for language embedding and generation, respectively.

BART [55] uses a standard sequence-to-sequence transla- tion model architecture. It is pre-trained by corrupting text with an arbitrary noising function, and then learning to reconstruct the original text.

1) The GPT Family: Generative Pre-trained Transform- ers (GPT) are a family of decoder-only Transformer-based language models, developed by OpenAI. This family con- sists of GPT-1, GPT-2, GPT-3, InstrucGPT, ChatGPT, GPT-4, CODEX, and WebGPT. Although early GPT models, such as GPT-1 and GPT-2, are open-source, recent models, such as GPT-3 and GPT-4, are close-source and can only be accessed via APIs. GPT-1 and GPT-2 models have been discussed in the early PLM subsection. We start with GPT-3 below.

GPT-3 [56] is a pre-trained autoregressive language model with 175 billion parameters. GPT-3 is widely considered as the first LLM in that it not only is much larger than previous PLMs, but also for the first time demonstrates emergent abilities that are not observed in previous smaller PLMs. GPT- 3 shows the emergent ability of in-context learning, which means GPT-3 can be applied to any downstream tasks without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and the cloze tasks, as well as several ones that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, 3-digit arithmetic. Fig 9 plots the performance of GPT-3 as a function of the number of examples in in-context prompts.

CODEX [57], released by OpenAI in March 2023, is a general-purpose programming model that can parse natural language and generate code in response. CODEX is a de- scendant of GPT-3, fine-tuned for programming applications on code corpora collected from GitHub. CODEX powers Microsoft’s GitHub Copilot.

WebGPT [58] is another descendant of GPT-3, fine-tuned to answer open-ended questions using a text-based web browser, facilitating users to search and navigate the web. Specifically, WebGPT is trained in three steps. The first is for WebGPT to learn to mimic human browsing behaviors using human demonstration data. Then, a reward function is learned to predict human preferences. Finally, WebGPT is refined to optimize the reward function via reinforcement learning and rejection sampling.

To enable LLMs to follow expected human instructions, InstructGPT [59] is proposed to align language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, a dataset of labeler demonstrations of the desired model behavior is collected. Then GPT-3 is fine-tuned on this dataset. Then, a dataset of human-ranked model outputs is collected to further fine-tune the model using reinforcement learning. The method is known Reinforcement Learning from Human Feedback launch of ChatGPT (Chat Generative Pre-trained Transformer) [60] on November 30, 2022. ChatGPT is chatbot that enables users to steer a conversation to complete a wide range of tasks such as question answering, information seeking, text summarization, and more. ChatGPT is powered by GPT-3.5 (and later by GPT-4), a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.

Fig. 8: Popular LLM Families.

GPT-4 [33] is the latest and most powerful LLM in the GPT family. Launched in March, 2023, GPT-4 is a multi- modal LLM in that it can take image and text as inputs and produce text outputs. While still less capable than humans in some of the most challenging real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers, as shown in Fig 11. Like early GPT models, GPT-4 was first pre-trained to predict next tokens on large text corpora, and then fine-tuned with RLHF to align model behaviors with human-desired ones.

2) The LLaMA Family: LLaMA is a collection of founda- tion language models, released by Meta. Unlike GPT models, i.e., model weights are LLaMA models are open-source, released to the research community under a noncommercial license. Thus, the LLaMA family grows rapidly as these models are widely used by many research groups to develop better open-source LLMs to compete the closed-source ones or to develop task-specific LLMs for mission-critical applications.

The first set of LLaMA models [32] was released in Febru- ary 2023, ranging from 7B to 65B parameters. These models are pre-trained on trillions of tokens, collected from publicly available datasets. LLaMA uses the transformer architecture of GPT-3, with a few minor architectural modifications, including (1) using a SwiGLU activation function instead of ReLU, (2) using rotary positional embeddings instead of absolute positional embedding, and (3) using root-mean-squared layer- normalization instead of standard layer-normalization. The open-source LLaMA-13B model outperforms the proprietary GPT-3 (175B) model on most benchmarks, making it a good baseline for LLM research.

Fig. 9: GPT-3 shows that larger models make increasingly efficient use of in-context information. It shows in-context learning performance on a simple task requiring the model to remove random symbols from a word, both with and without a natural language task description. Courtesy of [56].

(RLHF), as shown in 10. The resultant InstructGPT models have shown improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

Fig. 10: The high-level overview of RLHF. Courtesy of [59].

Fig. 11: GPT-4 performance on academic and professional exams, compared with GPT 3.5. Courtesy of [33].

Fig. 13: Relative Response Quality of Vicuna and a few other well-known models by GPT-4. Courtesy of Vicuna Team.

In July 2023, Meta, in partnership with Microsoft, released the LLaMA-2 collection [61], which include both foundation language models and Chat models finetuned for dialog, known as LLaMA-2 Chat. The LLaMA-2 Chat models were reported to outperform other open-source models on many public benchmarks. Fig 12 shows the training process of LLaMA-2 Chat. The process begins with pre-training LLaMA-2 using publicly available online data. Then, an initial version of LLaMA-2 Chat is built via supervised fine-tuning. Subse- quently, the model is iteratively refined using RLHF, rejection sampling and proximal policy optimization. In the RLHF stage, the accumulation of human feedback for revising the reward model the reward model from being changed too much, which could hurt the stability of LLaMA model training.

Fig. 12: Training of LLaMA-2 Chat. Courtesy of [61].

Alpaca [62] is fine-tuned from the LLaMA-7B model using 52K instruction-following demonstrations generated in the style of self-instruct using GPT-3.5 (text-davinci-003). Alpaca is very cost-effective for training, especially for academic research. On the self-instruct evaluation set, Alpaca performs similarly to GPT-3.5, despite that Alpaca is much smaller.

The Vicuna team has developed a 13B chat model, Vicuna- 13B, by fine-tuning LLaMA on user-shared conversations

Like Alpaca and Vicuna, the Guanaco models [63] are also finetuned LLaMA models using instruction-following data. But the fine-tuning is done very efficiently using QLoRA such that fine-tuning a 65B parameter model can be done on a single 48GB GPU. QLoRA back-propagates gradients through a frozen, 4-bit quantized pre-trained language model into Low Rank Adapters (LoRA). The best Guanaco model outperforms all previously released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of fine-tuning on a single GPU.

Koala [64] is yet another instruction-following language model built on LLaMA, but with a specific focus on interaction data that include user inputs and responses generated by highly capable closed-source chat models such as ChatGPT. The Koala-13B model performs competitively with state-of-the-art chat models according to human evaluation based on real- world user prompts.

Mistral-7B [65] is a 7B-parameter language model engi- neered for superior performance and efficiency. Mistral-7B outperforms the best open-source 13B model (LLaMA-2-13B) across all evaluated benchmarks, and the best open-source 34B model (LLaMA-34B) in reasoning, mathematics, and code generation. This model leverages grouped-query attention for faster inference, coupled with sliding window attention to effectively handle sequences of arbitrary length with a reduced inference cost.

The LLaMA family is growing rapidly, as more instruction- following models have been built on LLaMA or LLaMA- 2, including Code LLaMA [66], Gorilla [67], Giraffe [68], Vigogne [69], Tulu 65B [70], Long LLaMA [71], and Stable Beluga2 [72], just to name a few.

3) The PaLM Family: The PaLM (Pathways Language Model) family are developed by Google. The first PaLM model [31] was announced in April 2022 and remained private until March 2023. It is a 540B parameter transformer-based LLM. The model is pre-trained on a high-quality text corpus consisting of 780 billion tokens that comprise a wide range of natural language tasks and use cases. PaLM is pre-trained on 6144 TPU v4 chips using the Pathways system, which enables highly efficient training across multiple TPU Pods. PaLM demonstrates continued benefits of scaling by achiev- ing state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. PaLM- 540B outperforms not only state-of-the-art fine-tuned models on a suite of multi-step reasoning tasks, but also on par with humans on the recently released BIG-bench benchmark.

The U-PaLM models of 8B, 62B, and 540B scales are continually trained on PaLM with UL2R, a method of continue training LLMs on a few steps with UL2’s mixture-of-denoiser objective [73]. An approximately 2x computational savings rate is reported.

U-PaLM is later instruction-finetuned as Flan-PaLM [74]. Compared to other instruction fine-tuning work mentioned above, Flan-PaLM’s fine-tuning is performed using a much larger number of tasks, larger model sizes, and chain-of- thought data. As a result, Flan-PaLM substantially outperforms previous instruction-following models. For instance, Flan- PaLM-540B, which is instruction-finetuned on 1.8K tasks, outperforms PaLM-540B by a large margin (+9.4% on av- erage). The fine-tuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks, as illustrated in Fig 14. [77]. Med-PaLM 2 scored up to 86.5% on the MedQA dataset (i.e., a benchmark combining six existing open ques- tion answering datasets spanning professional medical exams, research, and consumer queries), improving upon Med-PaLM by over 19% and setting a new state-of-the-art.

C. Other Representative LLMs

In addition to the models discussed in the previous sub- sections, there are other popular LLMs which do not belong to those three model families, yet they have achieved great performance and have pushed the LLMs field forward. We briefly describe these LLMs in this subsection.

FLAN: In [78], Wei et al. explored a simple method for improving the zero-shot learning abilities of language models. They showed that instruction tuning language models on a collection of datasets described via instructions substantially improves zero-shot performance on unseen tasks. They take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. They call this instruction-tuned model FLAN. Fig 15 provides a comparison of instruction tuning with pretrain–finetune and prompting.

Fig. 15: train–finetune and prompting. Courtesy of [78].

Fig. 14: Flan-PaLM fine-tuning consist of 473 datasets in above task categories. Courtesy of [74].

PaLM-2 [75] is a more compute-efficient LLM with bet- ter multilingual and reasoning capabilities, compared to its predecessor PaLM. PaLM-2 is trained using a mixture of objectives. Through extensive evaluations on English, multi- lingual, and reasoning tasks, PaLM-2 significantly improves the model performance on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference than PaLM.

Med-PaLM [76] is a domain-specific PaLM, and is de- signed to provide high-quality answers to medical questions. Med-PaLM is finetuned on PaLM using instruction prompt tuning, a parameter-efficient method for aligning LLMs to new domains using a few exemplars. Med-PaLM obtains very encouraging results on many healthcare tasks, although it is still inferior to human clinicians. Med-PaLM 2 improves Med- PaLM via med-domain fine-tuning and ensemble prompting

Gopher: In [79], Rae et al. presented an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models were evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. The number of layers, the key/value size, and other hyper-parameters of different model sizes are shown in Fig 16.

Fig. 16: Model architecture details of Gopher with different number of parameters. Courtesy of [78].

T0: In [80], Sanh et al. developed T0, a system for easily mapping any natural language tasks into a human-readable

ERNIE 3.0: In [81], Sun et al. proposed a unified frame- work named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto- encoding network, so that the trained model can be easily tai- lored for both natural language understanding and generation tasks using zero-shot learning, few-shot learning or fine-tuning. They have trained ERNIE 3.0 with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Fig 17 illustrates the model architecture of Ernie 3.0.

Fig. 18: Retro architecture. Left: simplified version where a sequence of length n = 12 is split into l = 3 chunks of size m = 4. For each chunk, we retrieve k = 2 neighbours of r = 5 tokens each. The retrieval pathway is shown on top. Right: Details of the interactions in the CCA operator. Causality is maintained as neighbours of the first chunk only affect the last token of the first chunk and tokens from the second chunk. Courtesy of [82].

Fig. 17: High-level model architecture of ERNIE 3.0. Courtesy of [81].

RETRO: In [82], Borgeaud et al. enhanced auto-regressive language models by conditioning on document chunks re- trieved from a large corpus, based on local similarity with pre- ceding tokens. Using a 2-trillion-token database, the Retrieval- Enhanced Transformer (Retro) obtains comparable perfor- mance to GPT-3 and Jurassic-1 [83] on the Pile, despite using 25% fewer parameters. As shown in Fig 18, Retro combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.

GLaM: In [84], Du et al. proposed a family of LLMs named GLaM (Generalist Language Model), which use a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT- 3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero, one and few-shot performance across 29 NLP tasks. Fig 19 shows the high-level architecture of GLAM.

LaMDA: In [85], Thoppilan et al. presented LaMDA, a family of Transformer-based neural language models special- ized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. They showed that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

Fig. 19: GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). Courtesy of [84].

OPT: In [86], Zhang et al. presented Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained trans- formers ranging from 125M to 175B parameters, which they share with researchers. The OPT models’ parameters are shown in 20

Chinchilla: In [2], Hoffmann et al. investigated the optimal model size and number of tokens for training a transformer language model under a given compute budget. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found the model size and the training, that for compute-optimal number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. They tested this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4% more more data.

Fig. 20: Different OPT Models’ architecture details. Courtesy of [86].

Galactica: In [87], Taylor et al. introduced Galactica, a large language model that can store, combine and reason about scientific knowledge. They trained on a large scientific corpus of papers, reference material, knowledge bases and many other sources. Galactica performed well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%.

CodeGen: In [88], Nijkamp et al. trained and released a family of large language models up to 16.1B parameters, language and programming called CODEGEN, on natural language data, and open sourced the training library JAX- FORMER. They showed the utility of the trained model by demonstrating that it is competitive with the previous state-of- the-art on zero-shot Python code generation on HumanEval. They further investigated the multi-step paradigm for program synthesis, where a single program is factorized into multi- ple prompts specifying sub-problems. They also constructed an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts.

AlexaTM: In [89], Soltan et al. demonstrated that mul- tilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various task. They trained a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and showed that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM consist of 46 encoder layers, 32 decoder layers, 32 attention heads, and dmodel = 4096.

Sparrow: In [90], Glaese et al. presented Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. They used reinforcement learning from human feed- back to train their models with two new additions to help human raters judge agent behaviour. The high-level pipeline of Sparrow model is shown in Fig 21.

Minerva: In [91], Lewkowycz et al. introduced Minerva, a large language model pretrained on general natural language data and further trained on technical content, to tackle previous LLM struggle with quantitative reasoning (such as solving mathematics, science, and engineering problems).

MoD: In [92], Tay et al. presented a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. They proposed Mixture-of-Denoisers (MoD), a pre- training objective that combines diverse pre-training paradigms together. This framework is known as Unifying Language Learning (UL2). An overview of UL2 pretraining paradigm is shown in Fig 21.

Fig. 21: Sparrow pipeline relies on human participation to continually expand a training set. Courtesy of [90].

Fig. 22: An overview of UL2 pretraining paradigm. Courtesy of [92].

BLOOM: In [93], Scao et al. presented BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). An overview of BLOOM architecture is shown in Fig 23.

GLM: In [94], Zeng et al. introduced GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It was an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre- trained.

Pythia: In [95], Biderman et al. introduced Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study.

Palmyra [106], Camel [107], Yalm [108], MPT [109], ORCA- 2 [110], Gorilla [67], PAL [111], Claude [112], CodeGen 2 [113], Zephyr [114], Grok [115], Qwen [116], Mamba [30], Mixtral-8x7B [117], DocLLM [118], DeepSeek-Coder [119], FuseLLM-7B [120], TinyLlama-1.1B [121], LLaMA-Pro-8B [122].

Fig 24 provides an overview of some of the most repre- sentative LLM frameworks, and the relevant works that have contributed to the success of LLMs and helped to push the limits of LLMs.

Fig. 23: An overview of BLOOM architecture. Courtesy of [93].

3 HOW LLMS ARE BUILT

Orca: In [96], Mukherjee et al. develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of large foundation models. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT.

StarCoder: In [97], Li et al. introduced StarCoder and StarCoderBase. They are 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch in- ference enabled by multi-query attention. StarCoderBase is trained on one trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. They fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. They performed the most comprehensive evalu- ation of Code LLMs to date and showed that StarCoderBase outperforms every open Code LLM that supports multiple pro- gramming languages and matches or outperforms the OpenAI code-cushman-001 model.

KOSMOS: In [98], Huang et al. introduced KOSMOS-1, a Multimodal Large Language Model (MLLM) that can per- ceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e. zero-shot). Specifically, they trained KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. Experimental results show that KOSMOS- 1 achieves impressive performance on (i) language understand- ing, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question an- swering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions).

Gemini: In [99], Gemini team introduced a new family of multimodal models, that exhibit promising capabilities across image, audio, video, and text understanding. Gemini family includes three versions: Ultra for highly-complex tasks, Pro for enhanced performance and deployability at scale, and Nano for on-device applications. Gemini architecture is built on top of Transformer decoders, and is trained to support 32k context length (via using efficient attention mechanisms).

Some of the other popular LLM frameworks (or techniques used for efficient developments of LLMs) includes Inner- Monologue [100], Megatron-Turing NLG [101], LongFormer [102], OPT-IML [103], MeTaLM [104], Dromedary [105],

In this section, we first review the popular architectures used for LLMs, and then discuss data and modeling techniques ranging from data preparation, tokenization, to pre-training, instruction tuning, and alignment.

Once the model architecture is chosen, the major steps involved in training an LLM includes: data preparation (col- tokenization, model pre- lection, cleaning, deduping, etc.), training (in a self-supervised learning fashion), instruction tuning, and alignment. We will explain each of them in a separate subsection below. These steps are also illustrated in Fig 25.

A. Dominant LLM Architectures

The most widely used LLM architectures are encoder-only, decoder-only, and encoder-decoder. Most of them are based on Transformer (as the building block). Therefore we also review the Transformer architecture here.

1) Transformer: in a ground-breaking work [44], Vaswani et al. proposed the Transformer framework, which was orig- inally designed for effective parallel computing using GPUs. The heart of Transformer is the (self-)attention mechanism, which can capture long-term contextual information much more effectively using GPUs than the recurrence and convo- lution mechanisms. Fig 26 provides a high-level overview of transformer work. In this section we provide an overview of the main elements and variants, see [44], [123] for more details.

The Transformer language model architecture, originally proposed for machine translation, consists of an encoder and a decoder. The encoder is composed of a stack of N = 6 identical Transformer layers. Each layer has two sub-layers. The first one is a multi-head self-attention layer, and the other one is a simple position-wise fully connected feed-forward network. The decoder is composed of a stack of 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder has a third sub-layer, which performs multi-head attention over the output of the encoder stack. The attention function can be described as mapping a query and a set of key- value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Instead of performing a single attention function with dmodel dimensional keys, values and queries, it is found to be beneficial to linearly project the queries, keys and values h with different, learned linear projections to dk, dk and dv dimensions, respectively. Positional encoding is incorporated to fuse information about the relative or absolute position of the tokens in the sequence.

Fig. 24: Timeline of some of the most representative LLM frameworks (so far). In addition to large language models with our #parameters threshold, we included a few representative works, which pushed the limits of language models, and paved the way for their success (e.g. vanilla Transformer, BERT, GPT-1), as well as some small language models. ♣ shows entities that serve not only as models but also as approaches. ♦ shows only approaches.

2) Encoder-Only: For this family, at each stage, the atten- tion layers can access all the words in the initial sentence. The pre-training of these models usually consist of some- how corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence. Encoder models are great for tasks requiring an understanding of the full sequence, such as sentence classification, named entity recognition, and extractive question answering. One prominent encoder only model is BERT (Bidirectional Encoder Representations from Transformers), proposed in [24].

3) Decoder-Only: For these models, at each stage, for any word, the attention layers can only access the words positioned before that in the sentence. These models are also sometimes called auto-regressive models. The pretraining of these models is usually formulated as predicting the next word (or token) in the sequence. The decoder-only models are best suited for tasks involving text generation. GPT models are prominent example of this model category.

4) Encoder-Decoder: These models use both encoder and decoder, and are sometimes called sequence-to-sequence mod- els. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder only accesses the words positioned before a given word in the input. These models are usually pre- trained using the objectives of encoder or decoder models, but usually involve something a bit more complex. For instance, some models are pretrained by replacing random spans of text

(that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces. Encoder-decoder models are best suited for tasks about generating new sentences conditioned on a given input, such as summarization, translation, or generative question answering.

B. Data Cleaning

Data quality is crucial to the performance of language models trained on them. Data cleaning techniques such as filtering, deduplication, are shown to have a big impact on the model performance.

As an example, in Falcon40B [124], Penedo et al. showed that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, they were able to obtain five trillion tokens from CommonCrawl. They also released an extract of 600 billion tokens from our REFINEDWEB dataset, and 1.3/7.5B param- eters language models trained on it. 27 shows the Refinement process of CommonCrawl data by this work.

1) Data Filtering: Data filtering aims to enhance the qual- ity of training data and the effectiveness of the trained LLMs. Common data filtering techniques include:

Removing Noise: refers to eliminating irrelevant or noisy data that might impact the model’s ability to generalize well. As an example, one can think of removing false information from the training data, to lower the chance of model generating.

Fig. 25: This figure shows different components of LLMs.

2) Deduplication: De-duplication refers to the process of removing duplicate instances or repeated occurrences of the same data in a dataset. Duplicate data points can introduce biases in the model training process and reduce the diversity, as the model may learn from the same examples multiple times, potentially leading to overfitting on those particular instances. Some works [125] have shown that de-duplication improves models’ ability to generalize to new, unseen data.

The de-duplication process is particularly important when dealing with large datasets, as duplicates can unintentionally inflate the importance of certain patterns or characteristics. This is especially relevant in NLP tasks, where diverse and representative training data is crucial for building robust lan- guage models.

The specific de-duplication method can vary based on the nature of the data and the requirements of the particular language model being trained. It may involve comparing entire data points or specific features to identify and eliminate du- plicates. At the document level, existing works mainly rely on the overlap ratio of high-level features (e.g. n-grams overlap) between documents to detect duplicate samples.

C. Tokenizations

Tokenization referes to the process of converting a se- quence of text into smaller parts, known as tokens. While the simplest tokenization tool simply chops text into tokens based on white space, most tokenization tools rely on a word dictionary. However, out-of-vocabulary (OOV) is a problem in this case because the tokenizer only knows words in its dictionary. To increase the coverage of dictionaries, popular tokenizers used for LLMs are based on sub-words, which can be combined to form a large number of words, including the words unseen in training data or words in different languages. In what follows, we describe three popular tokenizers.

1) BytePairEncoding: BytePairEncoding is originally a type of data compression algorithm that uses frequent patterns at byte level to compress the data. By definition, this algorithm mainly tries to keep the frequent words in their original form and break down ones that are not common. This simple paradigm keeps the vocabulary not very large, but also good enough to represent common words at the same time. Also morphological forms of the frequent words can be represented very well if suffix or prefix is also commonly presented in the training data of the algorithm.

2) WordPieceEncoding: This algorithm is mainly used for very well-known models such as BERT and Electra. At the beginning of training, the algorithm takes all the alphabet from the training data to make sure that nothing will be left as UNK or unknown from the training dataset. This case happens when the model is given an input that can not be tokenized by the tokenizer. It mostly happens in cases where some characters are not tokenizable by it. Similar to BytePairEncoding, it tries to maximize the likelihood of putting all the tokens in vocabulary based on their frequency.

Fig. 26: High-level overview of transformer work. Courtesy of [44].

Fig. 27: Subsequent stages of Macrodata Refinement remove nearly 90% of the documents originally in CommonCrawl. Courtesy of [124]. false responses. Two mainstream approaches for quality filter- ing includes: classifier-based, and heuristic-based frameworks.

Handling Outliers: Identifying and handling outliers or anomalies in the data to prevent them from disproportionately influencing the model.

Addressing Imbalances: Balancing the distribution of classes or categories in the dataset to avoid biases and ensure fair representation. This is specially useful for responsible model training and evaluation.

Text Preprocessing: Cleaning and standardizing text data by removing stop words, punctuation, or other elements that may not contribute significantly to the model’s learning.

Dealing with Ambiguities: Resolving or excluding am- biguous or contradictory data that might confuse the model during training. This can help the model to provide more definite and reliable answers.

3) SentencePieceEncoding: Although both tokenizers de- scribed before are strong and have many advantages compared to white-space tokenization, take assumption of words being always separated by white-space as granted. This approaches used for pre-training like next sentence prediction [24], two most common ones include, next token prediction (autoregressive language modeling), and masked language modeling.

They still assumption is not always true, in fact in some languages, words can be corrupted by many noisy elements such as unwanted spaces or even invented words. SentencePieceEncoding tries to address this issue.

D. Positional Encoding

1) Absolute Positional Embeddings: (APE) [44] has been used in the original Transformer model to preserve the infor- mation of sequence order. Therefore, the positional information of words is added to the input embeddings at the bottom of both the encoder and decoder stacks. There are various options for positional encodings, either learned or fixed. In the vanilla Transformer, sine and cosine functions are employed for this purpose. The main drawback of using APE in Transformers is the restriction to a certain number of tokens. Additionally, APE fails to account for the relative distances between tokens.

2) Relative Positional Embeddings: (RPE) [126] involves extending self-attention to take into account the pairwise links between input elements. RPE is added to the model at two to the keys, and levels: first as an additional component subsequently as a sub-component of the values matrix. This approach looks at the input as a fully-connected graph with labels and directed edges. In the case of linear sequences, edges can capture information about the relative position differences between input elements. A clipping distance, represented as k 2 ≤ k ≤ n − 4, specifies the maximum limit on relative lo- cations. This allows the model to make reasonable predictions for sequence lengths that are not part of the training data.

3) Rotary Position Embeddings: Rotary Positional Em- bedding (RoPE) [127] tackles problems with existing ap- proaches. Learned absolute positional encodings can lack gen- eralizability and meaningfulness, particularly when sentences are short. Moreover, current methods like T5’s positional embedding face challenges with constructing a full attention matrix between positions. RoPE uses a rotation matrix to encode the absolute position of words and simultaneously in- cludes explicit relative position details in self-attention. RoPE brings useful features like flexibility with sentence lengths, a decrease in word dependency as relative distances increase, and the ability to improve linear self-attention with relative position encoding. GPT-NeoX-20B, PaLM, CODEGEN, and LLaMA are among models that take advantage of RoPE in their architectures.

4) Relative Positional Bias: The concept behind this type of positional embedding is to facilitate extrapolation during inference for sequences longer than those encountered in train- ing. In [128] Press et al. proposed Attention with Linear Biases (ALiBi). Instead of simply adding positional embeddings to word embeddings, they introduced a bias to the attention scores of query-key pairs, imposing a penalty proportional to their distance. In the BLOOM model, ALiBi is leveraged.

E. Model Pre-training

Pre-training is the very first step in large language model training pipeline, and it helps LLMs to acquire fundamental language understanding capabilities, which can be useful in a wide range of language related tasks. During pre-training, the LLM is trained on a massive amount of (usually) unlabeled texts, usually in a self-supervised manner. There are different

In Autoregressive Language Modeling framework, given a sequence of n tokens \(x_1, ..., x_n\), the model tries to predict the next token \(x_{n+1}\) (and sometimes the next sequence of tokens) in an auto-regressive fashion. One popular loss function in this case is the log-likelihood of predicted tokens as shown in Eq 2

\(L_{ALM}(x) = \prod_{i=1}^{n} p(x_{i+n} \\| x_i, ..., x_{i+n−1})\) … (Equation 2)

Given the auto-regressive nature, the decoder-only models are naturally better suited to learn how to accomplish this task.

In Masked Language Modeling, some words are masked in a sequence, and the model is trained to predict the masked words based on the surrounding context. Sometimes people refer to this approach as denoising autoencoding, too. If we denote the masked/corrupted samples in the sequence \(x\) as \(\tilde{x}\), then the training objective of this approach can be written as:

\[L_{MLM}(x) = \sum_{\tilde{x} \in M} \log p(\tilde{x} \\| \text{context}(\tilde{x}))\]

And more recently, Mixture of Experts (MoE) [130], [131] have become very popular in LLM space too. MoEs enable models to be pre-trained with much less compute, which means one can dramatically scale up the model or dataset size with the same compute budget as a dense model. MoE consists of two main elements: Sparse MoE layers, which are used instead of dense feed-forward network (FFN) layers, and have a certain number of “experts” (e.g., 8), in which each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks. A gate network or router, that determines which tokens are sent to which expert. It is worth noting that, one can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network. Fig 29 provides an illustration of a Switch Transformer encoder block, which are used in MoE.

F. Fine-tuning and Instruction Tuning

Early language models such as BERT trained using self- supervision as explained in section III-E were not able to perform specific tasks. In order for the foundation model to be useful it needed to be fine-tuned to a specific task with labeled data (so-called supervised fine-tuning or SFT for short). For example, in the original BERT paper [24], the model was fine- tuned to 11 different tasks. While more recent LLMs no longer require fine-tuning to be used, they can still benefit from task or data-specific fine-tuning. For example, OpenAI reports that the much smaller GPT-3.5 Turbo model can outperform GPT-4 when fine-tuned with task specific data 2.

2 https://platform.openai.com/docs/guides/fine-tuning

(a) Absolute Positional Embeddings [129]

(b) Relative Positional Embeddings

(c) Rotary Positional Embedding [127]

(d) Relative Positional Bias [128]

Fig. 28: Various positional encodings are employed in LLMs.

tuning [133]. We dive into the details of how to design and engineer prompts in section IV-B, but in the context of instruction tuning, it is important to understand that the instruction is a prompt that specifies the task that the LLM should accomplish. Instruction tuning datasets such as Natural Instructions [134] include not only the task definition but other components such as positive/negative examples or things to avoid.

The specific approach and instruction datasets used to instruction-tune an LLM varies, but, generally speaking, in- struction tuned models outperform their original foundation models they are based on. For example, InstructGPT [59] outperforms GPT-3 on most benchmarks. The same is true for Alpaca [62] when compared to LLaMA.

Self-Instruct [135], proposed by Wang et al. is also a popular approach along this line, in which they introduced a framework for improving the instruction-following capabilities of pre-trained language models by bootstrapping their own generations. Their pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to fine tune the original model.

G. Alignment

AI Alignment is the process of steering AI systems towards human goals, preferences, and principles. LLMs, pre-trained for word prediction, often exhibit unintended behaviors. For example, they might generate contents that are toxic, harmful, misleading and biased.

Instruction tuning, discussed above, gets LLMs a step closer to being aligned. However, in many cases, it is important

Fig. 29: : Illustration of a Switch Transformer encoder block. They replaced the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue). . Courtesy of [131].

Fine-tuning does not need to be performed to a single task though, and there are different approaches to multi-task fine-tuning (see e.g. Mahabi et al. [132]). Fine-tuning to one or more tasks is known to improve results and reduce the complexity of prompt engineering, and it can serve as an alternative to retrieval augmented generation. Furthermore, there are other reasons why it might be advisable to fine-tune. For example, one might want to fine-tune to expose the model to new or proprietary data that it has not been exposed to during pre-training.

An important reason to fine-tune LLMs is to align the responses to the expectations humans will have when providing instructions through prompts. This is the so-called instruction to include further steps to improve the alignment of the model and avoid unintended behaviors 3. We review the most popular approaches to alignment in this subsection.

RLHF (reinforcement learning from human feedback) and RLAIF (reinforcement learning from AI feedback) are two popular approaches. RLHF uses a reward model to learn alignment from human feedback. This reward model, after being tuned, is able to rate different outputs and score them according to their alignment preferences given by humans. The reward model gives feedback to the original LLM and this feedback is used to tune the LLM further [137]. Reinforcement learning from AI feedback on the other hand, directly connects a pretrained and well-aligned model to the LLM and helps it to learn from larger and more aligned models [138].

In another recent work (known as DPO) [139], Rafailov et al. discussed that RLHF is a complex and often unstable procedure, and tried to address this with a new approach. They leveraged a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which they called Direct Preference Optimization (DPO), is stable, per- formant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine- tuning, or performing significant hyperparameter tuning. They observed that fine-tuning with DPO exceeds RLHF’s ability to control sentiment of generations and improves response quality in summarization. Fig 30 shows the high-level comparison between DPO vs RLHF.

Fig. 30: DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning lan- guage models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classi- fication objective, without an explicit reward function or RL. Courtesy of [139].

Even more recently Ethayarajh et al. proposed a new align- ment approach called the Kahneman-Tversky Optimization (KTO) [136]. Unlike existing state-of-the-art approaches, KTO does not require paired preference data (x, yw, yl), and it only needs (x,y) and knowledge of whether y is desirable or undesirable. KTO-aligned models are shown to be good or better than DPO-aligned models at scales from 1B to 30B, despite not using paired preferences. KTO is also far easier to use in the real world than preference optimization methods, as the kind of data it needs is far more abundant. As an example, every retail company has a lot of customer interaction data and whether that interaction was successful (e.g., purchase made) or unsuccessful (e.g., no purchase made). However, They have little to no counterfactual data (i.e., what would have made an unsuccessful customer interaction yl into a successful one yw). Fig 31 shows a high-level comparison between KTO and other alignment approaches discussed above.

3 According to very recent research by Ethayarajh et al. [136], further alignment besides SFT mainly improves models of at least 7B parameters. For smaller models, SFT is sufficient.

Fig. 31: LLM alignment involves supervised fine-tuning fol- lowed by optimizing a human-centered loss (HALO). How- ever, the paired preferences that existing approaches need are hard-to-obtain. In contrast, KTO uses a far more abundant kind of data, making it much easier to use in the real world. Courtesy of [136].

H. Decoding Strategies

Decoding refers to the process of text generation using pre- trained LLMs. Given an input prompt, the tokenizer translates each token in the input text into a corresponding token ID. Then, the language model uses these token IDs as input and predicts the next most likely token (or a sequence of tokens). Finally, the model generates logits, which are converted to probabilities using a softmax function. Different decoding strategies have been proposed. Some of the most popular ones are greedy search, beam search, as well as different sample techniques such as top-K, top-P (Nucleus sampling).

1) Greedy Search: Greedy search takes the most probable token at each step as the next token in the sequence, discarding all other potential options. As you can imagine, this is a simple approach and can loose a lot of temporal consistency and coherency. It only considers the most probable token at each step, without considering the overall effect on the sequence. This property makes it fast, but it also means that it can miss out on better sequences that might have appeared with slightly less probable next tokens.

2) Beam Search: Unlike greedy search that only considers the next most probable token, beam search takes into account the N most likely tokens, where N denotes the number of beams. This procedure is repeated until a predefined maxi- mum sequence length is reached or an end-of-sequence token appears. At this point, the sequence of tokens (AKA “beam”) with the highest overall score is chosen as the output. For example for beam size of 2 and maximum length of 5, the beam search needs to keep track of 25 = 32 possible sequences. So it is more computationally intensive than greedy search.

3) Top-k Sampling: Top-k sampling is a technique that uses the probability distribution generated by the language model to select a token randomly from the k most likely options.

4) Top-p Sampling: Top-p sampling, also known as Nu- cleus sampling, takes a slightly different approach from top-k sampling. Instead of selecting the top k most probable tokens, nucleus sampling chooses a cutoff value p such that the sum of the probabilities of the selected tokens exceeds p. This forms a “nucleus” of tokens from which to randomly choose the next token. In other words, in top-p sampling the language model examines the most probable tokens in descending order and keeps adding them to the list until the sum of probabilities surpasses the threshold p. As you can imagine, this could be better specially for scenarios in which top-k tokens do not have a large probability mass. Unlike top-k sampling, the number of tokens included in the nucleus sampling is not fixed. This variability often results in a more diverse and creative output, making nucleus sampling popular for text generation related tasks.

Suppose we have 6 tokens (A, B, C, D, E, F) and k=2, and P(A)= 30%, and P(B)= 20%, P(C)= P(D)= P(E)= P(F)= 12.5%. In top-k sampling, tokens C, D, E, F are disregarded, and the model outputs A 60% of the time, and B, 40% of the time. This approach ensures that we prioritize the most probable tokens while introducing an element of randomness in the selection process.

The randomness is usually introduced via the concept of temperature. The temperature T is a parameter that ranges from 0 to 1, which affects the probabilities generated by the softmax function, making the most likely tokens more influential. In practice, it simply consists of dividing the input logits by temperature value:

A low temperature setting significantly alters the proba- bility distribution (and is commonly used in text generation to control the level of “creativity” in the generated output), while a large temperature prioritizes the tokens with higher probabilities. Top-k is a creative way of sampling, and can be used along with beam search. The sequence chosen by top- k sampling may not be the sequence with highest probability in beam search. But it’s important to remember that highest scores do not always lead to more realistic or meaningful sequences.

I. Cost-Effective Training/Inference/Adaptation/Compression

In this part, we review some of the popular approaches used for more cost-friendly (and compute-friendly) training and usage of LLMs.

1) Optimized Training: There are many frameworks de- veloped for optimized training of LLMs, here we introduce some of the prominent ones.

ZeRO: In [140], Rajbhandari et al. developed a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed of LLMs while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing one to scale the model size proportional to the number of devices with sustained high efficiency.

RWKV: In [141], Peng et al. proposed a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Their approach leverages a linear attention mechanism and allows them to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant compu- tational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. RWKV architecture is shown in Fig 32. The Time Complexity comparison of RWKV with different

Fig. 32: RWKV architecture. Courtesy of [141].

Transformers are provided in Fig 33.

Fig. 33: Time Complexity comparison of RWKV with different Transformers. Here T denotes the sequence length, d the feature dimension, and c is MEGA’s chunk size of quadratic attention. Courtesy of [141].

It is also referred to as an approach to distill the knowledge of not a single model but in fact multiple models into a smaller one. Creating smaller models by this approach yields smaller model sizes that can be used even on edge devices. Knowledge distillation as shown in Fig 35, illustrates a general setup of this training scheme.

Fig. 35: A generic knowledge distillation framework with student and teacher (Courtesy of [144]).

Knowledge can be transferred by different forms of learn- ing: response distillation, feature distillation, and API distilla- tion. Response distillation is concerned only with the outputs of the teacher model and tries to teach the student model how to exactly or at least similarly perform (in the sense of prediction) as the teacher. Feature distillation not only uses the last layer but also intermediate layers as well to create a better inner representation for the student model. This helps the smaller model to have a similar representation as the teacher model.

API distillation is the process of using an API (typically from an LLM provider such as OpenAI) to train smaller models. In the case of LLMs, it is used to train the model from the direct output of the larger model which makes it very similar to response distillation. Many concerns are raised by this type of distillation because in cases where the model itself is not openly available, a (usually) paid API is exposed for end users. On the other hand, while users pay for each call, how to use the predictions is limited, for example, OpenAI prohibits usage of its API to create LLMs that later will be used to compete with it. The main value in such case is training data.

Fig. 34: An illustration of LoRA reparametrizan. Only A and B trained during this process. Courtesy of [142].

2) Low-Rank Adaption (LoRA): Low-Rank Adaptation is a popular and lightweight training technique that significantly reduces the number of trainable parameters, and is based on a crucial the difference between the fine- tuned weights for a specialized task and the initial pre-trained weights often exhibits “low intrinsic rank” - meaning that it can be approximated well by a low rank matrix [142]. Training with LoRA is much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), that are easier to store and share. One property of low-rank matrices is that they can be represented as the product of two smaller matrices. This realization leads to the hypothesis that this delta between fine-tuned weights and initial pre-trained weights can be represented as the matrix product of two much smaller matrices. By focusing on updating these two smaller matrices rather than the entire original weight matrix, computational efficiency can be substantially improved.

Specifically, for a pre-trained weight matrix W0 ∈ Rd×k, LoRA constrains its update by representing the latter with a low-rank decomposition W0 + ∆W = W0 + BA, where B ∈ Rd×r , A ∈ Rr×k, and the rank r ≪ min(d, k). During training, W0 is frozen and does not receive gradient updates, while A and B contain trainable parameters. It is worth mentioning that both W0 and ∆W = BA are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For h = W0x, their modified forward pass yields: h = W0x + ∆W x = W0x + BAx. Usually a random Gaussian initialization is used for A, and zero initialization for B, so ∆W = BA is zero at the beginning of training. They then scale ∆W x by αr, where α is a constant in r. This reparametrization is illustrated in Figure 34

It is worth mentioning that LoRA can be applied to any a subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architec- ture, there are four weight matrices in the self-attention module (Wq , Wk, Wv , Wo), and two in the MLP module. Most of the time, LoRA is focused on adapting the attention weights only for downstream tasks, and freezes the MLP modules, so they are not trained in downstream tasks both for simplicity and parameter-efficiency.

3) Knowledge Distillation: Knowledge distillation is the process of learning from a larger model [143]. Earlier days of best-performing models release have proven that this approach is very useful even if it is used in an API distillation approach.

4) Quantization: deep learning in its core, is a set of mathematical functions applied to matrices, with a specific precision for model weights. Reducing the precision of the weights can be used to reduce the size of the model and also make it faster. As an example, Float-32 operations compared to Int-8 operations are slower. This process, which is called quantization, can be applied in different phases. Main ap- proaches for model quantization can be categorized as: post training quantization and quantization-aware training. Post- training quantization is concerned with quantized trained mod- els in two well-known methods: dynamic and static. Dynamic post-training quantization computes the range of quantization on the runtime and is slower compared to static. Quantization- aware training adds quantization criteria into training, and a quantized model is trained and optimized during training process. This approach ensures that the end model will have good performance and also does not need to be quantized after training.

4 HOW LLMS ARE USED AND AUGMENTED

Once the LLMs are trained, we can use them to generate desired outputs for a variety of tasks. LLMs can be used directly through basic prompting. However, in order to exploit their full potential or to address some of the shortcomings, we need to augment the models through some external means. In this section we first provide a brief overview of the main shortcoming of LLMs, with a deeper look at the issue of hallucination. We then describe how prompting and some aug- mentation approaches can not only address those limitations but also be used to augment the capabilities of LLMs going as far as turning an LLM into a full-blown AI agent with the ability to interface with the external world.

A. LLM limitations

It is important to remember that LLMs are trained to predict a token. While fine-tuning and alignment improves their per- formance and adds different dimensions to their abilities, there are still some important limitations that come up, particularly if they are used naively. Some of them include the following:

They don’t have state/memory. LLMs on their own cannot remember even what was sent to them in the previous prompt. That is an important limitation for many of the uses cases that require some form of state.

They are stochastic/probabilistic. If you send the same prompt to an LLM several times, you are likely to get different responses. While there are parameters, and in particular the temperature, to limit the variability in the response, this is an inherent property of their training that can create issues.

They have stale information and, on their own, don’t have access to external data. An LLM on its own does not even know about the current time or day and does not have access to any information that was not present in its training set.

They are generally very large. This means that many costly GPU machines are needed for training and serving. In some cases, largest models have poor SLAs, particularly in terms of latency.

They hallucinate. LLMs do not have a notion of ”truth” and they have usually been trained on a mix of good and bad content. They can produce very plausible but untruthful answers.

While the previous limitations can all become important for some applications, it is worth for us to dive a bit into the last one, hallucinations, since it has gathered a lot of interest over the past few months and it has also sparked many of the prompt approaches and LLM augmentation methods we will later describe.

Hallucination: In the realm of Large Language Models (LLMs), the phenomenon of ”hallucinations” has garnered significant attention. Defined in the literature, notably in the ”Survey of Hallucination in Natural Language Generation” paper [145], hallucination in an LLM is characterized as ”the generation of content that is nonsensical or unfaithful to the provided source.” This terminology, although rooted in psychological parlance, has been appropriated within the field of artificial intelligence.

Hallucinations in LLMs can be broadly categorized into two types:

1) Intrinsic Hallucinations: These directly conflict with the source material, introducing factual inaccuracies or logical inconsistencies. 2) Extrinsic Hallucinations: These, while not contra- dicting, are unverifiable against the source, encom- passing speculative or unconfirmable elements.

The definition of ’source’ in LLM contexts varies with the task. In dialogue-based tasks, it refers to ’world knowledge’, whereas in text summarization, it pertains to the input text itself. This distinction plays a crucial role in evaluating and interpreting hallucinations. The impact of hallucinations is also highly context-dependent. For instance, in creative endeavors like poem writing, hallucinations might be deemed acceptable or even beneficial.

LLMs, trained on diverse datasets including the internet, books, and Wikipedia, generate text based on probabilistic models without an inherent understanding of truth or falsity. Recent advancements like instruct tuning and Reinforcement Learning from Human Feedback (RLHF) have attempted to steer LLMs towards more factual outputs, but the fundamental probabilistic nature and its inherent limitations remain. A recent study, “Sources of Hallucination by Large Language Models on Inference Tasks” [146], highlights two key aspects contributing to hallucinations in LLMs: the veracity prior and the relative frequency heuristic, underscoring the complexities inherent in LLM training and output generation.

Effective automated measurement of hallucinations in LLMs requires a combination of statistical and model-based metrics.

Statistical Metrics:

  • Metrics like ROUGE [147] and BLEU [148] are com- mon for assessing text similarity, focusing on intrinsic hallucinations.

  • Advanced metrics such as PARENT [149], PARENT- T [150], and Knowledge F1 [151] are utilized when structured knowledge sources are available. These metrics, while effective, have limitations in capturing syntactic and semantic nuances.

Model-Based Metrics:

  • IE-Based Metrics: Utilize Information Extraction models to simplify knowledge into relational tuples, then compare these with the source.
  • QA-Based Metrics: Assess the overlap between gen- erated content and the source through a question- answering framework (see [152]).
  • NLI-Based Metrics: Use Natural Language Inference datasets to evaluate the truthfulness of a generated hypothesis based on a given premise (see [153]).
  • Faithfulness Classification Metrics: Offer a refined assessment by creating task-specific datasets for a nuanced evaluation (see [154]).

Fig. 36: How LLMs Are Used and Augmented.

Despite advances in automated metrics, human judgment remains a vital piece. It typically involves two methodologies: 1) Scoring: Human evaluators rate the level of halluci- nation within a predefined scale. 2) Comparative Analysis: Evaluators compare gener- ated content against baseline or ground-truth refer- ences, adding an essential layer of subjective assess- ment.

FactScore [155] is a recent example of a metric that can be used both for human and model-based evaluation. The metric breaks an LLM generation into “atomic facts”. The final score is computed as the sum of the accuracy of each atomic fact, giving each of them equal weight. Accuracy is a binary number that simply states whether the atomic fact is supported by the source. The authors implement different automation strategies that use LLMs to estimate this metric.

Finally, mitigating hallucinations in LLMs is a multifaceted challenge, requiring tailored strategies to suit various applica- tions. Those include:

  • Product Design and User Interaction Strategies such as use case design, structuring the input/output, or providing mechanisms for user feedback.
  • Data Management and Continuous Improvement. Maintaining and analyzing a tracking set of hallucina- tions is essential for ongoing model improvement.
  • Prompt Engineering and Metaprompt Design. Many of the advanced prompt techniques described in IV-B such as Retrieval Augmented Generation directly ad- dress hallucination risks.

  • Model Selection and Configuration for Hallucination Mitigation. For exemple, larger models with lower temperature settings usually perform better. Also, techniques such as RLHF or domain-sepcific fine- tuning can mitigate hallucination risks.

B. Using LLMs: Prompt Design and Engineering

A prompt in generative AI models is the textual input provided by users to guide the model’s output. This could range from simple questions to detailed descriptions or specific tasks. Prompts generally consist of instructions, questions, input data, and examples. In practice, to elicit a desired response from an AI model, a prompt must contain either instructions or questions, with other elements being optional. Advanced prompts involve more complex structures, such as ”chain of thought” prompting, where the model is guided to follow a logical reasoning process to arrive at an answer.

Prompt engineering is a rapidly evolving discipline that shapes the interactions and outputs of LLMs and other gen- erative AI models. The essence of prompt engineering lies in crafting the optimal prompt to achieve a specific goal with a generative model. This process is not only about instructing the model but also involves some understanding of the model’s capabilities and limitations, and the context within which it operates.

Prompt engineering transcends the mere construction of prompts; it requires a blend of domain knowledge, understand- ing of the AI model, and a methodical approach to tailor prompts for different contexts. This might involve creating templates that can be programmatically modified based on a given dataset or context. For example, generating personalized responses based on user data might use a template that is dynamically filled with relevant user information.

Furthermore, prompt engineering is an iterative and ex- ploratory process, akin to traditional machine learning prac- tices such as model evaluation or hyperparameter tuning. The rapid growth of this field suggests its potential to revolutionize certain aspects of machine learning, moving beyond traditional methods like feature or architecture engineering. On the other hand, traditional engineering practices such as version con- trol and regression testing need to be adapted to this new paradigm just like they were adapted to other machine learning approaches [156].

In the following paragraphs we detail some of the most

interesting and popular prompt engineering approaches.

1) Chain of Thought (CoT): The Chain of Thought (CoT) technique, initially described in the paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”[34] by Google researchers, represents a pivotal advancement in prompt engineering for Large Language Models (LLMs). This approach hinges on the understanding that LLMs, while proficient in token prediction, are not inherently designed for explicit reasoning. CoT addresses this by guiding the model through essential reasoning steps.

CoT is based on making the implicit reasoning process of LLMs explicit. By outlining the steps required for reasoning, the model is directed closer to a logical and reasoned output, especially in scenarios demanding more than simple informa- tion retrieval or pattern recognition.

CoT prompting manifests in two primary forms:

1) Zero-Shot CoT: This form involves instructing the LLM to “think step by step”, prompting it to de- construct the problem and articulate each stage of reasoning.

2) Manual CoT: A more complex variant, it requires providing step-by-step reasoning examples as tem- plates for the model. While yielding more effective results, it poses challenges in scalability and mainte- nance.

Manual CoT is more effective than zero-shot. However, the effectiveness of this example-based CoT depends on the choice of diverse examples, and constructing prompts with such examples of step by step reasoning by hand is hard and error prone. That is where automatic CoT [157] comes into play.

2) Tree of Thought (ToT): The Tree of Thought (ToT) [158] prompting technique is inspired by the concept of considering various alternative solutions or thought processes before converging on the most plausible one. ToT is based on the idea of branching out into multiple ”thought trees” where each branch represents a different line of reasoning. This method allows the LLM to explore various possibilities and hypotheses, much like human cognitive processes where multiple scenarios are considered before determining the most likely one.

A critical aspect of ToT is the evaluation of these reasoning paths. As the LLM generates different branches of thought, each is assessed for its validity and relevance to the query. This process involves real-time analysis and comparison of the branches, leading to a selection of the most coherent and logical outcome.

ToT is particularly useful in complex problem-solving scenarios where a single line of reasoning might not suffice. It allows LLMs to mimic a more human-like problem-solving approach, considering a range of possibilities before arriving at a conclusion. This technique enhances the model’s ability to handle ambiguity, complexity, and nuanced tasks, making it a valuable tool in advanced AI applications.

3) Self-Consistency: Self-Consistency [159] utilizes an ensemble-based method, where the LLM is prompted to gen- erate multiple responses to the same query. The consistency among these responses serves as an indicator of their accuracy and reliability.

The Self-Consistency approach is grounded in the principle that if an LLM generates multiple, similar responses to the same prompt, it is more likely that the response is accurate. This method involves asking the LLM to tackle a query mul- tiple times, each time analyzing the response for consistency. This technique is especially useful in scenarios where factual accuracy and precision are paramount.

The consistency of responses can be measured using vari- ous methods. One common approach is to analyze the overlap in the content of the responses. Other methods may include comparing the semantic similarity of responses or employing more sophisticated techniques like BERT-scores or n-gram overlaps. These measures help in quantifying the level of agreement among the responses generated by the LLM.

Self-Consistency has significant applications in fields where the veracity of information is critical. It is particularly relevant in scenarios like fact-checking, where ensuring the accuracy of information provided by AI models is essential. By employing this technique, prompt engineers can enhance the trustworthiness of LLMs, making them more reliable for tasks that require high levels of factual accuracy.

4) Reflection: Reflection [160] involves prompting LLMs to assess and potentially revise their own outputs based on reasoning about the correctness and coherence of their re- sponses. The concept of Reflection centers on the ability of LLMs to engage in a form of self-evaluation. After generating an initial response, the model is prompted to reflect on its own output, considering factors like factual accuracy, logical consistency, and relevance. This introspective process can lead to the generation of revised or improved responses.

A key aspect of Reflection is the LLM’s capacity for self-editing. By evaluating its initial response, the model can identify potential errors or areas of improvement. This iterative process of generation, reflection, and revision enables the LLM to refine its output, enhancing the overall quality and reliability of its responses.

5) Expert Prompting: Expert Prompting [161] enhances the capabilities of Large Language Models (LLMs) by simulating the responses of experts in various fields. This method involves prompting the LLMs to assume the role of an expert and re- spond accordingly, providing high-quality, informed answers. A key strategy within Expert Prompting is the multi-expert approach. The LLM is prompted to consider responses from multiple expert perspectives, which are then synthesized to form a comprehensive and well-rounded answer. This tech- nique not only enhances the depth of the response but also incorporates a range of viewpoints, reflecting a more holistic understanding of the subject matter.

6) Chains: Chains refer to the method of linking multiple components in a sequence to handle complex tasks with Large Language Models (LLMs). This approach involves creating a series of interconnected steps or processes, each contributing to the final outcome. The concept of Chains is based on the idea of constructing a workflow where different stages or components are sequentially arranged. Each component in a Chain performs a specific function, and the output of one serves as the input for the next. This end-to-end arrangement allows for more complex and nuanced processing, as each stage can be tailored to handle a specific aspect of the task. Chains can vary in complexity and structure, depending on the requirements. In “PromptChainer: Chaining Large Lan- guage Model Prompts through Visual Programming” [162], the authors not only describe the main challenges in designing chains, but also describe a visual tool to support those tasks.

7) Rails: Rails in advanced prompt engineering refer to a method of guiding and controlling the output of Large Language Models (LLMs) through predefined rules or tem- plates. This approach is designed to ensure that the model’s responses adhere to certain standards or criteria, enhancing the relevance, safety, and accuracy of the output. The concept of Rails involves setting up a framework or a set of guidelines that the LLM must follow while generating responses. These guidelines are typically defined using a modeling language or templates known as Canonical Forms, which standardize the way natural language sentences are structured and delivered.

Rails can be designed for various purposes, depending on the specific needs of the application:

  • Topical Rails: Ensure that particular topic or domain.

  • Fact-Checking Rails: Aimed at minimizing the gen- eration of false or misleading information.

  • Jailbreaking Rails: Prevent the LLM from generating responses that attempt to bypass its own operational constraints or guidelines.

8) Automatic Prompt Engineering (APE): Automatic Prompt Engineering (APE) [163] focuses on automating the process of prompt creation for Large Language Models (LLMs). APE seeks to streamline and optimize the prompt design process, leveraging the capabilities of LLMs themselves to generate and evaluate prompts. APE involves using LLMs in a self-referential manner where the model is employed to generate, score, and refine prompts. This recursive use of LLMs enables the creation of high-quality prompts that are more likely to elicit the desired response or outcome.

The methodology of APE can be broken down into several key steps:

  • Prompt Generation: The LLM generates a range of potential prompts based on a given task or objective.

  • Prompt Scoring: Each generated prompt is then evaluated for its effectiveness, often using criteria like clarity, specificity, and likelihood of eliciting the desired response.

  • Refinement and Iteration: Based on these evalua- tions, prompts can be refined and iterated upon, further enhancing their quality and effectiveness.

C. Augmenting LLMs through external knowledge - RAG

One of the main limitations of pre-trained LLMs is their lack of up-to-date knowledge or access to private or use- case-specific information. This is where retrieval augmented generation (RAG) comes into the picture [164]. RAG, illus- trated in figure 37, involves extracting a query from the input prompt and using that query to retrieve relevant information from an external knowledge source (e.g. a search engine or a knowledge graph, see figure 38 ). The relevant information is then added to the original prompt and fed to the LLM in order for the model to generate the final response. A RAG system includes three important components: Retrieval, Generation, Augmentation [165].

a) RAG-aware prompting techniques: Because of the importance of RAG to build advanced LLM systems, several RAG-aware prompting techniques have been developed re- cently. One such technique is Forward-looking Active Retrieval Augmented Generation (FLARE)

Forward-looking Active Retrieval Augmented Generation (FLARE) [168] enhances the capabilities of Large Language Models (LLMs) by iteratively combining prediction and in- formation retrieval. FLARE represents an evolution in the use of retrieval-augmented generation, aimed at improving the accuracy and relevance of LLM responses.

FLARE involves an iterative process where the LLM actively predicts upcoming content and uses these predictions as queries to retrieve relevant information. This method con- trasts with traditional retrieval-augmented models that typically retrieve information once and then proceed with generation. In FLARE, this process is dynamic and ongoing throughout the generation phase. In FLARE, each sentence or segment gener- ated by the LLM is evaluated for confidence. If the confidence tool usage by training an LLM to decide what tool to use when, and even what parameters the API needs. Tools include two different search engines, or a calculator. In the following examples, the LLM decides to call an external Q&A tool, a calculator, and a Wikipedia Search Engine More recently, researchers at Berkeley have trained a new LLM called Gorilla [67] that beats GPT-4 at the use of APIs, a specific but quite general tool.

Fig. 37: An example of synthesizing RAG with LLMs for question answering application [166].

a) Tool-aware prompting techniques: Similarly to what was described with RAG, several tool-aware prompting ap- proaches have been developed to make usage of tools more scalable. A popular technique is the so called Automatic Multi- step Reasoning and Tool-use (ART).

Automatic Multi-step Reasoning and Tool-use (ART) [170] is a prompt engineering technique that combines automated chain of thought prompting with the use of external tools. ART represents a convergence of multiple prompt engineering strategies, enhancing the ability of Large Language Models (LLMs) to handle complex tasks that require both reasoning and interaction with external data sources or tools.

ART involves a systematic approach where, given a task and input, the system first identifies similar tasks from a task library. These tasks are then used as examples in the prompt, guiding the LLM on how to approach and execute the current task. This method is particularly effective when tasks require a combination of internal reasoning and external data processing or retrieval.

E. LLM Agents

The idea of AI agents has been well-explored in the history of AI. An agent is typically an autonomous entity that can perceive the environment using its sensors, make a judgment based on the state it currently is, and accordingly act based on the actions that are available to it.

In the context of LLMs, an agent refers to a system based on a specialized instantiation of an (augmented) LLM that is capable of performing specific tasks autonomously. These level is below a certain threshold, the model uses the generated content as a query to retrieve relevant information, which is then used to regenerate or refine the sentence. This iterative process ensures that each part of the response is informed by the most relevant and current information available.

Fig. 38: This is one example of synthesizing the KG as a retriever with LLMs [167].

For more details on RAG framework and its relevant works, we refer the readers to this survey of retrieval augmented generations [165].

D. Using External Tools

Retrieving information from an external knowledge source as described above is only one of the potential ways to augment an LLM. More generally, an LLM can access any number of external tools (e.g. an API to a service) to augment its functionality. In that regards, RAG can be seen as a specific instance of the broader category of the so called ”tools”.

Tools in this context are external functions or services that LLMs can utilize. These tools extend the range of tasks an LLM can perform, from basic information retrieval to complex interactions with external databases or APIs.

In the paper ”Toolformer: Language Models Can Teach Themselves to Use Tools” [169], the authors go beyond simple agents are designed to interact with users and environment to make decisions based on the input and the intended goal of the interaction. Agents are based on LLMs equipped with the ability to access and use tools, and to make decisions based on the given input. They are designed to handle tasks that require a degree of autonomy and decision-making, typically beyond simple response generation.

The functionalities of a generic LLM-based agent include:

  • Tool Access and Utilization: Agents have the capabil- ity to access external tools and services, and to utilize these resources effectively to accomplish tasks.
  • Decision Making: They can make decisions based on the input, context, and the tools available to them, often employing complex reasoning processes.

As an example, an LLM that has access to a function (or an API) such as weather API, can answer any question related to the weather of the specific place. In other words, it can use APIs to solve problems. Furthermore, if that LLM has access to an API that allows to make purchases, a purchasing agent can be built to not only have capabilities to read information from the external world, but also act on it [171].

Fig. 40 shows another example of LLM-based agents for conversational information seeking [36], where an LLM is augmented with a set of plug-and-play modules, including a working memory that tracks the dialog state, a policy that makes an execution plan for the task and selects next system action, an action executor that performs an action selected by the policy (consolidating evidence from external knowledge, or prompting the LLM to generate responses), and a utility that accesses the alignment of the LLM’s responses with user expectations or specific business requirements, and generate feedback to improve agent performance.

For more details on LLM-based AI agents see recent survey [172], [173], [174].

a) Prompt engineering techniques for agents: Like RAG and Tools, prompt engineering techniques that specif- ically address the needs of LLM-based agents have been developed. Three such examples are Reasoning without Ob- servation (ReWOO), Reason and Act (ReAct), and Dialog- Enabled Resolving Agents (DERA).

Reasoning without Observation (ReWOO) [175] aims to decouple reasoning from direct observations. ReWOO operates by enabling LLMs to formulate comprehensive reasoning plans or meta-plans without immediate reliance on external data or tools. This approach allows the agent to create a struc- tured framework for reasoning that can be executed once the necessary data or observations are available. In ReWOO, the LLM initially develops a plan (a series of steps) that outlines how to approach and solve a given problem. This meta- planning phase is crucial as it sets the stage for the agent to process information once it becomes available. The execution phase then involves integrating actual data or observations into the pre-specified plan, leading to coherent and contextually relevant responses. ReWOO offers significant advantages in terms of token efficiency and robustness to tool failure. It enables LLMs to handle tasks where immediate access to external data is not available, relying instead on a well- structured reasoning framework. This method is particularly advantageous in scenarios where data retrieval is costly, slow, or uncertain, allowing the LLM-based agent to maintain a high level of performance and reliability.

Reason and Act (ReAct)[176] prompts LLMs to generate thus not only verbal reasoning but also actionable steps, enhancing the model’s dynamic problem-solving capabilities. ReAct is grounded in the principle of integrating reasoning with action. In this approach, the LLM is prompted to alternate between generating reasoning traces (explanations) and taking actions (steps or commands) in an interleaved manner. This approach allows the model to dynamically reason about a prob- lem, and propose and take concrete actions simultaneously.

Dialog-Enabled Resolving Agents (DERA) [177] are spe- cialized AI agents that can engage in dialogue, resolve queries, and make decisions based on interactive exchanges. DERA is developed based on the idea of utilizing multiple agents within a dialog context, each with specific roles and functions. These agents can include Researchers, who gather and analyze information, and Deciders, who make final judgments based on the information provided. This division of roles allows for a well-organized and efficient approach to problem-solving and decision-making. DERA is particularly advantageous in scenarios requiring complex decision-making and problem- solving, such as those in medical diagnostics or customer ser- vice. The collaborative and interactive nature of DERA agents allows them to handle intricate queries with a level of depth and nuance that single-agent systems might struggle with. Moreover, this approach aligns well with human decision- making processes, making AI reasoning more relatable and trustworthy.

Large language models exhibit promising accomplish- ments, but the main question that arises is how effectively they function and how their performance can be assessed in specific tasks or applications.

The evaluation of LLMs poses particular challenges due to the evolving landscape of their applications. The original intent behind developing LLMs was to boost the performance of NLP tasks such as translation, summarization, question- answering, and so on [178]. However, today that these models are finding utility across diverse domains including code generation and finance. Moreover, the eval- uation of LLMs encompasses several critical considerations such as fairness and bias, fact-checking, and reasoning. In this section, we outline the commonly used benchmarks for assessing LLMs. These benchmarks are categorized based on training or evaluating the LLM Capabilities.

A. . Datasets for Basic Tasks: language modeling/understanding/generation

Natural Questions [179] is a QA dataset that consists of real anonymized, aggregated queries submitted to the Google search engine as questions. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present.

Fig. 39: HuggingGPT: An agent-based approach to use tools and planning [image courtesy of [171]]

  • HumanEval [182] is a dataset for code generation task. This dataset consists of 164 hand-crafted pro- gramming challenges. Each challenge is accompanied by a function signature, docstring, code body, and mul- tiple unit tests. The main intuition behind developing this dataset is to guarantee the exclusion of its contents from training datasets for code generation models.
  • APPS [183] is designed for code generation task focusing on the Python programming language. The APPS dataset contains a collection of 232, 444 Python programs. Each program in the dataset has an average of 18 lines of Python code. Additionally, APPS offers access to a repository of 10, 000 unique programming exercises, each with text-based problem descriptions. The final aspect to highlight is that the it includes test cases.
  • WikiSQL [184] is crafted for code generation task and it has 87,726 carefully labeled pairs of SQL queries and corresponding natural language questions from Wikipedia tables. The SQL queries comprise three test sets (17, 284 examples), development subsets: (9, 145 examples), and training (61, 297 examples).
  • TriviaQA [185] is designed for QA task. This dataset comprises more than 650, 000 question- answer-evidence triples. There are 95, 000 question- answer pairs in this dataset, each authored by trivia en- thusiasts and supported by an average of six indepen- dently sourced evidence documents. These documents are automatically acquired from Wikipedia or broader web search results. The dataset is categorized into two segments, including those with authentic answers from Wikipedia and web domains, and verified sets embody the accurately answered questions along with their associated documents from both Wikipedia and online.

Fig. 40: A LLM-based agent for conversational information seeking. Courtesy of [36].

  • MMLU [180] is intended to evaluate the knowl- edge gained in zero-shot and few-shot scenarios. That means that MMLU assesses both the general knowl- edge and problem-solving ability of a model. It covers 57 subjects in STEM, humanities, social sciences, and other areas. The benchmark varies in complexity, ranging from elementary to advanced professional. It is worth mentioning that the main contribution of this dataset is for multi-task language understanding, question answering, and arithmetic reasoning.
  • MBPP [181] stands for “Mostly Basic Python Prob- lems” and provides a benchmark for evaluating the performance of models designed for code generation. The benchmark encompasses 974 short Python programs including a wide range of topics, including fundamental programming concepts and standard li- brary usage, and more. Each challenge comprises a task description, a code solution, and three automated test cases.

Fig. 41: Dataset applications.

  • RACE [186] suits for reading comprehension task. This dataset is based on English tests completed by Chinese students from middle school and high school, aged 12 to 18, and it contains roughly 28, 000 texts and 100, 000 questions rigorously prepared by human specialists, primarily English instructors. This dataset contains a wide range of subjects that were purpose- fully chosen to assess students’ comprehension and reasoning abilities. This dataset is available in three subgroups: RACE-M, RACE-H, and RACE. RACE-M refers to the middle school examinations, whereas RACE-H denotes the high school tests. Finally, RACE is the synthesis of RACE-M and RACE-H.
  • SQuAD [187] stands for “Stanford Question Answer- ing Dataset” and is a crowdsourced reading compre- hension dataset based on Wikipedia articles. It has approximately 100, 000 question-answer pairs con- nected to more than 500 articles. The answers to these questions are typically text fragments or spans taken from the corresponding reading passages. The questions may be unanswerable in some cases. The dataset is divided into three sets: an 80% training set, a 10% development set, and a 10% hidden test set.

Fig. 42: Datasets licensed under different licenses.

  • BoolQ [188] is a yes/no question-answering dataset where the goal is reading comprehension task. BoolQ includes 15, 942 examples. Each example is a triplet that includes a question, a relevant paragraph, and the solution. Although the main intuition behind this dataset is for reading comprehension, it can be used for reasoning, natural language inference, and question-answering tasks.
  • MultiRC [189] is another dataset that fits reading comprehension task. MultiRC contains brief para- graphs as well as multi-sentence questions that can be answered using the information in the paragraph. The paragraphs in this dataset come from a variety of sources, including news, fiction, historical texts, Wikipedia articles, discussions on society and law, elementary school science textbooks, and 9/11 re- ports. Each question has many response choices, with one or more of them being correct. Answering the questions requires reasoning across several sentences. MultiRC dataset encompasses around 6, 000 multi- sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five.

B. Datasets for Emergent: ICL, reasoning (CoT), instruction following

This section centers on the benchmarks and datasets employed to evaluate the emergent abilities of LLMs.

  • GSM8K [190] is designed to evaluate the model’s ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistically diverse grade school math word problems written by humans. The dataset is split into two sets: a training set with 7.5K problems, and a test set with 1K problems. These problems need 2 to 8 steps to be solved. Solutions mainly are a series of elementary calculations using basic arithmetic operations.
  • MATH [191] enables to assess how well models can solve math problems. MATH dataset hast 12, 500 problems from high school math competitions. Each problem in the dataset has a step-by-step solution and a final answer enclosed in a box. The problems cover a wide range of topics and have different levels of complexity. There are seven subjects in total. Further- more, the difficulty of each problem is rated based on the AoPS standards on a scale from ′1′ to ′5′. A ′1′ shows the easiest problems in a subject, while ′5′ represents the most difficult. In terms of formatting, all problems and solutions are presented using LATEX and the Asymptote vector graphics language.
  • HellaSwag [192] is designed to assess commonsense reasoning in LLMs. This benchmark includes 70, 000 multiple-choice questions. Each question is derived from one of two domains: ActivityNet or WikiHow, and presents four answer choices regarding what might happen in the following situation. The correct answer provides an actual statement describing the upcoming event, but the three wrong answers are created to confuse machines.
  • AI2 Reasoning Challenge (ARC) [193] is used for commonsense reasoning. This benchmark encom- passes 7, 787 science examination questions. These questions are in English, and most of them are set up in a multiple-choice format. The questions have been divided into two groups: a Challenge Set with 2, 590 difficult questions and an Easy Set with 5,197 questions. Each collection has also been pre-divided into Train, Development, and Test subsets.
  • SIQA [195] provides a framework for evaluating mod- els’ ability for commonsense reasoning about social situations. SIQA dataset has 38, 000 multiple-choice questions designed to assess emotional and social intelligence in everyday circumstances. This dataset covers a wide variety of social scenarios. In SIQA, the potential answers is a mixture of human-selected responses and machine-generated ones that have been filtered through adversarial processes.
  • OpenBookQA (OBQA) [196] is a new kind of question-answering dataset where answering its ques- tions requires additional common and commonsense knowledge not contained in the book and rich text comprehension. This dataset includes around 6,000 multiple-choice questions. Each question is linked to one core fact, as well as an additional collection of over 6000 facts. The questions were developed using a multi-stage crowdsourcing and expert filter- ing procedure. OpenBookQA questions are difficult because they need multi-hop reasoning with limited background.
  • TruthfulQA [197] is designed specifically to eval- uate the truthfulness of language models in gen- erating answers to questions. This dataset includes 817 questions, written by authors, from 38 different categories, including health, law, finance, and politics. These questions are purposefully designed to chal- lenge human responders, as they may contain common misunderstandings that lead to incorrect answers.
  • OPT-IML Bench [103] is a comprehensive bench- mark for Instruction Meta-Learning. It covers 2000 NLP tasks from 8 existing benchmarks. The OPT-IML Bench consists of a training set with 17.9 M examples, a dev set with 145K samples, and a test set with 321K samples.

C. Datasets for Augmented: using external knowledge/tools

This section focuses on datasets designed for the augmented abilities of LLMs.

  • HotpotQA [198] is designed to cover a diverse and explainable question-answering dataset that necessi- tates multi-hop reasoning. This dataset is derived from the English Wikipedia. It consists of roughly 113, 000 questions. Each question in the dataset comes with two paragraphs, called gold paragraphs, from two Wikipedia articles. Also, there is a list of sentences in those paragraphs that crowdworkers have picked as important for answering the question.
  • ToolQA [199] is a question answering benchmark to evaluate LLMs’ ability to use external tools for answering questions.
  • GPT4Tools serves as an instructional dataset, gener- ated by instructing advanced teachers (such as Chat- GPT), with instructions conditioned on visual content and tool descriptions. This process results in the generation of instructions related to the use of tools. There are three versions of this dataset. The first version comprises 71,000 instruction-following data points utilized to fine-tune the GPT4Tools model. The next version consists of manually cleaned instruction data used for validation, covering instructions related to the tools from the first version. The last version is cleaned instruction data used for testing and includes instructions related to some tools that are not present in the first version.

6 PROMINENT LLMS’ PERFORMANCE ON BENCHMARKS

In this section we first provide an overview of some of popular metrics used for evaluating the performance of LLMs under different scenarios. We then look at the performance of prominent large language models on some of the popular datasets and benchmarks.

Evaluating the performance of generative language models depends on the underlying task they are going to be used for. Tasks that are mostly about selecting a choice out of given ones (such as sentiment analysis), can be seen as simple as classification and their performance can be evaluated using classification metrics. Metrics such as accuracy, precision, recall, F1, etc are applicable in this case. It is also important to note that the answers generated by the model for specific tasks such as multi-choice question answering are always either True or False. If the answer is not in a set of options, it can be seen as False as well.

However, some tasks that are purely open-ended text gener- ation cannot be evaluated in the same way as for categorization. Different metrics are required for the specific purpose of the evaluation. Code generation is a very different case in open- ended generative evaluations. The generated code must pass the test suite but on the other hand, is also important to understand if a model is capable of generating different it solutions as a code, what is the probability of selecting the correct one among them. Pass@k is a very good metric in this case. It works in this manner that given a problem, different solutions as code are generated. They are tested for correctness using different functionality tests. Afterward, from generated n solutions, and the respective c number of them being correct equation 4 provides the final value.

TABLE II: LLM Datasets Overview.

Exact match (EM) is another metric that is mostly con- cerned with exact matches from (pre-defined) answers. It counts a prediction as correct if it exactly matches one of more than one desired reference text token by token. In some cases, it can be the same as accuracy and the equation 5 shows the mathematical definition. Here M is total number of correct answers and N is the total number of questions [202].

Human equivalence score (HEQ) on the other hand, is an alternative to F1 score [203]. HEQ-Q represents the precision of individual questions, wherein an answer is deemed correct if the model’s F1 score surpasses the average human F1 score. Likewise, HEQ-D denotes the precision of each dialogue; it is deemed accurate when all questions within the dialogue meet the criteria of HEQ [182].

Evaluation of other generative tasks such as machine trans- lation are based on metrics such as Rouge and BLEU. These scores work well when there is a reference text as ground truth (such as translation) and a hypothesis that is generated by the generative model, in our case the LLM. These scores are mostly used for cases where the goal is to detect the similarity of the answer and ground truth in a computation manner. In a computation manner, it meant that nothing more than N-Grams would be used. However, metrics such as BERT- Score are also good for these cases but they are also heavily erroneous because another model is used to judge. Still, even today, evaluating purely generated content is very hard and no completely fitting metric is not found, metrics are either looking for simplistic features such as N-Gram, SkipGram, etc, or they are models with unknown accuracy and preciseness [204].

TABLE III: LLM categories and respective definitions.

Category Small Medium Large Very Large Foundation model Instruction model Chat model Original model Tuned model Publicly available Publicly unavailable Model and weights are not publicly available

Description Number of parameters ≤ 1B 1B < Number of parameters ≤ 10B 10B < Number of parameters ≤ 100B 100B < Number of parameters Pretrained language model Pretrained and instruction fine-tuned language model Pretrained, instruction fine-tuned, and chat fine-tuned language model An original model released with either Foundation, Instruction, or Chat model Fine-tuned version of an original model Model and weights are available due to request to without request

TABLE IV: Different LLM categorization.

Generative evaluation metrics are also another type of eval- uation metric for LLMs that use another LLM for evaluating the answer. However, depending on the task itself, evaluation can be possible in this way or not. Another dependency that makes generative evaluation error-prone is reliance on the prompt itself. RAGAS is one of the good examples that incorporate the usage of generative evaluation.

Various benchmarks and leaderboards have been proposed to address the most challenging question in the world of large language models: Which one is better? However not a simple answer can address this question. The answer de- pends on various aspects of large language models. Section V shows the categorical presentation of different tasks and the most important datasets in each category. We will follow the same categorization and provide a comparison based on each category. After providing comparison for each category, we will provide a broad overview of aggregated performance by averaging the reported performance metric on different tasks.

Evaluating different LLMs can be seen also from different perspectives. For example, a LLM with a drastically fewer number of parameters is not completely comparable to one with a larger number of parameters. From this perspective, we will categorize LLMs in four categories as well: small (less than or equal to 1 billion parameters), medium (between 1 and 10 billion), large (between 10 and 100 billion), and very large (more than 100 billion). Another classification for the LLMs we use is their primary use case. We consider each LLM to be either: Foundation model (pretrained language model with no instruction fine-tuning and chat fine-tuning), Instruction model (pretrained language model with only instruction fine- tuning), and Chat model (pretrained language model with instruction and chat fine-tuning). Apart from all the catego- rization described, another category is required to distinguish between original models and tuned ones. Original models are those that have been released as a foundation model or a fine- tuned one. Tuned models are those that grasped the original model and tuned it with different datasets or even different training approaches. It is also good to note that original models are usually foundation models that have been fine-tuned on specific datasets or even different approaches. Availability of the model weights regardless of the license is another category in our classification. Models that have their weights publicly available (even through request) are noted as Public models while others are noted as Private. Table III shows all of these definitions and abbreviations used in the rest of the article. Figure 43 illustrate these visually.

According to the provided categorizations, we can catego- rize and label each notable LLM as shown in table IV. As can be seen from this table, models categorized as very large are also unavailable as well.

B. LLMs’ Performance on Different Tasks

Commonsense reasoning is one of the important capabili- ties each model can obtain. This capability denotes the ability of the model to use prior knowledge in combination with reasoning skills. In the case of HellaSwag for example, finding the continuation of text is challenging because the given text contains a partial part of the story while the given choices as continuation are tricky to select, and without having prior knowledge about the world it is not possible. This specific kind of reasoning deserves high attention because it is related to utilizing previous knowledge with open text-described scenes or facts. As can be seen from table V not just Unavailable models but also Public ones can achieve good results on various tests.

Fig. 43: LLM categorizations.

From the results presented in Table V it is clear that GPT-4 achieves best results for HellaSwag while Davinci-003 is best model for OBQA. It is also good to note that results for OBQA are not reported for all of the models and possibly davinci-003 is not the best model achieving highest results on OBQA.

TABLE V: Commonsense reasoning comparison.

TABLE VII: World knowledge comparison.

TABLE IX: Arithmetic reasoning comparison.

For some specific use-case models, it is highly demanded to have coding and code-generation capability. Table VIII shows the results of different models on coding capability.

TABLE VIII: Coding capability comparison.

Large language models in some cases are hallucinating an- swers simply because they are next-token prediction machines. Hallucination is one of the important factors in measuring how much a large language model is trustworthy and reliable. Measuring hallucination on the other hand is also not easy as it seems because each fact can be written in different styles and even the smallest changes in writing make it hard to detect. It is fair to assume if any particular LLM is more capable to detect hallucination of false information in text, it is also more trustworthy. HaluEval is one of the datasets that aims to measure hallucination in this field [205]. Evaluation can also be performed by another model judging the response with regard to the actual answer [206]. Table X shows the evaluation of different models based on these datasets.

Arithmetic reasoning is another challenging reasoning ca- pability to achieve. GSM8K for example contains grade school mathematical questions with respect to their answers. Table IX provides an insight for different model comparisons.

7 CHALLENGES AND FUTURE DIRECTIONS

As we have seen in the previous sections, large language models have achieved impressive results in the past 1-2 years.

TABLE X: Hallucination evaluation

At the same time this is still a new and extremely active research area where the pace of innovation is increasing rather than slowing down. As in any other evolving area though, there are still numerous challenges ahead. Here we briefly mention some of the challenges and main active areas which are known so far.

A. Smaller and more efficient Language Models

This is a survey on large language models, and there has been an initial push towards ”larger is better” that has clearly been rewarded with ever larger models like GPT- 4 getting better accuracy and performance in benchmarks. However, those large models are costly and inefficient in several dimensions (e.g. high latency). In response to all of this, there is a current research trend to come up with Small Language Models (SLMs) as a cost-effective alternative to LLMs, particularly when used on specific tasks that might not require the full generality of larger models. Prominent works in this direction include Phi-1 [207], Phi-1.5 [208], and Phi-2 from Microsoft.

More generally, we should expect many research efforts in this area of how to train smaller and more efficient models. Techniques such as parameter-efficient fine-tuning (PEFT), teacher/student, and other forms of distillation – see section III-I – will continue to be used to build a smaller model out of larger ones.

B. New Post-attention Architectural Paradigms

Transformer blocks have been a crucial and constant part of most of current LLM frameworks, and it’s a big question mark how much longer this architecture will be in vogue, and what will be the next big architectural break-through in the field of deep learning (and NLP). Since AlexNet in 2012, we have seen many architectures go in and out of fashion, including LSTM, GRU, seq2seq, but Transformers have been the dominant approach since its inception. As described earlier, attention is the main mechanism driving transformers. More recently, there has been promising research in alternative approaches that are being labelled as post-attention.

An important class of such class of post-attention models are the so called State Space Models (SSMs). While the notion of State Space Models has a long history in machine learning, it should be noted that in the context of language models, SSM is usually used in reference to the newer Structure State Space Model architecture or S4 for short (see Gu et al. [29]). Some recent models in this category are Mamba [30], Hyena [209], and Striped Hyena [210].

While all of those models are very competitive in terms of performance in leaderboards and efficiency, they also address an important challenge in more traditional attention-based architectures: the lack of support for larger context windows.

The longer the context length, the more tokens we can squeeze into the context. The more information the model has access to, the better its response will be. But on the other hand, with very long context, it would be hard for the model to remember everything and efficiently process all the informa- tion. Attention-based models are highly inefficient for longer contexts and that is why we should expect more research in different mechanisms that enable processing longer contexts and generally come up with more efficient architectures.

That being said, new architectures might not only propose alternatives for the attention mechanism but rather rethink the whole Transformer architecture. As an early example of this, Monarch Mixer [211] proposes a new architecture that uses the same sub-quadratic primitive that achieves high hardware efficiency on GPUs – Monarch matrices – along both sequence length and model dimension.

On the other end of the spectrum, it is worth mentioning that there are some attention-compatible architectural mecha- nisms that have been recently gaining steam and proving their value in creating better and more powerful LLMs. Probably the best example of such mechanism is Mixture of Experts (MoE). MoEs have been around in machine learning for years, even before the Deep Learning Era [212], but they have been gaining popularity since then, and particularly in the context of Transformer models and LLMs.

In LLMs, MoEs allow to train an extremely large model than is then only partially instantiated during inference when some of the experts are turned off wherever the gat- ing/weighting function has a low weight assigned to them. As an example, the GLaM model has 1.2 trillion parameters, but during inference only 2 out of the 64 experts are used [84].

MoEs are nowadays an important component of the so- called frontier LLMs (i.e. the most advanced and capable models). GPT-4 itself is rumored to be based on a MoE architecture, and some of the best performing LLMs such as Mixtral [117], are basically an MoE version of pre-existing LLMs.

Finally, it is important to note that MoEs can be used as a component of any architecture regardless of whether it is based on attention or not. In fact, MoEs have also been applied to SSM-based LLMs like Mamba citepioro2024moemamba. We should continue to see MoE-driven improvements in the future regardless of the underlying architecture.

C. Multi-modal Models

Future LLMs are expected to be multi-modal and handle a variety of data types, such as text, images, and videos, audio, in a unified manner. This opens up possibilities for more diverse applications in fields like question answering, content generation, creative arts, and healthcare, robotics, and beyond. There are already several prominent multi-modal LLMs out there, including: LLAVA [213], LLAVA-Plus [214], GPT-4 [33], Qwen-vl [116], Next-GPT [215], but the trend is expected to be continued. Evaluation of these models also is a new research topic, especially conversational generative vision models [216]. Multi-modal LLMs can unlock huge potentials in a variety of tasks, and there has already been a descent progress in this direction, which needs a dedicated paper to discuss all its details.

D. Improved LLM Usage and Augmentation techniques

As we described in sectionIV, many of the shortcomings and limitations of LLMs such as hallucination can be ad- dressed through advanced prompt engineering, use of tools, or other augmentation techniques. We should expect not only continued, but accelerated research in this area.

LLM-based systems are already starting to replace ma- chine learning systems that were until recently using other approaches. As a clear example of this, LLMs are now being deployed to better understand people preference and interests, and provide more personalized interactions, whether in cus- tomer service, content recommendation, or other applications. This involves better understanding of user preferences, and analyzing their past interactions and using them as the context. We will continue to see research in the application and usage of LLMs for not only personalization and recommendations, but many other application areas using other machine learning techniques.

Finally, another important area of research we expect to gather increased attention is that of LLM-based agents and multi-agent systems [172], [173], [174]. The development of LLM systems with access to external tools and decision- making capabilities is both exciting and challenging. We will see continued research and progress in this important area that some argue could lead to Artificial General Intelligence (AGI).

E. Security and Ethical/Responsible AI

Ensuring the robustness and security of LLMs against adversarial attacks and other vulnerabilities is a critical area of research [217]. As LLMs are increasingly deployed in real- world applications, they need to be protected from potential threats, to prevent them being used to manipulate people or spread mis-information.

Addressing ethical concerns and biases in LLMs is another active area of research. Efforts are being made to ensure that LLMs are fair, unbiased, and capable of handling sensitive information responsibly. As LLMs are being used more and more by a large number of people on a daily basis, making sure they are unbiased and behave responsibly is crucial.

8 CONCLUSION

This paper present a survey of LLMs developed in the past few years. We first provide an overview of early pre- trained language models (e.g., as BERT), then review three popular LLM families (GPT, LLaMA, PaLM), and other representative LLMs. We then survey methods and techniques of building, augmenting, and using LLMs. We review popular LLM datasets and benchmarks, and compare performance of a set of prominent models on public benchmarks. Finally, we present open challenges and future research directions.

Previous: CoT | Chain-of-Thought Without Prompting* Next: Survey | Continual Learning LLM Survey

post contain ""

    No matching posts found containing ""