Contents
1. 서론
2. 모델 아키텍처
3. 훈련 인프라
4. training dataset
5. 평가
Gemini 모델은 텍스트, 이미지, 오디오, 비디오를 포함한 멀티모달 학습으로 개발되었습니다. 이 접근 방식이 각 도메인에서 특화된 모델들과 경쟁할 수 있는 강력한 능력을 제공하는지가 중요한 연구 질문입니다. 발견은 Gemini가 다양한 텍스트, 이미지, 오디오, 비디오 벤치마크에서 새로운 최고 기록을 세웠다는 것을 보여줍니다.
5.1. 텍스트
5.1.1. 학문적 벤치마크 Gemini Pro와 Ultra는 다양한 텍스트 기반 학문적 벤치마크에서 기존 LLM 및 PaLM 2와 비교되었습니다. Gemini Ultra는 모든 현존하는 모델을 능가하는 성능을 보였으며, 특히 MMLU에서는 휴먼 전문가 성능을 초과하여 90.04%의 정확도를 달성했습니다. 수학 분야에서도 Gemini Ultra는 초등 수학 시험 및 고난도 수학 문제에서 모든 경쟁 모델을 능가하는 성능을 보였습니다. 코딩 분야에서도 Gemini Ultra는 다양한 벤치마크에서 높은 성능을 기록하며, 특히 Python 코드 생성에서 74.9%의 최고 점수를 달성했습니다.
5.1.2. 능력 트렌드 Gemini 모델 가족은 50개 이상의 벤치마크를 통해 다양한 능력을 평가받았습니다. 이런 능력은 사실 확인, 장문 요약, 수학/과학 문제 해결, 다언어 작업 등을 포함합니다. 모델 크기가 증가함에 따라 일관된 품질 향상이 관찰되었으며, Gemini Ultra는 모든 능력에서 상위 모델로 나타났습니다.
5.1.3. 나노 Gemini 나노 모델은 장치 내 배치를 위해 설계되었으며, 요약 및 독해 작업에서 향상된 성능을 보였습니다. 이 모델들은 작은 크기에도 불구하고 상대적으로 높은 성능을 보여줍니다.
5.1.4. 다언어성 Gemini 모델은 다언어 작업에서도 평가되었으며, 여러 언어 간의 번역, 요약, 그리고 벤치마크 번역에서 우수한 성능을 보였습니다. 특히, WMT 23 번역 벤치마크에서 Gemini Ultra는 모든 언어 쌍에서 높은 성능을 보였습니다.
5.1.5. 장문 문맥 Gemini 모델은 32,768 토큰의 시퀀스 길이로 훈련되었으며, 이를 효과적으로 활용하는 것으로 나타났습니다. 이 모델은 긴 문서와 비디오 이해에 유용하게 사용될 수 있습니다.
5.1.6. 휴먼 선호(human preference)도 평가 Gemini 모델의 출력에 대한 휴먼의 선호도는 자동 평가를 보완하는 중요한 지표입니다. Gemini Pro 모델은 창의성, 지시 사항 따르기, 안전성에서 높은 선호도를 보였습니다.
5.1.7. 복잡한 인퍼런스 시스템 Gemini는 검색 및 도구 사용과 결합하여 복잡한 다단계 문제를 해결할 수 있는 강력한 인퍼런스 시스템을 생성할 수 있습니다. 예를 들어, AlphaCode 2는 경쟁 프로그래밍 문제를 해결하는 데에서 향상된 성능을 보였습니다.
5.2. 멀티모달
Gemini 모델은 본질적으로 멀티모달이며, 이미지, 비디오, 오디오를 포함한 다양한 입력을 처리할 수 있습니다. 이 모델들은 높은 수준의 객체 인식, 세밀한 트랜스크립션, 차트 이해 및 멀티모달 인퍼런스 등 다양한 능력을 보여줍니다.
5.2.1. 이미지 이해
Gemini는 다양한 이미지 이해 작업에서 높은 성능을 보였으며, 특히 OCR 도구 없이 자연 이미지에서 텍스트를 읽는 능력이 뛰어났습니다.
5.2.2. 비디오 이해
비디오 입력을 이해하는 것은 유용한 범용 에이전트로 나아가는 중요한 단계입니다. Gemini는 여러 비디오 캡셔닝 및 질문 응답 작업에서 최고의 성과를 달성했습니다.
5.2.3. 이미지 생성
Gemini는 텍스트와 이미지를 혼합한 프롬프트를 사용하여 이미지를 직접 출력할 수 있습니다. 이는 블로그 포스트나 웹사이트를 위한 이미지와 텍스트의 디자인 제안 등에 유용하게 사용될 수 있습니다.
5.2.4. 오디오 이해
Gemini 나노 및 프로 모델은 공개 벤치마크에서 여러 오디오 이해 작업을 평가받았으며, 다양한 언어에서 향상된 성능을 보였습니다. 이런 모델은 자동 음성 인식과 음성 번역 작업에서 특히 높은 성능을 보여줍니다.
6. 책임 있는 배포
Gemini 모델 개발 과정에서, Google의 AI 기술 이전 릴리스와 일치하는 예측 가능한 사회적 영향을 식별, 측정 및 관리하기 위해 구조화된 책임 있는 배포 접근 방식을 따릅니다. 이 섹션에서는 이런 접근 방식과 주요 발견들을 개괄적으로 설명하고 있으며, 자세한 내용은 추후 보고서에서 공유할 예정입니다.
6.1. 영향 평가 Gemini 모델 개발과 관련된 주요 사회적 이득과 해를 식별, 평가 및 문서화하기 위해 모델 영향 평가를 개발했습니다. 이는 사실성, 아동 안전, 유해 콘텐츠, 사이버보안, 생물위험, 대표성 및 포용성과 같은 분야에 중점을 두고 있습니다. 이런 평가는 모델 개발과 함께 업데이트됩니다.
6.2. 모델 정책 알려진 및 예상되는 영향에 대한 이해를 바탕으로, 모델 개발 및 평가를 안내하기 위한 일련의 “모델 정책”을 개발했습니다. 이는 책임 있는 개발의 표준화된 기준 및 우선 순위 체계로서, 런치 준비 상태를 나타내는 지표로 작용합니다.
6.3. 평가 영향 평가 내에서 식별된 정책 영역 및 기타 주요 위험 영역에 대해 Gemini 모델을 평가하기 위해 모델 개발의 전 과정에 걸쳐 평가 세트를 개발했습니다. 개발 평가는 훈련 및 파인튜닝 과정에서 ‘힐 클라이밍’을 목적으로 수행되며, 보증 평가는 주요 이정표나 훈련 주기의 끝에 거버넌스 및 리뷰 목적으로 수행됩니다. 외부 평가는 Google 외부의 파트너들이 진행하여 모델의 맹점을 식별합니다.
6.4. 완화 조치 위에서 설명한 평가 및 정책 접근 방식에 따라 완화 조치가 개발됩니다. 이런 완화 조치는 데이터, 지시 튜닝, 사실성과 관련합니다. (본문 참조)
6.4.1. 데이터 훈련 전, 데이터 큐레이션 및 수집 단계에서 잠재적인 downstream 해를 완화하기 위한 다양한 조치를 취합니다. 고위험 콘텐츠를 필터링하고, 모든 데이터가 고품질을 유지하도록 합니다. 또한, 데이터 풍부화 작업자가 최소한 현지 생활임금을 받도록 합니다.
6.4.2. 지시 튜닝 지시 튜닝은 감독된 파인튜닝(SFT)과 휴먼 피드백을 통한 강화 학습(RLHF)을 포함합니다. 텍스트 및 멀티모달 설정에서 지시 튜닝을 적용합니다. 지시 튜닝 레시피는 유용성 증가와 모델 해로움 감소를 균형있게 설계되었습니다.
6.4.3. 사실성 모델이 다양한 시나리오에서 사실적인 응답을 생성하는 것이 중요합니다. 실제 시나리오를 반영하는 세 가지 주요 행동을 목표로 지시 튜닝 노력을 집중했습니다. 출처 표기, 폐쇄형 응답 생성, 회피. 이런 행동은 목표된 지도 training dataset을 큐레이션하고 RLHF를 수행함으로써 유도되었습니다.
6.5. 배포 검토 완료 후, 각 승인된 Gemini 모델에 대한 모델 카드를 생성하여 중요한 성능 및 책임 메트릭스를 구조화되고 일관된 내부 문서로 기록하고 이 메트릭스를 시간이 지남에 따라 적절하게 외부에 전달합니다.
6.6. 책임 있는 거버넌스 책임 있는 개발 과정 전반에 걸쳐, Google DeepMind의 책임 및 안전 위원회(RSC)와 함께 윤리 및 안전 검토를 수행합니다. RSC는 Google의 AI 원칙에 대해 Google DeepMind의 프로젝트, 논문 및 협력을 평가하고 영향 평가, 정책, 평가 및 완화 노력에 대한 피드백을 제공합니다.
7. Limitation
이런 인상적인 능력에도 불구하고, LLM의 사용에는 한계가 있음을 지적할 필요가 있습니다. 모델 출력이 더 신뢰할 수 있고 검증 가능하도록 LLM에 의해 생성된 “환각”에 대한 지속적인 연구와 개발이 필요합니다.
LLM은 시험 벤치마크에서 인상적인 성능을 보이지만, 인과 이해, 논리적 인퍼런스 및 반사실적 인퍼런스과 같은 고차원적 인퍼런스 능력을 필요로 하는 작업에서는 어려움을 겪습니다. 이는 현재 최고 수준의 LLM이 많은 벤치마크를 포화시키면서 진정한 이해를 측정하기 위해 더 도전적이고 견고한 평가가 필요함을 강조합니다.
We present Gemini, a family of highly capable multimodal models developed at Google. We trained Gemini jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance in each respective domain.
Gemini 1.0, our first version, comes in three sizes: Ultra for highly-complex tasks, Pro for enhanced performance and deployability at scale, and Nano for on-device applications. Each size is specifically tailored to address different computational limitations and application requirements. We evaluate the performance of Gemini models on a comprehensive suite of internal and external benchmarks covering a wide range of language, coding, reasoning, and multimodal tasks.
Gemini advances state-of-the-art in large-scale language modeling (Anil et al., 2023; Brown et al., 2020; Chowdhery et al., 2023; Hoffmann et al., 2022; OpenAI, 2023a; Radford et al., 2019; Rae et al., 2021), image understanding (Alayrac et al., 2022; Chen et al., 2022; Dosovitskiy et al., 2020; OpenAI, 2023b; Reed et al., 2022; Yu et al., 2022a), audio processing (Radford et al., 2023; Zhang et al., 2023), and video understanding(Alayrac et al., 2022; Chen et al., 2023). It also builds on the work on sequence models (Sutskever et al., 2014), a long history of work in deep learning based on neural networks (LeCun et al., 2015), and machine learning distributed systems (Barham et al., 2022; Bradbury et al., 2018; Dean et al., 2012) that enable large-scale training.
Our most capable model, Gemini Ultra, achieves new state-of-the-art results in 30 of 32 benchmarks we report on, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU (Hendrycks et al., 2021a) — a prominent benchmark testing knowledge and reasoning via a suite of exams — with a score above 90%. Beyond text, Gemini Ultra makes notable advances on challenging multimodal reasoning tasks. For example, on the recent MMMU benchmark (Yue et al., 2023), that comprises questions about images on multi-discipline tasks requiring college-level subject knowledge and deliberate reasoning, Gemini Ultra achieves a new state-of-the-art score of 62.4%, outperforming the previous best model by more than 5 percentage points. It provides a uniform performance lift for video question answering and audio understanding benchmarks.
Qualitative evaluation showcases impressive crossmodal reasoning capabilities, enabling the model to understand and reason across an input sequence of audio, images, and text natively (see Figure 5 and Table 13). Consider the educational setting depicted in Figure 1 as an example. A teacher has drawn a physics problem of a skier going down a slope, and a student has worked through a solution to it. Using Gemini’s multimodal reasoning capabilities, the model is able to understand the messy handwriting, correctly understand the problem formulation, convert both the problem and solution to mathematical typesetting, identify the specific step of reasoning where the student went wrong in solving the problem, and then give a worked through correct solution to the problem. This opens up exciting educational possibilities, and we believe the new multimodal and reasoning capabilities of Gemini models have dramatic applications across many fields.
Figure 1. Verifying a student’s solution to a physics problem. The model is able to correctly recognize all of the handwritten content and verify the reasoning. On top of understanding the text in the image, it needs to understand the problem setup and correctly follow instructions to generate LATEX.
The reasoning capabilities of large language models show promise toward building generalist agents that can tackle more complex multi-step problems. The AlphaCode team built AlphaCode 2 (Leblond et al, 2023), a new Gemini-powered agent, that combines Gemini’s reasoning capabilities with search and tool-use to excel at solving competitive programming problems. AlphaCode 2 ranks within the top 15% of entrants on the Codeforces competitive programming platform, a large improvement over its state-of-the-art predecessor in the top 50% (Li et al., 2022).
In tandem, we advance the frontier of efficiency with Gemini Nano, a series of small models targeting on-device deployment. These models excel in on-device tasks, such as summarization, reading comprehension, text completion tasks, and exhibit impressive capabilities in reasoning, STEM, coding, multimodal, and multilingual tasks relative to their sizes.
In the following sections, we first provide an overview of the model architecture, training infrastructure, and training dataset. We then present detailed evaluations of the Gemini model family, covering well-studied benchmarks and human-preference evaluations across text, code, image, audio and video — which include both English performance and multilingual capabilities. We also discuss our approach to responsible deployment,2 including our process for impact assessments, developing model policies, evaluations, and mitigations of harm before deployment decisions. Finally, we discuss the broader implications of Gemini, its limitations alongside its potential applications — paving the way for a new era of research and innovation in AI.
Gemini models build on top of Transformer decoders (Vaswani et al., 2017) that are enhanced with improvements in architecture and model optimization to enable stable training at scale and optimized inference on Google’s Tensor Processing Units. They are trained to support 32k context length, employing efficient attention mechanisms (for e.g. multi-query attention (Shazeer, 2019)). Our first version, Gemini 1.0, comprises three main sizes to support a wide range of applications as discussed in Table 1.
Our most capable model that delivers state-of-the-art performance across a wide It is range of highly complex tasks, including reasoning and multimodal tasks. efficiently serveable at scale on TPU accelerators due to the Gemini architecture.
A performance-optimized model in terms of cost as well as latency that delivers significant performance across a wide range of tasks. This model exhibits strong reasoning performance and broad multimodal capabilities.
Our most efficient model, designed to run on-device. We trained two versions of Nano, with 1.8B (Nano-1) and 3.25B (Nano-2) parameters, targeting low and high memory devices respectively. It is trained by distilling from larger Gemini models. It is 4-bit quantized for deployment and provides best-in-class performance.
Table 1. An overview of the Gemini 1.0 model family.
Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).
Video understanding is accomplished by encoding the video as a sequence of frames in the large context window. Video frames or images can be interleaved naturally with text or audio as part of the model input. The models can handle variable input resolution in order to spend more compute on tasks that require fine-grained understanding. In addition, Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM) (Zhang et al., 2023) features. This enables the model to capture nuances that are typically lost when the audio is naively mapped to a text input (for example, see audio understanding demo on the website).
Figure 2. Gemini supports interleaved sequences of text, image, audio, and video as inputs (illustrated by tokens of different colors in the input sequence). It can output responses with interleaved image and text.
Training the Gemini family of models required innovations in training algorithms, dataset, and infrastructure. For the Pro model, the inherent scalability of our infrastructure and learning algorithms enable us to complete pretraining in a matter of weeks, leveraging a fraction of the Ultra’s resources. The Nano series of models leverage additional advancements in distillation and training algorithms to produce the best-in-class small language models for a wide variety of tasks, such as summarization and reading comprehension, which power our next generation on-device experiences.
We trained Gemini models using TPUv5e and TPUv4 (Jouppi et al., 2023), depending on their sizes and configuration. Training Gemini Ultra used a large fleet of TPUv4 accelerators across multiple datacenters. This represents a significant increase in scale over our prior flagship model PaLM-2 which presented new infrastructure challenges. Scaling up the number of accelerators results in a proportionate decrease in the mean time between failure of hardware in the overall system. We minimized the rate of planned reschedules and preemptions, but genuine machine failures are commonplace across all hardware accelerators at such large scales.
TPUv4 accelerators are deployed in “SuperPods” of 4096 chips, each connected to a dedicated optical switch, which can dynamically reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologies in around 10 seconds (Jouppi et al., 2023). For Gemini Ultra, we decided to retain a small number of cubes per superpod to allow for hot standbys and rolling maintenance.
TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
The ‘single controller’ programming model of Jax (Bradbury et al., 2018) and Pathways (Barham et al., 2022) allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow. The GSPMD partitioner (Xu et al., 2021) in the XLA compiler partitions the training step computation, and the MegaScale XLA compiler (XLA, 2019) pass statically schedules appropriate collectives so that they maximally overlap with the computation with very little variation in step time.
Maintaining a high goodput3 at this scale would have been impossible using the conventional approach of periodic checkpointing of weights to persistent cluster storage. For Gemini, we instead made use of redundant in-memory copies of the model state, and on any unplanned hardware failures, we rapidly recover directly from an intact model replica. Compared to both PaLM and PaLM-2 (Anil et al., 2023), this provided a substantial speedup in recovery time, despite the significantly larger training resources being used. As a result, the overall goodput for the largest-scale training job increased from 85% to 97%.
Training at unprecedented scale invariably surfaces new and interesting systems failure modes and in this instance one of the problems that we needed to address was that of “Silent Data Corruption (SDC)” (Dixit et al., 2021; Hochschild et al., 2021; Vishwanathan et al., 2015). Although these are extremely rare, the scale of Gemini means that we can expect SDC events to impact training every week or two. Rapidly detecting and removing faulty hardware required several new techniques that exploit deterministic replay to isolate incorrect computations, combined with proactive SDC scanners on idle machines and hot standbys. Our fully deterministic infrastructure allowed us to quickly identify root causes (including hardware failures) during the development leading up to the Ultra model, and this was a crucial ingredient towards stable training.
Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data.
We use the SentencePiece tokenizer (Kudo and Richardson, 2018) and find that training the tokenizer on a large sample of the entire training corpus improves the inferred vocabulary and subsequently improves model performance. For example, we find Gemini models can efficiently tokenize non-Latin scripts which can, in turn, benefit model quality as well as training and inference speed.
The number of tokens used to train the largest models were determined following the approach in Hoffmann et al. (2022). The smaller models are trained for significantly more tokens to improve performance for a given inference budget, similar to the approach advocated in Touvron et al. (2023a).
We apply quality filters to all datasets, using both heuristic rules and model-based classifiers. We also perform safety filtering to remove harmful content. We filter our evaluation sets from our training corpus. The final data mixtures and weights were determined through ablations on smaller models. We stage training to alter the mixture composition during training – increasing the weight of domain-relevant data towards the end of training. We find that data quality is critical to a highly performing model, and believe that many interesting questions remain around finding the optimal dataset distribution for pretraining.
We define goodput as the time spent computing useful new steps over the elapsed time of the training job.
The Gemini models are natively multimodal, as they are trained jointly across text, image, audio, and video. One open question is whether this joint training can result in a model which has strong capabilities in each domain – even when compared to models and approaches that are narrowly tailored to single domains. We find this to be the case: Gemini sets a new state of the art across a wide range of text, image, audio, and video benchmarks.
We compare Gemini Pro and Ultra to a suite of external LLMs and our previous best model PaLM 2 across a series of text-based academic benchmarks covering reasoning, reading comprehension, STEM, and coding. We report these results in Table 2. Broadly, we find that the performance of Gemini Pro outperforms inference-optimized models such as GPT-3.5 and performs comparably with several of the most capable models available, and Gemini Ultra outperforms all current models. In this section, we examine some of these findings.
On MMLU (Hendrycks et al., 2021a), Gemini Ultra can outperform all existing models, achieving an accuracy of 90.04%. MMLU is a holistic exam benchmark, which measures knowledge across a set of 57 subjects. Human expert performance is gauged at 89.8% by the benchmark authors, and Gemini Ultra is the first model to exceed this threshold, with the prior state-of-the-art result at 86.4%. Achieving high performance requires specialist knowledge across many domains (e.g. law, biology, history, etc.), alongside reading comprehension and reasoning. We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought. We refer the reader to appendix for a detailed breakdown of how this approach compares with only chain-of-thought prompting or only greedy sampling.
In mathematics, a field commonly used to benchmark the analytical capabilities of models, Gemini Ultra shows strong performance on both elementary exams and competition-grade problem sets. For the grade-school math benchmark, GSM8K (Cobbe et al., 2021), we find Gemini Ultra reaches 94.4% accuracy with chain-of-thought prompting and self-consistency (Wang et al., 2022) compared to the previous best accuracy of 92% with the same prompting technique. Similar positive trends are observed in increased difficulty math problems drawn from middle and high-school math competitions (MATH benchmark), with the Gemini Ultra model outperforming all competitor models, reaching 53.2% using 4-shot prompting. The model also outperforms the state of the art on even harder tasks derived from American Mathematical Competitions (150 questions from 2022 and 2023). Smaller models perform poorly on this challenging task scoring close to random, but Gemini Ultra can solve 32% of the questions, compared to the 30% solve rate for GPT-4.
Gemini Ultra also excels in coding, a popular use case of current LLMs. We evaluate the model on many conventional and internal benchmarks and also measure its performance as part of more complex reasoning systems such as AlphaCode 2 (see section 5.1.7 on complex reasoning systems). For example, on HumanEval, a standard code-completion benchmark (Chen et al., 2021) mapping function descriptions to Python implementations, instruction-tuned Gemini Ultra correctly implements 74.4% of problems. On a new held-out evaluation benchmark for python code generation tasks, Natural2Code, where we ensure no web leakage, Gemini Ultra achieves the highest score of 74.9%.
Model / Task | GPT-4 | GPT-3.5 | PaLM 2-L | Claude 2 | Inflection-2 | Grok 1 | LLAMA-2 | Gemini Ultra | Gemini Pro |
---|---|---|---|---|---|---|---|---|---|
MMLU | 70% 5-shot | 78.4% 5-shot | 78.5% 5-shot CoT | 79.6% 5-shot | 73.0% 5-shot | 68.0%∗∗∗ | 57.1% 5-shot | 90.04% CoT@32∗ | 79.13% CoT@8∗ |
GSM8K | 34.1% 4-shot (via API∗∗) | 80.0% 5-shot | 34.4% 4-shot | 88.0% 0-shot | 81.4% 8-shot | 34.8% | 62.9% 8-shot | 83.7% 5-shot | 71.8% 5-shot |
MATH | 56.8% 5-shot | 23.9% 4-shot | 13.5% 4-shot | 66.6% 3-shot (via API∗∗) | 77.7% 3-shot | 48.1% 0-shot | 62.3% 0-shot (via API∗∗) | 94.4% Maj1@32 | 86.5% Maj1@32 |
BIG-Bench-Hard | 64.1% 3-shot | — | — | 82.0% Variable shots | 85.5% 10-shot | 86.8% 10-shot | — | 53.2% 4-shot | 32.6% 4-shot |
HumanEval | 70.0% 0-shot | 44.5% 0-shot | 63.2% 0-shot | 29.9% 0-shot | — | — | 51.2% 3-shot | 83.6% 3-shot | 75.0% 3-shot |
Natural2Code | 74.4% 0-shot (IT) | 67.7% 0-shot (IT) | 74.9% 0-shot | 69.6% 0-shot | 82.4% Variable shots | 74.1% Variable shots | 87.8% 10-shot | 84.7% 10-shot | 87.29% CoT@32 (via API∗∗) |
DROP | 86.4% 5-shot (reported) | 92.0% SFT & 5-shot CoT | 52.9% 4-shot (via API∗∗) | 50.3% (Zheng et al., 2023) | 83.1% 3-shot (via API∗∗) | 67.0% 0-shot (reported) | 73.9% 0-shot (via API∗∗) | 80.9% 3-shot (reported) | 95.3% 10-shot (reported) |
HellaSwag | 89.0% 10-shot | — | — | — | — | 80.0%∗∗∗ | — | — | — |
WMT23 | 74.4% 1-shot (IT) | 71.7% 1-shot | 73.8% 1-shot (via API∗∗) | — | 72.7% 1-shot | — | — | — | — |
Table 2. Gemini performance on text benchmarks with external comparisons and PaLM 2-L.
∗
The model produces a chain of thought with k = 8 or 32 samples, if there is a consensus above a threshold (chosen based on the validation split), it selects this answer, otherwise it reverts to a greedy sample. Further analysis in Appendix 9.1.
∗∗
Results self-collected via the API in Nov, 2023.
∗∗∗
Results shown use the decontaminated numbers from Touvron et al. (2023b) report as the most relevant comparison to Gemini models which have been decontaminated as well.
Evaluation on these benchmarks is challenging and may be affected by data contamination. We performed an extensive leaked data analysis after training to ensure the results we report here are as scientifically sound as possible, but still found some minor issues and decided not to report results on e.g. LAMBADA (Paperno et al., 2016). As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred fine-tuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data. So, we evaluate Gemini models on several new held-out evaluation datasets that were recently released, such as WMT23 and Math-AMC 2022-2023 problems, or internally generated from non-web sources, such as Natural2Code. We refer the reader to the appendix for a comprehensive list of our evaluation benchmarks.
Even so, model performance on these benchmarks gives us an indication of the model capabilities and where they may provide impact on real-world tasks. For example, Gemini Ultra’s impressive reasoning and STEM competencies pave the way for advancements in LLMs within the educational domain4. The ability to tackle complex mathematical and scientific concepts opens up exciting possibilities for personalized learning and intelligent tutoring systems.
We investigate the trends in capabilities across the Gemini model family by evaluating them on a holistic harness of more than 50 benchmarks in six different capabilities, noting that some of the most notable benchmarks were discussed in the last section. These capabilities are: “Factuality” covering open/closed-book retrieval and question answering tasks; “Long-Context” covering long form summarization, retrieval and question answering tasks; “Math/Science” including tasks for mathematical problem solving, theorem proving, and scientific exams; “Reasoning” tasks that require arithmetic, scientific, and commonsense reasoning; “Multilingual” tasks for translation, summarization, and reasoning in multiple languages. Please see appendix for a detailed list of tasks included for each capability.
Figure 3. Language understanding and generation performance of Gemini model family across different capabilities (normalized by the Gemini Pro model).
We observe consistent quality gains with increased model size in Figure 3, especially in reasoning, math/science, summarization and long-context. Gemini Ultra is the best model across the board for all six capabilities. Gemini Pro, the second-largest model in the Gemini family of models, is also quite competitive while being a lot more efficient to serve.
Bringing AI closer to the user, we discuss the Gemini Nano 1 and Nano 2 models engineered for on-device deployments. These models excel in summarization and reading comprehension tasks with per-task fine-tuning. Figure 3 shows the performance of these pretrained models in comparison to the much larger Gemini Pro model, while Table 3 dives deeper into specific factuality, coding, Math/Science, and reasoning tasks. Nano-1 and Nano-2 model sizes are only 1.8B and 3.25B parameters respectively. Despite their size, they show exceptionally strong performance on factuality, i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and
See demos on website https://deepmind.google/gemini.
Task | Gemini Nano 1 Accuracy | Normalized by Pro (Nano 1) | Gemini Nano 2 Accuracy | Normalized by Pro (Nano 2) |
---|---|---|---|---|
BoolQ | 71.6 | 0.81 | 79.3 | 0.90 |
TydiQA (GoldP) | 68.9 | 0.85 | 74.2 | 0.91 |
NaturalQuestions (Retrieved) | 38.6 | 0.69 | 46.5 | 0.83 |
NaturalQuestions (Closed-book) | 18.8 | 0.43 | 24.8 | 0.56 |
BIG-Bench-Hard (3-shot) | 34.8 | 0.47 | 42.4 | 0.58 |
MBPP | 20.0 | 0.33 | 27.2 | 0.45 |
MATH (4-shot) | 13.5 | 0.41 | 22.8 | 0.70 |
MMLU (5-shot) | 45.9 | 0.64 | 55.8 | 0.78 |
이 표는 다양한 멀티링구얼 태스크에 대한 Gemini Nano 1과 Gemini Nano 2 모델의 정확도를 보여줍니다. 또한, 각 태스크의 정확도는 ‘Pro’ 모델 기준으로 정규화된 값을 함께 제공합니다. 이 데이터는 모델들이 어떻게 다양한 언어와 태스크에서 성능을 보이는지를 보여줍니다.
Table 3. Performance of Gemini Nano series on factuality, summarization, reasoning, coding and STEM tasks compared to significantly larger Gemini Pro model.
The multilingual capabilities of the Gemini models are evaluated using a diverse set of tasks requiring multilingual understanding, cross-lingual generalization, and the generation of text in multiple languages. These tasks include machine translation benchmarks (WMT 23 for high-medium-low resource translation; Flores, NTREX for low and very low resource languages), summarization benchmarks (XLSum, Wikilingua), and translated versions of common benchmarks (MGSM: professionally translated into 11 languages).
Machine Translation Translation is a canonical benchmark in machine learning with a rich history. We evaluated Gemini Ultra with instruction-tuning applied (see section 6.4.2) on the entire set of language pairs in the WMT 23 translation benchmark in a few-shot setting. Overall, we found that Gemini Ultra (and other Gemini models) performed remarkably well at translating from English to any other language, and surpassed the LLM-based translation methods when translating out-of-English, on high-resource, mid-resource and low-resource languages. In the WMT 23 out-of-English translation tasks, Gemini Ultra achieved the highest LLM-based translation quality, with an average BLEURT (Sellam et al., 2020) score of 74.8, compared to GPT-4’s score of 73.6, and PaLM 2’s score of 72.2. When averaged across all language pairs and directions for WMT 23, we see a similar trend with Gemini Ultra 74.4, GPT-4 73.8 and PaLM 2-L 72.7 average BLEURT scores on this benchmark.
Model | High Resource | Mid Resource | Out-of-English | Into-English | All Languages |
---|---|---|---|---|---|
Gemini Ultra | 74.2 | 74.7 | 74.8 | 73.9 | 74.4 |
Gemini Pro | 71.7 | 71.8 | 71.5 | 72.0 | 71.7 |
Gemini Nano 2 | 67.7 | 67.0 | 66.2 | 69.0 | 67.4 |
Gemini Nano 1 | 64.1 | 64.8 | 65.2 | 63.5 | 64.8 |
GPT-4 | 74.0 | 73.6 | 73.6 | 74.1 | 73.8 |
PaLM 2-L | 72.6 | 72.7 | 72.2 | 73.4 | 72.7 |
Table 4. Performance of Gemini models on WMT 23 translation benchmark. All numbers with 1-shot.
In addition to the languages and translation tasks above, we also evaluate Gemini Ultra on very low-resource languages. These languages were sampled from the tail of the following language sets: Flores-200 (Tamazight and Kanure), NTREX (North Ndebele), and an internal benchmark (Quechua).
For these languages, both from and into English, Gemini Ultra achieved an average chrF score of 27.0 in 1-shot setup, while the next-best model, PaLM 2-L, achieved a score of 25.3.
Multilingual Math and Summarization Beyond translation, we evaluated how well Gemini performs in challenging tasks across a range of languages. We specifically investigated the math benchmark MGSM (Shi et al., 2023), which is a translated variant of the math benchmark GSM8K (Cobbe et al., 2021). We find Gemini Ultra achieves an accuracy of 79.0%, an advance over PaLM 2-L which scores 74.7%, when averaged across all languages in an 8-shot setup. We also benchmark Gemini on the multilingual summarization benchmarks – XLSum (Hasan et al., 2021) and WikiLingua (Ladhak et al., 2020). In XLSum, Gemini Ultra reached an average of 17.6 rougeL score compared to 15.4 for PaLM 2. For Wikilingua, Gemini Ultra (5-shot) trails behind PaLM 2 (3-shot) measured in BLEURT score. See Table 5 for the full results. Overall the diverse set of multilingual benchmarks show that Gemini family models have a broad language coverage, enabling them to also reach locales and regions with low-resource languages.
Model | MGSM (8-shot) | XLsum (3-shot) | Wikilingua |
---|---|---|---|
Gemini Ultra | 79.0 | 17.6 | 48.9 |
Gemini Pro | 63.5 | 16.2 | 47.8 |
GPT-4 | 74.5 | — | — |
PaLM 2-L | 74.7 | 15.4 | 50.4 |
This table shows the performance of different models (Gemini Ultra, Gemini Pro, GPT-4, PaLM 2-L) on MGSM (8-shot), XLsum (3-shot), and Wikilingua tasks. The dashes (—) indicate the absence of data or that the task was not applicable for that particular model.
Table 5. Performance of Gemini models on multilingual math and summarization.
Gemini models are trained with a sequence length of 32,768 tokens and we find that they make use of their context length effectively. We first verify this by running a synthetic retrieval test: we place key-value pairs at the beginning of the context, then add long filler text, and ask for value associated with a particular key. We find that the Ultra model retrieves the correct value with 98% accuracy when queried across the full context length. We further investigate this by plotting the negative log likelihood (NLL) versus the token index across a held-out set of long documents in Figure 4. We find that the NLL decreases with sequence position up to the full 32K context length. The longer context length of Gemini models enable new use cases such as retrieval over documents and video understanding discussed in section 5.2.2.
Human preference of the model outputs provides an important indication of quality that complements automated evaluations. We have evaluated the Gemini models in side-by-side blind evaluations where human raters judge responses of two models to the same prompt. We instruction tune (Ouyang et al., 2022) the pretrained model using techniques discussed in the section 6.4.2. The instruction-tuned version of the model is evaluated on a range of specific capabilities, such as following instructions, creative writing, multimodal understanding, long-context understanding, and safety. These capabilities encompass a range of use cases inspired by current user needs and research-inspired potential future use cases.
Instruction-tuned Gemini Pro models provide a large improvement on a range of capabilities, including preference for the Gemini Pro model over the PaLM 2 model API, 65.0% time in creative writing, 59.2% in following instructions, and 68.5% time for safer responses as shown in Table 6. These improvements directly translate into a more helpful and safer user experience.
Category | Win-rate | 95% Confidence Interval |
---|---|---|
Creativity | 65.0% | [62.9%, 67.1%] |
Instruction Following | 59.2% | [57.6%, 60.8%] |
Safety | 68.5% | [66.0%, 70.8%] |
Table 6. Win rate of Gemini Pro over PaLM 2 (text-bison@001) with 95% confidence intervals.
Gemini can also be combined with additional techniques such as search and tool-use to create powerful reasoning systems that can tackle more complex multi-step problems. One example of such a system is AlphaCode 2, a new state-of-the-art agent that excels at solving competitive programming problems (Leblond et al, 2023). AlphaCode 2 uses a specialized version of Gemini Pro – tuned on competitive programming data similar to the data used in Li et al. (2022) – to conduct a massive search over the space of possible programs. This is followed by a tailored filtering, clustering and reranking mechanism. Gemini Pro is fine-tuned both to be a coding model to generate proposal solution candidates, and to be a reward model that is leveraged to recognize and extract the most promising code candidates.
AlphaCode 2 is evaluated on Codeforces,5 the same platform as AlphaCode, on 12 contests from division 1 and 2, for a total of 77 problems. AlphaCode 2 solved 43% of these competition problems, a 1.7x improvement over the prior record-setting AlphaCode system which solved 25%. Mapping this to competition rankings, AlphaCode 2 built on top of Gemini Pro sits at an estimated 85th percentile on average – i.e. it performs better than 85% of entrants. This is a significant advance over AlphaCode, which only outperformed 50% of competitors.
The composition of powerful pretrained models with search and reasoning mechanisms is an exciting direction towards more general agents; another key ingredient is deep understanding across a range of modalities which we discuss in the next section.
Gemini models are natively multimodal. These models exhibit the unique ability to seamlessly combine their capabilities across modalities (e.g. extracting information and spatial layout out of a table, a chart, or a figure) with the strong reasoning capabilities of a language model (e.g. its state-of-art-performance in math and coding) as seen in examples in Figures 5 and 12. The models also show strong performance in discerning fine-grained details in inputs, aggregating context across space and time, and applying these capabilities over a temporally-related sequence of video frames and/or audio inputs.
The sections below provide more detailed evaluation of the model across different modalities (image, video, and audio), together with qualitative examples of the model’s capabilities for image generation and the ability to combine information across different modalities.
We evaluate the model on four different capabilities: high-level object recognition using captioning or question-answering tasks such as VQAv2; fine-grained transcription using tasks such as TextVQA and DocVQA requiring the model to recognize low-level details; chart understanding requiring spatial understanding of input layout using ChartQA and InfographicVQA tasks; and multimodal reasoning using tasks such as Ai2D, MathVista and MMMU. For zero-shot QA evaluation, the model is instructed to provide short answers aligned with the specific benchmark. All numbers are obtained using greedy sampling and without any use of external OCR tools.
The provided information seems extensive and covers various models’ performance metrics across multiple tasks. Due to the length and complexity, it would be more efficient to break it down into smaller tables. Here is the first part:
MMMU (val) - Multi-discipline College-Level Problems
Model | Performance Metric |
---|---|
Gemini Ultra (pixel only) | 59.4% pass@1 |
Gemini Pro (pixel only) | 47.9% |
Gemini Nano 2 (pixel only) | 32.6% |
Gemini Nano 1 (pixel only) | 26.3% |
GPT-4V | 56.8% GPT-4V, 0-shot |
Prior SOTA | 74.6% |
TextVQA (val) - Text Reading on Natural Images
Model | Performance Metric |
---|---|
Gemini Ultra (pixel only) | 62.4% Maj1@32 |
Gemini Pro (pixel only) | 65.9% |
Gemini Nano 2 (pixel only) | 62.5% |
Gemini Nano 1 (pixel only) | 78.0% |
GPT-4V | 79.5% Google PaLI-3, fine-tuned |
Prior SOTA | 90.9% |
Table 7. Image understanding Gemini Ultra consistently outperforms existing approaches even in zero-shot, especially for OCR-related image understanding tasks for natural images, text, documents, and figures without using any external OCR engine (‘pixel only’). Many existing approaches fine-tune on the respective tasks, highlighted in gray, which makes the comparison with 0-shot not apples-to-apples.
We find that Gemini Ultra is state of the art across a wide range of image-understanding benchmarks in Table 7. It achieves strong performance across a diverse set of tasks such as answering questions on natural images and scanned documents as well as understanding infographics, charts and science diagrams. When compared against publicly reported results from other models (most notably GPT-4V), Gemini is better in zero-shot evaluation by a significant margin. It also exceeds several existing models that are specifically fine-tuned on the benchmark’s training sets for the majority of tasks. The capabilities of the Gemini models lead to significant improvements in the state of the art on academic benchmarks like MathVista (+3.1%)6 or InfographicVQA (+5.2%).
MMMU (Yue et al., 2023) is a recently released evaluation benchmark, which consists of questions about images across 6 disciplines with multiple subjects within each discipline that require collegelevel knowledge to solve these questions. Gemini Ultra achieves the best score on this benchmark advancing the state-of-the-art result by more than 5 percentage points and outperforms the previous best result in 5 of 6 disciplines (see Table 8), thus showcasing its multimodal reasoning capabilities.
MMMU (val) | GeminiUltra (0-shot) | GPT-4V (0-shot) | |
---|---|---|---|
Maj@32 | pass@1 | pass@1 | |
Art & Design | 74.2 | 70.0 | 65.8 |
Business | 62.7 | 56.7 | 59.3 |
Science | 49.3 | 48.0 | 54.7 |
Health & Medicine | 71.3 | 67.3 | 64.7 |
Humanities & Social Science | 78.3 | 78.3 | 72.5 |
Technology & Engineering | 53.0 | 47.1 | 36.7 |
Overall | 62.4 | 59.4 | 56.8 |
Table 8. Gemini Ultra performance on the MMMU benchmark (Yue et al., 2023) per discipline. Each discipline covers multiple subjects, requiring college-level knowledge and complex reasoning.
Gemini models are also capable of operating across modalities and a diverse set of global languages simultaneously, both for image understanding tasks (e.g., images containing text in Icelandic) and for generation tasks (e.g., generating image descriptions for a wide range of languages). We evaluate the performance of generating image descriptions on a selected subset of languages in the Crossmodal3600 (XM-3600) benchmark in a 4-shot setting, using the Flamingo evaluation protocol (Alayrac et al., 2022), without any fine-tuning for all models. As shown in Table 9, Gemini models achieve a significant improvement over the existing best model, Google PaLI-X.
XM-3600 (CIDER) Performance Metrics by Language
Model | English | French | Hindi | Modern Hebrew | Romanian | Thai | Chinese | Average (of 7) |
---|---|---|---|---|---|---|---|---|
Gemini Ultra 4-shot | 86.4 | 77.9 | 31.1 | 54.5 | 39.0 | 86.7 | 33.3 | 58.4 |
Gemini Pro 4-shot | 87.1 | 76.7 | 29.8 | 52.6 | 37.7 | 77.0 | 30.2 | 55.9 |
Google PaLI-X 4-shot | 77.8 | 62.5 | 22.2 | 38.7 | 30.2 | 56.0 | 27.7 | 45.0 |
Table 9. Multilingual image understanding Gemini models outperform existing models in captioning images in many languages when benchmarked on a subset of languages in XM-3600 dataset (Thapliyal et al., 2022).
6.MathVista is a comprehensive mathematical reasoning benchmark consisting of 28 previously published multimodal datasets and three newly created datasets. Our MathVista results were obtained by running the MathVista authors’ evaluation script.
Figure 5. Gemini’s multimodal reasoning capabilities to generate matplotlib code for rearranging the subplots. The multimodal prompt is shown at the top-left in gray. Gemini Ultra’s response, including its generated code, is shown in the right column in blue. The bottom left figure shows rendered version of the generated code. Successfully solving this task shows the model’s capability to combine several capabilities: (1) recognition of the functions depicted in the plots; (2) inverse graphics to infer the code that would have generated the subplots; (3) instruction-following to put subplots in their desired positions; and (4) abstract reasoning to infer that the exponential plot must stay in its original place, because the sine plot must move out of the way for the 3-dimensional plot.
Qualitative evaluation in Figure 5 illustrates an example of Gemini Ultra’s multimodal reasoning capabilities. The model is required to solve the task of generating matplotlib code that would rearrange a set of subplots provided by the user. The model output shows that it successfully solves this task combining multiple capabilities of understanding the user plot, inferring the code required to generate it, following user instructions to put subplots in their desired positions, and abstract reasoning about the output plot. This highlights Gemini Ultra’s native multimodality and eludes to its more complex reasoning abilities across interleaved sequences of image and text. We refer the reader to the appendix for more qualitative examples.
Understanding video input is an important step towards a useful generalist agent. We measure the video understanding capability across several established benchmarks that are held-out from training. These tasks measure whether the model is able to understand and reason over a temporally-related sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA and the Perception test), we evaluate the Gemini models on videos that were still publicly available in the month of November, 2023.
Gemini Ultra achieves state-of-the-art results on various few-shot video captioning tasks as well as zero-shot video question answering tasks as shown in Table 10. This demonstrates its capability of strong temporal reasoning across several frames. Figure 21 in the appendix provides a qualitative example of understanding the video of the ball-striking mechanics of a soccer player and reasoning about the player can improve their game.
Task | Gemini Ultra | Gemini Pro | Few-shot SoTA |
---|---|---|---|
VATEX (test) - English video captioning | 62.7 (4-shots) | 57.4 (4-shots) | 56.0 (DeepMind Flamingo, 4-shots) |
VATEX ZH (test) - Chinese video captioning | 51.3 (4-shots) | 50.0 (4-shots) | – |
YouCook2 (val) - English cooking video captioning | 135.4 (4-shots) | 123.2 (4-shots) | 74.5 (DeepMind Flamingo, 4-shots) |
NextQA (test) - Video question answering | 29.9 (0-shot) | 28.0 (0-shot) | 26.7 (DeepMind Flamingo, 0-shot) |
ActivityNet-QA (test) - Video question answering | 52.2 (0-shot) | 49.8 (0-shot) | 45.3 (Video-LLAVA, 0-shot) |
Perception Test MCQA (test) - Video question answering | 54.7 (0-shot) | 51.1 (0-shot) | 46.3 (SeViLA, 0-shot) |
Table 10. Few-shot video understanding across tasks and languages on selected academic benchmarks. The reported metric is CIDER for video captioning, WUPS for NextQA, and top-1 accuracy for the Perception Test and ActivityNet-QA. For ActivityNet-QA, we use the Video-LLAVA (Lin et al., 2023) evaluation protocol.
Gemini is able to output images natively, without having to rely on an intermediate natural language description that can bottleneck the model’s ability to express images. This uniquely enables the model to generate images with prompts using interleaved sequences of image and text in a few-shot setting. For example, the user might prompt the model to design suggestions of images and text for a blog post or a website (see Figure 10 in the appendix).
Figure 6 shows an example of image generation in 1-shot setting. Gemini Ultra model is prompted with one example of interleaved image and text where the user provides two colors (blue and yellow) and image suggestions of creating a cute blue cat or a blue dog with yellow ear from yarn. The model is then given two new colors (pink and green) and asked for two ideas about what to create using these colors. The model successfully generates an interleaved sequence of images and text with suggestions to create a cute green avocado with pink seed or a green bunny with pink ears from yarn.
Figure 6. Image Generation. Gemini can output multiple images interleaved with text given a prompt composed of image and text. In the left figure, Gemini Ultra is prompted in a 1-shot setting with a user example of generating suggestions of creating cat and dog from yarn when given two colors, blue and yellow. Then, the model is prompted to generate creative suggestions with two new colors, pink and green, and it generates images of creative suggestions to make a cute green avocado with pink seed or a green bunny with pink ears from yarn as shown in the right figure.
We evaluate the Gemini Nano-1 and Gemini Pro models on a variety of public benchmarks and compare it with Universal Speech Model (USM) (Zhang et al., 2023) and Whisper (large-v2 (Radford et al., 2023) or large-v3 (OpenAI, 2023) as indicated). These benchmarks include automatic speech recognition (ASR) tasks such as FLEURS (Conneau et al., 2023), VoxPopuli, (Wang et al., 2021), Multi-lingual Librispeech (Pratap et al., 2020), as well as the speech translation task CoVoST 2, translating different languages into English (Wang et al., 2020). We also report on an internal benchmark YouTube test set. ASR tasks report a word error rate (WER) metric, where a lower number is better. Translation tasks report a BiLingual Evaluation Understudy (BLEU) score, where a higher number is better. FLEURS is reported on 62 languages that have language overlap with the training data. Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER, similar to Whisper (Radford et al., 2023).
Table 11 indicates that our Gemini Pro model significantly outperforms the USM and Whisper models across all ASR and AST tasks, both for English and multilingual test sets. Note that there is a large gain in FLEURS, compared to USM and Whisper, as our model is also trained with the FLEURS training dataset. However, training the same model without FLEURS dataset results in a WER of 15.8, which still outperforms Whisper. Gemini Nano-1 model also outperforms both USM and Whisper on all datasets except FLEURS. Note that we did not evaluate Gemini Ultra on audio yet, though we expect better performance from increased model scale.
Performance Metrics in Automatic Speech Recognition and Translation Tasks
Task | Metric | Gemini Pro | Gemini Nano-1 | Whisper (OpenAI, 2023; Radford et al., 2023) | USM (Zhang et al., 2023) |
---|---|---|---|---|---|
Automatic Speech Recognition | |||||
YouTube (en-us) | WER (↓) | 4.9% | 5.5% | 6.5% (v3) | 6.2% (v2) |
Multilingual Librispeech (en-us) | WER (↓) | 4.8% | 5.9% | 17.6% (v3) | 15.9% (v2) |
FLEURS (62 lang) | WER (↓) | 7.6% | 14.2% | 29.1 (v2) | - |
VoxPopuli (14 lang) | WER (↓) | 9.1% | 9.5% | 6.2% | 7.0 % |
CoVoST 2 (21 lang) | WER (↓) | 11.8% | 13.4% | - | - |
Automatic Speech Translation | |||||
BLEU (↑) | 40.1 | 35.4 | - | - |
This table displays various models’ performance metrics in automatic speech recognition and translation tasks. The tasks include YouTube (en-us), Multilingual Librispeech (en-us), FLEURS (62 languages), VoxPopuli (14 languages), and CoVoST 2 (21 languages). The metrics used are Word Error Rate (WER) and BLEU score, with WER being a metric where lower is better (↓) and BLEU being a metric where higher is better (↑). The versions of the Whisper model are noted as v2 or v3 where applicable.
Table 11. Speech evaluation results on selected benchmarks for ASR and AST. For ASR, the reported metric is WER where lower is better. For AST, the reported metric is BLEU where higher is better.
Table 12. shows further error analysis with USM and Gemini Pro. We find that Gemini Pro produces more understandable responses, particularly on rare words and proper nouns.
Table 13. Audio-visual qualitative example showcasing the ability of Gemini models to process interleaved sequences of text, vision, and audio, as well as reason across modalities. This example inputs interleaved images and audio from the user in a cooking scenario. The user prompts the model for instructions to make an omelet and to inspect whether it is fully cooked.
During the development of the Gemini models, we follow a structured approach to responsible deployment in order to identify, measure, and manage foreseeable downstream societal impacts of our models, in line with previous releases of Google’s AI technology (Kavukcuoglu et al., 2022). Throughout the lifecycle of the project, we follow the structure below. This section outlines our broad approach and key findings through this process. We will share more details on this in an upcoming report.
We develop model impact assessments to identify, assess, and document key downstream societal benefits and harms associated with the development of advanced Gemini models. These are informed by prior academic literature on language model risks (Weidinger et al., 2021), findings from similar prior exercises conducted across the industry (Anil et al., 2023; Anthropic, 2023; OpenAI, 2023a), ongoing engagement with experts internally and externally, and unstructured attempts to discover new model vulnerabilities. Areas of focus include: factuality, child safety, harmful content, cybersecurity, biorisk, representation and inclusivity. These assessments are updated in tandem with model development.
Impact assessments are used to guide mitigation and product delivery efforts, and inform deployment decisions. Gemini impact assessments spanned across different capabilities of Gemini models, assessing the potential consequences of these capabilities with Google’s AI Principles (Google, 2023).
Building upon this understanding of known and anticipated effects, we developed a set of “model policies” to steer model development and evaluations. Model policy definitions act as a standardized criteria and prioritization schema for responsible development and as an indication of launch-readiness. Gemini model policies cover a number of domains including: child safety, hate speech, factual accuracy, fairness and inclusion, and harassment.
To assess the Gemini models against policy areas and other key risk areas identified within impact assessments, we developed a suite of evaluations across the lifecycle of model development.
Development evaluations are conducted for the purpose of ‘hill-climbing’ throughout training and fine-tuning Gemini models. These evaluations are designed by the Gemini team, or are assessments against external academic benchmarks. Evaluations consider issues such as helpfulness (instruction following and creativity), safety and factuality. See section 5.1.6 and the next section on mitigations for a sample of results.
Assurance evaluations are conducted for the purpose of governance and review, usually at the end of key milestones or training runs by a group outside of the model development team. Assurance evaluations are standardized by modality and datasets are strictly held-out. Only high-level insights are fed back into the training process to assist with mitigation efforts. Assurance evaluations include testing across Gemini policies, and include ongoing testing for dangerous capabilities such as potential biohazards, persuasion, and cybersecurity (Shevlane et al., 2023).
External evaluations are conducted by partners outside of Google to identify blindspots. External groups stress-test our models across a range of issues, including across areas listed in the White House Commitments,7 and tests are conducted through a mixture of structured evaluations and unstructured red teaming. The design of these evaluations are independent and results are reported periodically to the Google DeepMind team.
In addition to this suite of external evaluations, specialist internal teams conduct ongoing red teaming of our models across areas such as the Gemini policies and security. These activities include less structured processes involving sophisticated adversarial attacks to identify new vulnerabilities. Discovery of potential weaknesses can then be used to mitigate risks and improve evaluation approaches internally. We are committed to ongoing model transparency and plan to share additional results from across our evaluation suite over time.
Mitigations are developed in response to the outcomes of the assessment, policy, and evaluation approaches described above. Evaluations and mitigations are used in an iterative way, with evaluations being re-run following mitigation efforts. We discuss our efforts on mitigating model harms across data, instruction-tuning, and factuality below.
Prior to training, we take various steps to mitigate potential downstream harms at the data curation and data collection stage. As discussed in the section on “Training Data”, we filter training data for high-risk content and to ensure all training data is sufficiently high quality. Beyond filtering, we also take steps to ensure all data collected meets Google DeepMind’s best practices on data enrichment,8 developed based on the Partnership on AI’s “Responsible Sourcing of Data Enrichment Services”9. This includes ensuring all data enrichment workers are paid at least a local living wage.
https://whitehouse.gov/wp-content/uploads/2023/07/Ensuring-Safe-Secure-and-Trustworthy-AI.pdf
https://deepmind.google/discover/blog/best-practices-for-data-enrichment/
https://partnershiponai.org/responsible-sourcing-considerations/
Instruction tuning encompasses supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model. We apply instruction tuning in both text and multimodal settings. Instruction tuning recipes are carefully designed to balance the increase in helpfulness with decrease in model harms related to safety and hallucinations (Bai et al., 2022a).
Curation of “quality” data is critical for SFT, reward model training, and RLHF. The data mixture ratios are ablated with smaller models to balance the metrics on helpfulness (such as instruction following, creativity) and reduction of model harms, and these results generalize well to larger models. We have also observed that data quality is more important than quantity (Touvron et al., 2023b; Zhou et al., 2023), especially for larger models. Similarly, for reward model training, we find it critical to balance the dataset with examples where the model prefers to say, “I cannot help with that,” for safety reasons and examples where the model outputs helpful responses. We use multi-objective optimization with a weighted sum of reward scores from helpfulness, factuality, and safety, to train a multi-headed reward model.
We further elaborate our approach to mitigate risks of harmful text generation. We enumerate approximately 20 harm types (e.g. hate speech, providing medical advice, suggesting dangerous behavior) across a wide variety of use cases. We generate a dataset of potential harm-inducing queries in these categories, either manually by policy experts and ML engineers, or via prompting high capability language models with topical keywords as seeds.
Given the harm-inducing queries, we probe our Gemini models and analyze the model responses via side-by-side evaluation. As discussed above, we balance the objective of model output response being harmless versus being helpful. From the detected risk areas, we create additional supervised fine-tuning data to demonstrate the desirable responses. To generate such responses at scale, we heavily rely on a custom data generation recipe loosely inspired from Constitutional AI (Bai et al., 2022b), where we inject variants of Google’s content policy language as “constitutions”, and utilize language model’s strong zero-shot reasoning abilities (Kojima et al., 2022) to revise responses and choose between multiple response candidates. We have found this recipe to be effective – for example in Gemini Pro, this overall recipe was able to mitigate a majority of our identified text harm cases, without any perceptible decrease on response helpfulness.
It is important that our models generate responses that are factual in a variety of scenarios, and to reduce the frequency of hallucinations. We focused instruction tuning efforts on three key desired behaviors, reflecting real-world scenarios:
We provide three key results on respective challenge sets below.
Set | Condition | Inaccurate Rate (Factuality Set) | AIS (Attribution Set) | Accuracy (Hedging Set) |
---|---|---|---|---|
Gemini Pro | No factuality-focused adaptation | 7.9% [7%, 9%] | 40.2% [37.9%, 42.4%] | 59.7% [57.2%, 61.9%] |
Gemini Pro | Final stage of instruction tuning | 3.4% [2.8%, 4.1%] | 0% | 69.30% |
In this table, the performance of the Gemini Pro model is shown in various sets: Factuality, Attribution, and Hedging. The metrics include the Inaccurate Rate for the Factuality Set, AIS (Attribution Integrity Score) for the Attribution Set, and Accuracy for the Hedging Set. The performance is compared between two conditions: without factuality-focused adaptation and at the final stage of instruction tuning.
Table 14. Factuality mitigations: Impact of instruction-tuning on the rate of inaccuracy, presence of attribution and the rate of accurate hedging (with corresponding 95% confidence intervals).
Following the completion of reviews, model cards (Mitchell et al., 2019) for each approved Gemini model are created for structured and consistent internal documentation of critical performance and responsibility metrics as well as to inform appropriate external communication of these metrics over time.
Across the responsible development process, we undertake ethics and safety reviews with the Google DeepMind’s Responsibility and Safety Council (RSC),10 an interdisciplinary group which evaluates Google DeepMind’s projects, papers and collaborations against Google’s AI Principles. The RSC provides input and feedback on impact assessments, policies, evaluations and mitigation efforts. During the Gemini project, the RSC set specific evaluation targets across key policy domains (e.g. child safety).
We have presented Gemini, a new family of models that advance multimodal model capabilities in text, code, image, audio, and video. This technical report evaluates the capabilities of Gemini on a diverse set of widely-studied benchmarks, and our most capable model Gemini Ultra makes significant advances across the board. In the natural language domain, the performance gains from careful developments in data and model training at scale continue to deliver quality improvements, setting new state of the art in several benchmarks. In particular, Gemini Ultra surpasses human-expert performance on the exam benchmark MMLU, scoring 90.0%, which has been a defacto measure of progress for LLMs ever since it was first released in 2020. In the multimodal domain, Gemini Ultra sets new state of the art on most of the image understanding, video understanding, and audio understanding benchmarks without task-specific modifications or tuning. In particular, Gemini Ultra’s multimodal reasoning capabilities are evident from its state-of-the-art performance on the recent MMMU benchmark (Yue et al., 2023), that comprises questions about images requiring college-level subject knowledge and deliberate reasoning.
Beyond the state-of-art results on benchmarks, what we are most excited about is the new use cases enabled by Gemini models. The new capabilities of Gemini models to parse complex images, such as charts or infographics, reason over interleaved sequences of images, audio, and text, and generate interleaved text and images as responses open a wide variety of new applications. As shown in figures throughout the report and appendix, Gemini can enable new approaches in areas like education, everyday problem solving, multilingual communication, information summarization, extraction, and creativity. We expect that the users of these models will find all kinds of beneficial new uses that we have only scratched the surface of in our own investigations.
Despite their impressive capabilities, we should note that there are limitations to the use of LLMs. There is a continued need for ongoing research and development on “hallucinations” generated by LLMs to ensure that model outputs are more reliable and verifiable. LLMs also struggle with tasks requiring high-level reasoning abilities like causal understanding, logical deduction, and counterfactual reasoning even though they achieve impressive performance on exam benchmarks. This underscores the need for more challenging and robust evaluations to measure their true understanding as the current state-of-the-art LLMs saturate many benchmarks.
Gemini is a further step towards our mission to solve intelligence, advance science and benefit humanity, and we are enthusiastic to see how these models are used by our colleagues at Google and beyond. We build on many innovations in machine learning, data, infrastructure, and responsible development – areas that we have been pursuing at Google for over a decade. The models we present in this report provide a strong foundation towards our broader future goal to develop a large-scale, modularized system that will have broad generalization capabilities across many modalities.