00:00:00

Share Your Feedback 🏝️

Model | DeepSeek Coder

Model | DeepSeek Coder

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Llemma Pile-2 Next: IPCA

Model | DeepSeek Coder

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-06-28

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence

  • url: https://arxiv.org/abs/2401.14196
  • pdf: https://arxiv.org/pdf/2401.14196
  • abstract: The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

Contents

TL;DR


  • 종합적 데이터 수집과 고도의 훈련 전략을 통해 개발된 대규모 오픈소스 코드 모델
  • FIM 기법과 광범위한 벤치마크를 활용한 향상된 성능 검증
  • 지속적인 성능 개선을 위한 추가 프리트레이닝 실행

1. 서론

코드 지능의 새 시대를 연 대규모 언어모델의 발전은 소프트웨어 개발 분야에 혁신적인 변화를 가져왔습니다. 이런 모델들은 버그 탐지에서 코드 생성까지 다양한 코딩 작업을 자동화하고 효율화할 수 있는 잠재력을 갖고 있습니다. 특히, 공개소스와 폐쇄소스 모델 간의 성능 격차는 주요한 챌린지로 남아 있습니다. 이에 대응하여, DeepSeek-Coder 시리즈를 개발하였습니다. 이 시리즈는 1.3B부터 33B까지 다양한 크기의 모델을 포함하며, 2조 토큰에 걸쳐 87개 프로그래밍 언어로부터 트레이닝되었습니다. 모델의 코드 완성 능력을 강화하기 위해 FIM(Fill-In-Middle) 접근 방식을 통합하였으며, 다양한 공개 코드 관련 벤치마크를 사용하여 광범위한 실험을 수행하였습니다.


2. 데이터 수집

2.1 GitHub 데이터 크롤링 및 필터링

GitHub에서 공개된 리포지토리를 대상으로 데이터를 수집하고, 일정 기준에 따라 필터링을 수행하여 데이터의 질을 관리합니다. 이는 필터 규칙을 적용하여 데이터의 양을 원래 크기의 32.8%로 줄이는 과정을 포함합니다.

2.2 의존성 파싱

파일 간의 의존성을 파악하고 이를 바탕으로 파일을 정렬하는 것은 프로젝트 수준의 코드 시나리오를 효과적으로 다루는 데 중요합니다. 이 과정은 파일들 사이의 의존성을 토폴로지 정렬을 사용하여 분석합니다. 아래는 의존성 분석을 위한 토폴로지 정렬 알고리즘입니다.

\[\begin{align*} \textbf{procedure} & \ \text{TOPOLOGICAL\_SORT(files)} \\ & \text{graphs} \leftarrow \{\} \\ & \text{inDegree} \leftarrow \{\} \\ & \text{for each file in files do} \\ & \quad \text{graphs[file]} \leftarrow [] \\ & \quad \text{inDegree[file]} \leftarrow 0 \\ & \text{end for} \\ & \text{for each (file, dependency) in dependencies do} \\ & \quad \text{graphs[dependency].append(file)} \\ & \quad \text{inDegree[file]} \leftarrow \text{inDegree[file]} + 1 \\ & \text{end for} \\ & \text{queue} \leftarrow \text{[file for file in files if inDegree[file] == 0]} \\ & \text{allResults} \leftarrow [] \\ & \text{while queue is not empty do} \\ & \quad \text{current} \leftarrow \text{queue.pop(0)} \\ & \quad \text{allResults.append(current)} \\ & \quad \text{for each node in graphs[current] do} \\ & \quad \quad \text{inDegree[node]} \leftarrow \text{inDegree[node]} - 1 \\ & \quad \quad \text{if inDegree[node] == 0 then} \\ & \quad \quad \quad \text{queue.append(node)} \\ & \quad \text{end for} \\ & \text{end while} \\ & \text{return allResults} \\ & \text{end procedure} \end{align*}\]

이 과정은 각 파일을 노드로, 의존성을 방향성 간선으로 가지는 그래프를 구성합니다. 각 파일에 대해 의존성 개수(진입 차수)를 계산하고, 진입 차수가 0인 파일부터 처리하여 결과 리스트에 추가합니다. 파일이 처리될 때마다, 해당 파일에 의존하는 다른 파일의 진입 차수를 감소시키고, 진입 차수가 0이 된 파일을 처리 대기열에 추가해 모든 파일이 처리될 때까지 반복합니다.

2.3 리포지토리 수준 중복 제거

훈련 데이터셋의 중복을 제거하는 것은 성능 향상에 중요한 요소입니다. 리포지토리 수준에서 중복을 제거함으로써 데이터의 구조적 무결성을 유지합니다.

2.4 품질 검사 및 오염 제거

코드의 품질을 높이기 위해 컴파일러 및 품질 모델을 사용하여 낮은 품질의 데이터를 추가적으로 필터링합니다. 테스트 세트와의 정보 오염을 방지하기 위해 n-gram 필터링 프로세스를 구현합니다.


3. 훈련 정책

3.1 훈련 전략

다음 토큰 예측과 FIM을 주요 훈련 목표로 설정합니다. FIM 방법은 코드 프리트레이닝 시나리오에서 중요한 역할을 하며, PSM(Prefix-Suffix-Middle) 및 SPM(Suffix-Prefix-Middle) 두 가지 모드를 통해 텍스트의 다양한 구조적 배열을 처리할 수 있는 능력을 향상시킵니다.

3.2 토크나이저

BPE(Byte Pair Encoding) 토크나이저를 사용하여 훈련 데이터셋에 맞는 토크나이저를 구성합니다. 이 과정은 모델의 입력 처리 능력을 최적화합니다.

3.3 모델 아키텍처

다양한 파라미터 크기를 가진 모델을 개발하여 넓은 범위의 응용 프로그램 요구를 충족합니다. 모든 모델은 디코더-만 구조의 Transformer를 기반으로 하며, RoPE(Rotary Position Embedding)를 포함합니다.

3.4 최적화

AdamW 최적화 알고리즘을 사용하여 모델을 훈련합니다. 배치 크기와 학습률은 DeepSeek LLM의 스케일링 법칙을 따릅니다.

3.5 환경

NVIDIA A100 및 H800 GPU를 사용하여 모델을 훈련합니다. 이런 설정은 복잡한 연산을 효율적으로 수행할 수 있는 강력한 인프라를 제공합니다.

3.6 긴 컨텍스트 처리

모델이 더 긴 컨텍스트를 처리할 수 있도록 RoPE 파라미터를 재구성하며, 특히 프로젝트 수준의 코드 처리 시나리오에서 유용합니다.

3.7 지시 기반 튜닝

DeepSeek-Coder-Instruct는 고품질의 지시 데이터를 사용하여 파인튜닝을 거쳐 개발되었습니다. 이는 다단계 대화 시나리오에서 모델의 성능을 극대화합니다.


4. 실험 결과

DeepSeek-Coder는 다양한 벤치마크에서 최고의 성능을 보여 주었습니다. 특히, 코드 생성, FIM 코드 완성, 파일 간 코드 완성 및 프로그램 기반 수학 인퍼런스에서 향상된 결과를 도출하였습니다. 이런 결과는 DeepSeek-Coder 시리즈가 실제 프로그래밍 문제를 효과적으로 해결할 수 있음을 입증합니다.

5. 일반 LLM에서 계속되는 프리트레이닝

DeepSeek-Coder 모델의 자연어 이해 및 수학적 인퍼런스 능력을 향상시키기 위해 추가 프리트레이닝을 수행합니다. 이 과정은 모델의 성능을 더욱 향상시키는 데 기여합니다.

6. 결론

DeepSeek-Coder 시리즈는 코드 지능을 위해 특별히 설계된 LLM으로, 다양한 표준 테스트에서 우수한 성능을 보여 주었습니다. 추가적인 자연어 처리 기능을 갖춘 새로운 버전의 모

델 개발을 통해 이런 성능을 지속적으로 개선할 계획입니다. 이는 코드 중심의 LLM이 자연어 지시를 효과적으로 해석하고 실행할 수 있도록 하는 데 중요합니다.


1 Introduction

The field of software development has been significantly transformed by the swift advancement of large language models (OpenAI, 2023; Touvron et al., 2023), which have brought about a new era of code intelligence. These models have the potential to automate and streamline many aspects of coding, from bug detection to code generation, thereby enhancing productivity and reducing the likelihood of human error. However, a major challenge in this field is the performance gap between open-source models (Li et al., 2023; Nijkamp et al., 2022; Roziere et al., 2023; Wang et al., 2021) and closed-source models (Gemini Team, 2023; OpenAI, 2023). The giant closed-source models, while powerful, are often inaccessible to many researchers and developers due to their proprietary nature.

In response to this challenge, we present the DeepSeek-Coder series. This series comprises a range of open-source code models, varying in size from 1.3B to 33B, including the base version and instructed version for each size. Each model in the series has been trained from scratch on 2 trillion tokens sourced from 87 programming languages, ensuring a comprehensive understanding of coding languages and syntax. Besides, we attempt to organize the pretraining data at the repository level to enhance the pre-trained model’s understanding capability within the context of cross-files within a repository. In addition to employing the next token prediction loss during pre-training, we have also incorporated the Fill-In-Middle (FIM) approach (Bavarian et al., 2022; Li et al., 2023). This approach is designed to further bolster the model’s code completion capabilities. To meet the requirements of handling longer code inputs, we have extended the context length to 16K. This adjustment allows our models to handle more complex and extensive coding tasks, thereby increasing their versatility and applicability in various coding scenarios.

We have carried out comprehensive experiments using a variety of public code-related benchmarks. The findings reveal that among open-source models, DeepSeek-Coder-Base 33B consistently delivers superior performance across all benchmarks. Furthermore, DeepSeekCoder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in the majority of the evaluation benchmarks, significantly narrowing the performance gap between OpenAI GPT-4 and open-source models. Remarkably, despite having fewer parameters, DeepSeek-Coder-Base 7B demonstrates competitive performance when compared to models that are five times larger, such as CodeLlama-33B (Roziere et al., 2023). To summarize, our main contributions are:

• We introduce DeepSeek-Coder-Base and DeepSeek-Coder-Instruct, our advanced codefocused large language models (LLMs). Developed through extensive training on an expansive code corpus, these models exhibit proficiency in understanding 87 programming languages. Additionally, they are available in various model scales to cater to a wide range of computational and application needs. • We make the first attempt to incorporate repository-level data construction during the pre-training phase of our models. We find that it can significantly boost the capability of cross-file code generation. • Our analysis rigorously examines the impact of FIM training strategies on the pretraining phase of code models. The outcomes of these comprehensive studies shed light on intriguing aspects of FIM configurations, offering valuable insights that significantly contribute to the enhancement and development of code pretrained models. • We conduct extensive evaluations of our code LLMs against a wide array of benchmarks encompassing numerous code-related tasks. The findings demonstrate that DeepSeek-CoderBase surpasses all existing open-source code LLMs across these benchmarks. Furthermore, with meticulous fine-tuning using instructional data, DeepSeek-Coder-Instruct achieves better performance compared to the OpenAI GPT-3.5 Turbo model in code-related tasks.

2 Data Collection

The training dataset of DeepSeek-Coder is composed of 87% source code, 10% English coderelated natural language corpus, and 3% code-unrelated Chinese natural language corpus. The English corpus consists of materials from GitHub’s Markdown and StackExchange1, which are used to enhance the model’s understanding of code-related concepts and improve its ability to handle tasks like library usage and bug fixing. Meanwhile, the Chinese corpus consists of high-quality articles aimed at improving the model’s proficiency in understanding the Chinese language. In this section, we will provide an overview of how we construct the code training data. This process involves data crawling, rule-based filtering, dependency parsing, repositorylevel deduplication, and quality screening, as illustrated in Figure 2. In the following, we will describe the data creation procedure step by step.

Figure 2 The Procedure of Dataset Creation

2.1. GitHub Data Crawling and Filtering

We collect public repositories created before February 2023 on GitHub and retain only 87 programming languages, as listed in Table 1. To reduce the amount of data to be processed, we apply filtering rules similar to those used in the StarCoder project (Li et al., 2023) to preliminarily filter out lower-quality code. By applying these filtering rules, we reduce the total amount of data to only 32.8% of its original size. To make the paper self-contained, we briefly describe the filter rules used in the StarCoder Data project:

Firstly, we filter out files with an average line length exceeding 100 characters or a maximum line length surpassing 1000 characters. Additionally, we remove files with fewer than 25% alphabetic characters. Except for the XSLT programming language, we further filter out files where the string “<?xml version=” appeared in the first 100 characters. For HTML files, we consider the ratio of visible text to HTML code. We retain files where the visible text constitutes at least 20% of the code and is no less than 100 characters. For JSON and YAML files, which typically contain more data, we only keep files that have a character count ranging from 50 to 5000 characters. This effectively removes most data-heavy files.

2.2. Dependency Parsing

In previous works (Chen et al., 2021; Li et al., 2023; Nijkamp et al., 2022; Roziere et al., 2023), large language models for code are mainly pre-trained on file-level source code, which ignores the dependencies between different files in a project. However, in practical applications, such models struggle to effectively scale to handle entire project-level code scenarios. Therefore, we will consider how to leverage the dependencies between files within the same repository in this step. Specifically, we first parse the dependencies between files and then arrange these files in an order that ensures the context each file relies on is placed before that file in the input sequence. By aligning the files in accordance with their dependencies, our dataset more accurately represents real coding practices and structures. This enhanced alignment not only makes our dataset more relevant but also potentially increases the practicality and applicability of the model in handling project-level code scenarios. It’s worth noting that we only consider the invocation relationships between files and use regular expressions to extract them, such as “import” in Python, “using” in C#, and “include” in C.

Algorithm 1: Topological Sort for Dependency Analysis

1:  procedure TOPOLOGICAL_SORT(files)
2:      graphs ← {}
3:      inDegree ← {}
4:      for each file in files do
5:          graphs[file] ← []
6:          inDegree[file] ← 0
7:      end for
8:      for each fileA in files do
9:          for each fileB in files do
10:             if HAS_DEPENDENCY(fileA, fileB) then
11:                 graphs[fileB].append(fileA)
12:                 inDegree[fileA] ← inDegree[fileA] + 1
13:             end if
14:         end for
15:     end for
16:     subgraphs ← getDisconnectedSubgraphs(graphs)
17:     allResults ← []
18:     for each subgraph in subgraphs do
19:         results ← []
20:         while length(results) ≠ NumberOfNodes(subgraph) do
21:             file ← argmin({inDegree[file] | file ∈ subgraph and file ∉ results})
22:             for each node in graphs[file] do
23:                 inDegree[node] ← inDegree[node] - 1
24:             end for
25:             results.append(file)
26:         end while
27:         allResults.append(results)
28:     end for
29:     return allResults
30: end procedure

Stack Exchange

The algorithm 1 describes a topological sort for dependency analysis on a list of files within the same project. Initially, it sets up two data structures: an empty adjacency list named “graphs” to represent dependencies between files and an empty dictionary called “inDegree” for storing the in-degrees of each file. The algorithm then iterates over each file pair to identify dependencies, updating “graphs” and “inDegree” accordingly. Next, it identifies any disconnected subgraphs within the overall dependency graph. For each subgraph, the algorithm employs a modified topological sort. Unlike the standard approach that selects nodes with zero in-degrees, this algorithm selects nodes with minimal in-degrees, which allows it to handle cycles within the graph. Selected nodes are added to a “results” list, and the in-degrees of their connected nodes are decreased. This process continues until a topologically sorted sequence is generated for each subgraph. The algorithm concludes by returning a list of these sorted sequences, and each sequence’s files are concatenated to form a single training sample. To incorporate file path information, a comment indicating the file’s path is added at the beginning of each file. This method ensures that the path information is preserved in the training data.

2.3. Repo-Level Deduplication

Recent studies have demonstrated the significant performance improvements that can be achieved by deduplicating training datasets for Large Language Models (LLMs). Lee et al. (2022) have shown that language model training corpora often contain numerous near-duplicates, and the performance of LLMs can be enhanced by removing long repetitive substrings. Kocetkov et al. (2022) have applied a near-deduplication method to training data, resulting in dramatic improvements, and they emphasize that near-deduplication is a crucial preprocessing step for achieving competitive performance on code benchmark tasks. In our dataset, we have also employed near-deduplication. However, there is a distinction in our approach compared to previous works. We perform deduplication at the repository level of code, rather than at the file level, as the latter approach may filter out certain files within a repository, potentially disrupting the structure of the repository. Specifically, we treat the concatenated code from the repository level as a single sample and apply the same near-deduplication algorithm to ensure the integrity of the repository structure.

2.4. Quality Screening and Decontamination

In addition to applying the filtering rules mentioned in Section 2.1, we also employ a compiler and a quality model, combined with heuristic rules, to further filter out low-quality data. This includes code with syntax errors, poor readability, and low modularity. We provide the statistical summary of source code in Table 1, which includes a total of 87 languages, detailing the disk size, number of files, and percentage for each language. The total data volume is 798 GB with 603 million files. To ensure that our code training data is not contaminated by information from the test set, which may be present on GitHub, we’ve implemented an n-gram filtering process. This process involves the removal of any code segments that match specific criteria. Specifically, we filter out files containing docstrings, questions, and solutions from sources such as HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). For the filtering criteria, we apply the following rules: if a piece of code includes a 10-gram string identical to any in the test data, it is excluded from our training data. In cases where the test data comprises strings that are shorter than 10-grams but no less than 3-grams, we use an exact match approach for filtering.

3. Training Policy

3.1. Training Strategy

3.1.1. Next Token Prediction

The first training objective for our model is known as next token prediction. In this process, various files are concatenated to form a fixed-length entry. Then, these entries are used to train the model, enabling it to predict the subsequent token based on the provided context.

3.1.2. Fill-in-the-Middle

The second training objective for our model is known as fill-in-the-middle. In the code pre-training scenario, it is often necessary to generate corresponding inserted content based on the given context and subsequent text. Due to specific dependencies in a programming language, relying solely on next token prediction is insufficient to learn this fill-in-the-middle capability. Therefore, several approaches (Bavarian et al., 2022; Li et al., 2023) propose the pretraining method of Fill-in-the-Midlle (FIM). This approach involves randomly dividing the text into three parts, then shuffling the order of these parts and connecting them with special characters. This method aims to incorporate a fill-in-the-blank pretraining task during the training process. Within the FIM methodology, two distinct modes are employed: PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle). In the PSM mode, the training corpus is organized in the sequence of 𝑃𝑟𝑒 𝑓 𝑖𝑥, 𝑆𝑢 𝑓 𝑓 𝑖𝑥, 𝑀𝑖𝑑𝑑𝑙𝑒, aligning the text in a way that the middle segment is flanked by the prefix and suffix. Conversely, the SPM mode arranges the segments as 𝑆𝑢 𝑓 𝑓 𝑖𝑥, 𝑃𝑟𝑒 𝑓 𝑖𝑥, 𝑀𝑖𝑑𝑑𝑙𝑒, presenting a different structural challenge. These modes are instrumental in enhancing the model’s capability to handle various structural arrangements in code, providing a robust training framework for advanced code prediction tasks.

Figure 3 The effectiveness of using FIM objective.

To determine the effectiveness of various hyperparameters within the FIM approach, we conducted a series of ablation experiments.

Experiment Settings: In this experiment, we employ DeepSeek-Coder-Base 1.3B as our model architecture. We focused on a Python subset from our training dataset to streamline the experimental process. Our primary objective was to assess the efficacy of the Fill-in-the-Middle (FIM) technique, utilizing the HumanEval-FIM benchmark (Fried et al., 2022). This benchmark specializes in a single-line FIM task for Python, in which one line of code from a HumanEval solution is randomly obscured, testing the model’s proficiency in predicting the missing line. We hypothesize that the PSM mode may exhibit subtle differences compared to the traditional next-token prediction objective. This is primarily because PSM involves rearranging the order of the original text, potentially impacting the learning dynamics of the model. Therefore, we implement the PSM mode for FIM across four distinct configurations: 0% FIM rate, 50% FIM rate, 100% FIM rate, and 50% MSP rate. The Masked Span Prediction (MSP) strategy, initially introduced in T5 (Raffel et al., 2023), conceals multiple text spans and trains the model to reconstruct these segments. According to CodeGen2.5 (Nijkamp et al., 2023), MSP may enhance FIM performance compared to PSM. Thus, we include this method in our comparative analysis.

Results: The outcomes of our experiment are illustrated in Figure 3. While the model demonstrates peak performance on the HumanEval-FIM with a 100% FIM rate, this configuration also results in the weakest code completion capability. This indicates a trade-off between FIM and code completion abilities. Moreover, we observe that with a 50% PSM rate, the model outperforms the MSP strategy. To achieve a balance between FIM efficiency and code completion proficiency, we ultimately choose the 50% PSM rate as our preferred training policy.

In our implementation, we have introduced three sentinel tokens specifically for this task. For each code file, we initially divide its content into three segments, denoted as 𝑓𝑝𝑟𝑒, 𝑓𝑚𝑖𝑑𝑑𝑙𝑒, and 𝑓𝑠𝑢 𝑓 . Using the PSM mode, we construct the training example as follows:

<|fim_start|> 𝑓𝑝𝑟𝑒<|fim_hole|> 𝑓𝑠𝑢 𝑓 <|fim_end|> 𝑓𝑚𝑖𝑑𝑑𝑙𝑒<eos_token>

We implement the Fill-in-the-Middle (FIM) method at the document level before the packing process, as proposed in the original work by Bavarian et al. (2022). This is done with an FIM rate of 0.5, following the PSM mode.

3.2. Tokenizer

For the tokenization process, we employ the HuggingFace Tokenizer library2 to train Byte Pair Encoding (BPE) tokenizers, as outlined in Sennrich et al. (2015) (Sennrich et al., 2015), on a subset of our training corpus. Ultimately, we utilize a tokenizer configured with a vocabulary size of 32,000.

3.3. Model Architecture

We develop a range of models with varying parameters to cater to diverse applications, including models with 1.3B, 6.7B, and 33B parameters. These models are built upon the same framework as the DeepSeek Large Language Model (LLM) outlined by DeepSeek-AI (2024). Each model is a decoder-only Transformer, incorporating Rotary Position Embedding (RoPE) as described by Su et al. (2023). Notably, the DeepSeek 33B model integrates Grouped-Query-Attention (GQA) with a group size of 8, enhancing both training and inference efficiency. Additionally, we employ FlashAttention v2 (Dao, 2023) to expedite the computation involved in the attention mechanism. The architectural details of our models are summarized in Table 2.

3.4. Optimization

Following DeepSeek LLM (DeepSeek-AI, 2024), we use AdamW (Loshchilov and Hutter, 2019) as the optimizer with 𝛽1 and 𝛽2 values of 0.9 and 0.95. We adapt batch sizes and learning rates by the scaling laws suggested in DeepSeek LLM. For the learning rate scheduling, we implement a three-stage policy, which includes 2000 warm-up steps, and set the final learning rate to 10% of the initial rate. Notably, the learning rate at each stage is scaled down to stage’s rate, following the guidelines established in DeepSeek LLM (DeepSeek-AI, 2024).

3.5. Environments

Our experiments are conducted using the HAI-LLM (High-Flyer, 2023) framework, known for its efficiency and lightweight approach in training large language models. This framework incorporates a variety of parallelism strategies to optimize computational efficiency. These include tensor parallelism (Korthikanti et al., 2023), alongside ZeRO data parallelism (Rajbhandari et al., 2020) and PipeDream pipeline parallelism (Narayanan et al., 2019). Our experiments utilize clusters outfitted with NVIDIA A100 and H800 GPUs. In the A100 cluster, each node is configured with 8 GPUs, interconnected in pairs using NVLink bridges. The H800 cluster is similarly arranged, with each node containing 8 GPUs. These GPUs are interconnected using a combination of NVLink and NVSwitch technologies, ensuring efficient data transfer within nodes. To facilitate seamless communication between nodes in both A100 and H800 clusters, we employ InfiniBand interconnects, known for their high throughput and low latency. This setup provides a robust and efficient infrastructure for our computational experiments.

2 https://github.com/huggingface/tokenizers

3.6. Long Context

To enhance the capabilities of DeepSeek-Coder in handling extended contexts, particularly for scenarios like repository-level code processing, we have reconfigured the RoPE (Su et al., 2023) parameters to extend the default context window. Following previous practices (Chen et al., 2023; kaiokendev, 2023), we employed a linear scaling strategy, increasing the scaling factor from 1 to 4 and altering the base frequency from 10000 to 100000. The model underwent an additional 1000 steps of training, using a batch size of 512 and a sequence length of 16K. The learning rate was maintained as in the final pre-training phase. Theoretically, these modifications enable our model to process up to 64K tokens in context. However, empirical observations suggest that the model delivers its most reliable outputs within a 16K token range. Future research will continue to refine and evaluate the long-context adaptation methodology, aiming to further enhance DeepSeek-Coder’s efficiency and user-friendliness in processing extended contexts.

3.7. Instruction Tuning

We develop DeepSeek-Coder-Instruct by enhancing the DeepSeek-Coder-Base through instructionbased fine-tuning using high-quality data. This data comprises helpful and impartial human instructions, structured by the Alpaca Instruction format (Taori et al., 2023). To demarcate each dialogue turn, we employed a unique delimiter token to signify the conclusion of each segment. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate 1e-5. We also use a batch size of 4M tokens and 2B tokens in total.

An example of using DeepSeek-Coder-Instruct 34B is depicted in Figure 4. This example is a multi-turn dialogue scenario for building a snake game. Initially, we ask the model to write a game snake using pygame. The model successfully creates a basic snake game that can run without bugs. To improve the game, we further request adding a scoring system in the top left corner. The model then introduces a “score” variable and a “display_score” function, along with an explanation of how to integrate these features. This example illustrates DeepSeek-CoderInstruct’s ability to provide complete solutions in multi-turn dialogue settings. More cases can be found in the Appendix A.

Figure 4 An example of responses from DeepSeek-Coder-Instruct 33B in a multi-turn setting.

4. Experimental Results

In this section, we evaluate DeepSeek-Coder on four tasks, including code generation (§4.1), FIM code completion (§4.2), cross-file code completion (§4.3) and program-based math reasoning (§4.4). We compare DeepSeek-Coder with the previous state-of-the-art large language models:

• CodeGeeX2 (Zheng et al., 2023) represents the second generation of the multilingual code generation model CodeGeeX. It is developed using the ChatGLM2 (Du et al., 2022) architecture and is enhanced with an extensive dataset of coding examples.

• StarCoder (Li et al., 2023) is a publicly accessible model with a substantial parameter count of 15 billion. It is specifically trained on a meticulously curated subset of the Stack dataset (Kocetkov et al., 2022), covering 86 programming languages, ensuring its proficiency across a wide range of coding tasks.

• CodeLlama (Roziere et al., 2023) encompasses a series of code-centric Large Language Models (LLMs) that are derivatives of LLaMA2 (Touvron et al., 2023). Available in three sizes — 7B, 13B, and 34B — these models undergo continued training on a vast 500 billion token code corpus, building upon the foundational LLaMA2 architecture.

• code-cushman-001 Chen et al. (2021) is a 12 billion parameter model developed by OpenAI

and served as the initial model for Github Copilot.

• GPT-3.5 and GPT-4 (OpenAI, 2023) are advanced generative AI models developed by OpenAI. While they are not explicitly trained for code generation, they also demonstrate notable performance in this domain. Their effectiveness in handling code generation tasks is largely attributed to their massive scale in terms of parameter count.

4.1. Code Generation

HumanEval and MBPP Benchmarks The HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) benchmarks are widely used for evaluating code LLMs. HumanEval consists of 164 hand-written Python problems that are validated using test cases to assess the code generated by a Code LLM in a zero-shot setting, while the MBPP benchmark includes 500 problems in a few-shot setting. To evaluate the model’s multilingual capabilities, we expanded the Python problems of Humaneval Benchmark to seven additional commonly used programming languages, namely C++, Java, PHP, TypeScript (TS), C#, Bash, and JavaScript (JS) (Cassano et al., 2023). For both benchmarks, We adopted a greedy search approach and re-implemented the baseline results using the same script and environment for fair comparison.

Table 3 Performance of approaches on the Multilingual HumanEval and MBPP Benchmarks.

The results are presented in Table 3. As we can see, DeepSeek-Coder-Base achieves stateof-the-art performance with an average accuracy of 50.3% on HumanEval and 66.0% on MBPP. In comparison to the similarly sized open-source model CodeLlama-Base 34B, our model has demonstrated a notable improvement of 9% and 11% in accuracy, respectively. It’s worth noting that even our smaller model, DeepSeek-Coder-Base 6.7B, surpasses the performance of CodeLlama-Base 34B. After instruction fine-tuning, our model surpasses the closed-source GPT-3.5-Turbo model in HumanEval benchmark, significantly reducing the performance gap between OpenAI GPT-4 and open-source models.

DS-1000 Benchmark HumanEval and MBPP have a significant drawback in that they rely heavily on straightforward programming tasks that may not accurately represent the kind of code most programmers typically write. In contrast, the DS-1000 benchmark, as introduced in the work by Lai et al. (2023), offers a comprehensive collection of 1,000 practical and realistic data science workflows across seven different libraries. This benchmark evaluates code generation by executing it against specific test cases. What sets DS-1000 apart is its categorization of problems based on the libraries involved, which encompass Matplotlib, NumPy, Pandas, SciPy, Scikit-Learn, PyTorch, and TensorFlow. The benchmark assesses the performance of base models in the code completion setting and we provide pass@1 results for each library, as well as overall score.

The results of DS-1000 benchmark are shown in Table 4. As can be seen from the table, the DeepSeek-Coder model achieves relatively high accuracy in all libraries, demonstrating that our model is not only capable of generating good code but also of using libraries more accurately in real data science workflows.

Table 4 Performance of different approaches on the DS-1000-Tasks.

LeetCode Contest Benchmark To further validate the model’s capability in real-world programming problems, we construct the LeetCode Contest benchmark3. LeetCode4 presents competition-level problems, offering significant challenges that test the model’s problem understanding and code generation skills. We collected the latest problems from LeetCode Contests to prevent the appearance of both the problems or their solutions in our pre-training data. A total of 180 problems were collected from July 2023 to January 2024. For each problem, we collected 100 test cases to ensure the test coverage. We use the template “{problem_description}\nPlease complete the code below to solve the above problem:\npython\n{code_template}\n” to build the instruction prompt.

The evaluation results are shown in Table 5. In our evaluation, the DeepSeek-Coder models demonstrate remarkable performance over current open-source coding models. Specifically, the DeepSeek-Coder-Instruct 6.7B and 33B achieve Pass@1 scores of 19.4% and 27.8% respectively in this benchmark. This performance notably surpasses existing open-sourced models such as Code-Llama-33B. The DeepSeek-Coder-Instruct 33B is the only open-sourced model that outperforms OpenAI’s GPT-3.5-Turbo in this task. However, there remains a substantial performance gap when compared to the more advanced GPT-4-Turbo.

Our analysis indicates that the implementation of Chain-of-Thought (CoT) prompting notably enhances the capabilities of DeepSeek-Coder-Instruct models. This improvement becomes particularly evident in the more challenging subsets of tasks. By adding the directive, “You need first to write a step-by-step outline and then write the code.” following the initial prompt, we have observed enhancements in performance. This observation leads us to believe that the process of first crafting detailed code descriptions assists the model in more effectively understanding and addressing the intricacies of logic and dependencies in coding tasks, particularly those of higher complexity. Therefore, we strongly recommend employing CoT prompting strategies when utilizing DeepSeek-Coder-Instruct models for complex coding challenges. Such an approach promotes a more methodical and logical framework for problem-solving, potentially resulting in more precise and efficient outcomes in code generation tasks.

3 We have published this benchmark in https://github.com/deepseek-ai/DeepSeek-Coder/tree/main/Evaluation/LeetCode.

4 https://leetcode.com/

Table 5 Performance of different models on the LeetCode Contest Benchmark.

It is important to acknowledge that despite our diligent efforts to gather the most recent code questions for model testing, the possibility of data contamination cannot be entirely ruled out. We observed that the GPT-4-Turbo and DeepSeek-Coder models achieved higher scores in the LeetCode Contest held in July and August. We encourage the research community to consider the potential issue of data contamination when evaluating models in future studies using our released LeetCode data.

4.2. Fill-in-the-Middle Code Completion

DeepSeek-Coder models are trained with a 0.5 FIM (Fill-In-the-Middle) rate during their pretraining phase. This specialized training strategy empowers the model to proficiently generate code by filling in blanks based on the surrounding context, both prefix and suffix, of the given code snippet. This capability is particularly advantageous in the realm of code completion tools. Several open-source models have emerged with similar capabilities. Notable among these are SantaCoder (Allal et al., 2023), StarCoder (Li et al., 2023), and CodeLlama (Roziere et al., 2023). These models have set a precedent in the field of code generation and completion. In evaluating the performance DeepSeek-Coder models, we conducted a comparative analysis with the aforementioned models. The benchmark for this comparison was the Single-Line Infilling benchmarks, encompassing three different programming languages, as proposed by Allal et al. (2023). This benchmark uses the line exact match accuracy as the evaluation metric.

Table 6 Performance of different approaches on the FIM-Tasks.

The evaluation results are shown in Table 6. Despite being the smallest model with a capacity of 1.3 billion parameters, DeepSeek-Coder outperforms its larger counterparts, StarCoder and CodeLlama, in these benchmarks. This superior performance can be attributed to the high quality of the pre-trained data utilized by DeepSeek-Coder. Furthermore, a notable trend observed is the correlation between the size of the model and its performance. As the model size increases, there is a corresponding and responsible enhancement in performance. This trend underscores the importance of model capacity in achieving higher accuracy in code completion tasks. Based on these findings, we recommend the deployment of the DeepSeekCoder-Base 6.7B model in code completion tools. This recommendation is grounded in the model’s demonstrated balance between efficiency and accuracy. The DeepSeek-Coder-Base 6.7B model, with its substantial parameter size, has proven to be highly effective in the context of code completion, making it an ideal choice for integrating advanced computational capabilities into coding environments.

4.3. Cross-File Code Completion

In this section, we will evaluate the performance of existing open-source models in cross-file code completion tasks. Unlike code generation discussed in the previous section, cross-file code completion requires the model to access and understand repositories that span multiple files with numerous cross-file dependencies. We use CrossCodeEval (Ding et al., 2023) to evaluate the capabilities of currently available open-source code models of 7B scale in cross-file completion tasks. This dataset is constructed on a diverse set of real-world, open-sourced, permissively licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. The dataset is specifically designed to strictly require cross-file context for accurate completion. Notably, this dataset was constructed from repositories created between March and June 2023, while our pre-training data only includes code created before February 2023, which ensures that this dataset was not present in our pre-training data, thus avoiding data leakage.

Table 7 Performance of different models on cross-file code completion.

In our evaluation of various models, we set the maximum sequence length to 2048 tokens, the maximum output length to 50 tokens, and a limit of 512 tokens for the cross-file context. For the cross-file context, we utilize the official BM25 search results provided by Ding et al. (2023). Evaluation metrics include exact match and edit similarity. The results, presented in Table 7, demonstrate that DeepSeek-Coder consistently outperforms other models in cross-file completion tasks across multiple languages, showcasing its superior practical application capabilities. When only utilizing file-level code corpus (w/o Repo Pre-training) to pre-train DeepSeek-Coder, we observe a decrease in performance in the Java, TypeScript, and C# languages, indicating the effectiveness of the repository-level pre-training.

4.4. Program-based Math Reasoning

Program-based math reasoning involves evaluating a model’s ability to understand and solve mathematical problems through programming. This type of reasoning is critical in fields such as data analysis and scientific computing. To conduct this assessment, we utilize the Program-Aided Math Reasoning (PAL) method as outlined in Gao et al. (2023). This approach is applied across seven distinct benchmarks, each offering unique challenges and contexts. These benchmarks includes GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), GSMHard (Gao et al., 2023), SVAMP (Patel et al., 2021), TabMWP (Lu et al., 2022), ASDiv (Miao et al., 2020) and MAWPS (Gou et al., 2023). In each of these benchmarks, the model is prompted to alternately describe a solution step in natural language and then execute that step with code. As seen in Table 8, DeepSeek-Coder models achieve a remarkable performance across all benchmarks, especially the 33B variant, which demonstrates the potential of using such models in applications that require complex mathematical computations and problem-solving abilities.

Table 8 Performance of different approaches on the program-aid math reasoning tasks.

5 Continue Pre-Training From General LLM

To further enhance the natural language understanding and mathematical reasoning abilities of the DeepSeek-Coder model, we perform additional pre-training from the general language model DeepSeek-LLM-7B Base (DeepSeek-AI, 2024) on 2 trillion tokens, resulting in DeepSeekCoder-v1.5 7B. For this pre-training, we specifically use the data sources listed in Table 9. Unlike DeepSeek-Coder, DeepSeek-Coder-v1.5 employs solely a next token prediction objective with a 4K context length during its pre-training phase.

We conduct a comparison between DeepSeek-Coder-v1.5 7B and DeepSeek-Coder 6.7B, and re-run all benchmarks using our evaluation pipeline to ensure a fair comparison. We evaluate performance across a wide range of tasks, which can be categorized as follows:

• Programming: This category includes evaluations in a multilingual setting using the HumanEval dataset by Chen et al. (2021), as well as evaluations in a Python setting using the MBPP dataset by Austin et al. (2021)

• Math Reasoning: We assess performance on math reasoning tasks using the GSM8K benchmark (Cobbe et al., 2021) and the MATH (Hendrycks et al., 2021) benchmark [4]. These tasks involve solving math problems by generating programs.

• Natural Language Our evaluation in natural language tasks includes MMLU (Hendrycks et al., 2020), BBH (Suzgun et al., 2022), HellaSwag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2021), and ARC-Challenge (Clark et al., 2018) benchmarks.

The results for the Base and Instruct models are presented in Table 10.

It is observed that the DeepSeek-Coder-Base-v1.5 model, despite a slight decrease in coding performance, shows marked improvements across most tasks when compared to the DeepSeek-Coder-Base model. In particular, in the Math Reasoning and Natural Language categories, DeepSeekCoder-Base-v1.5 significantly outperforms its predecessor across all benchmarks, which also demonstrates significant improvements in its mathematical reasoning and natural language processing capabilities.

Table 10 Comparative analysis of performance between DeepSeek-Coder-Base and DeepSeek-Coder-Base-v1.5. Math tasks are solved through programming.

6. Conclusion

In this technical report, we introduce a series of specialized Large Language Models (LLMs) for coding, named DeepSeek-Coder, available in three distinct scales: 1.3B, 6.7B, and 33B parameters. These models are uniquely trained on a meticulously curated project-level code corpus, utilizing a “fill-in-the-blank” pre-training objective to enhance code infilling capabilities. A significant advancement is the extension of the models’ context window to 16,384 tokens, thereby greatly improving their effectiveness in handling extensive code generation tasks. Our evaluations reveal that the most advanced model in our series, DeepSeek-Coder-Base 33B surpasses existing open-source code models across a variety of standard tests. Impressively, the DeepSeek-CoderBase 6.7B model, despite its smaller scale, delivers performance on par with the 34B parameter CodeLlama, a testament to the high quality of our pretraining corpus.

To augment the zero-shot instruction capabilities of the DeepSeek-Coder-Base models, we have fine-tuned them with high-quality instructional data. This has led to the DeepSeek-CoderInstruct 33B model outperforming OpenAI’s GPT-3.5 Turbo in a range of coding-related tasks, showcasing its exceptional proficiency in code generation and understanding.

To further improve the natural language understanding capabilities of the DeepSeek-CoderBase models, we have conducted additional pretraining based on the DeepSeek-LLM 7B checkpoint. This additional training involved processing a diverse dataset comprising 2 billion tokens, including natural language, code, and mathematical data. The result is the creation of a new and improved code model, DeepSeek-Coder-v1.5. Our observations indicate that DeepSeekCoder-v1.5 not only maintains its predecessor’s high-level coding performance but also exhibits enhanced natural language comprehension. This advancement underscores our belief that the most effective code-focused Large Language Models (LLMs) are those built upon robust general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language. Looking ahead, our commitment is to develop and openly share even more powerful code-focused LLMs based on larger-scale general LLMs.

Previous: Model | Llemma Pile-2 Next: IPCA

post contain ""

    No matching posts found containing ""