Contents
1 서론
최근 몇 년간 대규모 언어모델(LLM)의 발전은 업무 환경에 큰 영향을 미치고 있습니다. 특히 코드 작성을 위한 LLM(Code LLMs)은 빠르게 도입되어 많은 개발자들이 사용하고 있습니다. 이런 모델들의 발전은 저작권, 개인정보 보호 및 개방성과 관련된 우려를 낳고 있습니다. 이 논문에서는 StarCoder와 StarCoderBase, 두 개의 오픈-액세스 코드 LLM을 소개합니다. 이 모델들은 개방된 과학 협업을 통해 개발되었으며, 투명한 데이터 관리 체계를 바탕으로 훈련되었습니다.
2 관련 연구
초기 대규모 언어모델은 n-gram과 간단한 평활화 기법을 사용했습니다. 이후 트랜스포머(Transformer) 아키텍처의 등장은 언어 모델의 확장 가능성을 크게 향상시켰으며, 모델 크기와 훈련 토큰 수, 연산 예산과 같은 스케일링 요소와 언어 모델링 손실 간에 예측 가능한 관계를 보여주었습니다. 코드를 위한 언어 모델은 GitHub의 텍스트와 코드 혼합 데이터셋을 통해 훈련되는 디코더 전용 Transformer 아키텍처가 강력한 생성 모델을 만들어 냈습니다.
3 데이터 큐레이션 및 클리닝
3.1 프로그래밍 언어
The Stack v1.2 데이터셋에서, 86가지 프로그래밍 언어를 선택했습니다. 이들은 파일 확장자 기반으로 데이터가 할당되었으며, 품질 관리를 위해 30,000개의 파일을 무작위로 선정하여 시각적 검사를 실시했습니다. 데이터 중복을 피하기 위해 근접 중복 제거 과정을 사용하였고, 언어별로 데이터를 필터링하여 고품질의 데이터셋을 확보했습니다.
3.2 Jupyter 노트북
Jupyter 노트북 데이터는 스크립트와 구조화된 데이터셋으로 변환되었습니다. 스크립트 변환에는 Jupytext 소프트웨어를 사용했으며, 구조화된 데이터셋은 마크다운과 코드 블록을 연속적으로 병합하여 처리하였습니다.
3.3 GitHub 문제점
GitHub에서 수집된 이슈와 풀 리퀘스트 대화는 텍스트 품질을 유지하기 위해 여러 단계의 필터링을 거쳤습니다. 비영어 문제점은 fasttext 라이브러리를 사용하여 필터링했습니다.
3.4 Git 커밋
Git 커밋 데이터는 단일 파일 커밋만을 포함하도록 필터링되었으며, 중요하지 않은 파일 변경은 제외하고 training dataset로 사용되었습니다.
3.5 중복 제거
모든 프로그래밍 언어와 Jupyter 노트북에 대하여 중복 제거 과정을 수행했습니다. 이 과정을 통해 비슷한 코드 파일을 동일한 버킷에 매핑하여 데이터의 질을 향상시켰습니다.
3.6 데이터 소스 가중치 조정
데이터 소스별로 사용되는 계산 리소스가 다르기 때문에 특정 프로그래밍 언어를 상향 조정하거나 하향 조정할지에 대한 논의가 있었습니다. 그러나 가장 많은 데이터를 제공하는 인기 언어가 많은 사용자에게 이익이 될 것임을 깨달았고, 데이터의 자연스러운 분포를 따르기로 결정했습니다.
이 논문은 특히 수학적 인퍼런스과 논증에 중점을 두어, 데이터 처리와 모델 훈련 과정에서 적용된 다양한 기술과 알고리즘의 수학적 배경을 상세히 설명합니다. 데이터의 질을 결정하는 과정에서 적용된 휴리스틱 방법과 수학적 알고리즘은 모델의 성능에 직접적인 영향을 미치며, 이를 통해 얻어진 결과들은 벤치마크를 통해 검증되었습니다.
4. PII 삭제
4.1 데이터 수집
본 섹션에서는 Toloka 플랫폼을 이용하여 전 세계 35개국에서 1,399명의 군중 작업자들이 소스 코드 내 개인 식별 정보(PII)를 주석 처리하는 작업을 수행합니다. 이 데이터는 PII 탐지 모델을 훈련하기 위한 기초 자료로 활용되었습니다. 수집된 데이터는 12,000개의 파일로, 각 파일은 약 50줄의 코드를 포함하며, 다양한 프로그래밍 언어로 작성되었습니다. 데이터 수집 과정에서, 7,100개의 파일은 미리 필터링되어 드물게 발견되는 PII 유형을 포함하도록 조정되었으며, 나머지 5,100개 파일은 무작위로 선택되었습니다.
4.2 StarEncoder
StarEncoder는 양방향 self-attention 트랜스포머 기반의 인코더 모델로, BERT의 MLM(Masked Language Modelling)과 NSP(Next Sentence Prediction) 목표를 사용하여 모델을 훈련합니다. 이 모델은 소스 코드와 관련된 텍스트 모두에서 효과적으로 파인튜닝될 수 있습니다는 장점을 갖고 있습니다. 특히, MLM 손실에 대해 입력에서 독립적으로 15%의 토큰을 마스킹하고, NSP 손실은 [CLS] 토큰에서 출력된 표현에 적용된 선형 분류기를 사용하여 두 문장이 문서 내에서 이웃하는지 예측합니다.
\[\text{MLM Loss} = -\sum_{i \in \text{masked}} \log P(x_i \\| x_{\text{context}})\] \[\text{NSP Loss} = -[\log P(\text{IsNext}\\|x_{\text{context}}) + \log(1 - P(\text{IsNext}\\|x_{\text{not\_context}}))]\]$x_{\text{context}}$는 입력 문장, $x_{\text{not_context}}$는 비인접 문장, $x_i$는 마스킹된 토큰
4.3 PII 탐지 모델
이 섹션에서는 StarEncoder를 NER(Named Entity Recognition) 작업에 파인튜닝하는 과정을 설명합니다. 모델은 6개의 대상 클래스(이름, 이메일, 키, 비밀번호, IP 주소, 사용자 이름)에 대한 토큰 분류 헤드를 추가하여 훈련되었습니다. 훈련 및 평가 데이터 분할은 PII 유형의 균형을 맞추기 위해 신중하게 수행되었습니다. 파인튜닝된 베이스라인에서 이름, 이메일, IP 주소에 대한 F1 점수는 90% 이상이었으며, 비밀번호, 키, 사용자 이름은 상대적으로 낮은 성능을 보입니다. 키의 경우 레이블의 수가 제한적이라 낮은 성능을 나타낸 것으로 보입니다.
\[\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]Precision(Precision)는 모델이 올바르게 식별한 PII의 비율을, 재현율(Recall)은 실제 PII 중 모델이 올바르게 식별한 비율을 나타냅니다.
본 논문은 특히 수학적 인퍼런스 및 논증을 통해 데이터를 수집, 처리 및 모델 훈련 방법을 체계적으로 설명하고 있으며, 각 단계에서의 수학적 배경과 논리적 연결고리를 강조하고 있습니다. 이를 통해 PII 탐지 모델의 효율성과 정확성을 입증하며, 더 나아가 이런 모델이 다양한 형태의 데이터에 어떻게 적용될 수 있는지에 대한 방향을 제시합니다.
5. 모델 훈련
본 섹션에서는 StarCoder 모델의 훈련 과정에 대한 정보를 제공합니다. 먼저, 두 가지 모델의 차이점을 명확히 합니다.
[데이터 정제 정형화 색인마킹]
5.1 데이터 정형화
다양한 데이터 소스에 대한 정형화 지침을 제시합니다. 코드 파일의 맥락에 저장소 이름, 파일 이름, 별점 수를 미리 추가하여, 별점의 정확한 숫자에 대한 과적합을 방지합니다. 예를 들어, 저장소 이름, 파일 이름, 별점은 각각 0.2의 확률로 독립적으로 붙여진다.
<reponame>reponame<filename>filename<gh_stars>stars\ncode
위 템플릿에 따라, 소스 코드에는 중간 채우기 변환(Fill-in-the-Middle, FIM)을 문자 수준에서 적용합니다. 이는 데이터의 다양성을 보장하고 모델이 더 일반화된 학습을 할 수 있도록 합니다.
5.2 training dataset 정제
training dataset에서 HumanEval과 MBPP의 문서 문자열이나 솔루션, APPS의 문서 문자열, GSM8K의 질문 또는 DS1000의 프롬프트 등을 제거하여 데이터를 정제합니다. 이 과정은 training dataset의 질을 향상시키며, 특히 Python 언어에서 558개 파일이 제거됐습니다.
5.3 토크나이저
모델의 토크나이저는 Hugging Face의 Tokenizers 라이브러리를 사용하여 바이트 수준의 Byte-Pair-Encoding을 훈련합니다. 이 토크나이저는 디지털 분할기와 GPT-2의 정규식 분할기를 포함한 사전 토크나이징 단계를 포함합니다.
5.4 모델 아키텍처
15.5B 파라미터를 가진 이 모델은 디코더 전용 트랜스포머 구조를 사용하며, MQA(Multi-Query-Attention)와 절대 위치 임베딩을 학습합니다. training dataset에 FIM 변환을 적용하여 어텐션 계산을 가속화하고 메모리 사용을 줄입니다.
5.5 훈련 세부사항
StarCoderBase 모델은 25만 반복으로 1조 토큰에 대해 훈련되었습니다. 이어서, StarCoder 모델은 Python 데이터의 부분 집합에 대해 2 에포크 동안 파인튜닝을 수행했습니다. 이 과정에서 기존 설정은 유지하면서 학습률만 조정했습니다.
5.6 다중 노드 GPU 설정
모델은 512개의 NVIDIA A100 80GB GPU를 갖춘 클러스터에서 훈련되었습니다. 이 구성은 모델의 효율적인 활용과 GPU 사용률의 최적화를 가능하게 합니다.
5.7 CO2 배출량
StarCoderBase의 훈련으로 인한 탄소 발자국은 총 16.68톤의 CO2eq입니다. 이는 전력 사용량과 데이터 센터의 평균 전력 사용 효율을 고려한 결과입니다.
6 성능 평가
본 섹션에서는 StarCoder와 StarCoderBase 외에 여러 모델들을 평가하였으며, 다양한 벤치마크를 통해 Python 언어 성능을 보고하였습니다. 이어서 다양한 언어에 대한 평가를 다루었습니다.
6.1 StarCoder: Python 평가
StarCoder는 Python에서 오픈 액세스 및 비공개 모델과 비교하여 향상된 성능을 보였습니다. 주로 사용된 두 가지 벤치마크는 HumanEval과 MBPP입니다.
6.1.1 HumanEval과 MBPP 벤치마크
HumanEval과 MBPP는 Python 프로그래밍 문제로 구성된 벤치마크로, 생성된 코드가 테스트 케이스를 통과하는지 검증합니다. 이런 벤치마크에서 코드 LLM은 출력 분포에서 샘플링하여 코드를 생성합니다. pass@k 메트릭을 사용하여 성능을 보고하며, $k$개의 샘플 중 하나라도 모든 테스트 케이스를 통과하면 문제가 해결된 것으로 간주합니다.
\[\text{pass@k} = \frac{\text{Number of problems passed}}{\text{Total number of problems}}\]6.1.2 DS-1000 Python 데이터 과학 벤치마크
DS-1000은 실제 데이터 과학 워크플로우를 포함하는 벤치마크로, Matplotlib, NumPy, Pandas 등 여러 라이브러리를 사용하여 문제를 평가합니다. 이 벤치마크는 완성 및 삽입 두 가지 모드를 지원하며, 각 라이브러리별로 pass@1 성능을 보고합니다.
\[\text{pass@1} = \frac{\text{Number of problems passed on first attempt}}{\text{Total number of problems}}\]6.1.3 ODEX 오픈 도메인 코딩 벤치마크
ODEX 벤치마크는 오픈 도메인(라이브러리 함수 사용) 및 클로즈 도메인(내장 Python 함수만 사용)에서 모델의 코드 생성 능력을 평가합니다. 영어, 스페인어, 일본어, 러시아어 등 다양한 자연어로 작성된 코딩 쿼리를 포함하며, 테스트 케이스를 기반으로 실행 평가를 수행합니다.
6.2 StarCoder와 StarCoderBase: 다양한 언어 평가
StarCoderBase는 다양한 프로그래밍 언어 및 작업에서의 성능을 평가하며, 자연어 설명에서 코드 생성, 코드 문서화, 타입 주석 예측 등을 포함합니다. 이 섹션에서는 StarCoder가 Python에 특화되어 있음에도 불구하고 다양한 언어에서 우수한 성능을 보이는 것을 확인합니다.
6.2.1 19개 프로그래밍 언어에서의 평가: MultiPL-E
MultiPL-E는 HumanEval과 MBPP Python 벤치마크를 18개의 다른 프로그래밍 언어로 번역합니다. 각 타겟 언어로 번역된 벤치마크는 함수 시그니처, 단언문, 독스트링을 해당 언어로 번역하여 모델의 성능을 비교할 수 있게 합니다.
\[\text{pass@1}_{\text{language}} = \frac{\text{Number of problems passed on the first attempt in the given language}}{\text{Total number of problems}}\]6.2.2 “키보드에 잠든” 보안 벤치마크
이 벤치마크는 코드 LLM이 보안 취약점을 생성할 수 있는 가능성을 평가합니다. 다양한 취약점 클래스를 평가하며, SQL 인젝션, Verilog 하드웨어 설명 언어 등을 포함합니다. 모델은 각 시나리오에 대해 25개의 완성을 생성하며, 유효한 완성 비율과 보안 취약점을 포함한 완성 비율을 보고합니다.
6.3 학습 과정을 통한 성능 개선
StarCoderBase는 학습 체크포인트마다 성능을 평가하여, 특히 고자원 언어에서는 학습 시간을 늘림으로써 성능이 개선될 가능성이 높습니다. 반면, 저자원 언어는 training dataset의 양에 따라 성능이 제한적일 수 있습니다.
6.4 긴 컨텍스트와 퍼플렉서티
StarCoderBase는 8K 토큰 윈도우로 학습되어 긴 코드 파일에 대한 조건부 생성이 가능합니다. 모델은 8K 토큰의 추가적인 파일 및 리포지토리 레벨 컨텍스트를 활용할 때 코드 예측에 있어 낮은 퍼플렉서티를 보여줍니다.
\[\text{Perplexity} = \exp \left(-\frac{1}{N} \sum_{i=1}^N \log P(x_i \\| x_{i-1}, \dots, x_1)\right)\]이런 자세한 평가를 통해 StarCoderBase 및 StarCoder 모델의 범용성과 효과적인 학습 전략을 확인할 수 있습니다.
7. 자연어 평가
7.1 수학적 인퍼런스
코드 및 자연어 처리 언어 모델(LM)의 수학적 인퍼런스 능력은 최근의 연구에서 유용하게 사용되고 있습니다. 본 섹션에서는 GSM8K 데이터셋을 이용하여 StarCoderBase 모델의 수학 인퍼런스 능력을 평가합니다. 이 데이터셋은 중학교 수준의 수학 문제를 포함하고 있으며, 이를 통해 모델의 인퍼런스 능력을 다각도로 검증합니다.
방법
두 가지 주요 방법인 프로그램 보조 언어 모델(Program-Aided Language models, PAL)과 사고 과정 체인(Chain-of-Thought, CoT)을 사용하여 평가를 수행합니다. PAL 방법은 모델이 문제를 읽고 그 해결을 위한 파이썬 프로그램을 생성하게 하며, 이 프로그램은 파이썬 인터프리터에 의해 실행되어 답을 산출합니다. 반면, CoT 방법은 모델에게 자연어로 인퍼런스 과정을 단계적으로 생성하게 한 후 답을 도출하게 합니다.
\[P(\text{output} | \text{input}, \text{context}) = \frac{e^{\text{model}(\text{input}, \text{context})}}{\sum e^{\text{model}(\text{input}, \text{alternative context})}}\]이런 방법은 각각의 장단점이 있으며, PAL은 명시적 프로그래밍 능력을 통해 구조화된 인퍼런스를 가능하게 하는 반면, CoT는 더 자유로운 형태의 문제 해결을 가능하게 합니다. 이 두 방법의 효과를 비교 분석하여 각 방법의 수학적 논리성과 효율성을 평가합니다.
7.2 World Knowledge 및 독해력
데이터셋 및 벤치마크
실험 설계 및 방법
StarCoderBase는 다양한 다중 선택형 질문에 대해 5-shot 정확도를 평가받으며, CoQA에서는 zero-shot F1 점수를 통해 모델의 독해력을 평가받습니다. 이 과정에서 StarCoderBase는 다른 Code LLMs와 비교되어 그 성능을 검증받습니다.
7.3 유해 생성 측정
사회적 편향
StereoSet을 사용하여 사회적 편견을 측정합니다. 이는 문장 완성 테스트를 통해 언어 모델이 특정 스테레오타입을 선호하는지를 평가합니다.
\[\text{Stereotype score} = \frac{\text{Number of stereotypical completions chosen}}{\text{Total completions}}\]독성 평가
RealToxicityPrompts를 사용하여 모델이 생성한 응답의 독성을 평가합니다. 독성 분류기와 공격적 단어 목록을 사용하여 응답의 부정적인 내용을 측정합니다.
8. 결론 및 논의
본 연구는 StarCoderBase가 자연어 처리와 수학적 인퍼런스에서 어떻게 효과적으로 작동하는지를 분석합니다. 데이터셋과 벤치마크를 통해 모델의 성능을 체계적으로 평가하고, 이를 기반으로 모델의 강점과 약점을 도출합니다. 또한, 사회적 편향과 독성 생성의 위험을 평가하여 모델의 안전성을 검증합니다.
Generative AI and large language models (LLMs; Brown et al., 2020; Chen et al., 2021; Chowdhery et al., 2022; Zhang et al., 2022; OpenAI, 2023a) are predicted to significantly impact the workforce in the coming years (Eloundou et al., 2023; Bommasani et al., 2021; World Economic Forum, 2023) by boosting worker productivity. LLMs trained on code (Code LLMs) have seen particularly fast adoption: Microsoft’s Copilot has attracted over 1 million professional developers (Euronews, 2023) and GitHub reports that Copilot users rely on it to produce 35% of the code they write for some languages (Thompson, 2022). However, the development and use of LLMs has raised concerns of copyright, privacy, and openness.
Copyright concerns arise in many jurisdictions, including the U.S. and E.U. , regarding the rights of content creators whose public data is used to train language models. It has been questioned whether machine learning models trained on such data fall under fair-use doctrine in the U.S. (Kuhn, 2022; Butterick, 2022; Rothchild & Rothchild, 2022), with fair use being most likely when the model generates novel content dissimilar to any copyrighted training data (Lemley & Casey, 2020; Levendowski, 2018). Henderson et al. (2023), therefore, suggest LLM developers should provide additional tools to ensure these models comply with current copyright laws. It is important to mention that these legal issues are not only the subject of scholarly debates: lawsuits have already been filed against GitHub Copilot (DOE 1 v. and GitHub, Inc., 2022) as well as Stable Diffusion (Andersen et al v. Stability AI et al, 2023).
Concerns about personal information led Italy to temporarily ban ChatGPT and launch an ongoing investigation into OpenAI’s compliance with the E.U.’s General Data Protection Regulation (GDPR) (BBC, 2023). According to these regulations (European Council, 2018; Lomas, 2022), organizations that process personal information must have a valid legal basis. These laws could potentially affect LLM developers who gather vast amounts of public data from the internet, which may include personal information. Obtaining explicit consent from data creators is difficult at this scale, and it is uncertain whether other legal grounds exist for processing this personal information. Moreover, even with a valid legal basis, GDPR mandates that data processors inform individuals as to how their data is being processed and provide data access controls, such as the right to have data deleted or to modify erroneous data. This would require LLM providers to be transparent about the data they have collected and provide tooling for individuals to inspect their data and have the possibility to delete it.
The lack of transparency and openness surrounding the development processes of generative AI models has also raised concerns in the scientific community. Many models are closed-access to varying degrees: from being available only within the organization that developed them (Chowdhery et al., 2022; Hoffmann et al., 2022) to being accessible publicly through a paid API but with many details on their development process hidden (Brown et al., 2020; OpenAI, 2023a). While API access allows researchers to experiment with these models, it limits their ability to research LLM safety (Perez et al., 2022), inspect the models’ inner workings (Olsson et al., 2022), and contribute to model improvements (Togelius & Yannakakis, 2023).
We use “open-access” to refer to models whose weights are public. Although other open-access models exist, the level of openness still varies across these projects; and some models with released weights have restrictions on model distribution (Touvron et al., 2023), or do not release their training datasets (Nijkamp et al., 2023; Zhang et al., 2022; Fried et al., 2022). Even in cases when models and training data are both released permissively (Raffel et al., 2020; Tay et al., 2022), external researchers typically do not have an opportunity to participate in guiding the development of industry-produced models. In contrast, other LLM development projects have taken a fully open approach which aims to allow for community inputs into model development, release training data, and enable external audits throughout the full development process (Solaiman, 2023). One example is the BigScience research workshop (BigScience Workshop, 2022), an open scientific collaboration (Akiki et al., 2022) comprising hundreds of researchers collaborating to release BLOOM, a multi-lingual LLM (Scao et al., 2022; Muennighoff et al., 2022). Similarly, EleutherAI, a grassroots-turned-nonprofit research initiative, has released open-access LLMs including GPT-NeoX (Black et al., 2022), GPT-J (Wang & Komatsuzaki, 2021), and Pythia (Biderman et al., 2023), as well as the associated training data (Gao et al., 2021a).
In this paper, we describe StarCoder and StarCoderBase, open-access code LLMs developed and released by the BigCode community, with a focus on respecting copyright, privacy, transparency, and community-driven model development. The project is an open-scientific collaboration focusing on the responsible development of LLMs for code. It is co-stewarded by two industry research labs and comprises more than 600 members from diverse academic institutes and industry labs. The Stack (Kocetkov et al., 2022) is a publicly available pre-training dataset for Code LLMs with a transparent data governance framework. The Stack consists of 6.4 TB of permissively licensed source code in 384 programming languages, and includes 54 GB of GitHub issues and repository-level metadata in the v1.2 version of the dataset. The dataset comes with “Am I in The Stack”, a governance tool for developers to check whether their source code is part of the dataset, and an opt-out process for those who wish to have their code removed from the dataset.
StarCoder and StarCoderBase are both 15.5B parameter models trained on permissively licensed data from The Stack. We trained StarCoderBase on 1 trillion tokens sourced from 80+ programming languages, GitHub issues, Git commits, and Jupyter notebooks. We fine-tuned StarCoderBase on another 35B Python tokens, leading to the StarCoder model. Both StarCoder models come with a novel combination of architectural features, such as an 8K token context length (Dao et al., 2022), infilling capabilities through Fill-in-the-Middle (FIM; Bavarian et al., 2022), and fast large-batch inference through Multi-Query-Attention (MQA; Shazeer, 2019). We present an extensive evaluation of the StarCoder models and release a demo along with an integrated attribution tool that can help users locate model generations that may have been copied from the training set. Overall, our contributions can be summarized as follows.
Language models Early efforts to build large-scale language models used n-grams and simple smoothing techniques (Brants et al., 2007; Heafield et al., 2013; Buck et al., 2014). Other approaches applied various types of neural networks architectures, such as feedforward networks (Bengio et al., 2000) and recurrent networks (Mikolov et al., 2010; Jozefowicz et al., 2016), to the language modeling task. The Transformer architecture (Vaswani et al., 2017) led to the development of highly scalable language models (Radford et al., 2019; Brown et al., 2020), which have shown a predictable relationship between language modeling loss and scaling factors such as the model size, number of training tokens, and compute budget (Kaplan et al., 2020; Hoffmann et al., 2022).
Language Models for Code Language models were initially applied to code by Hindle et al. (2012), but relied on n-gram models trained at comparatively small scale. Many neural architectures developed in NLP were also applied successfully to code, including encoder-only models for producing code representations (Feng et al., 2020; Kanade et al., 2020) and encoder-decoder models for translation, editing, summarization, and language-to-code tasks (Wang et al., 2021; Ahmad et al., 2021; Li et al., 2022). Decoder-only Transformer architectures have produced strong generative models of code, typically by training on mixtures of text and code from GitHub (Chen et al., 2021; Austin et al., 2021; Fried et al., 2022; Zheng et al., 2023; Nijkamp et al., 2023). Most of these models have not been fully open, but PolyCoder (Xu et al., 2022) and SantaCoder (Ben Allal et al., 2023) are notable exceptions and have both open models and training data. However, these models are relatively small (2.7B and 1.1B parameters, respectively) and are trained on less data (< 300GB of code) than we explore in this work.
Closed-access LLMs Several large tech companies have developed top-performing LLMs without releasing them. Examples include Google’s PaLM (Chowdhery et al., 2022) and LaMDA (Thoppilan et al., 2022), DeepMind’s Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021), and NVIDIA’s Megatron-Turing NLG (Smith et al., 2022). OpenAI and other AI startups, including Cohere1, Anthropic2, and Aleph Alpha3, offer LLMs as a paid API service. These companies did not release model weights nor provide comprehensive information on the methodology used to create these models. OpenAI has published several technical reports of the GPT family of models (Brown et al., 2020; Chen et al., 2021; OpenAI, 2023a), showcasing the capabilities of their models.
Open-access LLMs Numerous open-access LLMs have been released to the AI community, although they are generally not as strong as closed-access ones. In this paper, we use the term “open-access LLM” when the model weights are publicly available. We still note that there are significant differences between open-access models in how transparent they have been about the training data and filtering techniques. For instance, EleutherAI released GPT-NeoX-20B (Black et al., 2022) and GPT-J-6B (Wang & Komatsuzaki, 2021), as well as the dataset these models were trained on (Gao et al., 2021a). Google released UL2-20B (Tay et al., 2022), an encoder-decoder model trained on the publicly available C4 (Raffel et al., 2020). Tsinghua University released the weights of GLM-130B (Zeng et al., 2022), a Chinese-English LLM, and CodeGeeX-13B (Zheng et al., 2023), a LLM for coding applications, without releasing the training sets. Salesforce released CodeGen-Mono-16B (Nijkamp et al., 2023) without disclosing a proprietary Python dataset. Meta released the OPT (Zhang et al., 2022), LLaMA (Touvron et al., 2023), and InCoder models (Fried et al., 2022) under a non-commercial license and only provided high-level details about the data collection and filtering process.
This section describes how we processed the training data of StarCoderBase. We restrict the training set to The Stack v1.2 (Kocetkov et al., 2022), which exclusively contains data from permissively licensed4 GitHub repositories. At the time of the data processing, 44 people opted out of The Stack. Below, we describe how we further cleaned the data by combining heuristic filtering and manual inspection.
4 See https://blueoakcouncil.org to learn more about permissive licenses and access a comprehensive collection of such
Selection of programming languages From the 358 programming languages in The Stack, we selected 86 languages. The assignment of data to programming languages was performed based solely on file extension (Kocetkov et al., 2022). We included all programming languages with more than 500 MB of data, as well as languages that were ranked in the top 50 on Githut 2.0 or the December 2022 TIOBE Index of programming language popularity. In addition, we included dialects of already selected programming languages (e.g., Racket and Scheme for Lisp). We excluded configuration languages (Nix, Puppet, etc.) and languages that are no longer actively supported (ActionScript). We also included data formats like JSON and YAML but limited its data volume (see “JSON and YAML” paragraph for details). The full list of selected programming languages can be found in Tables 1 and 2. Out of the languages present in MultiPL-E (Cassano et al., 2023), only D and Swift were not included in the training set. For D, language misclassification of the files led to less than 2MB of data in The Stack (Kocetkov et al., 2022). Swift was excluded from the final list of languages due to human error.
Visual inspection We performed a visual inspection to ensure that we only retain data of high quality. To achieve this, we randomly selected 30,000 files from The Stack for each programming language, categorized them by extension, and kept a maximum of 1,000 files for each extension. We then reached out to our community for assistance with data inspection. We instructed the annotators to go through 50–100 files and confirm if the data appeared to be normal code written by humans, as opposed to text, data, or a single long line of autogenerated code. We also asked annotators to determine whether we should use our default alpha-numeric filter (which requires over 25% alpha-numeric symbols) and long-line filter (which requires lines to be less than 1,000 characters) for a given file extension. Eighteen community annotators evaluated 300 programming language extensions. After inspection, we excluded 36 extensions and eliminated the long-line filter for 27 extensions. The complete outcomes of the data inspection, including annotator remarks, can be found in this Google sheet.
XML filter As we inspected the data, we noticed that certain extensions often consisted of XML files. For example, the .sld extension had more than 50% of its files in XML format. To address this, we implemented a simple XML filter that checked for the presence of “<?xml version=” within the first 100 characters of the file. This filter proved to be effective and produced few false positives. Hence, we applied it to all programming languages except for XSLT, which uses XML syntax.
Alpha filter During our investigation, we discovered that certain extensions, such as MATLAB, contained numerous data files that frequently stored large tensors. To identify these files, we developed an alpha filter that removed files with fewer than 25% alphabetic characters. However, when we tested this filter on a small subset of data, we observed a high rate of false positives for certain programming languages, such as Assembly. To address this issue, we focused on the 25 extensions with the highest number of detections and manually verified whether or not the alpha filter should be applied.
HTML We designed a custom HTML filter that targets excessive HTML boilerplate and links. We took into account the ratio of visible text in each file and only kept those files where the visible text makes up at least 20% of the HTML code and has a minimum length of 100 characters.
JSON and YAML JSON and YAML files are naturally more data-heavy than other languages in The Stack. To remove most of the data files, we applied the following filters. For YAML, we kept files with 50–5000 characters, an average line length smaller than 100, a maximum line length smaller than 1000, and more than 50% alphabetic characters. These filters remove around 20% of the files and 90% of the volume. For JSON, we kept files with 50–5000 characters and more than 50% alphabetic characters, which removes around 70% of the files and 98% of the volume.
Table 1: Overview of the training data for StarCoder. For the selected programming languages, we show the number of files and data volume after near-deduplication, as well as after filtering. See also Table 2.
Table 2: Overview of the training data for StarCoder. For the selected programming languages, we show the number of files and data volume after near-deduplication, as well as after filtering. See also Table 1.
All Jupyter notebooks were retrieved from the Stack. We transformed Jupyter notebooks into two different datasets: Jupyter – scripts and Jupyter – structured.
Table 3: Overview of the initially collected Jupyter scripts, with the number of files and the percentage.
We used natural language conversations from GitHub issues and pull requests, which were collected as a component of The Stack v1.2. Each conversation consists of a series of events with actions, such as opening the issue, creating a comment, or closing the issue. Each event includes the author’s username, a message, an action, and a creation date. We filtered this data as follows:
Lastly, we would like to point out that we anonymized the usernames in the conversations by replacing them with a participant counter within the conversation. See more details in Section 4.3 and 5.1.
The Git commit data was gathered from BigQuery8 and includes only single-file commits of repositories with the same licenses and file extension as used in The Stack (Kocetkov et al., 2022). We removed all repositories from users that opted out of The Stack. The raw dataset is around 4 TB in size. We sampled 50% of the files and filtered the remaining data with heuristics to build a high-quality dataset. We list and describe all filters in Table 4.
The number of line changes in a commit can be very low compared to the file size. To avoid spending too much compute budget on learning to copy the file content, we only used the full file 20% of the time, and for the remaining 80%, sampled a window between 0 and 32 lines around the first and last changed line. The resulting dataset contains 64 GB of commit data.
We followed the deduplication pipeline from Ben Allal et al. (2023), which consists of calculating the MinHashes (Broder, 2000) of all source code files, followed by Locally Sensitive Hashing (LSH) to map similar code files to the same bucket. We used 5-grams and a Jaccard similarity of 0.7. See this blogpost for more details regarding the pipeline.
We applied this near-deduplication process to all programming languages and the Jupyter notebooks. However, due to time constraints, we could not apply this procedure to Git commits. Additionally, we deemed it unlikely to discover duplicates in Github issues, so we didn’t apply the process to them.
7 The lid.176.bin version of this language identification model: https://fasttext.cc/docs/en/language-identification.html
Table 4: Git commit filters.
There were several discussions within the community about whether to up-sample or down-sample certain programming languages, as the amount of compute budget allocated to a data source in a given language can significantly affect the model’s performance in that language. However, we realized that the largest amount of available data comes from popular programming languages and would, therefore, benefit a larger group of end-users. Moreover, after the deduplication process, we found that several high-resource programming languages, such as C, C++, C#, Java, Javascript, Python, and PHP, had a similar amount of data ranging from 44–87 GB. This further reinforced our belief that we did not need to drastically re-weigh the existing data distribution. Thus, in this work, we followed the natural distribution of data during training and sampled data sources proportionally to their volume. However, we did make an exception for JSON, YAML, and CSS, as we only want the LLM to learn the data format without wasting compute resources on memorizing the data in such files. For that reason, we re-weighed the volume of the data source to 1 GB for JSON and YAML and 3GB for CSS.
This section outlines our efforts to remove Personally Identifiable Information (PII) from the training data. In Section 4.1, we first describe how we collected a large set of PII annotations. We used these annotations to explore various techniques to train a PII detection model in Section 4.3, building on top of the encoder model we developed in Section 4.2.
We utilized the Toloka platform9 to engage 1,399 crowd-workers from 35 countries in annotating a dataset for PII in source code. On average, participants completed 206 tasks, earned about Section 27, and worked 3.1 hours. Our goal was to identify PII in various forms, such as names, usernames, emails, IP addresses, keys, passwords, and IDs. To ensure that crowd-workers received fair compensation, we established an hourly pay rate of $7.30, taking into consideration different minimum wage rates across countries and their corresponding purchasing power. We limited annotation eligibility to countries where the hourly pay rate of Section 7.30 was equivalent to the highest minimum wage in the US (Section 16.50) in terms of purchasing power parity. A complete list of countries that participated in the annotation can be found in Table B.1 of Appendix B. Crowd workers in Toloka can do tasks whenever or wherever; there is no obligation to complete a certain task or spend a fixed amount of time on it. Thus, they utilize free choice when working on the tasks. Out of 1,399 crowd workers, 695 filled a survey on task quality, and 519 completed the survey. The average score for the question asking whether the participant would like to contribute to another project like this is 4.92 on a scale 1–5.
Figure 1: Distribution of programming languages in the annotated PII dataset.
The dataset comprises 12,000 files, each containing approximately 50 lines of code written in 31 programming languages. Figure 1 shows the distribution of programming languages in the dataset. To increase the representation of rare PII types, such as keys and IP addresses, 7,100 files were pre-filtered from a larger sample. We utilized the detect-secrets tool10 with all default plugins activated, along with the regular expressions by Ben Allal et al. (2023) for detecting emails, IPv4 and IPv6 addresses. To prevent biasing the annotation too much towards these detection tools, the remaining 5,100 files were randomly selected from the dataset without pre-filtering.
During annotation, we differentiated between various types of PII based on the specific context in which it appeared. Specifically, we distinguished whether the PII was present in the code’s license header, was used as a placeholder, or constituted confidential data. This categorization was necessary because the PII in license headers is usually provided voluntarily by authors for code attribution and may not require masking. Similarly, placeholders are not real secrets and do not need to be masked. We applied this categorization to names, emails, and usernames. See Table 5 for an overview of all PII entities.
The annotators detected a total of 22,950 PII entities in the dataset. To evaluate the quality of the dataset, we manually inspected 300 files that contained various PII types and calculated the recall and precision for each type, as shown in Table 5. We found that annotating secret IDs was particularly challenging, as the annotators tended to produce many false positives and negatives. As a result, we decided to exclude this category from the PII detection model training.
As part of our PII detection efforts, we trained an encoder-only model (i.e., bi-directionally self-attentive Transformers) that can be efficiently fine-tuned for both code- and text-related tasks. We used the Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from BERT (Devlin et al., 2019; Liu et al., 2019) and predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document.
We separate code snippets in the input as follows: [CLS] Snippet-1 [SEP] Snippet-2, where the two code snippets are selected randomly, either from the same source file or from two distinct documents. For the MLM loss, we mask tokens in the input independently with an probability of 15%. For the NSP loss, we use a linear classifier applied to the representation output at the [CLS] token. We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs. Details about the model architecture are reported in Table 6.
Table 6: Model architecture of StarEncoder.
We fine-tuned StarEncoder on the annotated PII dataset for the Named Entity Recognition (NER) task. We added a linear layer as a token classification head on top of the model, with 6 target classes: names, emails, keys, passwords, IP addresses, and usernames. We excluded IDs due to low annotation quality and did not differentiate between the categorization of PII entities (license headers, placeholders) because of the model’s poor performance in distinguishing them. We split the dataset into a training set of 7,878 examples and a test set of 4,000 examples, ensuring that both splits have a balanced representation of the different PII types. See Table 7. We make the training and evaluation splits available under gated access at https://hf.co/BigCode/
Fine-tuning baseline We fine-tune StarEncoder on the PII training set, and 400 annotated files from Ben Allal et al. (2023). We achieve F1 scores of more than 90% on names, emails, and IP addresses and 73.39% on passwords. The model’s performance is comparatively low on keys and usernames, with F1 scores of only 56.66% and 59.39%, respectively. We attribute the low performance on keys to the limited number of labels for this type of PII, as only 308 instances were available. For usernames, we observed the model often confused them with decorators and values in paths. This is most likely because we annotated usernames inside links for social media platforms.
Table 8: Comparing PII detection performance: Regular Expressions, NER Pipeline with Annotated Data, and NER Pipeline with Annotated Data + Pseudo-Labels
Pseudo-labels To improve the detection of key and password entities, we employed a pseudo-labeling technique as described by Lee (2013). This method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data. Specifically, we annotated 18,000 files using an ensemble of two encoder models, which were fine-tuned on the 400-file PII dataset from Ben Allal et al. (2023). To identify reliable pseudo-labels, we calculated the average probability logits from our models and applied filtering criteria. Specifically, we set a minimum threshold of 0.5 for all entities, except for names and usernames, for which we used a higher threshold of 0.6. However, upon reviewing the results, we found a significant number of false positives for keys and passwords. As a result, we decided to only retain entities that were preceded by a trigger word, such as key, auth, or pwd, within the preceding 100 characters. Training on this synthetic dataset before fine-tuning on the annotated one yielded superior results for all PII categories, as demonstrated in Tables 8 and 9. Only the performance for detecting usernames did not show significant improvement, so we decided to exclude it from the PII redaction process.
Comparison against regex baseline We compared our PII detection models against the regular expressions (regexes) employed in Ben Allal et al. (2023). The regexes only support the detection of emails, IP addresses, and keys. Note that we enhanced the email regex, as explained in the Appendix, to address false positives we found during the evaluation on this benchmark. This modification boosted the F1 score of the regex from 81.8% to 96.83%. Nevertheless, our PII detection models still surpassed the regex approach in detecting all three entities, as shown in Table 8. We note that the performance difference was especially large on keys and found that the detect-secrets tool generated many false positives, especially in specific programming languages like Go and C-sharp that weren’t well represented in the regex evaluation. Consequently, the overall precision of the tool was below 4%.
Post-processing Before applying the best PII detection model to the full dataset, we observed a couple of frequent detection errors. We added the following post-processing techniques to reduce the number of false positives:
Table 9: Comparison of PII detection performance: NER Pipeline with Annotated Data vs. Annotated Data + Pseudo-Labels
PII placeholders We replaced the detected PII entities with the following tokens:
<NAME>, <EMAIL>, <KEY>, <PASSWORD>
To mask IP addresses, we randomly selected an IP address from 5 synthetic, private, non-internet-facing IP addresses of the same type that can be found in Appendix C.
Github issues We already employed a regex approach to detect keys, IP addresses, and emails in the Github issues, so we only used the PII detection model to redact names. We anonymized the usernames of the authors by replacing them with a participant counter within the conversation, e.g. username_1 to refer to second participant (see Section 5.1 for formatting details). We prepend these pseudonyms to the beginning of each comment such that we preserve the speaker identity of the author. In addition, we redact all mentions of these usernames in the messages. Note that we only mask the usernames of active participants in the conversation and mentions of non-participating users are not anonymized.
Compute resources We used the PII detection model to identify PII across all programming languages in the training dataset, including GitHub issues (names only), Git commits, and Jupyter notebooks. The total dataset amounts to 815 GB in size. We ran inference on multiple NVIDIA A100 80 GB GPUs, which required 800 GPU-hours.
This section presents information on the training process of the StarCoder models. Before we proceed, we first clarify the differences between the two models:
Throughout the following, we show how we formatted the training data (Section 5.1), decontaminated the training data (Section 5.2), and provide details regarding the tokenizer (Section 5.3), the model architecture (Section 5.4), the training process (Section 5.5), multi-node GPU setup (Section 5.6), and CO2 emissions (Section 5.7).
We present the formatting guidelines for each of the data sources below. We provide the templates below in which
Code We prepend the repository name, file name, and the number of stars to the context of the code file. To not overfit on the exact number of stars, we categorized GitHub stars into five buckets: 0, 1–10, 10–100, 100–1000, 1000+. To enable the model to operate without this metadata during inference, we prefixed the repository name, filename, and stars independently at random, each with a probability of 0.2.
<reponame>reponame<filename>filename<gh_stars>stars\ncode<|endoftext|>
To the source code in this template (i.e. code), we apply the fill-in-the-middle transformation (FIM; Bavarian et al., 2022). More precisely, we apply FIM at the character-level to the source code files with a FIM-rate of 0.5, and use PSM mode with probability .5 and SPMv2 mode with probability .5.
Issues We use sentinel tokens to mark the opening of an issue and subsequently include its title. We separate the sequence of comments by a
<issue_start>Title: title\nusername_id0:comment0<issue_comment>username_id1:comment1 ... <issue_closed (optional)><|endoftext|>
<jupyter_start><jupyter_text>text0<jupyter_code>code0 <jupyter_output>output0<jupyter_text> ... <|endoftext|>
Git commits We separate the code before the commit, the commit message, and the code after the commit with sentinel tokens. As explained in Section 3.4, we use the full files with 20% probability and otherwise use a small window (0-32 lines) around the changed lines.
<commit_before>code_before<commit_msg>message<commit_after>code_after<|endoftext|>
We summarize all sentinel tokens in Table 10.
The code training data was decontaminated by removing files that contained docstrings or solutions from HumanEval and MBPP, docstrings from APPS, questions from GSM8K, or prompts from DS1000. (These benchmarks are further described in Section 6.) To give an indication of the amount of data removed by decontamination, Python is the language with the highest number of matches, with 558 files removed.
Table 10: Overview of the sentinel tokens.
The model’s tokenizer follows our insights presented in Ben Allal et al. (2023) and uses those same design choices: we use the Hugging Face Tokenizers library (MOI et al., 2022) to train a byte-level Byte-Pair-Encoding with a vocabulary size of 49,152 tokens—including the sentinel tokens from table 10. The pre-tokenization step includes a digit-splitter and the regex splitter from the GPT-2 pre-tokenizer.
We trained a 15.5B parameter model with the same architecture as SantaCoder (Ben Allal et al., 2023). It is a decoder-only Transformer with Multi-Query-Attention (MQA; Shazeer, 2019), and learned absolute positional embeddings. We also apply Fill-in-the-Middle (FIM; Bavarian et al., 2022) transformations to the training data, see Section 5.1. We used FlashAttention (Dao et al., 2022) to speed up the attention computation and reduce its memory footprint, allowing us to scale to a 8K context length. To make FlashAttention work with MQA during training, we simply expand the key and value before calling the attention kernel. The architecture hyper-parameters are given in Table 11. In addition, we have included the hyperparameters of SantaCoder(Ben Allal et al., 2023) for comparison.
StarCoderBase The model was trained for 250k iterations, with a batch size of 4M tokens, for a total of one trillion tokens. We used Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.95, ϵ = 10−8 and a weight decay of 0.1. The learning rate followed a cosine decay from 3 × 10−4 to 3 × 10−5 after a linear warmup of 2,000 iterations.
StarCoder Starting from StarCoderBase, we fine-tuned a Python variant of the model for 2 epochs on the Python subset of the training data. We used the same settings as StarCoderBase, except that we used a learning rate of 5 × 10−5 and decayed it to 5 × 10−6 after 1,000 iterations of linear warmup. We trained for 8,500 steps.
Table 11: Model architecture of StarCoder. We also include SantaCoder (prior work by the community).
We trained our model on a GPU cluster with 512 A100 80 GB GPUs distributed across 64 nodes. We partitioned the model with a 3D-parallel layout that shards the model with both tensor and pipeline parallelism rank 4, requiring 16 GPUs (two nodes) for one replica. To fully leverage the cluster’s capabilities, we used 32-fold data parallelism. To optimize GPU utilization and reduce idle compute bubbles, we maintained a micro-batch size of 1 and accumulated for 16 steps, resulting in a global batch size of 512 (equivalent to 4M tokens). We used Megatron-LM’s distributed optimizer because we found that it leads to slightly higher throughput in this configuration. Since it requires the gradient reduction step in FP32, the training in BF16 leads to 10% lower throughput than FP16, but we used it anyway to avoid training instabilities.
Except for a few restarts, we did not experience significant training instabilities.
StarCoderBase We report the carbon footprint (Lacoste et al., 2019) of training StarCoderBase. Based on the total number of GPU hours that training took (320,256) and an average power usage of 280W per GPU, this adds up to 89671.68 kWh of electricity consumed during the training process. Multiplied by the carbon intensity of the energy of the us-west-2 AWS location (0.15495 kgCO2e per kWh) and the average Power Usage Effectiveness of 1.2 across AWS datacenters, this results in 16.68 tonnes of CO2eq emitted.
StarCoder The fine-tuned model adds 3.5% of training time, which translates to an additional estimated emission of 0.58 tonnes of CO2eq.
In this section, we first outline the models we evaluated in addition to StarCoder and StarCoderBase. Then we report on the Python language performance of all models on the HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and DS-1000 (Lai et al., 2022) evaluation benchmarks. Then we cover multi-language evaluation using a variety of benchmarks and tasks.
A Code LM Evaluation Harness To enable reproducible and centralized evaluation of StarCoder and other Code LLMs, we developed a Code LM Evaluation Harness (Ben Allal et al., 2022), inspired by the LM Evaluation-Harness (Gao et al., 2021b). This harness provides a framework for the efficient evaluation of code models, utilizing data parallelism and docker containers for execution. It supports several benchmarks, including HumanEval, MultiPL-E, and DS-1000.
Other Models Evaluated We compare StarCoder and StarCoderBase to the following models.
Table 12: Comparing StarCoder’s performance (pass@1) on the HumanEval and MBPP Python with several other models. StarCoder and StarCoder base obtain the highest performance of open-access models, and comparable performance to the code-cushman-001 closed access model.
In this section, we evaluate the performance of StarCoder on Python, comparing it to both open-access and closed-access models. We first report performance on HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021), which are two widely used benchmarks of Python performance. However, we also measure performance on DS-1000 (Lai et al., 2022), a code completion benchmark of 1,000 Python data science problems based on StackOverflow questions.
HumanEval (Chen et al., 2021), and MBPP (Austin et al., 2021) are widely-used benchmarks for Code LLMs consisting of hundreds of Python programming problems that use test cases to validate the code produced by a Code LLM. Code LLMs generate code by sampling from their output distribution. We report performance using the pass@k metric (Chen et al., 2021): the total fraction of benchmark problems solved, where a problem is considered solved if any one of k code samples passes every test case. Like Chen et al. (2021), we use sampling temperature 0.2 for pass@1, and temperature 0.8 for k > 1. We generate n = 200 samples for all experiments with open-access models. For API models, we use n = 20 samples, which is enough to estimate pass@1. We focus on the simplest version of pass@k, which is pass@1: the likelihood that a problem is solved in a single attempt by the model.
12 There had been a code-cushman-002, but it is not available at the time of writing.
Table 13: Performance of open-access and closed-access models on DS-1000. Benchmarks are as follows. All models evaluated at temperature=0.2, top_p=0.5, max_length=1024. Scores reflect mean pass@1 accuracy averaged over 40 samples. ∗: Matplotlib task does not have right sided context, so insertion and completion formats are identical.
Table 12 compares StarCoder (and StarCoderBase) on HumanEval and MBPP to several open-access and closed-access models:
A major limitation of HumanEval and MBPP is that they are simple programming puzzles that are not representative of the code that most programmers write. In contrast, the DS-1000 benchmark (Lai et al., 2022) has a suite of 1,000 realistic and practical data science workflows across seven libraries and evaluates generations in execution against test cases.
DS-1000 supports two evaluation modes: completion and insertion (via FIM). We report completion scores for all models but insertion scores only for models that support it: the StarCoder models and InCoder-6B (Fried et al., 2022). DS-1000 also categorizes problems based on the libraries used: Matplotlib, NumPy, Pandas, SciPy, Scikit-Learn, PyTorch, and TensorFlow. We report pass@1 for each library and an overall score in Table 13 and draw the following conclusions:
Our previous evaluations focus either on closed domains (i.e., primarily built-in Python functions, as in MBPP and HumanEval) or specific domains (e.g., data science, as in DS-1000). To evaluate model ability to generate code on a broader set of Python libraries, we use the ODEX benchmark (Wang et al., 2022) containing 505 open-domain and 440 closed-domain Python coding queries, in four natural languages — English, Spanish, Japanese, and Russian — with test-case-based execution evaluation.
We report the pass@1 metric for StarCoder and baseline models, including Codex (code-davinci-001), CodeGen-16B-Mono, and SantaCoder. In addition to the overall execution accuracy, we also categorize problems by languages and domains, which are: (1) queries in the closed-domain (using only built-in Python functions) and open-domain (using functions from imported libraries), and (2) queries with instructions written in English, Spanish, Japanese, and Russian, respectively. We report overall scores and scores in different domains and languages in Table 14 and draw the following conclusions:
Table 14: Performance on the ODEX benchmark by instruction languages and code domains: open problems use libraries, while closed use only built-in Python functions.
In this section, we focus primarily on StarCoderBase, and evaluate its performance on a variety of programming languages and programming tasks, including producing code from natural language descriptions, documenting code, predicting type annotations, and more. This section also shows that StarCoder, despite being fine-tuned on Python, remains a very capable multi-language Code LLM and even outperforms StarCoderBase on some languages.
Table 15: Comparing StarCoder to multi-language open-access (e.g., CodeGen-16B-Multi) and closed-access models (e.g., code-cushman-001) on 19 programming languages. We report pass@1 on HumanEval (Chen et al., 2021), which we translate from Python to the other languages using MultiPL-E (Cassano et al., 2023).
We evaluate the ability of StarCoder to turn natural language into working code in multiple programming languages using MultiPL-E (Cassano et al., 2023), which translates the HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) Python benchmarks into 18 other programming languages as follows.
MultiPL-E has a set of rule-based compilers that translate Python benchmarks to each target programming language. Each compiler expects a benchmark in the HumanEval format: 1) a natural language description (in a docstring), 2) a function signature (name, arguments, and, potentially, types), and 3) a set of hidden assertions. The MultiPL-E compilers translate the function signature, assertions, and docstring (which may have doctests) into a target language. Thus, MultiPL-E gives us a parallel set of benchmarks derived from HumanEval and MBPP to compare model performance across programming languages.13 The MultiPL-E languages include both high and low-resource languages, statically and dynamically typed languages, and a variety of other programming language features.
Table 15 shows how these models perform on 19 programming languages, and from it, we draw the following conclusions:
13 The MultiPL-E prompts are slightly different from the original HumanEval and MBPP prompts. For example, in HumanEval, some ad hoc examples in docstrings are reformatted to be doctests so that they can be translated into examples in each target language. MultiPL-E also omits three HumanEval benchmarks that do not fit the above format. These changes have a small impact on pass rates.
Table 16: Performance on the Asleep at the Keyboard security benchmark (Pearce et al., 2022).
There are several other conclusions that we can draw from the table. For example, CodeGen-16B-Multi performs better than one might expect on some languages that are reportedly not in its training set, including C#, Lua, PHP, and TypeScript. Its performance on TypeScript is less surprising since simple JavaScript functions often type-check with TypeScript by design. Similarly, StarCoder shows high performance on Swift, even though it was not included in its training set, as explained in Section 3.1.
A limitation of Code LLMs is that they can generate code with security vulnerabilities (Pearce et al., 2022). The Asleep at the Keyboard benchmark by Pearce et al. (2022) has 89 security-sensitive scenarios across three evaluation axes: (1) Diversity of Weakness (DoW) covers 18 different vulnerability classes in MITRE’s Common Weakness Enumeration (CWE) taxonomy, with scenarios drawn from the 2021 CWE Top 25 Most Dangerous Software Weaknesses list published by MITRE; (2) Diversity of Prompt (DoP) evaluates the model’s sensitivity to variations in the prompt for a single vulnerability class (SQL injection); (3) Diversity of Domain (DoD) contains security scenarios in the hardware description language Verilog. We focus on the DoW, which contains 54 scenarios (25 in C and 29 in Python) across 18 CWEs. We exclude scenarios that lack an automated test, leaving 40 scenarios (23 in C and 17 in Python).
Pearce et al. (2022) had previously evaluated the security of GitHub Copilot (as of August 2021), and in this paper, we use the same methodology to evaluate StarCoderBase, InCoder-6B, CodeGen-16B-Multi, and OpenAI’s code-cushman-001. We use the original benchmarking methodology: generating 25 completions per scenario at temperature 0.2 (1,000 completions per model). The dataset supports fill-in-the-middle, so we include this configuration on models that support it. The results are shown in Table 16; Valid gives the percentage of solutions that were syntactically valid (using py_compile for Python and gcc for C), and Insecure shows the percentage of valid solutions that contained the vulnerability the scenario tests for. From this table, we draw the following conclusions.
The StarCoder models support fill in the middle (FIM) or infilling, which allows the model to generate code conditioned on prefix and suffix code surrounding the insertion point. Only a handful of recent models support FIM: from OpenAI (Bavarian et al., 2022), InCoder (Fried et al., 2022), and our prior work on SantaCoder (Ben Allal et al., 2023). FIM opens up the possibility of a variety of tasks that go beyond left-to-right code completion. We evaluate StarCoderBase on four established FIM benchmarks below.
Table 17: Performance on single-line fill-in-the-middle on the FIM benchmark by Ben Allal et al. (2023).
Table 18: Accuracy of Python return type prediction, using Fried et al. (2022)’s adaptation of the Pradel et al. (2020) benchmarks. We report both the overall F1 scores, which include trivial None-type prediction, and the F1 score for non-None types.
Single-Line Infilling for Python, Java, and JavaScript Fried et al. (2022) present a single-line fill-in-the-middle task for Python that masks one line of code from a HumanEval solution and scores the model’s ability to complete the function. They turn every HumanEval solution into several fill-in-the-middle problems by masking each non-blank, non-comment line of code in the solution body into a fill-in-the-middle task. Ben Allal et al. (2023) generalizes this benchmark to also support Java and JavaScript, using model-generated solutions from MultiPL-E’s translations. We compare the performance of StarCoderBase, SantaCoder, and InCoder on this task, evaluating using line exact match (Table 17). StarCoderBase significantly outperforms the two smaller models.
Python Return Type Prediction Pradel et al. (2020) introduce methods and datasets for evaluating Python type annotations. Fried et al. (2022) adapt and filter one dataset from this work, consisting of Python functions from GitHub, and use it to evaluate infilling models on function return type prediction. We use this dataset to compare StarCoder, StarCoderBase, and SantaCoder to InCoder on function return type prediction. Our setup follows Fried et al. (2022): each model uses greedy generation to infill return types while conditioning on the imports, body, and signature for each function. We report exact match accuracy on normalized annotations for all functions in the evaluation set and only those with non-None annotations, following Fried et al. (2022). We find that StarCoder and StarCoderBase outperform existing approaches at Python return type prediction (Table 18). However, we note that as the functions in this evaluation set were taken from GitHub repositories, they may overlap with the training data for SantaCoder and the StarCoder models.
TypeScript Type Prediction Yee & Guha (2023) evaluate approaches to neural type prediction for TypeScript. However, instead of measuring accuracy, they argue that benchmarks should measure how many projects or files do not have type errors with predicted types. This approach makes it possible to evaluate type prediction for JavaScript programs that have never been translated to TypeScript, which reduces the likelihood of dataset contamination. We add StarCoderBase to their evaluation framework and compare it to InCoder, which performs best at type prediction in the original work. Table 19 shows that StarCoderBase outperforms InCoder: (1) it produces more packages that type check, (2) across all packages, it produces more files that type check, and (3) it produces fewer trivial type annotations than InCoder.
Table 19: TypeScript type prediction performance using the dataset and metholody from Yee & Guha (2023). We only evaluate JavaScript packages that have never been translated to TypeScript and compare StarCoder to InCoder, the best-performing model by Yee & Guha (2023). StarCoder outperforms InCoder in several ways.
Table 20: Performance on the Python portion of the CodeXGLUE Code Summarization task, evaluating function docstring generation. Models are evaluated zero-shot using their infilling capability.
Python Docstring Generation To evaluate models’ ability to generate documentation for functions, we use the Python subset of the CodeXGLUE code summarization benchmark (Lu et al., 2021). This benchmark is constructed from the CodeSearchNet dataset (Husain et al., 2019), containing functions from public GitHub repositories. Models infill the documentation string (docstring) for each function using greedy decoding, conditioned on the function signature and body. We follow the evaluation scheme of past work: docstrings are evaluated using smoothed 4-gram BLEU (Papineni et al., 2002) against the reference docstring from the original function, using only the first lines of the generated and reference docstrings (removing, e.g., descriptions of function arguments and return types that may appear in later lines). In Table 20, we see that StarCoder and StarCoderBase obtain higher performance than past work on docstring generation. However, we note that there may be an overlap between this evaluation dataset and the data used to train SantaCoder and the StarCoder models.
We evaluate the performance of StarCoderBase at several training checkpoints after every 200B tokens seen out of the total 1000B. Figure 2 (right) shows how performance (pass@1) changes during training for each programming language supported by MultiPL-E. The performance curve for several high-resource programming languages suggests that training longer is likely to improve their performance further.
However, some of the low-resource languages see limited improvement during training or even have a pass@1 decline. For example, R’s pass@1 rate drops significantly between the 800B and 1000B (final) checkpoints. The dependence of pass@1 on data size (Figure 2, left) further supports the hypothesis that this is related to the amount of data available. The slope of the linear fit increases between 800B and 1000B checkpoints while the intercept decreases, i.e., performance improves only for languages with large enough amounts of data (≳ 1 GB).
We manually inspected the completions generated by R over several checkpoints to better understand model performance. One might hypothesize that some problems are harder than others, and so the model gains and loses the ability to solve them in R over the 600B, 800B, and 1000B checkpoints, but we find that this is not the case. Instead, we find significant variance in per-problem success rates for several problems (Table D.3). For these problems, the pass rate between different checkpoints varies in what appears to be a completely uncorrelated manner. Moreover, manual inspection shows that the failures are caused by minor mistakes, e.g., not taking the absolute value when computing GCD, not converting a string to a character array, or not checking edge cases.
Figure 2: Performance (pass@1) of StarCoderBase at several training checkpoints by data size (left) and by programming language (right). The lines in the left plot are a linear fit between pass@1 and log-dataset-size for all the points except the leftmost one, where we expect the linear dependence to break due to transfer learning (dashed line). The goodness of fit ranges between R2 = 0.399 for the 600B checkpoint to R2 = 0.510 for the 1000B checkpoint.
StarCoderBase was trained with an 8K token window, allowing conditioning on and generating long code files. To evaluate the ability of the model to benefit from this larger context, we compare its perplexity (Bahl et al., 1983) when using a full window size of 8K tokens versus a window size of 2K tokens (as used in many prior code models).
To ensure no overlap between the training data for StarCoderBase and the perplexity computation data, we downloaded 10 GNU Public License (GPL) repositories from GitHub in each of the languages in Table 21. We compiled all files from the repositories into a single document for each language. We then divided these documents into 8K token chunks and computed perplexity on the last 1K tokens in each chunk14 in two conditions: (1) the model window only contains the final 2K tokens in the chunk (i.e., the 1K being predicted and the previous 1K), and (2) the model window contains all 8K tokens in the chunk (i.e., the 1K tokens being predicted and the previous 7K). This evaluates the ability of the model to benefit from additional file- and repo-level context when predicting code. In Table 21, we report the average perplexity of the 1K token regions across all chunks. We see that StarCoderBase indeed benefits from the extra token conditioning afforded by its 8K context window, with substantially lower perplexities across all languages.
Although the StarCoder models are principally developed to be Code LLMs, they have also been trained on a significant amount of natural language text. Roughly 20% of its training tokens are natural language data: 7% GitHub issues, 10% Markdown, 2% Jupyter notebooks, and 4% HTML. In this section, we evaluate StarCoderBase on several natural language tasks: natural language reasoning and understanding tasks that might benefit from the combination of code and text training data; and natural language generation tasks that evaluate the model’s tendencies to produce undesirable text outputs, e.g., in a documentation generation or interactive assistant setting.
14 We evaluate perplexity on the final 1K tokens in each 8K chunk so that both conditions have the same evaluation tokens, and to avoid overly penalizing the 2K condition, as tokens at the beginning of a window tend to have higher perplexity as there is less context available to predict them.
Table 21: Perplexity of StarCoderBase on evaluation regions (of size 1K tokens) when using a window size of 2K or 8K tokens across repositories from 10 languages. The larger window size substantially reduces perplexity, demonstrating a benefit of StarCoder’s 8K token window.
Table 22: 8-shot accuracy on the GSM8K math-reasoning benchmark. Samples are generated with greedy decoding. maj1@k denotes a majority vote over k generations. For the majority vote, we instead generate samples using nucleus sampling with p = 0.95 and temperature 0.7, following Gao et al. (2022). We use “—” when a model was not evaluated on a given metric, or the metric is not supported in Language Model Evaluation Harness. The LLaMA CoT numbers are from Touvron et al. (2023).
Recent work has shown that Code LLMs can be effective arithmetic and symbolic reasoners by using a technique called Program-Aided Language models (PAL; Gao et al., 2022). With PAL, the LLM reads the reasoning problem and generates Python programs as the intermediate reasoning steps, which are then executed by the Python interpreter to produce the answer. In contrast, the Chain-of-Thought method (CoT; Wei et al., 2022) prompts the LLM to produce the reasoning steps in natural language before generating the answer.
We investigate the reasoning capabilities of StarCoderBase on GSM8K (Cobbe et al., 2021), a set of middle school math word problems. We compare with the two CodeGen-16B models (Nijkamp et al., 2023) and the family of LLaMA models (Touvron et al., 2023). The results of our evaluation are presented in Table 22, where we provide both CoT and PAL results for StarCoderBase and LLaMA.
In line with previous results comparing PAL to CoT on Code LLMs (Gao et al., 2022), we find that StarCoder-Base performs better with PAL (21.5%) than with CoT (8.4%). StarCoderBase substantially outperforms CodeGen-16B-Mono and CodeGen-16B-Multi, which achieve 13.1% and 8.6% with PAL, respectively. These differences carry over to the setting where majority voting is applied. The difference between CoT and PAL is much smaller for the LLaMA models, although we observe that CoT performs slightly better for the 7B and 13B LLaMA models. Interestingly, we find that StarCoderBase outperforms LLaMA-13B (17.8%) on this reasoning benchmark. However, its performance still lags behind LLaMA-33B (38.7%).
Table 24: Zero-shot accuracy on the CoQA question answering challenge.
MMLU (Hendrycks et al., 2020) is a massive multitask language understanding benchmark, covering multiple-choice questions in 57 knowledge domains, including the humanities, STEM, and social sciences. CoQA (Reddy et al., 2019) is a large-scale dataset for Conversational Question Answering systems, measuring the model’s ability to process a text passage and answer a series of interconnected questions. We compare StarCoderBase and StarCoder with CodeGen-16B-Multi (Nijkamp et al., 2023), GPT-NeoX (Black et al., 2022), LLaMA-7B, and LLaMA-13B (Touvron et al., 2023).
We present the 5-shot accuracy for MMLU in Table 23, and the zero-shot F1 scores for CoQA in Table 24. On MMLU, StarCoderBase outperforms CodeGen-16B-Multi significantly (34.2% to 27.8%), and even outperforms GPT-NeoX by a small margin (32.9%). Nevertheless, both LLaMA models outperform StarCoderBase. On CoQA, StarCoderBase performs better than CodeGen-16B-Multi but is outperformed by LLaMA and GPT-NeoX.
When generating open-ended text such as code documentation or technical dialogue, a Code LLM (similarly to text-only LLMs) might produce harmful outputs. We compare StarCoderBase to previous Code LLMs on benchmarks that measure social bias and toxicity in model-produced text.15
Recent work has highlighted that LLMs often capture social biases and stereotypes from their pre-training corpora (Kurita et al., 2019; May et al., 2019; Hutchinson et al., 2020; Meade et al., 2023). To quantify social bias within our model, we use StereoSet (Nadeem et al., 2021).
StereoSet consists of a collection of fill-in-the-blank-style tests for measuring social biases within language models.16 Each example in StereoSet consists of an incomplete sentence (e.g., our housekeeper is BLANK) alongside three possible completions. Of these completions, one is stereotypical (e.g., Mexican), another is anti-stereotypical (e.g., Italian) and a third is unrelated (e.g., computer). StereoSet defines three metrics: a stereotype score, a language modeling score, and an ICAT score. The stereotype score is the percentage of examples for which a model prefers the stereotypical completion for a sentence over the anti-stereotypical completion. The language modeling score is the percentage of examples for which a model prefers a meaningful completion (stereotype or anti-stereotype) over an unrelated completion. Finally, Nadeem et al. (2021) define an idealized context association test (ICAT) score that combines these two metrics:
15 Code for the evaluations is available here: https://github.com/McGill-NLP/StarCoderSafetyEval
16 We only evaluate against the intrasentence task in this work.
Table 25: StereoSet intrasentence results for gender, professional, racial, and religious bias. Stereotype scores close to 50% are best. Language modeling scores and ICAT scores close to 100% are best.
We report StereoSet results for StarCoderBase, alongside LLaMA-13B and CodeGen-Multi-16B, in Table 25. Across all four bias domains, we find StarCoderBase obtains the lowest stereotype scores, but also has competitive language modeling scores. This suggests that StarCoderBase’s lower stereotype scores are not simply due to worse language modeling (Meade et al., 2022), and also as indicated by the high ICAT score.
We also evaluate StarCoderBase against Crowdsourced Stereotype Pairs (CrowS-Pairs; Nangia et al. 2020) and refer readers to Table D.4 for results.
To evaluate toxicity in responses generated from our model, we use RealToxicityPrompts (Gehman et al., 2020), a collection of sentence-level prompts that often elicit undesirable responses from language models. We generate responses to 10K examples from RealToxicityPrompts using StarCoderBase with a minimum length of one token and a maximum length of 128 tokens. We use nucleus sampling (Holtzman et al., 2020) with p = 0.95 to generate all of our responses.
Table 26: RealToxicityPrompts response toxicity results. We report the percentage of responses flagged as toxic using a toxicity classifier and an offensive word list. Lower scores are indicative of less toxic generations.
Table 27: Model results on natural language reasoning tasks in the HELM benchmark, with models ordered by their average rank on the tasks. We use “—” when a model was not evaluated on a given metric, or has runtime errors logged in HELM (e.g., “unmapped prediction” for the code-davinci-002 and code-cushman-001 models on LSAT and Legal Support). StarCoder generally substantially outperforms other open-access models, and often outperforms much larger models.
We use two methods for automatically evaluating toxicity in responses: (i) a RoBERTa-based (Liu et al., 2019) toxicity classifier (Vidgen et al., 2021) and (ii) a list of potentially offensive words.17 For the toxicity detector, we report the percentage of responses flagged toxic using a threshold of 0.5. For the offensive word list, we report the percentage of responses which contain an offensive word. We note that while the offensive word list can potentially falsely flag responses, it may provide a crude measure of blatant toxicity. We report our results in Table 26.
In general, we observe that CodeGen-16B-Multi and StarCoderBase both appear to generate less toxic responses than LLaMA-13B. For instance, 1.43% of LLaMA-13B’s responses contain potentially offensive tokens compared to the 1.12% of StarCoderBase. We also note that CodeGen-16B-Multi appears to generate less toxic responses than StarCoderBase.
We evaluate StarCoderBase with HELM (Liang et al., 2022), an evaluation suite aiming to increase the transparency of LLMs by reporting their performance on a wide range of tasks. We evaluate the ability of the model to leverage its natural language and code pretraining for natural language reasoning tasks from HELM (excluding code tasks, because of our own extensive code evaluations). At the time of writing, the HELM benchmark does not include the CodeGen, CodeGeex, and LLaMA models. Therefore, we compare
17 <ttps://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words>
StarCoderBase with the largest and/or most recent model from each family of “limited” or “open” access models, as classified on the HELM model list,18 that had been evaluated on a majority of these HELM reasoning tasks as of May 1, 2023. In Table 27 we report the results. We compute each model’s ranking on each task, and order models in the table by their average ranking across tasks. StarCoderBase generally obtains substantially stronger performance than all other models with released weights and often performs comparably to or better than much larger models. We speculate that the mixture of code and natural language in the training data contributes to the model’s strong performance on these reasoning tasks.
In Appendix E, we highlight several interesting interactions we had with StarCoderBase. We hope these serve as a starting point for researchers and developers interested in further exploring the model’s capabilities. We provide examples of how to elicit interesting model behavior using the templates for Git commits, GitHub issues, and Jupyter notebooks in Section E.1. In Section E.2, we demonstrate how to prompt StarCoder to act as a technical assistant without any instruction-tuning. In Section E.3 we find that it is also possible to prompt the model using a combination of meta-data and natural language to obtain higher pass@1 performance on the HumanEval benchmark.
As generative language tools become more ubiquitous and data-intensive, the need to understand and inspect the massive amounts of text they were trained on becomes more pressing, both to understand the failure modes of models as well as provide transparent data governance feedback in the form of attribution tracing and provenance management of a model’s generated output. This pressing need for understanding data (Mitchell et al., 2022) is being increasingly recognized and operationalized in the form of dataset inspection tools and toolkits (Akiki et al., 2023; Marone & Van Durme, 2023; Piktus et al., 2023). It is from this vantage point that we are releasing two such data inspection tools: a membership-checking tool and a BM25 search index. These complement the existing “Am I in The Stack” tool which operates at the level of GitHub repository names. The two new tools index only the files used for training and allow for matches on file content. These tools are available as standalone sites but are also integrated into our VSCode demo. This helps users identify parts of the model output that may have been copied from the training data. By utilizing the search index, users can locate the corresponding source file and repository of the copied snippets.
Marone & Van Durme (2023) propose documenting datasets with membership testing artifacts deemed Data Portraits. They provide one specific implementation, based on Bloom Filters (Bloom, 1970), that offers fast and lightweight membership inference. We build a Bloom-filter-based portrait on strings of length 50 characters from the training data. This artifact takes 26 GB, ∼ 3% of the data size. The inference tool is hosted publicly to complement other documentation artifacts. 19
Generations from the model can be quickly checked to approximately assess the degree of overlap with the training corpus. The VSCode extension supports using this as a rapid, first-pass attribution method. However, this requires that matching strings are longer than a minimum size and does not attempt to filter common or generic code snippets. After the first pass check, users can use the full search index to further assess attribution.
We index the training dataset using Elasticsearch 7.1720 and provide two search tools to query it: one focused on the Python subset and one covering the entire dataset. The code itself is preprocessed using a lowercase filter and Lucene’s ASCIIFoldingFilter, tokenized using a 3-gram tokenizer, and indexed using the default Lucene implementation of BM25 as a similarity function. We further index the username and license fields as keyword fields allowing for easy filtering and lookup based on these specific metadata fields. Both indexes are currently running in single-node mode on one virtual machine.
20 https://www.elastic.co/guide/en/elasticsearch/reference/7.17
Open-science and open-governance StarCoder is an output of a community research project. The project is conducted in the spirit of Open Science (Woelfle et al., 2011), focused on the responsible development and use of Code LLMs. Through open-governance practices conducted throughout the project, priority in decision-making has always yielded to the more responsible option even if this meant introducing limitations that might impact adoption or future research. For example, the Legal, Ethics, Governance Working Group decided to remove and not release a dataset of identified malicious code, even though this data might be useful for future security research.
Openness and safety risks Solaiman (2023) explains how the degree of openness in the LLM development process is connected to the potential risks associated with a model release. When systems are developed in a fully closed manner, it is more likely for power to become concentrated among high-resourced organizations, and the small development team may not fully comprehend the impact and long-term consequences of the model being deployed. In addition, closed-development systems are often less auditable by external experts and can impede scientific progress since researchers cannot build upon each other’s work. On the other hand, fully open development allows for community research, democratizes access to the models, and enables audits throughout the whole development process. However, without appropriate guardrails, open LLM development poses a higher risk of misuse, as increased model access also increases the likelihood of harm caused by the model. Even though a released API can be shut down, once the model weights are released, it is nearly impossible to retract them. Discussing and implementing responsible AI practices has, therefore, been front and center during the development of our project’s LLMs.
Dataset and data licensing StarCoder was trained on a subset of The Stack v1.2 dataset. This dataset has been filtered using a license detector to only include permissively licensed source code. Nevertheless, the license detector might have incorrectly classified a number of repositories. See Kocetkov et al. (2022) for more details on this license detection process.
Opt-out process Although The Stack offers a way to remove developer code, its opt-out process only applies to individual repositories and could benefit from further enhancements. For example, when code is licensed under a permissive or copy-left license, it can be duplicated to another repository, making it challenging to eliminate such copies if the copyright owner chooses to opt out. More work is necessary to create better data control and consent mechanisms for large-scale training sets of LLMs.
PII detection Despite our best efforts to remove PII (Section 4), StarCoder may still produce PII (however, note that the model license restricts use that aims to generate or disseminate PII with the purpose of harming others). As mentioned in Section 4.2, we trained an encoder-only model to detect PII for both code- and text-related tasks and noted that there is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. Moreover, the PII detection model’s performance may vary across different data types and programming languages, necessitating further validation and fine-tuning for specific use cases. The PII annotations are only available to approved individuals, and researchers and developers who are granted access are expected to uphold ethical standards and data protection measures. By making it accessible, our aim is to encourage further research and development of PII redaction technology.
Malicious code On the Hugging Face platform, where the Stack is hosted, a malicious code detection tool identified 654 files as unsafe. With the help of our community, we removed these files ahead of the release of The Stack v1.2. Nevertheless, The Stack may contain undetected malicious code, and StarCoder might be able to generate malware. The StarCoder OpenRAIL-M license, therefore, includes a use restriction against generating and/or disseminating malware (including — but not limited to — ransomware) or any other content that can be used to harm electronic systems.
Model limitations StarCoder is subject to typical limitations of LLMs, including the potential to generate content that is inaccurate, offensive, misleading, discriminatory towards age or gender, or reinforces other stereotypes. Please refer to Section 7.3 for an investigation into such safety concerns. Deployments of StarCoder need to further challenge and adapt the model to prevent such behavior, e.g., through red-teaming (Perez et al., 2022), adversarial testing (Wan et al., 2023), and/or by adding a robust safety layer (OpenAI, 2023b). The model is released with an OpenRAIL-M license that places enforceable use restrictions that apply to the model and its modifications, and to applications using the model.
English-only evaluations We evaluated the performance of StarCoder solely on English-based benchmarks to understand its coding capabilities and natural language understanding. To make these models more accessible to a wider audience, future research should investigate the performance and limitations of Code LLMs on other natural languages.
Code attribution tools The StarCoder membership-checking tool and BM25 search index are limited to dataset inspection against the subset of The Stack that was used for training and, as such, will not find matches to code that was not included or that was removed from the dataset for this project. The Portraits-based membership testing tool uses hash matching and thus may have false positives. It also has a minimum resolution and requires a certain amount of context to trigger a match. Both attribution tools do not attempt to distinguish between generic code (e.g., boilerplate) or protected content. However, we hope that these tools will support ongoing research on the responsible development of LLMs.
Code LLMs We expect Code LLMs to enable people from diverse backgrounds to learn to write higher-quality code and develop low-code applications (Leinonen et al., 2023). Mission-critical software could become easier to maintain as professional developers are guided by code-generating systems on how to write more robust and efficient code. However, the security implications should also be carefully considered (Sandoval et al., 2023). While the social impact is intended to be positive, the increased accessibility of Code LLMs comes with certain risks such as over-reliance on the generated code and long-term effects on the software development job market. We refer the reader to Chen et al. (2021, Section 7) for a broader impact analysis of Code LLMs, as well as Khlaaf et al. (2022) for an in-depth risk assessment and hazard analysis of this emerging technology.
Data annotation It was important for the project to only use reputable data annotation services. It was also important to balance the constraints of costs (fair compensation), time (the timing and time to complete the work were on the critical path for the project), and quality (to ensure that PII Detection Model training was not impacted). While traditional data annotation services using salaried employees were considered, the decision to work with Toloka crowd-workers was taken after a review of service providers and their compensation practices — most would not provide sufficient transparency and guarantees about worker compensation. Our determination of compensation took into consideration different minimum wage rates across countries and their corresponding purchasing power. We limited annotation eligibility to countries where the hourly pay rate of $7.30 was equivalent to the highest minimum wage in the US ($16.50) in terms of purchasing power parity.
Feedback opt-out form During the first stage of the opt-out process, individuals were asked to specify the reasons for wanting their code to be excluded from the dataset. The recurring concerns we heard from the individual who wished to opt out are:
The opt-out form thus provided an opportunity to directly engage with content creators and learn about the impact of our work on them.
Community feedback on opt-out process We conducted community research with individuals at specific organizations whose data is used in The Stack (The Alan Turing Institute and The Turing Way) and contributed to two open, international workshops (Open Data Day 2023 and Mozilla Festival 2023 with a session titled ‘Designing for Data Rights in the AI Production Pipeline’). These qualitative interviews and participatory co-design workshops included 50 participants, primarily from North America and Europe, with roles including research scientist, community manager, software engineer, and principal investigator (PI).
The outcomes from the community research can be summarized as follows: when it comes to governance of LLM datasets, participants feel that it is both better to know and better to have a choice. Most participants had neutral to positive feelings about their permissively licensed data being used to train LLMs. While all had positive impressions of the “Am I in The Stack” tool, not one interviewed expressed a desire to actually opt out. The main takeaway seemed to be that participants found the most value in the project’s governance tools for their ability to raise awareness of data practices and to empower individuals and communities to take action based on their specific needs. These initial conversations also highlighted the importance of bringing governance discussions and decisions directly to impacted communities, an important direction of future work that should extend community research beyond North America and Europe. Participants in the workshops also raised examples of new groups to center in data rights considerations, including artists, data miners, and future generations. The co-created outputs can be viewed on this MozFest Miro Board.
In this technical report, we described the efforts of the BigCode community in creating StarCoderBase and StarCoder, open-access 15.5B parameter large language models trained on code. We provided full transparency on all aspects of the research and development process, including the training data, the data curation process, the PII redaction pipeline, and the model training. We conducted the most extensive evaluation of Code LLMs to date, finding that StarCoder outperforms other Code LLMs like CodeGen (Nijkamp et al., 2023) and CodeGeeX (Zheng et al., 2023), and matches or outperforms the closed-access code-cushman-001 model from OpenAI. By releasing the StarCoder models with an Open Responsible AI Model license, and by open-sourcing all code repositories for building the model on GitHub, we aim to increase access, reproducibility, and transparency of Code LLMs in the research and developer communities. The model license includes use restrictions to ensure that modifications of the model and applications using the model adhere to our principles of responsible AI. In addition, we released a novel set of attribution tools to help end-users of Code LLMs to detect and locate model generations that may have been copied from the training set. We hope these measures contribute towards a safe model release, ensuring that the strong-performing StarCoder models remain a force for good.