[배경 및 문제 정의]
최근 대규모 언어모델(LLM)들은 자연어 처리(NLP)와 코드 생성에서 인상적인 성과를 보여주고 있습니다. 특히 Python 코드 생성을 위한 Code-LLM들은 다양한 벤치마크를 통해 평가되어 왔습니다. 그러나 이런 벤치마크들은 종종 한정된 프로그래밍 개념과 난이도를 다루며, 실제 다양한 프로그래밍 문제를 폭넓게 평가하는 데에는 한계가 있습니다.
[선행 연구 검토]
기존의 코드 생성 모델들은 HumanEval, MBPP와 같은 데이터셋을 통해 평가되었습니다. 이 데이터셋들은 주로 간단한 문제를 포함하고 있어, 모델의 일반화 능력과 다양한 프로그래밍 개념에 대한 이해를 제대로 평가하기 어렵습니다. 이에 따라 보다 체계적이고 다양한 프로그래밍 문제를 포함한 새로운 벤치마크의 필요성이 대두되었습니다.
[방법]
[데이터셋 구성]
새로운 벤치마크 PythonSaga는 다음과 같은 구성으로 제안되었습니다.
[평가 방법]
[결과]
PythonSaga를 통한 평가에서 대부분의 모델들은 낮은 성능을 보였으며, 특히 개념을 다루는 문제에서 두드러진 성능 저하가 관찰되었습니다. 이는 기존 모델들이 실제 복잡한 프로그래밍 문제를 해결하는 데에 아직 충분치 않다는 것을 시사합니다.
[결론]
PythonSaga는 다양한 프로그래밍 개념과 난이도를 균형 있게 포함하고자 노력하였습니다.
The rapid advancement of large language models (LLMs), such as Gemini (Anil et al., 2023a), GPT-4 (OpenAI, 2023), LLaMA (Touvron et al., 2023), and PaLM (Anil et al., 2023b), has achieved near-human or even superhuman performance (Bowman, 2023) across a wide range of NLP tasks. This surge has also prompted the development of tailor-made code generation models, such as Codex (Chen et al., 2021), STARCODER (Li et al., 2023), Code-Gen (Nijkamp et al., 2022), and CodeGeeX (Zheng et al., 2023). These specialized models, hereafter collectively referred to as “Code-LLMs”, harness the capabilities of LLMs for automated code generation from human descriptions. Figure 1 shows a toy example with an input description from a human and an expected Python code generated by a Code-LLM.
The prevalence of Python as the dominant programming language has significantly influenced a majority of Code-LLMs to showcase their code-generation capabilities on Python-specific benchmarks. Consequently, HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), APPS (Hendrycks et al., 2021), and DS-1000 (Lai et al., 2023) have emerged as prominent benchmarks, leveraging data curated from popular coding platforms like GitHub (GitHub, 2024), LeetCode (GeeksForGeeks, 2023), and Codeforces (Codeforces, 2024) and crowdsourcing efforts. These benchmarks offer a diverse range of programming challenges, with sizes spanning from a few hundred instances in HumanEval (Chen et al., 2021) to several thousand instances in datasets like APPS (Hendrycks et al., 2021) and MBPP (Austin et al., 2021).
Code generation benchmarks, like their NLP counterparts (Kiela et al., 2021), are reaching saturation, revealing limitations in their ability to evaluate models comprehensively. Figure 2 reports pass@1 score of recent Code-LLMs on two popular benchmarks, HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). $\text{pass@k} $ measures if at least one of the k code samples generated by the model passes every test case. Detailed formal definition is present in Appendix A.1.
Figure 2: Performance comparison arranged in ascending order of (pass@1) of popular Code-LLMs on two Python benchmarks, HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). pass@1 scores are taken verbatim as reported in STARCODER (Li et al., 2023), Code Llama (Roziere et al., 2023), and Gemini (Anil et al., 2023a). GPT-4, Gemini Pro, and Gemini Ultra do not report performance scores on MBPP dataset.
This progress prompts two critical questions: (1) Have Code-LLMs attained the generalization ability to solve any programming problem? (2) What programming concepts remain challenging for them, hindering their ability to solve specific problems? Surprisingly, despite their widespread use, existing benchmarks lack a comprehensive evaluation of their diversity in terms of programming concepts and difficulty level.
In this paper, we introduce a comprehensive hierarchical classification of programming concepts, categorizing them into basic, intermediate, and advance levels (see Section 3). We then rigorously evaluate two benchmarks, HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021), on two dimensions: diversity of programming concepts and user-perceived difficulty. Our findings reveal a significant bias towards a small subset (<53%) of programming concepts, leaving the vast majority underrepresented. Additionally, over 80% of the problems are perceived as easy, raising concerns about the benchmarks’ generalizability and effectiveness (see Section 4). To address these limitations, in Section 5, we propose a novel code generation benchmark, PythonSaga, featuring a balanced representation of 38 programming concepts across three difficulty levels in the form of 185 manually crafted problems. Surprisingly, our experiments show poor pass@1 scores by the majority of the existing open (<4%) and closed-source (<13%) Code-LLMs on PythonSaga. Furthermore, detailed analysis unveils significant disparities in their capacity to handle different programming concepts and difficulty levels.
NLP for Programming: Over the years, various programming tasks, including clone detection (Roy et al., 2009) (assessing semantic similarity between code fragments), defect detection (Tabernik et al., 2020) (identifying potential flaws within source code), code completion (Hindle et al., 2016) (predicting subsequent tokens based on code context), automated code repair (Arcuri and Yao, 2008) (improving code by automatically addressing bugs), code search (Sachdev et al., 2018) (gauging semantic relevance between textual descriptions and code snippets), and code summarization (Allamanis et al., 2016) (generating natural language comments for code), have been extensively investigated and discussed within the NLP community. This exploration has led to the development of several datasets such as GitHub Java Corpus (Allamanis and Sutton, 2013), BigCloneBench (Svajlenko et al., 2014), POJ-104 (Mou et al., 2016), PY150 (Raychev et al., 2016), Devign (Zhou et al., 2019), Bugs2Fix (Tufano et al., 2019), CodeSearchNet (Husain et al., 2019), CT-max/min (Feng et al., 2020), MBPP by Austin et al. (2021), CodeXGLUE by Lu et al. (2021), CodeNet by Puri et al. (2021), HumanEval by Chen et al. (2021), XLCoST by Zhu et al. (2022), MultiPL-E by Cassano et al. (2022), and HumanEval-X by Zheng et al. (2023). These datasets and associated benchmarks span multiple programming languages, including Java, C, C++, PHP, Ruby, Go, and Python, among others.
Code Generation Models: The remarkable surge in the popularity of large language models (LLMs) has also been accompanied by significant advancements in code-generation LLMs (Code-LLMs). These models exhibit the capability to generate code in designated programming languages, guided by instructions presented in the form of prompts, functions, or docstrings. Prominent examples of such Code-LLMs include but are not limited to, Codex (Chen et al., 2021), Code-Gen (Nijkamp et al., 2022), Code Llama (Roziere et al., 2023), STARCODER (Li et al., 2023) and CodeGeeX (Zheng et al., 2023). These Code-LLMs are largely multilingual, capable of handling multiple programming languages, and their parameter sizes range from 1 billion to 35 billion. Their training datasets encompass popular programming websites and code repositories such as GitHub, LeetCode, and GeeksForGeeks. All popular Code-LLMs primarily focus on Python programs due to their widespread usage in ML and AI applications.
Python-based Evaluation Benchmarks: Recent thrust in Python code generation models also led to the development of several benchmark datasets. The PY150 dataset (Raychev et al., 2016), consisting of 150,000 Python source files from GitHub, serves as a valuable tool for LLM evaluation. The APPS dataset (Hendrycks et al., 2021) features 10,000 problems from platforms like Codewars, AtCoder, Kattis, and Codeforces. HumanEval (Chen et al., 2021) comprises 164 handwritten problems. The MBPP dataset (Austin et al., 2021) contains 974 entry-level problems. Additionally, the MathQA-Python dataset (Austin et al., 2021), with 23,914 problems, assesses code synthesis from complex textual descriptions.
Limitations in Existing Benchmarks: Current datasets for evaluating Large Language Models (LLMs) often lack transparency and comprehensiveness in problem selection and categorization. This opacity hinders assessments of the generalizability and representativeness of the benchmarks, potentially leading to overestimation of LLM performance on code generation tasks. To address this issue, this paper proposes a comprehensive problem categorization by outlining recommended concepts for problem inclusion, aiming to establish a rigorous and transparent benchmarking framework.
The concepts encompass language-specific constructs like variables, data types, control flow, and conditions to generic constructs like Algorithms, OOPs, etc. We, therefore, propose a hierarchy of programming concepts where a complex concept might require knowledge of several basic concepts. For example, sorting algorithms like Quicksort or Mergesort require a thorough understanding of data structures such as arrays and linked lists, as well as proficiency in algorithmic analysis and time complexity. Each programming concept is an intrinsic feature of a problem. We next describe the proposed hierarchy:
We curate a list of 38 programming concepts from three popular coding platforms (GeeksForGeeks, 2023; LeetCode, 2023; hackerearth, 2023). We further assign each concept to one of the three hierarchy levels. Table 1 presents the curated concepts and the proposed hierarchy.
An annotator, with their expertise and experience in programming, can perceive a programming problem as belonging to one of three difficulty levels: Easy, Medium, or Hard (Hendrycks et al., 2021). Thus, difficulty level is an extrinsic feature of a problem. This subjective assessment is based on a complex combination of factors, such as knowledge of programming concepts, problem-solving skills, experience with similar problems, and coding proficiency. It is important to note that this perceived difficulty is subjective and can vary significantly between annotators. A problem considered easy by one annotator due to their prior experience and knowledge might be deemed challenging by another who lacks those same advantages. Furthermore, the perceived difficulty of a problem can also evolve over time as an annotator develops their skills and knowledge. A problem that initially seemed challenging may become easier with practice and exposure to similar problems. Conversely, an annotator may encounter a problem that initially appears straightforward but then find themselves struggling due to hidden complexities or unforeseen challenges.
In this paper, we focus on Python Programming language and conduct human experiments with two popular Python-based code generation benchmarks to showcase extensive selection bias and poor diversity in the curated problems. The following section describes the human experiments in detail.
This study is grounded on the two most widely recognized Python code generation benchmarks: (i) HumanEval (Chen et al., 2021) and (ii) MBPP (Austin et al., 2021). Recent Code-LLMs including STARCODER (Li et al., 2023), LLaMA (Touvron et al., 2023), METAGPT (Hong et al., 2023), Code Llama (Roziere et al., 2023), SANTACODER (Allal et al., 2023), and CodeGeeX (Zheng et al., 2023) have employed these two benchmarks to assess their performance. We next briefly describe these benchmarks.
Both benchmarks evaluate model performances against one of the most popular metrics pass@k. We formally define pass@k in Section A.1.
Next, we conducted two human annotation studies to gain insights into the diversity in programming concepts and difficulty levels of the two proposed benchmarks. Each study involved the recruitment of the same set of five annotators. Each annotator is a postgraduate student in Computer Science with at least three years of experience in Python programming and competitive programming. It is noteworthy that each participant willingly volunteered throughout the entire duration of the experiment, and no remuneration was provided. Internet access was prohibited during the entire annotation period. Annotators were encouraged to utilize any brute-force technique they considered appropriate without prioritizing optimized solutions. No time constraints were imposed to prevent hasty or fatigue-induced decisions. Each annotator was presented with 164 problems from HumanEval and a randomly selected set of 164 problems from MBPP. We next describe the two annotation studies:
Diversity in the Programming Concepts: In this section, we report the proportion of problems assigned to a specific concept averaged over five annotators. We find five predominant concepts, Mathematics, Control Flow and Conditions, Basic Data Structures, Variables and Data Types, and In-Built Functions, which comprise 72.1% and 77.3% problems in HumanEval and MBPP, respectively. Surprisingly, we found a complete absence of 14 (=37.8%) concepts. Notable exclusions include OOPs, Linked-lists, Tree, Graph, and Backtracking. Figure 3 presents concept-wise proportions in both the benchmarks. Further analysis suggests that, on average, the Basic category comprises approximately 78% of problems in both HumanEval and MBPP. The Intermediate category comprises 20.24% and 18.04% problems in HumanEval and MBPP, respectively. Finally, the Advance category contains 1.09% and 3.04% problems in HumanEval and MBPP, respectively.
Diversity in the Difficulty level: Here, we report the difficulty level assigned to a problem using majority voting among the annotators. In HumanEval, 84.8% of the problems were classified as Easy, 14.6% as Medium, and only 0.6% as Hard. Whereas in MBPP, 89.6% and 10.4% of problems were categorized as Easy and Medium, respectively. No problem in MBPP was labeled as Hard. Here, we achieved significant consensus among the annotators. For example, in HumanEval, we find complete agreement among five annotators on 39% of the problems. Whereas we miss complete agreement by a single vote in 29.2% problems. In the case of MBPP, the 40.2% problems resulted in a complete agreement, with 42.1% problems missing the complete agreement by one vote.
Overall, we observe significant selection bias towards easy problems in both benchmarks.
We now introduce PythonSaga, a new Python code generation benchmark that addresses the limitations of existing benchmarks with respect to diversity in concepts and difficulty level. PythonSaga contains 185 prompts, close to equal representation from each of the 38 programming concepts with varied levels of difficulty (described in Section 3.2).
Aligned with the problem curation strategies employed in established benchmarks (Hendrycks et al., 2021; Lai et al., 2023; Zhu et al., 2022), this study leverages problems from two prominent coding platforms: GeekForGeeks (GeeksForGeeks, 2023) and LeetCode (LeetCode, 2023). To comprehensively represent each proposed programming concept (detailed in Section 3.1), we curated five problems per concept. This diverse set comprises one Easy problem, two Medium problems, and two Hard problems, ensuring a balanced distribution across difficulty levels (20%, 40%, and 40%, respectively) within the PythonSaga Dataset.
To enhance human-friendliness and ground the problems in realistic contexts, each shortlisted problem statement undergoes a manual rephrasing process without any aid from AI tools. Furthermore, a comprehensive description of input and output formats, accompanied by relevant examples, is supplied with each problem statement to ensure a thorough understanding of the task by Code-LLM. This multi-step approach aims to retain the core knowledge and essential solution steps while integrating them into relatable real-world scenarios. This reconstruction involves reformulating the entire problem statement while preserving its fundamental functionality. This deliberate transformation enhances the challenge for Code-LLMs, requiring them to move beyond simple pattern matching and grasp the nuanced context embedded within the prompt to devise a solution effectively. For example, the problem statement “Given a string str, find the length of the longest substring without repeating characters.” is paraphrased as “Let’s say you attend a car show where cars of different brands are showcased in a row. Find the length of the longest stretch where no two cars are of the same brand. Take the input from the user for the brands of the cars in the order they are placed in the row. Print the length of the longest stretch where no two cars are of the same brand”.
Overall, PythonSaga comprises five problem instances from each programming concept, resulting in a total size of 185 problems. Each problem is associated with a maximum of four test cases, with an average of 3.7 test cases per problem. PythonSaga’s structure resembles HumanEval and MBPP, wherein each problem comprises a function signature, docstring, body, and multiple unit tests. A representative example is present in Appendix A.2.
Next, we benchmark several open and closed-source LLMs on PythonSaga. Open-source models include three Llama variants, Code Llama (Roziere et al., 2023), Code Llama Python (Roziere et al., 2023), and Code Llama Instruct (Roziere et al., 2023), Mistral-Instruct (Jiang et al., 2023), StarCoderBase (Li et al., 2023) and two Deepseek variants, Deepseek Coder (Guo et al., 2024) and Deepseek Coder Instruct (Guo et al., 2024). Except for Mistral-Instruct, the rest are the Code-LLMs. In addition, we benchmark two closed-source models, including GPT variants GPT-3.5 (OpenAI, 2022) and GPT-4 (OpenAI, 2023). While larger open-source options exist, our selection was restricted to models with 7B parameters due to computational resource limitations, which were limited to a single Tesla V100 in our case.
We evaluate model performances using pass@k metric. Adhering to previous studies like HumanEval (Chen et al., 2021), StarCoder (Li et al., 2023), Deepseek Coder (Guo et al., 2024), etc, we primarily utilized $ k = 1 $ signifying that a model is considered successful if at least one of its generated solutions passes the defined evaluation criteria. However, we additionally explored $ k = 10 $ to analyze model consistency across larger sets of responses. Notably, unlike prior works that varied the number of sampled responses $(n)$, we consistently generated samples from both open-source and closed-source models for a consistent evaluation. $ n = 20 $
Table 2 compares the above models against pass@1 and pass@10 metrics. In consistent with the latest trends, closed-source models performed considerably better than open-source models. Among open-source models, Deepseek Coder (Guo et al., 2024) performed best, whereas, among closed-source models, GPT-4 (OpenAI, 2023) performed best. Notably, the performance of closed-source models on PythonSaga is significantly lower than the respective performances in HumanEval and MBPP benchmarks (see Figure 2 for more details).
Figure 4 illustrates the performance of each LLM on problems within specific programming concepts in the PythonSaga. We consider a model has successfully solved a problem if any one of the n(=20) generated samples passes all the test cases. As anticipated, all models exhibited better performance in solving problems associated with basic concepts compared to intermediate or Advance concepts. For example, Deepseek Coder, solved 21.1%, 25%, and 8.2% of problems in these categories, respectively. Whereas, GPT-4 solved 42.3%, 46.6%, and 32.8% of problems, respectively. In contrast to open-source models, closed-source models have successfully solved at least one problem from a majority of the concepts. Interestingly, none of the models could successfully solve any problem within five specific concepts, Basic Data Structures, Hashing, Context Managers, Concurrency and Parallelism, and Max Flow. Notably, closed-source models exhibited a more consistent performance across categorization compared to open-source models, suggesting a potential advantage in handling diverse problem complexities.
This study emphasizes the crucial need for a more balanced and comprehensive evaluation framework to ensure a fair and accurate assessment of large language models (LLMs) capable of generating code from human inputs. We address this gap by proposing an extensive categorization and hierarchy of programming concepts. Subsequent analysis of two prominent Python code generation benchmarks reveals limited diversity in both programming concepts and difficulty levels. Notably, we introduce a novel benchmark characterized by a uniform representation of concepts and difficulty, offering a more robust assessment paradigm. Our findings suggest that existing benchmarks potentially overestimate LLM performance on code generation tasks. This work lays the groundwork for the future development of diverse and representative Python code generation benchmarks, paving the way for similar studies in other programming languages.