Contents
1. 서론
최근 NLP 분야에서 대규모 언어모델(LLM)은 향상된 성능을 보이며 주목받고 있다. 특히, Chain-of-Thought(CoT), Self-Consistency(SC), ReAct 등과 같은 기법들이 도입되면서, LLM이 수백 또는 수천 번의 학습 없이도 특정 작업을 수행할 수 있는 능력을 보여주었다. 그러나 자체 오류 수정(self-correction)에 관한 연구는 이런 모델들이 자신의 논리적, 인퍼런스적 오류를 외부의 피드백 없이 수정하는 데 한계가 있음을 지적하고 있다.
2. BIG-Bench Mistake 데이터셋
2.1 데이터셋 설명
BIG-Bench Mistake 데이터셋은 PaLM 2 모델을 사용하여 생성된 CoT 스타일의 추적(trace)을 포함하고 있으며, 각 추적은 첫 번째 논리적 오류가 발생한 위치를 주석으로 달아 분석되었다. 이 데이터셋은 단어 정렬, 객체 추적, 논리적 인퍼런스, 다단계 산술 및 Dyck 언어 등 다양한 작업을 포함한다. 추적은 각 단계마다 별도로 생성되었고, 정확성은 정확한 일치를 기준으로 측정된다.
2.2 주석 처리
주석 처리는 휴먼 평가자와 자동화 도구를 사용하여 이루어졌다. 휴먼 평가자는 각 추적을 검토하여 오류를 식별했으며, Dyck 언어 작업은 패턴 매칭을 사용한 자동 주석이 주로 사용되었다.
3. 벤치마크 결과
3.1 모델 성능
GPT-4, GPT-3.5 등의 모델은 BIG-Bench Mistake 데이터셋에서 오류 위치를 정확하게 식별하는 데 어려움을 겪었다. 실험은 직접적인 추적 수준, 단계별, 및 CoT 단계별 프롬프팅 방식을 사용하여 수행되었다. 이런 결과는 LLM이 인퍼런스 과정에서 발생하는 오류를 자동으로 식별하고 수정하는 데 한계가 있음을 시사한다.
4. 백트래킹
4.1 백트래킹 메소드
백트래킹은 오류 위치 정보를 기반으로 초기 추적에서 오류가 식별된 단계를 다시 생성하는 기법이다. 이 방법은 오류가 있는 출력을 수정하여 성능을 향상시키는데 사용된다. 실험은 오류 위치 정보를 사용하는 골드 표준 레이블과 시뮬레이션된 보상 모델을 사용하여 수행되었다.
4.2 보상 모델
보상 모델은 특정 작업에서의 오류 식별을 개선하기 위해 별도로 훈련된다. 이 모델은 백트래킹 과정에서 오류 위치 정보를 제공하여 출력의 정확성을 높이는 데 사용된다.
5. 관련 작업
LLM의 출력에서 오류를 식별하고 수정하는 것은 중요한 연구 주제로, 다양한 데이터셋과 기법들이 제안되고 있다. 본 논문은 LLM의 인퍼런스 능력을 평가하고 개선하는 새로운 방법을 제시한다.
수학적 논리성과 배경 본 연구에서 사용된 데이터셋과 실험 방법은 각 단계의 논리적 정합성을 평가하고 오류를 식별하는 데 중점을 두었다. 백트래킹 기법은 다음과 같은 수학적 논리를 따른다.
\[\begin{align*} \text{If } \text{error_found} = \text{True} \text{ at step } n, \\ \text{then re-generate step } n \text{ with high variability (temperature = 1).} \end{align*}\]이 식은 오류가 발견되면 해당 단계를 다양한 출력을 생성할 수 있는 환경에서 다시 생성하도록 한다. 이후 가장 높은 로그-확률을 가진 출력을 선택하여 원래 추적을 수정하며, 이 과정은 오류의 정확한 위치와 유형을 식별하는 데 중요한 보상 모델의 입력으로 사용된다.
Large Language Models (LLMs) have dominated the field of NLP in recent years, achieving state- of-the-art performance in a large variety of applications. In particular, LLMs have demonstrated the ability to solve tasks with zeroor few-shot prompting, giving rise to prompting methods such as Chain-of-Thought (CoT) (Wei et al., 2022), SelfConsistency (SC) (Wang et al., 2023), ReAct (Yao et al., 2022), etc.
Recent literature on fewor zero-shot prompting has focused on the concept of self-correction, i.e. having an LLM correct its own outputs (Shinn et al., 2023; Miao et al., 2023; Madaan et al., 2023; Chen et al., 2023; Saunders et al., 2022). (See Pan et al. (2023) for a review of the literature.)
However, Huang et al. (2023) note that while self-correction may prove effective for improving model outputs in terms of style and quality, there is limited evidence that LLMs can identify and fix their own reasoning and logical errors without external feedback. For example, Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023) both use ground truth correctness as a signal to halt the self-correction loop. Initially observed by Madaan et al. (2023) on a math reasoning dataset, Huang et al. (2023) further demonstrate this shortcoming of self-correction in 2 additional datasets.
While previous work typically present self-correction as a single process, we divide it into mistake finding and output correction.
Mistake finding is a fundamental reasoning skill that has been studied and utilised extensively in philosophy, psychology, and mathematics, spawning concepts such as critical thinking, and logical and mathematical fallacies. One might expect that the ability to find mistakes should also be an important requirement for LLMs. However, our results show that state-of-the-art LLMs currently cannot find mistakes reliably.
Output correction involves partially or completely changing previously generated outputs. In the context of self-correction, this is typically done with outputs generated by the same model (see Pan et al. (2023) for an overview of different strategies). Despite LLMs’ inability to find mistakes, our results show that they can correct outputs using our backtracking method, if given information about the mistakes, for example via a small, supervised reward model.
Our contributions for this paper are as follows: 1. With Chain-of-Thought prompting, any task can be turned into a mistake-finding task. We collect and release1 to the research community BIG-Bench Mistake, a dataset of CoT-style traces2 generated using PaLM 2, and annotated according to where the first logical mistake is. To our knowledge, BIG-Bench Mistake is the first dataset of its kind that goes beyond problems in mathematics.
2 In this paper, we refer to a set of CoT reasoning steps as a trace.
BIG-Bench Mistake consists of 2186 sets of CoT-style traces. Each trace was generated by PaLM 2-L-Unicorn, and annotated with the location of the first logical error. An example trace is shown in Table 1, where the mistake location3 is the 4th step. Our traces span across a set of 5 tasks4 from the BIG-bench dataset (Srivastava et al., 2023): word sorting, tracking shuffled objects, logical deduction, multi-step arithmetic, and Dyck languages. CoT prompting is used to prompt PaLM 2 to answer questions from each task. As we wanted to separate our CoT traces into distinct steps, we follow the method used by Yao et al. (2022) and generate each step separately, using the newline as a stop token. In this dataset, all traces are generated with temperature = 0. The correctness of answers are determined by exact match. Prompts can be found at https://github.com/WHGTyen/ BIG-Bench-Mistake.
3 As some traces may not contain mistakes, we use the term mistake location as a multi-class label that can refer to either the integer N where the N th step contains the first mistake, or that there are no mistakes.
4 These 5 tasks were selected because 1) Anil et al. (2023) demonstrate that PaLM 2 performs poorly on these tasks, so it is likely to generate mistakes in CoT traces; 2) any mistakes that may occur are likely to be unambiguous, therefore minimising subjectivity during annotation; and 3) identifying mistakes for these tasks does not require expertise knowledge of a specific domain.
Each generated trace is annotated with the first logical error. We ignore any subsequent errors as they may be dependent on the original error.
Note that traces can contain a logical mistake yet arrive at the correct answer. To disambiguate the two types of correctness, we will use the terms correctans and incorrectans to refer to whether the final answer of the trace is correct. Accuracyans would therefore refer to the overall accuracy for the task, based on how many final answers are correct. To refer to whether the trace contains a logical mistake (rather than the correctness of the final answer), we will use correctmis and incorrectmis.
For 4 of the 5 tasks, we recruit human annotators to go through each trace and identify any errors.
Annotators have no domain expertise but are given guidelines5 to complete the task.
Before annotation, we sample a set of 300 traces for each task, where 255 (85%) are incorrectans, and 45 (15%) are correctans. Since human annotation is a limited and expensive resource, we chose this distribution to maximise the number of steps containing mistakes and to prevent over-saturation of correct steps. We also include some correctans traces because some may contain logical errors despite the correct answer, and to ensure that the dataset included examples of correct steps that are near the end of the trace. This also prevents annotators from feeling forced to find a mistake in all traces.To account for this skewed distribution, results in section 4 are split according to whether the original trace is correctans or incorrectans.
Following Lightman et al. (2023), annotators are instructed to go through each step in the trace and indicate whether the step is correct or not (binary choice). Annotators only submit their answers until all steps have been annotated, or there is one incorrect step. If an incorrect step is identified, the remaining steps are skipped. This is done to avoid ambiguities where a logically correct deduction is dependent on a previous mistake. We make our annotation guidelines available at https:// github.com/WHGTyen/BIG-Bench-Mistake, and we include a screenshot of the user interface in Figure 3.
Each trace has been annotated by at least 3 annotators. If there are any disagreements, we take the majority label. We calculate Krippendorff’s alpha (Hayes and Krippendorff, 2007) to measure inter-rater reliability (see Table 2).
Table 2: Inter-rater reliability for the human-annotated tasks, measured by Krippendorff’s alpha.
For Dyck languages, we opt for mostly automatic annotation instead of human annotation as the traces show limited variation in phrasing and solution paths.
For each trace, we generate a set of standard steps based on the format used in the prompt examples. Using pattern matching, we can identify whether each model-generated step also conforms to the same format. If so, we compare the two and assume that the trace is incorrect if the symbols do not match. Additionally, we also account for edge cases such as where the final two steps are merged into one, or variations in presentation where some symbols are placed in quotes and some are not. We release the code at https://github.com/WHGTyen/ BIG-Bench-Mistake along with our dataset.
Table 4 shows the accuracy of GPT-4-Turbo, GPT-4, and GPT-3.5-Turbo on our mistake-finding dataset. For each question, the possible answers are either that there are no mistakes, or, if there is a mistake, the number N indicating the step in which the first mistake occurs. A model’s output is only considered correct if the location matches exactly, or the output correctly indicates that there are no mistakes.
All models are given the same 3-shot prompts6.
We use three different prompting methods:
6 Prompts can be found at https://github.com/BIG-Bench-Mistake for further details.
Table 3: Breakdown of correctness and mistake distribution in our dataset. Correctnessans is based on exact matching. Dyck languages (sampled) refers to the set of traces which have been sampled so that the the ratio of correctans to incorrectans traces matches the other tasks.
Figure 1 demonstrates this trade-off.
All three models appear to struggle with our mistake finding dataset. GPT-4 attains the best results but only reaches an overall accuracy of 52.87 with direct step-level prompting.
Our findings are in line with and builds upon results from Huang et al. (2023), who show that existing self-correction strategies are ineffective on reasoning errors. In our experiments, we specifically target the models’ mistake finding ability and provide results for additional tasks. We show that state-of-the-art LLMs clearly struggle with mistake finding, even in the most simple and unambiguous cases. (For comparison, humans can identify mistakes without specific expertise, and have a high degree of agreement, as shown in Table 2.)
We hypothesise that LLMs’ inability to find mistakes is a main contributing factor to why LLMs are unable to self-correct reasoning errors. If LLMs are unable to identify mistakes, it should be no surprise that they are unable to self-correct either.
Note that the mistakes in our dataset are generated using PaLM 2 L (Unicorn), and traces were sampled according to whether the final answer was correct or not. Therefore, we expect that using PaLM 2 itself to do mistake finding will produce different and likely biased results. Further work is needed to elucidate the difference between crossmodel evaluation and self-evaluation.
As we compare results across the three methods, we find that the accuracy on traces with no mistakes goes down considerably from direct, trace-level prompting to more complex methods.
Note that the traces in BIG-Bench Mistake are sampled to contain more incorrect answer traces than correct answer traces (and therefore more incorrect mistakes than correct mistakes), so the overall mistake location accuracy appears higher for per-step prompting in Table 4, despite the poor accuracy for correct mistake traces. For a full set of splits by correctness, see Figure B.
We hypothesise that this is due to the number of outputs generated by the model. Our three methods involve generating increasingly complex outputs, starting with direct, trace-level prompting requiring a single token, then direct, step-level prompting requiring one token per step, and finally CoT step-level prompting requiring several sentences per step. If each generation call has some probability of identifying a mistake, then the more calls made on each trace, the more likely the model will identify at least one mistake.
Figure 1: Graph of mistake location accuracies for each prompting method (excluding GPT-4-Turbo which we do not have CoT step-level results for). Blue bars show accuracies on traces with no mistakes, so the model must predict that the trace has no mistake to be considered correct; orange bars show accuracies on traces with a mistake, so the model must predict the precise location of the mistake to be considered correct.
In this section, we investigate whether our prompting methods can reliably determine the correctnessans of a trace rather than the mistake location. Our motivation was that even humans use mistake finding as a strategy for determining whether an answer is correct or not, such as when going through mathematical proofs, or working through argumentation. Additionally, one might think that directly predicting the correctnessans of a trace may be easier than having to pinpoint the precise location of an error.
Table 4: Mistake finding accuracy across 5 tasks. The average number of steps in the CoT reasoning traces in each task is indicated in brackets. Unless otherwise indicated, the number of traces in each task is shown in Table 3. We also provide scores split by correctnessans of the original trace in Figure B. † indicates that traces were sampled to contain 15% correctans and 85% incorrectans traces (see Table 3). * indicates that traces were sampled to contain 45 correctans and 255 incorrectans traces to reduce costs.
We calculate averaged F1 scores based on whether the model predicts that there is a mistake in the trace. If there is a mistake, we assume the model prediction is that the trace is incorrectans. Otherwise, we assume the model prediction is that the trace is correctans.
Table 5: Weighted average F1 scores for predicted correctnessans of traces across 5 tasks. Baseline is 78 if we only select the incorrectans label. As in Table 4, traces for the Dyck languages task has been sampled to match the ratio of correctans to incorrectans traces of the other tasks. See Table 3 for a full breakdown.
Note that the baseline of predicting all traces as incorrect achieves a weighted F1 average of 78.
The weighted F1 scores show that prompting for mistakes is a poor strategy for determining the correctness of the final answer. This is in line with our previous finding that LLMs struggle to identify mistake locations, and also builds upon results from Huang et al. (2023), who demonstrate that improvements from Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023) are only from using oracle correctnessans information.
Madaan et al. (2023) and Huang et al. (2023) both demonstrate that self-correction is only effective with external feedback – for example, both Shinn et al. (2023) and Kim et al. (2023) rely on oracle labels for improvements – but there is often no external feedback available in many real-world applications.
In Table 5, we average the F1s using correctans and incorrectans as the positive label, weighted according to the number of times each label occurs.
As an alternative, we explore the possibility of replacing external feedback with a lightweight classifier trained on a small amount of data. Analogous to reward models in conventional reinforcement learning, this classifier detects any logical errors in a CoT trace, which is then fed back to the generator model to improve on the output. This can be done over multiple iterations to maximise improvements. We propose a simple backtracking method to improve model outputs based on the location of logical errors:
Our backtracking method provides several benefits over existing self-correction methods:
Backtracking with mistake location information from a reward model can be construed as a lightweight RL method. However, unlike conven- tional deep reinforcement learning:
As an initial experiment, we use labels from BIG-Bench Mistake to test if an LLM is able to correct logical errors using backtracking, independent of its inherent ability to identify these errors or any other reward model.
For example, if the mistake location is in step 4, we use backtracking to regenerate that step and continue the rest of the chain. If the mistake location is that there are no logical mistakes, we do not backtrack and use the original result.
The results are shown in Table 6. To show that performance increases are not due to randomly resampling outputs, we compare our results to a random baseline, where a mistake location9 is randomly selected for each trace and we perform backtracking based on the random location.
8 Having no logical errors in incorrectans traces is much rarer but does exist, for example when the answer is correct but is not captured by exact match, or if the original question is faulty and has multiple possible answers.
9 As described above, the mistake location can be either the number representing the step, or that there are no mistakes. If there are no mistakes, we do not use backtracking and simply use the original trace.
Table 6: Absolute differences in accuracyans before and after backtracking. “With mistake location” indicates that backtracking was done using oracle mistake locations from the dataset; “With random location” indicates that backtracking was done based on randomly selected locations. ∆accuracy ✓ refers to differences in accuracyans on the set of traces whose original answer was correctans; ∆accuracy ✗ for traces whose original answer was incorrectans. The average number of steps in a trace is shown to demonstrate the likelihood of randomly selecting the correct mistake location in the random baseline condition.
Note that Table 6 separates results into numbers for the correct set and the incorrect set, referring to whether the original trace was correctans or not. This gives a clearer picture than the overall accuracyans, which would be skewed by the proportion of traces that were originally correctans (15%) and incorrectans (85%).
Scores represent the absolute differences in accuracyans. We perform backtracking on both correctans and incorrectans traces, as long as there is a mistake in one of the steps.
∆accuracy ✓ refers to differences in accuracyans on the set of traces whose original answer was correctans. Note that we take losses here because, despite the correct answer, there is a logical mistake in one of the steps. Therefore, the answer may change to an incorrect one when we backtrack.
∆accuracy ✗ is the same but for incorrectans traces, so the answers may have been corrected, hence increasing accuracyans.
For example, for the word sorting task, 11.11% of traces that were originally correctans became incorrectans, while 23.53% of traces that were originally incorrectans became correctans.
The scores show that the gains from correcting incorrectans traces are larger than losses from changing originally correct answers. Additionally, while the random baseline also obtained improvements, they are considerably smaller than if the true mistake location was used. Note that tasks involving fewer steps are more likely to improve performance in the random baseline, as the true mistake location is more likely to be identified.
While our numbers do show that our gains are higher than our losses, it should be noted that changes in the overall accuracy depends on the original accuracy achieved on the task. For example, if the original accuracy on the tracking shuffled objects task was 50%, the new accuracy would be 68.6%. On the other hand, if the accuracy was 99%, the new accuracy would drop to 92.8%. As our dataset is highly skewed and only contains 45 correctans traces per task, we leave to future work to assess the effectiveness of backtracking in a more comprehensive way.
We show in subsection 4.1 that backtracking can be used to correct CoT traces using gold mistake location labels. To explore what level of accuracy reward model is needed when gold labels are not available, we use backtracking with simulated reward models, designed to produce labels at different levels of accuracy. We use accuracyRM to refer to the accuracy of the simulated reward model at identifying mistake locations.
For a given reward model at X% accuracyRM , we use the mistake location from BIG-Bench Mistake X% of the time. For the remaining (100 − X)%, we sample a mistake location randomly. To mimic the behaviour of a typical classifier, mistake locations are sampled to match the distribution found in the dataset. We also ensure that the sampled location does not match the correct location.
Results are shown in Figure 2. We can see that the losses in ∆accuracy ✓ begins to plateau at 65%. In fact, for most tasks, ∆accuracy ✓ is already larger than ∆accuracy ✗ at around 60-70% accuracyRM . This demonstrates that while higher accuracies produce better results, backtracking is still effective believe more data may be necessary to improve results across the board on all tasks. We leave the collection of this larger dataset and a more rigorous investigation of the trade-offs of model size vs. performance of the reward model to future work.
We also leave for future work the effect of backtracking iteratively with a reward model: for example, the generator model may make another mistake after backtracking for the first time, which can then be identified and corrected again.
We perform a preliminary investigation of if mistake-finding can benefit from a dedicated reward model and if learning to find mistakes in a set of tasks can transfer to finding mistakes in out-of-distribution tasks. We fine-tuned a PaLM 2-XS-Otter model based on our available data for 20k steps and choose the checkpoint with the best validation results. We hold out one task for evaluation while training the reward model on the other 4 tasks.
Note the reward model we train is significantly smaller than our inference model. We show the relative improvements and losses in Table 7 vs. a zero-shot baseline on PaLM 2-L-Unicorn. We see gains for 4 out of 5 of the tasks. This provides initial indication that it maybe possible to train separate reward model classifiers to assist in backtracking and that these reward models do not have to be large. Further, a reward model can work on mistakes that are out-of-distribution. However, we struggle with finding logical errors without external feedback, but argue that this feedback can come from a reward model instead. Finally, we demonstrate the effectiveness of backtracking, both with gold standard labels as well as with simulated reward models at lower levels of accuracy.
Datasets To our knowledge, the only publicly available dataset containing mistake annotations in LLM outputs is PRM800K (Lightman et al., 2023), which is a dataset of solutions to Olympiad-level math questions. Our dataset BIG-Bench Mistake covers a wider range of tasks to explore the reasoning capabilities of LLMs more thoroughly. Additionally, the generator LLM used in PRM800K has been fine-tuned on 1.5B math tokens as well as a dataset of step-by-step math solutions. For this paper, we wanted to explore few-shot in-context learning methods, which is typically used in realworld applications with API-based LLMs.
Self-correction Pan et al. (2023) present a plethora of self-correction methods in recent literature. While their list includes training-time correction strategies such as RLHF (Ouyang et al., 2022) and self-improve (Huang et al., 2022), our backtracking method falls into the category of post-hoc correction, where the correction process is applied to outputs that have already been generated.
Our paper focuses on correction of logical and reasoning errors, rather than stylistic or qualitative improvements. Previous post-hoc correction methods that are applied to reasoning errors include Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023), both of which cause performance deterioration when the oracle label is not used (Huang et al., 2023). Other methods such as Self-Refine (Madaan et al., 2023) and iterative refinement (Chen et al., 2023) focus on qualitative or stylistic improvements rather than correcting logical errors.
In this paper, we describe and release our dataset BIG-Bench Mistake for mistake finding, and propose a backtracking method to correct logical errors in CoT style traces. We show that LLMs generally even without gold standard mistake location labels.
Figure 2: ∆accuracy ✓ and ∆accuracy ✗ on each dataset as accuracyRM increases.
Limitations
One main limitation of our dataset is that it features tasks that are artificial and unrealistic for real-world applications. We made this choice to minimise ambiguity and subjectivity during the mistake finding process, but further work needs to be done to deter- mine the effectiveness of backtracking in a more realistic setting.
Another limitation is that our paper does not evaluate backtracking on the original datasets on BIG-Bench, only showing results on the limited set that we sampled in a skewed manner, in order to maximise the value of the human annotators’ time. We leave the full evaluation to future work.