00:00:00

Share Your Feedback 🏝️

Reasonning | Premise Order Matters in Reasoning

Reasonning | Premise Order Matters in Reasoning

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Fusion | Knowledge Fusion of Large Language Models Next: CoT | Chain-of-Thought Without Prompting*

Reasonning | Premise Order Matters in Reasoning

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-02-15

Premise Order Matters in Reasoning with Large Language Models

  • url: https://arxiv.org/abs/2402.08939
  • pdf: https://arxiv.org/pdf/2402.08939
  • abstract: Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model’s accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

Contents

TL;DR


대규모 언어모델의 수학적 인퍼런스 능력과 전제의 순서 영향 조사

  • 대규모 언어모델(LLM)의 인퍼런스 성능은 전제의 제시 순서에 많이 의존하며,
  • 전제의 제시 순서를 변화시키면 인퍼런스 정확도에 상당한 차이가 발생하는 것으로 알려져있습니다.
  • 이 연구는 전제의 순서가 LLM의 인퍼런스 능력에 미치는 영향을 체계적으로 분석합니다.

1. 서론

최근 연구에 따르면, 대규모 언어모델(LLM)은 수학, 과학, 코딩 문제 등 다양한 인퍼런스 벤치마크에서 인상적인 성능을 보여주고 있습니다. 그러나 이런 모델들은 전제의 순서가 바뀌는 단순한 조건에서도 성능이 큰 폭으로 떨어지는 것으로 나타났습니다. 이는 LLM이 인퍼런스 과정에서 전제의 순서에 민감하게 반응한다는 것을 의미하며, 이는 휴먼의 인퍼런스 패턴과 일치합니다. 본 연구에서는 LLM의 이런 성질을 체계적으로 조사하여, 인퍼런스 과정에서 전제의 순서가 결과에 미치는 영향을 분석합니다.


2. 연구 배경 및 이론적 토대

2.1 연구의 필요성

LLM은 자연어 처리의 다양한 분야에서 활용되고 있지만, 복잡한 논리적 인퍼런스를 요구하는 작업에서는 여전히 개선의 여지가 있습니다. 특히, 전제가 제시되는 순서에 따라 모델의 인퍼런스 성능이 크게 달라지는데, 이는 모델이 training dataset의 구조에 지나치게 의존하고 있음을 시사합니다.

2.2 이론적 토대

전제의 순서가 인퍼런스 과정에 미치는 영향을 분석하기 위해, 다음과 같은 수학적 모델을 고려합니다. 논리 연산에서, 전제의 순서는 결론에 영향을 미치지 않아야 합니다. 예를 들어, 아래와 같은 논리식에서

\[(A \rightarrow B) \wedge (B \rightarrow C) \wedge A \Rightarrow C\]

$A$, $B$, $C$의 순서를 바꾸어도 $C$는 여전히 참입니다. 그러나 LLM은 전제가 제시되는 순서에 따라 인퍼런스의 정확도가 달라질 수 있습니다. 이런 현상을 수학적으로 분석하고, 모델이 왜 이런 특성을 보이는지를 이해하는 것을 목표로 합니다.


3. 방법

3.1 실험 설계

LLM의 인퍼런스 능력을 평가하기 위해 여러가지 논리 문제를 준비하고, 문제의 전제를 다양한 순서로 제시하여 모델의 반응을 관찰합니다. 사용된 데이터셋은 GSM8K와 같은 기존의 벤치마크를 수정하여, 전제의 순서만을 변경한 새로운 데이터셋(R-GSM)을 생성합니다.

3.2 데이터셋 및 벤치마크

R-GSM 데이터셋은 GSM8K 데이터셋의 문제들을 재구성하여, 동일한 문제를 서로 다른 전제의 순서로 제시하는 방식으로 구성되었습니다. 각각의 문제는 원래의 순서와 뒤집힌 순서로 제공되며, 모델은 이런 변화에 따라 어떻게 반응하는지를 평가하는 데 사용됩니다.


4. 결과 및 토론

4.1 주요 결과

실험 결과, LLM은 전제의 제시 순서에 따라 인퍼런스 성능이 크게 달라지는 것을 확인하였습니다. 특히, 전제가 논리적 인퍼런스 순서에 따라 배열될 때 가장 높은 성능을 보였으며, 순서가 뒤바뀌거나 무작위로 제시될 때 성능이 저하되었습니다.

4.2 논의

이런 결과는 LLM이 특정한 학습 패턴에 의존하고 있으며, 이를 극복하기 위한 새로운 학습 방법이 필요함을 시사합니다. 또한, 모델이 전제의 순서를 어떻게 인식하고 처리하는지에 대한 추가적인 연구가 필요합니다.


5. 결론 및 향후 연구 방향

본 연구는 LLM의 인퍼런스 능력이 전제의 순서에 의존한다는 점을 밝혀내고, 이런 현상을 체계적으로 분석하였습니다. 향후 연구에서는 전제의 순서의 영향을 최소화할 수 있는 새로운 모델 아키텍처 및 학습 방법을 개발하는 것이 중요할 것입니다.


1 Introduction

Large language models (LLMs) have demonstrated impressive performance across a variety of reasoning tasks (Austin et al., 2021; Chen et al., 2021; Cobbe et al., 2021; Hendrycks et al., 2021; Wei et al., 2022). In particular, recent state-of-the-art LLMs have reached or even surpassed human performance on multiple reasoning benchmarks, including STEM problem-solving and code generation (Bubeck et al., 2023; Gemini, 2023; Li et al., 2022). However, recent works show that LLMs exhibit failure modes that align with human-like cognitive bias (Berglund et al., 2023; Hagendorff et al., 2023; Jones and Steinhardt, 2022; McCoy et al., 2023; Shi et al., 2023). For example, Berglund et al. (2023) revealed the Reversal Curse; i.e., LLMs trained on “A is B” tend to fail to infer that “B is A.” Distractibility is another failure mode (Jones and Steinhardt, 2022; Shi et al., 2023), where the LLM performance drastically decreases when irrelevant context is included in the task description.

In this work, we investigate the effect that premise order has on LLM reasoning. Specifically, in deductive reasoning, changing the order of premises alone does not change the conclusion. Consider the following illustrative example:

  1. If 𝐴 then 𝐵. 2. If 𝐵 then 𝐶. 3. 𝐴 is True.

We can derive that 𝐶 is True regardless of the order of these 3 premises. While some studies show that humans have a preference on the premise order to facilitate their reasoning (Dekeyser et al., 2000; Girotto et al., 1997), the premise order does not drastically affect human performance, especially for problems that only involve modus ponens (if P then Q; P; therefore Q), which are relatively straightforward for humans.

In contrast to humans, we observe that for LLMs, the premise order has a significant impact on reasoning performance. In particular, LLMs reach the best performance when the premises are arranged in the same order as they appear in the ground-truth proof. Taking the illustrative problem above as an example, we observe two phenomena:

  1. Presenting “If A then B” before “If B then C” in the prompt generally achieves a higher accuracy compared to the reversed order.
  2. The performance gap is more significant when the number of premises increases.

Intuitively, such a preference on the premise order aligns with human preference (Dekeyser et al., 2000) because in the preferred order, each derivation step can be done on-the-fly while looking at premises one by one, without needing to look back and forth across all premises at each step.

We conduct a systematic study on the premise order effect using a variety of SoTA LLMs, including GPT-4-turbo, GPT-3.5-turbo (OpenAI, 2023), PaLM 2-L (Google, 2023), and Gemini Pro (Gemini, 2023). Our primary focus is deductive reasoning, and we benchmark all LLMs on problems that only involve modus ponens (if P then Q; P; therefore Q), where all LLMs in our evaluation at least achieve decent performance with a small number of premises. We show that the accuracy decrease caused by different ordering can be more than 30%. The ordering effect is further amplified when irrelevant premises (i.e., premises that are not needed to derive a conclusion) are presented in the prompt. Figure 1 illustrates a failure case, where all LLMs fail to generate the proof after changing the order of relevant rules. Interestingly, while all LLMs perform best when the premise order follows the ground truth proof, they reveal different preferences on other alternative orderings. Specifically, compared to randomly ordering the premises, GPT-4-turbo and GPT-3.5-turbo generally achieve better performance when the premise order is exactly the reverse of the ground truth proof, which enables LLMs to perform derivation via backward chaining. On the other hand, PaLM 2-L generally achieves the worst performance with such a reversed order. Besides logical reasoning, we construct R-GSM to further investigate the ordering effect on mathematical reasoning. Specifically, we build R-GSM on top of a subset of GSM8K experiments, where we change the order of sentences in the problem description and manually verify that the ground truth answer remains the same. Our experiments again show that the performance of all LLMs notably drop, especially on longer problems that require more reasoning steps.

Our evaluation highlights that even in reasoning domains where the premise order does not matter, premise order does matter in LLM reasoning. Specifically, the premise ordering effect indicates that LLMs are more comfortable reasoning via reading left-to-right instead of back-and-forth, which can be attributed to the auto-regressive model design or the reasoning bias learned from the training corpus. We leave proposing new training and modeling techniques to mitigate the premise order effect as future work.

2. Benchmarks

2.1. Logical Reasoning

Prior work has revealed the weaknesses of LLMs in logical reasoning (Han et al., 2022; Saparov and He, 2022; Saparov et al., 2023; Wan et al., 2024; Xu et al., 2023), especially when the proof is long and requires the knowledge of multiple deduction theorems. To isolate the effect of premise orders, we focus on a confined problem space adapted from SimpleLogic (Zhang et al., 2022), which only includes propositional logic problems with definite clauses. Specifically, each problem includes: (1) a set of facts $A_1,\ldots,A_n$ that hold true; (2) a set of rules of the form “If $X$, then $Y$”, “If $X_0$ and $X_1$, then $Y$”, or “If $X_0$ and $X_1$ and $X_2$, then $Y$”; and (3) a conclusion “$C$ is True” to be proved. As opposed to SimpleLogic — which formulates the problem as a binary classification task (i.e., indicate whether the conclusion is True or False) — in our benchmark, every problem has a ground-truth label of True, and we consider the prediction to be correct only when the generated proof is completely valid. With these strict criteria, the LLM is required to produce the step-by-step deduction that leads to the conclusion, and any hallucination of non-existent facts and rules is considered erroneous.

The key characteristic of our benchmark is that for each logical reasoning problem, we synthetically generate variants with different premise orders. Specifically, we denote the order that conforms to the ground truth proof with forward chaining as the forward order, where the rule applied in each derivation step is sequentially presented in the problem description. Intuitively, presenting premises in the forward order simplifies the problem for humans, as this allows us to write the proof on-the-fly while reading the premises. Conversely, a premise ordering that is more random increases the task difficulty, since carrying out the derivation requires us to repetitively look for premises for each reasoning step. Motivated by this intuition, we categorize different premise orders based on their Kendall tau distance $\tau$ (Cicirello, 2019; Sen, 1968) to the forward order, normalized into the range $[-1, 1]$. Specifically, $\tau = 1$ is the forward order, and we denote the order with $\tau = -1$ as the backward order, which is the reverse of the forward order and aligns with the proof via backward chaining. $\tau \approx 0$ suggests that there is no strong correlation between the premise order in the problem description and the proof. To thoroughly investigate the LLM preference on different premise orders, we evaluate the model performance on $\tau = 0.5$, $0$ and $-0.5$, in addition to the forward ($\tau = 1$) and backward ($\tau = -1$) orders. We present examples with $\tau = 1$ and $0$ in Figure 1, and defer examples with other $\tau$ values to Figure 11 in Appendix B.

We measure the premise order effect by varying the following two factors:

  • Number of rules required in the proof. It is expected that the premise order effect is more significant with more rules. For our benchmark, we generate problems whose numbers of rules range from 4 to 12.
  • Number of distracting rules (i.e., rules that are not useful for the proof) presented in the problem. The presence of distracting rules also complicates the problem, as premise selection itself is challenging (Ferreira and Freitas, 2020; Irving et al., 2016; Wang et al., 2017), and LLMs are shown to be easily distracted by irrelevant context (Shi et al., 2023). We include problem variants with 0, 5 and 10 distracting rules.

We generate 200 problems for each number of required rules. Considering different premise orders and numbers of distracting rules, each problem includes 15 variants, resulting in a total of 27K problems in our benchmark.

2.2. R-GSM for Mathematical Reasoning

Figure 2. R-GSM example where the original problem can be correctly solved by all LLMs in our evaluation, but all of them failed on the reordered one. Different calculation steps and their corresponding problem statements are annotated in light blue. Specifically, the reasoning steps of the original problem follows the ordering of problem statements, while the reordered problem does not.

To further assess the effect of premise orders beyond logical reasoning, we construct the R-GSM dataset based on GSM8K (Cobbe et al., 2021), which is a popular benchmark of grade school math word problems. Specifically, we first select GSM8K test problems with at least 5 sentences in the problem description, then filter out those problems where there is no alternative ordering that does not change the ground truth answer, e.g., problem statements that follow the causal order of an event series. For each of the remaining problem, we keep the last sentence untouched and rewrite the problem description with a different ordering of other sentences. Minor editing on words is allowed to ensure the grammatical correctness of the problem description. To facilitate the annotation process, for each problem, we write a simple function to enumerate all alternative orderings of problem statements until an ordering that causes the LLM prediction failure is discovered, which can be used for our manual rewriting if the alternative ordering found in the enumeration process happens to preserve the ground truth answer. In total, our R-GSM benchmark contains 220 pairs of problems, including both the original GSM8K problem description and the manually rewritten one with a different ordering of problem statements. Despite that over 60% of problems in R-GSM only have 5 sentences, and all problems have at most 8 sentences, our evaluation shows that all LLMs still perform considerably worse on rewritten problems. Figure 2 presents an example in R-GSM where all LLMs correctly solve the original problem but not the rewritten one. Specifically, the reasoning steps for the original problem follows the ordering of problem statements, while for the rewritten problem, the second calculation step in the correct solution should refer to the second-to-last sentence instead of the second sentence in the problem description. We provide a more detailed case study in Section 3.3, and present the full dataset statistics in Appendix A.

3. Experiments

3.1. Experimental Setup

We evaluate the premise ordering effect on GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L and Gemini Pro. We perform the greedy decoding with the temperature 0, and apply the zero-shot prompting in all experiments. On R-GSM, the model input only contains the problem description without additional instructions. For logical reasoning, as shown in Figure 1, we add an instruction in the prompt to ask for a derivation that specifies which premise is used in each step.

3.2. Logical Reasoning

Figure 3. Logical reasoning without distracting rules. See Table 5 in Appendix D for accuracy numbers.

Figure 4. Logical reasoning with distracting rules. See Tables 6 and 7 for accuracy numbers.

Figure 3 presents the results with different numbers of relevant rules included in ground truth proofs, where the problem does not contain distracting rules, and the shuffled accuracy is the aggregation of results with 𝜏 = 0.5, 0 and -0.5. Across different LLMs, the forward order consistently achieves the best performance, which aligns with the human preference. The performance drop caused by alternative orderings becomes more significant when the number of rules increases. Meanwhile, models with weaker reasoning capabilities are also more sensitive to different premise orders. Specifically, while the accuracy decrease of GPT-4-turbo and PaLM 2-L is up to 20−30%, with Gemini-Pro and GPT-3.5-turbo, changing the premise order from the forward order can degrade the accuracy from over 65% to below 25%, with an accuracy decrease of more than 40%.

Figure 5. Results on different 𝜏 without distracting rules. See Table 8 for accuracy numbers.

Figure 6. Results on different 𝜏 with distracting rules. See Tables 9 and 10 for accuracy numbers.

Breakdown on different premise orders. We present the results of fine-grained breakdown on premise ordering in Figure 5, where the orders are categorized based on Kendall tau distance 𝜏 as described in Section 2.1. Interestingly, while the top preference of all LLMs is the forward order, their preferences on other orders are not alike. Specifically, GPT-4-turbo generally prefers the backward order over other orders, and the overall performance decreases with a smaller absolute value of 𝜏. This observation is also consistent with the human reasoning pattern, as backward chaining is another well-established inference method. On the other hand, PaLM 2-L generally performs the worst with the backward order. With the decrease of 𝜏 (i.e., the premise order deviates more from the forward order), the accuracy drops. The preferences of Gemini Pro and GPT-3.5-turbo are less consistent, still they prefer the backward order more often than other non-forward premise orders.

Effect of distracting rules. We assess the effect of distracting rules of GPT-4-turbo and PaLM 2-L, which reach a decent performance without the presence of distracting rules. Figures 4 and 6 show that adding distracting rules further decreases the reasoning performance and magnifies the effect of different premise orders. Still, the overall preferences of both LLMs remain the same as the scenario without distracting rules. Specifically, both LLMs again achieve the best performance with the forward order, and GPT-4-turbo prefers the backward order over other non-forward orders, while PaLM 2-L performance decreases with a smaller 𝜏.

Error analysis. In Table 1, we present the breakdown on prediction errors with different premise orders. We consider the following error categories:

    1. wrong refutation: the LLM wrongly claims that the conclusion can not be proved;
    1. rule hallucination: the LLM generates rules that do not exist in the problem;
    1. fact hallucination: the LLM generates facts that do not exist in the problem and are unproven.

We observe that for all LLMs, fact hallucination is typically the most common error pattern, and this error type escalates dramatically with the decrease of 𝜏. The main reason is that LLMs are inclined to use the rules in the sequential order as they present in the problem, so when the next rule in the problem is not yet applicable, LLMs might still hallucinate facts to complete the proof step. Simultaneously, we observe that the percentage of wrong refutation is generally lower for 𝜏 = −1 than for |𝜏| < 1. We present an example of wrong refutation in Figure 1, and we include more examples of rule and fact hallucination in Figure 10 of Appendix B.

Table 1. Error analysis for logical reasoning with 12 relevant rules and no distracting rules.

3.3. R-GSM for Mathematical Reasoning

Table 2. Results on the R-GSM dataset: (a) accuracies on the full dataset; (b) for each model, the accuracies on the R-GSM subset where the original problems are correctly solved, thus the initial accuracy is 100% for all models.

Figure 7. R-GSM results with different numbers of reasoning steps in the ground truth. See Table 11 in Appendix E for accuracy numbers.

Figure 8. R-GSM results with different problem lengths. See Table 12 for accuracy numbers.

Table 2a demonstrates the overall results on R-GSM. Again, all LLMs achieve a lower performance on R-GSM. Note that the original GSM8K problems are not necessarily written in the most preferable way, and thus sometimes the manual rewriting facilitates the reasoning and allows the model to correctly solve the reordered version of a problem that it fails on the original one. Therefore, in Table 2b, for each LLM, we also present the accuracy on those problems with their original descriptions solved by the model. We show that all LLMs fail on at least 10% of reordered problems that they are initially able to solve, and this performance degradation is more than 35% with GPT-3.5-turbo.

Breakdown of problem complexity. Figures 7 and 8 present the breakdown results on different number of reasoning steps and different number of problem sentences, respectively. Unsurprisingly, across all LLMs, the proof accuracy suffers on problems that require more reasoning steps and contain a greater number of sentences. Overall, the gap between the accuracies on initial and rewritten problems is more significant with more reasoning steps and longer problems for both GPT-4-turbo and Gemini Pro, while the gap remains similar across different numbers of reasoning steps and problem lengths for PaLM 2-L and GPT-3.5-turbo.

Error analysis. To further understand the failure modes, for each LLM, we analyze those error cases where the original problems can be correctly solved but not the reordered ones, and we categorize the common error types in Table 3. Similar to our observation in logical reasoning experiments, the prediction errors in R-GSM are primarily due to the LLMs blindly using numbers in the sequential order of their appearances in the problem. Specifically, the most common error case for all LLMs is their tendency to overlook temporal order. Figure 2 presents such an example, where the prediction failure is because some earlier events are described in the later part of the problem. Another category of errors occurs when some quantities are not specified while processing the problem in the sequential order, which introduces unknown variables for calculation. Take, for example, the problem in Figure 9. In the original problem, the number of each animal can be directly calculated based on its preceding sentence. However, in the reordered problem, the number of gerbils cannot directly be computed based on the preceding sentences, since the number of fish remains unknown up to that point, and the LLM must read the remaining sentences and calculate the number of fish first. However, the prediction from GPT-3.5-turbo instead uses the number calculated in the previous step (i.e., the number of rabbits) to calculate the number of gerbils, resulting in an error. Such a failure mode is less common with PaLM 2-L, but still constitutes a non-negligible proportion of prediction errors for the other LLMs. We present more examples of model predictions in Appendix C.

Table 3. Error analysis on R-GSM. “Temporal” refers to the temporal order, and “Unknown” refers to the unknown variables.

Figure 9. R-GSM example where the original problem can be correctly solved by all LLMs, but GPT-3.5-Turbo fails on the reordered version while all the other LLMs still solve it correctly.

Failure modes of LLMs. The premise order effect in this work is connected to several failure modes of LLMs in the literature, including the reversal curse (Berglund et al., 2023), distractibility (Shi et al., 2023), and limited capability of logical reasoning (Han et al., 2022; Saparov and He, 2022; Saparov et al., 2023; Wan et al., 2024; Xu et al., 2023; Zhu et al., 2023). Specifically, Shi et al. (2023) show that including irrelevant context in the problem statement leads to a considerable performance drop on GSM8K and other reasoning benchmarks, revealing that LLMs are distractible. This finding is in-line with our evaluation on logical reasoning, where we observe that adding irrelevant rules not only degrades the overall logical reasoning performance, but also escalates the premise order effect. The Reversal Curse (Berglund et al., 2023) unveils another perspective of the order effect, where they show that an LLM that recognizes “A is B” does not necessarily learn that “B is A.” While their work studies the order effect between two entities within a single factual statement, our work focuses on reasoning problems with multiple premises, without restrictions on the number of (or relationship between) entities. In particular, for logical reasoning, we demonstrate that random permutations of premises often result in worse accuracy than the purely backward order. Order effect for human logical reasoning. Although the premise order does not matter in deductive reasoning, several studies show that the premise order can impact the human reasoning performance (Dekeyser et al., 2000; Girotto et al., 1997). Dekeyser et al. (2000) described co-reference as a human preference of premise order; i.e., humans prefer the premises to be presented in an order where they can draw immediate conclusions after seeing each one. In this work, we show that LLMs also have such a preference, and they achieve the best performance when the ordering of rules follows the ground truth proof. Girotto et al. (1997) studied how the premise order affects logical reasoning for humans, and found that the premise order has a significant effect in solving modus tollens problems (i.e., if P, then Q; not Q; therefore, not P), but not modus ponens problems (i.e., if P, then Q; P; therefore, Q). However, differing from our work, they studied the influence of different ordering between rules and facts, e.g., their experiments on modus tollens problems show that presenting negation statements (not Q) before rules (if P, then Q) improves the performance over the reverse order. On the other hand, our work focuses on modus ponens problems that are easier for both humans and LLMs, and we show that the LLM performance is still quite sensitive to the ordering of the premises.

Order effect of language models. Some prior works show that language models are able to understand permuted texts to some extent, i.e., after a random permutation of words, models usually preserve a reasonable performance (Abdou et al., 2022; Sinha et al., 2020). Moreover, Cao et al. (2023) shows that even when a large fraction of words are scrambled, GPT-4 still achieves decent performance on several reasoning benchmarks. In contrast to permuted texts in these works that are typically unnatural and nonsensical, our premise order permutations do not alter the semantic meaning and remain syntactically valid (we manually verify this). Nevertheless, we demonstrate that LLM reasoning performance is highly brittle to the ordering of the premises.

5. Conclusion

In this work, we show that the premise order significantly affects LLMs’ performance on reasoning tasks, even when the premise order does not change the underlying task itself. Our comprehensive evaluation demonstrates that LLM tendencies resemble human preference w.r.t. premise order, i.e., LLMs achieve the best performance when the premise order follows the intermediate reasoning steps to solve the problem. Conversely, LLMs face difficulties when the reasoning problem requires the model to read the problem description back-and-forth, resulting in a performance drop of over 30%. We further extend the study to mathematical reasoning and present the R-GSM benchmark, and again experimentally confirm the ordering effect.

While humans also have a preference of premise orders for reasoning problems, LLMs are much more susceptible to such ordering effects. We can attempt to ascribe the premise order effect to several candidate factors, such as the auto-regressive model design, training objectives, and training data mixture. However, we leave proposing theoretical explanations of this limitation and developing new techniques towards addressing the premise order effect as future work.

Previous: Fusion | Knowledge Fusion of Large Language Models Next: CoT | Chain-of-Thought Without Prompting*

post contain ""

    No matching posts found containing ""