00:00:00

Share Your Feedback 🏝️

CoT | Chain-of-Thought Without Prompting*

CoT | Chain-of-Thought Without Prompting*

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Reasonning | Premise Order Matters in Reasoning Next: Survey | LLM Survey

CoT | Chain-of-Thought Without Prompting*

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-02-16

Chain-of-Thought Reasoning Without Prompting

  • url: https://arxiv.org/abs/2402.10200
  • pdf: https://arxiv.org/pdf/2402.10200
  • abstract: In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) prompting. These methods, while effective, often involve manually intensive prompt engineering. Our study takes a novel approach by asking: Can LLMs reason effectively without prompting? Our findings reveal that, intriguingly, CoT reasoning paths can be elicited from pre-trained LLMs by simply altering the \textit{decoding} process. Rather than conventional greedy decoding, we investigate the top-k alternative tokens, uncovering that CoT paths are frequently inherent in these sequences. This approach not only bypasses the confounders of prompting but also allows us to assess the LLMs’ \textit{intrinsic} reasoning abilities. Moreover, we observe that the presence of a CoT in the decoding path correlates with a higher confidence in the model’s decoded answer. This confidence metric effectively differentiates between CoT and non-CoT paths. Extensive empirical studies on various reasoning benchmarks show that the proposed CoT-decoding substantially outperforms the standard greedy decoding.

Contents

TL;DR


대규모 언어모델의 수학적 인퍼런스 능력 개선에 관한 연구

  • 대규모 언어모델(LLM)은 다양한 인퍼런스 벤치마크에서 향상된 성능을 보여줍니다.
  • 본 연구에서는 LLM의 인퍼런스 능력을 더욱 향상시키기 위해 새로운 디코딩 접근 방식을 제안합니다.
  • 제안된 방식은 CoT(chain of thought) 디코딩을 통해 LLM이 자체적으로 인퍼런스 경로를 생성하도록 유도합니다.

1. 서론

최근 대규모 언어모델들이 다양한 수학적, 논리적 인퍼런스 문제에서 놀라운 성과를 보이고 있습니다. 이런 모델들은 대개 pre-training된 상태에서 추가적인 튜닝 없이도, 특정 방식으로 디코딩을 조정하는 것만으로 인퍼런스 능력을 발휘할 수 있는 잠재력을 갖고 있습니다. 본 연구에서는 기존의 질의응답(QA) 형식을 사용하여, 모델이 자체적으로 인퍼런스 과정을 거치도록 하는 새로운 디코딩 기법을 개발하였습니다.


2. 이론적 배경

2.1. 연구 동기 및 필요성

기존 연구들은 주로 모델에 특정한 프롬프트를 제공하여 인퍼런스를 유도하였습니다. 그러나 이런 방식은 모델이 직접적으로 인퍼런스 과정을 내재화하는데 한계가 있으며, 또한 특정 태스크에 대한 프롬프트 설계가 필요하다는 점에서 범용성이 떨어집니다. 이에 본 연구에서는 모델이 스스로 인퍼런스 과정을 도출할 수 있도록 하는 CoT 디코딩 기법을 제안하였습니다.

2.2. CoT 디코딩의 수학적 원리

CoT 디코딩은 모델이 생성하는 여러 인퍼런스 경로 중에서 최종 답변에 대한 신뢰도가 가장 높은 경로를 선택하는 방식입니다. 이를 위해 각 단계에서 선택 가능한 최상위 $k$개 토큰을 고려하며, 각 토큰의 선택 확률을 기반으로 가장 높은 신뢰도를 가진 경로를 선택합니다.

\[\Delta(x) = \sum_{t=1}^{n} (\max p(x_t) - \text{second} \max p(x_t))\]

$x_t$는 시간 $t$에서의 토큰, $\max p(x_t)$와 $\text{second} \max p(x_t)$는 각각 최대 확률과 두 번째로 높은 확률을 나타냅니다. $\Delta(x)$는 모델이 해당 답변을 생성할 때 보여주는 신뢰도의 총합을 의미하며, 이 값이 크면 클수록 모델이 해당 경로에 대해 더 확신하고 있음을 나타냅니다.


3. 방법

3.1. 실험 설계

본 연구에서는 다양한 수학적 인퍼런스 문제를 포함한 벤치마크 데이터셋(GSM8K, Cobbe et al., 2021)을 사용하여 CoT 디코딩의 효과를 검증하였습니다. 실험은 PaLM-2 모델을 사용하여 기존의 탐욕적(greedy) 디코딩과 제안된 CoT 디코딩 방식을 비교 분석하였습니다.

3.2. 데이터셋 및 성능 평가

사용된 데이터셋은 다양한 수학 문제를 포함하고 있으며, 모델의 인퍼런스 성능은 정답률로 평가하였습니다. CoT 디코딩이 적용된 모델은 기존 방식에 비해 더 높은 정답률을 보였으며, 특히 복잡한 인퍼런스이 필요한 문제에서 더욱 두드러진 성능 향상을 보였습니다.


4. 결과 및 토론

4.1. 주요 발견

CoT 디코딩을 통해 모델은 스스로 인퍼런스 경로를 생성하며, 이 과정에서 보여주는 신뢰도는 모델이 답변을 얼마나 정확하게 도출했는지를 반영합니다. 본 연구 결과는 대규모 언어모델이 프롬프트 없이도 인퍼런스 능력을 발휘할 수 있음을 시사하며, 이는 향후 모델 설계 및 학습 방법에 중요한 시사점을 제공합니다.

4.2. 향후 연구 방향

향후 연구에서는 CoT 디코딩의 범용성을 다양한 타입의 문제에 적용하여 검증하고, 더욱 정교한 디코딩 알고리즘 개발을 통해 모델의 인퍼런스 능력을 더욱 향상시킬 수 있을 가능성을 시사합니다.


1 Introduction

Large language models (LLMs) have demonstrated remarkable performance on various complicated reasoning benchmarks (Anil et al., 2023; Brown et al., 2020; Chowdhery et al., 2023; Gemini, 2023; OpenAI, 2023; Romera-Paredes et al., 2023). These reasoning capabilities of LLMs are typically elicited by prompting techniques (Brown et al., 2020), which can be few-shot prompting with intermediate steps augmented demonstration exemplars (Chen et al., 2023b; Gao et al., 2022; Nye et al., 2021; Wei et al., 2022; Yao et al., 2023; Zhou et al., 2023a), or zero-shot prompting with specific instructions which ask for showing certain intermediate steps (Kojima et al., 2022; Yasunaga et al., 2023). The other prevalent strategy for eliciting LLM reasoning is through model training or instruction tuning using a substantial amount of chain-of-thought (CoT) reasoning data (Chung et al., 2022; Cobbe et al., 2021b; Ling et al., 2017; Nye et al., 2021).

In this work, we aim to elicit the reasoning ability of LLMs by exploring a different perspective and ask: Can LLMs reason effectively without prompting? And to what extent can they reason? We find that, perhaps surprisingly, there exists a task-agnostic way to elicit CoT reasoning from pre-trained LLMs by simply altering the decoding procedure. Figure 1 illustrates our new decoding approach: given a reasoning question, the LLM generates a wrong answer via the standard greedy decoding path, yet alternative top-𝑘 token inspection unveiled inherent CoT paths (e.g., decoding paths 2 and 4), which accurately resolved the query. This decoding modification bypasses CoT prompting and is entirely unsupervised without the need for model tuning.

In more details, we formulate the input using the standard question-answer (QA) format: “Q: [question]\nA:”.1 While most existing work suggest that LLMs falter in such direct-QA scenarios on reasoning (Cobbe et al., 2021a; Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022), our findings reveal a nuanced picture. We observe that LLMs indeed struggle with reasoning when relying solely on greedily decoded paths. However, when we consider alternative paths among the top-𝑘 tokens, CoT reasoning patterns emerge naturally within the decoding trajectories of LLMs. In addition, we have observed an interesting pattern: the model demonstrates increased confidence in the final answer when a CoT reasoning path is present in the decoding process. As illustrated in Figure 1, this is evident where paths 2 and 4 show heightened certainty in arriving at the correct answer “8”, contrasting sharply with the high uncertainty in paths that lead to the incorrect “5”. Leveraging this phenomenon, we develop a method to sift through the top-𝑘 decoding paths, which we refer to as CoT-decoding, thereby isolating the most reliable paths for model output.

1 The QA format is only needed because without it a pre-trained language model will continue the question instead of answering. It is also the most basic formatting employed in existing works for pre-trained language models.

Figure 1. Illustration of CoT-decoding. Pre-trained LLMs are capable of inherent reasoning without prompting by considering alternative top-𝑘 tokens, rather than solely relying on the top-1 greedy decoding path. Moreover, these models tend to display higher confidence in decoding the final answer (indicated by a darker shaded color) when a CoT reasoning path is present.

CoT-decoding offers an alternative way to elicit reasoning capabilities from pre-trained LLMs without explicit prompting. Moreover, it bypasses the confounders introduced by prompting, enabling a more accurate assessment of the models’ intrinsic reasoning abilities. In our experiments, we demon- strate that CoT-decoding spontaneously reveals CoT reasoning paths during decoding, significantly enhancing models’ reasoning capabilities over greedy decoding across various benchmarks. We also observe that, these paths are more prevalent in tasks frequently represented in the pre-training data and less so in complex, synthetic tasks, where advanced prompting might still be necessary to trigger those reasoning paths. This aligns with findings in (McCoy et al., 2023; Prystawski et al., 2023; Razeghi et al., 2022). We also observe that in such scenarios, few-shot CoT demonstrations play a larger “teaching” role in guiding how models solve a task, with models primarily mimicing the format of these prompts to generate accurate reasoning paths.

Our contributions are summarized as follows:

• Our study reveals that pre-trained language models inherently possess reasoning capabilities, as evidenced by their generation of CoT reasoning paths when examining alternative top tokens during decoding, rather than relying on greedy decoding. This finding contrasts with prior research focused on improved prompting for reasoning, highlighting that a mere change in decoding strategy can effectively elicit model reasoning.

• We find that the language model’s confidence in its final answers increases when a CoT is present in its decoding path. Leveraging this increased confidence, we propose CoT-decoding to select more reliable decoding paths, demonstrating significant improvements over greedy decoding across various reasoning benchmarks.

2. Chain-of-Thought (CoT) Decoding

2.1. The Presence of CoT Paths during Decoding

We investigate whether pre-trained language models inherently exhibit CoT reasoning during decoding, without explicit prompts. In Table 1, we show example decoding paths across a diverse set of reasoning tasks, including math (GSM8K, Cobbe et al. (2021a)), commonsense (year parity, Allen-Zhu and Li (2023)), and symbolic reasoning (coin flip, Wei et al. (2022)). We employ the pre-trained PaLM-2 large model (Anil et al., 2023) to compare its standard greedy decoding path ($k = 0$), predominantly used in state-of-the-art LLMs for reasoning tasks, with alternative decoding paths ($k > 0$), where $k$ represents the choice of the $k$-th token at the first decoding step.

[GSM8K] Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them? Greedy path: $k = 0$: $60.00 (0.029)$ Alternative top-$k$ paths: $k = 1$: $60 (0.058)$ $k = 2$: $Kylar$ $needs$ $to$ $pay$ $60$ $for$ $16$ $glasses.$ $(0.058)$ . . . $k = 7$: $If$ $Kylar$ $buys$ $16$ $glasses,$ $he$ $will$ $pay$ $60.$ $(0.032)$ $k = 9$: $We$ $can$ $calculate$ $the$ $price$ $of$ $16$ $glasses$ $by$ $multiplying$ $the$ $price$ $of$ $one$ $glass$ $by$ $16.$ $However,$ $we$ $need$ $to$ $remember$ $that$ $every$ $second$ $glass$ $costs$ $only$ $60%$ $of$ $the$ $price.$ $Therefore,$ $we$ $need$ $to$ $multiply$ $the$ $price$ $of$ $one$ $glass$ $by$ $16$ $and$ $then$ $subtract$ $40%$ $of$ $the$ $price$ $of$ $8$ $glasses.$ $16 \times 5 = 80$ $8 \times 5 = 40$ $40 \times 0.4 = 16$ $80 - 16 = 64$ $Kylar$ $needs$ $to$ $pay$ $64$ $for$ $16$ $glasses.$ $(0.994)$

Table 1. Examples of greedy decoded paths and alternative top-$k$ paths across various tasks, over the PaLM-2 Large model. The model’s confidence scores are highlighted in blue (See §2.2 for details).

In examining reasoning problems, we observe that models employing greedy decoding often does not contain a CoT path, opting to solve problems directly. This tendency may stem from the model’s skewed perception of problem difficulty, shaped by its pre-training on predominantly simpler questions. Consequently, the model is predisposed to immediate problem-solving. This observation aligns with findings in (Cobbe et al., 2021a; Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022), which show that direct-answer prompts generally result in low accuracy on reasoning tasks.

Contrastingly, an intriguing phenomenon emerges when exploring alternative top-$k$ ($k > 0$) tokens at the first decoding step. Continuing with greedy decoding from this point reveals natural CoT reasoning in many cases. For instance, in the GSM8K question (Table 1), a valid CoT emerges at $k = 9$. Similarly, in the year parity task, greedy decoding attempts to directly answer the parity question at $k = 0$, leading to a random choice between “even” and “odd” which often results in an incorrect answer. However, when exploring $k > 0$, the model naturally generates CoT paths at $k = 3$ and $k = 7$, where it first determines the year before resolving the parity.

2.2. CoT-Decoding for Extracting CoT Paths

Despite the natural occurrence of chain-of-thought paths, extracting them from the top-$k$ decoded paths is still an unresolved challenge. Table 1 illustrates that CoT paths do not consistently outrank non-CoT ones in the model’s probability assessment. Moreover, they often do not represent the predominant answer among all paths, rendering methods like self-consistency (Wang et al., 2023a) inapplicable. For instance, in the GSM8K question, the prevalent answer “60”, which aligns with the greedy decoding result, fails to serve as a reliable indicator for identifying the correct path.

Interestingly, upon examining the model’s logits, we found that the presence of a CoT path typically leads to a more confident decoding of the final answer, characterized by a significant probability disparity between the top and secondary tokens:

\[\Delta(x_{1,t}, x_{2,t}) = P(x_{1,t}) - P(x_{2,t})\]

where $x_{1,t}, x_{2,t}$ represent the top two tokens at each decoding step $t$ in the $k$-th decoding path, chosen for their maximum post-softmax probabilities from the vocabulary, given $x_t$ being part of the answer tokens. The model’s overall confidence in decoding the final answer is approximated by averaging these probability differences for all relevant $x_t$ tokens, where $n$ is the total number of answer tokens. For example, for the GSM8K question in Table 1, given the answer “60”, we average the probability differences for all tokens in that answer, i.e., “6” and “0”.

This method, referred to as CoT-decoding, aims to extract such CoT paths among the various decoded paths from the language models. As illustrated in Table 1, each decoding path is marked with its corresponding $\Delta$ value in blue (the answer tokens are bolded). It is evident that paths with a CoT component exhibit a significantly higher $\Delta$, highlighting the model’s increased confidence, as opposed to paths without CoT.

An additional heuristic involves selecting the decoding path based on its length, with the intuition that longer decoding paths more likely contain CoTs. We empirically find this heuristic works to a certain degree for math reasoning questions, but its general applicability across reasoning tasks, such as the year parity task, is limited (refer to the example in Table 1 where the model’s decoding paths exhibit comparable lengths). Alternatively, one can employ the model’s probability score normalized by length. This approach similarly introduces a length bias, favoring longer decoding paths when the probabilities are closely aligned. Consequently, its effectiveness diminishes in reasoning tasks where the decoding paths are of similar lengths.

Identify the answer spans. There are multiple ways to identify the answer span in a model’s response. One straightforward approach is to extract the last numerical value in math reasoning tasks, or the final option in set-based reasoning tasks, as the answer, following the Tülu evaluation (Ivison et al., 2023; Liu et al., 2024; Wang et al., 2023b). This simple method works in most cases but can be less precise when there are distractive numbers/options following the correct answer in open-ended responses.

A slightly more principled approach, proposed in Kojima et al. (2022), involves extending the model’s output with the prompt “So the answer is”, and then we can align these continuations with spans in the model’s decoding path. This alignment can be done with the token ids directly, without decoding those ids into strings. This technique is versatile, suitable for tasks encompassing mathematical and natural language reasoning. Note it is crucial to calculate the Δ over the answer spans from the original decoding path, not those following “So the answer is”, to avoid reinforcing incorrect probabilities due to flawed reasoning. Intuitively, the Δ in the original decoding path represents the confidence of the model generating the answer based on the reasoning path, while the Δ in the answer following “So the answer is” only represents the confidence of retrieving that answer from the original decoded path.

In cases where the answers are more open-ended, utilizing the probability differences of the top two tokens as an indicator of how models prefer one answer over another could be less precise. If the answer options are explicitly defined, like “yes” or “no”, we could modify the calculation of Δ slightly by aggregating the probability mass over “yes” (and equivalent options like “Yes/YES”), then compute the probability differences between the aggregated mass on “yes” and “no”. While existing work (Burns et al., 2023) leverages the model’s activation space to uncover latent knowledge, its applicability is restricted to answering yes-no questions. We hope that future research can address this limitation by delving deeper into the model’s internal representation across a broader, more open-ended answer space.

Branching at other decoding steps. CoT-decoding considers alternative tokens at the first decoding step. This leads to a natural question: is branching viable at later decoding stages? In Figure 2, we present a qualitative analysis of decoding paths, highlighting the impact of alternative token consideration in subsequent decoding steps. It is evident that early branching, e.g., at the first decoding step, significantly enhances the diversity of potential paths. Conversely, later-stage branching is significantly influenced by previously generated tokens. For instance, initiating with the token “5” greatly decreases the likelihood of rectifying an erroneous path. Nonetheless, the optimal branching point may vary with the task; in the year parity task, for instance, mid-path branching can effectively yield correct CoT paths.

Figure 2. We present an analysis of the decoded paths by considering alternative tokens at various decoding steps. Task-dependent challenges arise: at times, the model encounters difficulty recovering from incorrect paths when branching at later tokens. For certain tasks, multiple branching positions may exist, all leading to the correct reasoning path.

Sampling under the standard QA format. CoT-decoding adopts greedy decoding for the entire decoding path except branching on the first token. A natural question arises: can sampling achieve a similar effect and unveil the CoT reasoning paths? We found that, although sampling works well under few-shot CoT prompting (Wang et al., 2023a), it does not exhibit the desired behaviour when the model is queried with the standard QA format. We conduct a study over the first 50 questions over GSM8K and apply temperature sampling (Ackley et al., 1985; Ficler and Goldberg, 2017) with a temperature of 0.7 to sample 10 responses for each question, and found that it is much less sample- efficient compared to our CoT-decoding procedure: less than 30% of the sampled responses contain a correct CoT path. In most cases, the model tends to provide a direct answer because the first token is sampled based on the model’s probability distribution, heavily influenced by the model’s tendency to output a direct answer rather than taking a less-direct route. In addition, the rest of the tokens are sampled, leading to more frequent incorrect final answers. For instance, for our question in Table 1, temperature sampling yields responses like “10 apples”, “5 apples”, “5”, none of which contains the correct CoT paths.

3. Experiments

We evaluate the CoT-decoding approach across a range of reasoning benchmarks, demonstrating its capability to successfully recover CoT reasoning paths during decoding, all without the need for specialized prompting.

Experiment Setup. For all experiments, the default input to the model is the standard QA format of Q: [question]\nA:, where [question] is filled with the actual question depending on the task, and we ask the model to continue the generation given that prefix. During decoding, we use 𝑘 = 10 as default for the alternative top-𝑘 tokens at the first decoding position. We show ablation studies with respect to the different choice of 𝑘 in Section §3.1.

Models. We investigate our method using both (1) the PaLM-2 pre-trained model families (Anil et al., 2023) with different scales, ranging from X-Small, Small, Medium, and Large; and (2) the open-sourced model Mistral-7B (Jiang et al., 2023). Our experiments primarily focus on pre-trained models, but we also include experiments with instruction-tuned models (denoted as “inst-tuned”, or “IT”).

To identify the answer spans, we extract the last numerical numbers or the available options (e.g., “even” or “odd” for the year parity task) over the Mistral model. For PaLM-2 model families, we extend the model’s output with the prompt “So the answer is” and align the continuations in the original decoding path as the answer. Please refer to Appendix §C for more details on experiment settings and answer parsing.

3.1. Mathematical Reasoning Tasks

We use the following datasets for math reasoning: Grade-school math problems (GSM8K; Cobbe et al., 2021a), and the multi-step arithmetic dataset from (MultiArith; Roy and Roth, 2015). The results over the PaLM-2 models are shown in Table 2, demonstrating that CoT-decoding significantly enhances models’ reasoning ability over the greedy decoding approach, consistently across all model scales. For example, over GSM8K, CoT-decoding achieves +26.7% absolute accuracy compared to greedy decoding over the PaLM-2 Large model. In addition, we observe that CoT-decoding partially closes the gap between the pre-trained model and the instruction-tuned model (e.g., on the large model size), demonstrating that instruction-tuning with sufficient CoT data (Chung et al., 2022) can also be partially achieved by modifying the decoding procedure within pre-trained models.

Notably, we observe that CoT-decoding can further improve the instruction-tuned model. The instruction-tuning procedure (Chung et al., 2022) has already incorporated abundant CoT annotations during the fine-tuning process. Consequently, the model is expected to inherently generate CoT paths when addressing reasoning tasks. However, upon analyzing specific examples, we found that even after instruction-tuning, the model occasionally persists in attempting to directly address a question. In contrast, CoT-decoding can enhance the exploration of alternative paths by triggering a CoT first, consequently leading to more accurate question resolution.

Table 2. Accuracy on math reasoning tasks across the PaLM-2 model family with varying sizes.

Scaling results and choice of 𝑘 In Figure 3, we illustrate how the choice of 𝑘, representing the number of top alternative tokens considered, influences the overall accuracy. Overall we found that higher values of 𝑘 typically result in improved model performance, suggesting that in many cases, the correct CoT paths may indeed exist but are often ranked lower during model’s decoding. For instruction-tuned models, the effect of 𝑘 is less significant, indicating that the process of instruction- tuning effectively brings forth the majority of CoT-paths to the first few decoding paths.

Figure 3. Accuracy on the GSM8K dataset for the PaLM-2 model families, with respect to how many top-𝑘 tokens in decoding are used.

3.2. Natural Language Reasoning Tasks

We investigate the “year parity” task where recent literature finds large language models still struggle with. The task is to query the model with “Was [person] born in an even or odd year?” where “[person]” is filled by a random celebrity name. Existing work (Allen-Zhu and Li, 2023; Berglund et al., 2023) shows that even SoTA models like GPT-4 struggle with such tasks, achieving at-chance accuracy (∼50%) when prompted directly. Allen-Zhu and Li (2023) also show that SoTA LLMs achieve almost perfect accuracy in retrieving the year or judging the parity given the correct year, hence the limitation mostly lies in the model’s ability in knowledge manipulation. In this section, we show that CoT-decoding can effectively elicit the correct CoT reasoning from LLMs to solve this task.

Task setup. We curated a list of the top 100 celebrity names from (Berglund et al., 2023).2 We manually extracted and verified their birth years through web searches to establish the ground truth algorithmically. We evaluate models’ responses against the ground truth (“even” or “odd”) to compute the final accuracy for this task.

The results on PaLM-2 are shown in Table 3. Notably, when the language model is directly prompted with the question, it demonstrates a chance-level accuracy (57%, even for the largest model). However, when equipped with CoT-decoding, the model can recover the CoT paths in most cases and achieve an accuracy over 90%. Further error analysis shows that most of the errors stem from the model retrieving an incorrect birth year, while the generated CoT paths remain highly consistent between the parity and the model’s retrieved year. Note that for this task, when the model size is smaller, the model becomes incapable to determine the parity even given the correct year. Consequently, the performance does not vary significantly for model sizes equal to or below the “Small” scale.

Table 3. Accuracy of the year parity task on PaLM-2 pre-trained models with varying sizes.

3.3. Symbolic Reasoning Tasks

We consider the following symbolic reasoning tasks: (1) the Coin Flip task from (Wei et al., 2022), with 2, 3, 4 rounds of potential flips; and two tasks from Big-Bench-Hard (bench authors, 2023; Suzgun et al., 2022): (2) Web of lies, with 3, 4, 5 truth/lie statements, and (3) Multi-step arithmetic with various depth level 𝑑 and length 𝑙. These tasks, designed with rules through human intervention, allow us to generate task data with diverse difficulty levels, enabling a thorough assessment of the model’s problem-solving capabilities. For each task, we produce 100 examples for each difficulty level, except for Web-of-Lies (5) we use the existing dataset from (Suzgun et al., 2022). For Multi-step arithmetic we directly query the model with the original input (e.g., “3+5-6=”) without the QA format. We also include two natural-language-based but synthetic tasks from Big-Bench, Sports Understanding and Object Counting, to probe model’s intrinsic abilities in solving synthetic tasks.

The presence of correct CoT paths depends on the task prominence in the pre-training distri- bution The results are shown in Table 4. We see that the gains of CoT-decoding become smaller when the task complexity increases.

2 https://github.com/lukasberglund/reversal_curse/blob/main/data/celebrity_relations/top_celebrities.txt

Table 4. Accuracy on symbolic reasoning tasks and additional Big-Bench tasks, on the PaLM-2 Pre- trained Large model.

Additionally, we observe that models cannot generate accurate CoT paths when the task is highly synthetic, i.e., tasks that lack significant representation in the pre-training distribution. This mirrors the finding in (McCoy et al., 2023), where the authors show language models are highly influenced by the distribution they have been trained on. We also found in these tasks, CoT prompting based techniques play a larger “teaching” role in helping the model learn how to solve such tasks. Example of such tasks include:

• Tasks that require accurate state tracking, e.g., Coin-Flip and Web-of-Lies. We observe that the model can generate CoT paths that simulate the process step-by-step, but it can easily lose track of the states, especially when the task becomes more complex (e.g., Coin-Flip with >= 3 coins, and Web-of-Lies with >= 4 statements). This reveals model’s intrinsic vulnerability in performing accurate state tracking. The hand-crafted few-shot CoTs in Suzgun et al. (2022), on the other hand, teach the model to perform explicit state tracking in each step to better help the model solve this task.

• Multi-step Arithmetic: We observe that the model tends to perform calculations from left to right in the CoT-decoding paths. Correspondingly, Suzgun et al. (2022) crafted few-shot CoTs that explicitly instruct the model on the correct order of operations during few-shot demonstrations.

• Object counting: During CoT-decoding the model exhibits a tendency to conduct straightforward addition to all mentioned objects. Conversely, the few-shot CoT used in (Suzgun et al., 2022) teaches the model to exclude the objects that do not fit the question before performing the counting.

Figure 4. Accuracy of CoT-decoding (by taking the max path and the aggregated path) on the GSM8K dataset for the PaLM-2 Large model, with respect to how many top-𝑘 tokens in decoding are used. We also compare with the results from few-shot CoT prompting and zero-shot prompting.

Compared to CoT Prompting In Figure 4, we compare CoT-decoding with existing CoT prompting methods, e.g., few-shot CoT prompting (Wei et al., 2022) and zero-shot CoT prompting (Kojima et al., 2022). First, the aggregated path approach significantly improves the accuracy compare to taking the maximum path only, showing that it can indeed stabilize the results by mitigating the sensitivity to small differences in the model’s logits. Second, the aggregated path results in a performance similar to few-shot CoT prompting, indicating that on this task, the model possesses intrinsic abilities in solving this task effectively. The results suggest that few-shot CoT prompting may serve the purpose of surfacing model’s intrinsic CoT paths to be closer to the top-1 path.

Table 5. Example of generated CoTs using different approaches.

In Table 5, we present qualitative examples illustrating the distinctions in the generated CoTs for each method. Overall we observe that CoT-decoding exhibits a more “free-form” CoT generation in comparison to alternative CoT prompting methods. This divergence may be attributed to two factors: (1) we encourage the diversity at the initial decoding step, and (2) the absence of explicit constraints imposed by prompting.

Another noteworthy observation is that CoT-decoding can better reveal what LLMs’ intrinsic strategy in solving a problem, without being influenced by the external prompts which could be biased by the prompt designers. Take the last example in Table 5, we see that the few-shot CoT path is heavily influenced by the few-shot prompts. Specifically, the few-shot prompts, sourced from (Suzgun et al., 2022), consistently follow a standard analytical approach – first assessing the person’s profession, followed by an evaluation of whether the profession aligns with the action. This aligns with the standard method of solving this particular task.3 In contrast, CoT-decoding reveals paths that deviate from the conventional problem-solving approach. Despite yielding an incorrect final answer according to the ground truth in some cases, the CoT paths remain to be valid.

3 https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/sports_understanding

Table 6. Example of the top-𝑘 paths from the Mistral-7B pretrained-model showing a similar behaviour where CoT paths again exist but are ranked lower during decoding.

3.4 Results across Model Families

We also conduct experiments on other model families, specifically, the open-sourced Mistral-7B model (Jiang et al., 2023). We evaluate both the pre-trained model (“Mistral-7B-v0.1”) and the instruction- tuned variant (“Mistral-7B-Instruct-v0.1”). Table 6 provides an example where the Mistral-7B model attempts to directly solve the question with greedy decoding. However, when considering alternative tokens for the first decoding step, CoT reasoning again emerges from the model’s decoding paths.

The results are shown Table 7, demonstrating consistent improvements across model families. CoT- decoding significantly improves over greedy decoding without specialized prompting, encompassing tasks such as math reasoning (GSM8K and MultiArith) and natural language reasoning (year parity).

Table 7. Reasoning performance on Mistral-7B pre-trained and instruction-tuned model.

Chain-of-thought reasoning in large language models. In recent literature, many works have sought to enhance the reasoning abilities in large language models. These works predominantly involve proposing better prompting techniques to better elicit CoT reasoning paths from the model (Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022; Yao et al., 2023; Yasunaga et al., 2023; Zhou et al., 2023a). Despite achieving high performance, few-shot prompting techniques are often task-specific, requiring prompt designs tailored to each task. This limits their generalizability across tasks. Advanced prompting techniques often require manually intensive prompting engineering, and their effectiveness varies depending on the choice of prompts, resulting in inconsistent performance outcomes (Wang et al., 2022; Ye and Durrett, 2022; Zhou et al., 2023b). Efforts to discover improved prompts (Yang et al., 2024; Zhou et al., 2023b) further entail model-specific and task-specific tuning.

In addition, these prompting techniques can subtly alter the vocabulary’s posterior distribution in ways that remain largely elusive (Min et al., 2022; Webson and Pavlick, 2022). Specifically, prompts may assist in task decomposition, induce the model to generate additional tokens, or directly “teach” the model the exact underlying procedure to solve particular problems via manually crafted few-shot demonstrations. Dissecting the distinct influence of each aspect, however, presents a significant challenge. In contrast, our work explores a different perspective within the decoding stage, demonstrating that, even without explicit prompting, the model inherently holds the capability to generate chain-of-thought reasoning paths across a wide set of tasks.

Several recent works propose to improve the CoT generation process via better controlling and verifying the steps generated, e.g., step-by-step verification (Lightman et al., 2023), process-based feedback (Uesato et al., 2022), self-evaluation guided beam search (Xie et al., 2023), and PathFinder (Golovneva et al., 2023). Note all these works still require CoT prompting in order to generate the CoT reasoning paths, while our work completely removes CoT prompting. In addition, these existing works focus on searching and verifying the “steps” produced by the language model, while our work purely searches in the decoding space on the token-level and utilizes the confidence scores when decoding the answer.

Additionally, recent works aim to better understand how chain-of-thought emerges in language models (Feng et al., 2023; Li et al., 2023b; Prystawski et al., 2023). McCoy et al. (2023); Razeghi et al. (2022) demonstrate a similar phenomenon where the pretraining distribution heavily influences the model’s performance in few-shot reasoning.

Instruction-tuning to elicit CoTs in language models. When supervision is allowed, techniques such as instruction-tuning or distillation offer another way to elicit reasoning paths from language models without explicit prompting (Chung et al., 2022; Huang et al., 2023; Magister et al., 2023). However, these approaches typically involve resource-intensive fine-tuning over large language models and require a large set of examples annotated with CoTs, which may not be readily available.

Liu et al. (2024) show that a large language model can be tuned by a proxy using the logits differences between a pair of tuned and untuned small models, and achieves improved performance over some reasoning benchmarks as well. Liu et al. (2024) require a few additional models, and implicitly assume that the tuned model is well-optimized, e.g., on reasoning benchmarks the model needs to be tuned with CoT paths to enable contrasting logits with respect to the base untuned model. In contrast, our approach is entirely unsupervised and examines a model’s intrinsic ability in generating CoT paths, without resorting to fine-tuning or any additional models.

Decoding algorithms for language models. The predominant focus in existing literature on decoding for language models revolves around aspects such as fluency, coherence, reduction of repetitiveness, and diversity in responses. Popular decoding algorithms used for language models include greedy decoding, temperature sampling (Ackley et al., 1985; Ficler and Goldberg, 2017), top-𝑘 sampling (Fan et al., 2018; Holtzman et al., 2018; Radford et al., 2019), and nucleus sampling (Holtzman et al., 2020). Additionally, there exist refined algorithms such as minimum Bayes risk decoding (Eikema and Aziz, 2020), and typical decoding (Meister et al., 2022). Diverse beam search (Vijayakumar et al., 2018) is another way to explore alternative paths in a model’s generation. However, it emphasizes generation diversity rather than accuracy.

There is relatively little research dedicated to enhancing decoding algorithms specifically for reasoning tasks. Wang et al. (2023a) improves upon CoT prompting by sampling and aggregating over multiple generated responses to improve reasoning. Contrastive decoding (Li et al., 2023a) is another way to improve model’s generation quality by penalizing the logits from smaller models, and recent work (O’Brien and Lewis, 2023) shows that contrastive decoding can contribute to enhancing reasoning performance. Shi et al. (2023) propose context-aware decoding to improves the faithfulness of language models. These approaches typically require additional information, such as employing additional models to generate contrasting logits or incorporating additional contexts. In contrast, our work relies solely on a single model without the need for supplementary knowledge.

Decoding algorithms for efficiency. In addition to decoding algorithms for improving quality, there is a substantial body of research dedicated to improving decoding efficiency, e.g., speculative decoding (Chen et al., 2023a; Leviathan et al., 2022; Zhou et al., 2024). This line of work is orthogonal to our work as their primary focus is not on improving a model’s reasoning performance. However, these techniques could potentially be leveraged to improve the efficiency of CoT-decoding.

5. Conclusion and Discussion

We investigate the inherent capabilities of large language models in generating CoT reasoning paths during decoding, abstaining from any specialized prompting. Our findings indicate that, contrary to the prevalent practice of exclusively employing greedy decoding, exploring alternative top-𝑘 tokens in the decoding space reveals the natural existence of reasoning paths within these models. Furthermore, our empirical observations highlight that the presence of a CoT reasoning path correlates with increased model confidence in decoding its final answer. Based on this observation, we introduce CoT-decoding to extract more reliable decoding paths from language models, thereby enhancing overall reasoning performance.

The exploration of alternative decoding paths incurs additional computational costs. Future work may leverage the CoT-decoding paths to fine-tune the model to enhance its reasoning capabilities. In addition, our current exploration focuses on branching at the first token because it yields a high diversity in the decoding paths, but for future work one can explore branching at any token and searching for the best possible paths during the decoding phase. The computational cost will be substantially higher though, and how to reliably identify the best token during the search will be an interesting direction to explore.

Previous: Reasonning | Premise Order Matters in Reasoning Next: Survey | LLM Survey

post contain ""

    No matching posts found containing ""