00:00:00

Share Your Feedback 🏝️

Logic CoT

Logic CoT

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: A is B, B is not A** Next: Graph Neural Prompting

Logic CoT

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-09-27

Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic

  • url: https://arxiv.org/abs/2309.13339
  • pdf: https://arxiv.org/pdf/2309.13339
  • abstract: Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their behavior, particularly in terms of reasoning, often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. Generative language models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming to improve the zero-shot chain-of-thought reasoning ability of large language models, we propose Logical Chain-of-Thought (LogiCoT), a neurosymbolic framework that leverages principles from symbolic logic to verify and revise the reasoning processes accordingly. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of the enhanced reasoning paradigm by logic.
  • keywords: Large Language Models, Reasoning, Chain-of-Thought, Logic

Contents

TL;DR


  • 기존의 대규모 언어모델들은 논리적 인퍼런스에 한계가 있으며, 잘못된 정보를 생성할 수 있다.
  • 본 논문은 논리적 인퍼런스 과정(LogiCoT)을 통해 언어 모델의 오류를 감지하고 수정하는 방법을 제안한다.
  • 다양한 데이터셋과 벤치마크를 통해 LogiCoT의 효과를 검증하였다.

1. 서론

대규모 언어모델(LLM)은 일반 지식 또는 전문 지식을 요구하는 작업을 수행할 때 향상된 능력을 보여준다. 이런 성공은 언어 처리 영역을 넘어 다양한 분야에서 확인되었다. 그러나 생성적 LLM에서 해결되지 않은 주요 문제 중 하나는 잘못된 정보를 확신 있는 스타일로 생성하는 경향이다. 이런 문제를 해결하기 위해, 본 논문은 논리적 연쇄 인퍼런스(LogiCoT) 방법을 제안하여 LLM이 단계별로 인퍼런스하고 검증하도록 유도한다. 이는 ‘모순인퍼런스(Reductio ad Absurdum)’ 원칙에 기반하여, 필요시 인퍼런스 체인을 수정하여 정확한 인퍼런스를 보장한다.


2. 선행연구

LLM의 품질 향상을 위해, 적절한 프롬프트의 중요성이 강조된다. 예를 들어, 문제를 여러 부분으로 나누어 해결하는 ‘최소에서 최대(Least-to-Most)’ 기법, ‘사고의 연쇄(Chain-of-Thought, CoT)’ 프롬프트가 이런 접근 방식의 일부이다. CoT는 복잡한 문제를 간단한 여러 부분으로 분해하여 모델이 인퍼런스를 좀 더 쉽게 수행하도록 한다. 이런 방법은 단계적 검증 메커니즘을 통해 더욱 강화될 수 있는데, 이는 모델이 오류를 자동으로 감지하고 수정하도록 유도한다.


3. 방법

3.1 모순인퍼런스 (Reductio ad Absurdum)

논리학에서 모순인퍼런스은 어떤 주장의 부정에서 모순을 도출함으로써 그 주장의 참을 증명하는 방법이다. 예를 들어, 주장 $P$가 참임을 증명하고자 할 때, $P$의 부정 $\neg P$를 가정하고 여기에서 모순이 발생함을 보임으로써 $P$가 참임을 입증한다. 이 논리적 접근 방식을 LLM에 적용하여, 각 인퍼런스 단계에서 발생할 수 있는 오류를 검증하고 수정하는 방식을 구현하였다.

3.2 논리적 사고의 연쇄 (Logical Chain-of-Thought, LogiCoT)

LogiCoT는 LLM이 주어진 문제에 대해 한 단계씩 논리적으로 생각하고, 각 단계에서 가능한 오류를 검토하며, 필요에 따라 인퍼런스 과정을 수정하는 구조를 제공한다. 이 방법은 LLM이 인퍼런스 과정에서 발생할 수 있는 모순을 식별하고, 이를 바탕으로 보다 정확한 결론에 도달할 수 있도록 한다.


4. 실험

4.1 실험 설정

다양한 크기의 언어 모델과 다양한 주제의 데이터셋을 사용하여 LogiCoT의 효과를 검증하였다. 예를 들어, 수학 인퍼런스 작업인 GSM8K 데이터셋과 AQuA 데이터셋에서 LogiCoT는 기존의 CoT 방식과 비교하여 상당한 성능 향상을 보였다. 또한, 날짜 이해와 같은 상식 인퍼런스 작업에서도 LogiCoT의 유효성이 입증되었다.

4.2 결과

LogiCoT를 적용한 결과, 모델의 인퍼런스 능력이 향상되었으며, 오류 발견 및 수정 과정에서도 더 높은 정확도를 보였다. 이는 LogiCoT가 단순히 인퍼런스 과정을 안내하는 것을 넘어, 인퍼런스 과정 자체의 질을 개선할 수 있음을 시사한다.


1 Introduction

Large language models (LLMs) are expected to be omniscient because of their extraordinary ability to deal with tasks requiring knowledge of common sense or even specialized field knowledge. The success has been established in numerous fields extending beyond the realm of language processing (Bubeck et al., 2023; Yao et al., 2023b; Ahn et al., 2022; Zhao et al., 2023). However, one major problem residing in generative LLMs yet to be solved is their tendency to hallucinate wrong statements in a confident style (Bang et al., 2023). A quick example can be found by asking a non-internet-based LLM about very recent news – it will too easily make up facts without hesitation. An educated human with expertise in logical reasoning can systematically examine words before coming to a conclusion. Unlike logical reasoning by humans, the logical incompetence of the deduction by LLMs makes their decisions untrustworthy. LLMs may have all the logical concepts and tricks available but fail to actively utilize them in an organized manner, which brings the demand for expert guidance. Principles in logic well-adapted by humans can also benefit the reasoning ability of language models. Take a simple logic question as an example: “If Tom plays football outside, then John will also join to play; if John plays football, then Mary won’t go outside. Known Mary is outside. Is Tom playing football?” Nine out of ten answers from ChatGPT1 will conclude that “we cannot conclude whether Tom is playing football or not”. However, with the help of the knowledge in logic provided to ChatGPT that the contrapositive holds the exact same truth value with the original proposition, we may put it another way to prompt ChatGPT to “use contrapositive”. Then it deduces correctly: “ … Using the contrapositive of the first statement, if John does not join to play (which we have deduced), then it implies that Tom does not play football outside. Therefore, based on the given information and the contrapositives, it can be deduced that Tom is not playing football.” There is no newly introduced knowledge but a prompt of using contrapositive, a special variational expression of the original premise. While the concepts of logic are not new to a large language model, the model initially struggles to incorporate them. Compared to randomly sampling for diverse statements, the one derived from logical equivalence works effectively as it could be expressed quite differently in natural language and may result in a totally different deduction. Motivated by the reasoning process in logic, we propose Logical Chain-of-Thought (LogiCoT) to further expand the zero-shot reasoning ability of LLMs, which not only lets the LLM think step by step but also verify, step by step, according to the guidance via the principle of Reductio ad Absurdum, and revise the reasoning chain if necessary to guarantee a sound inference (see Fig. 1 for an overview).

1 https://openai.com/blog/chatgpt

In order to unleash the power of a pre-trained generative language model, the quality of the prompts to interact plays an important role. Summarizing known works, the reasoning procedure benefits from a prompt that guides it to possess the following properties:

  • Relevance The generative model can be easily distracted by irrelevant words in the prompt. A pre-selection of context helps the correctness of reasoning (Creswell et al., 2022; Creswell and Shanahan, 2022; Ling et al., 2023).
  • Decomposition An automatic decomposition of a tough question improves the reasoning reliability, which has been evidenced by the success of Least-to-Most (Zhou et al., 2023), Zero-shot-CoT (Kojima et al., 2022) and many prompting techniques (Yao et al., 2023a; Kojima et al., 2022; Wei et al., 2022).
  • Verification/Grounding External functions, e.g. a third-party calculator for mathematical problems (Schick et al., 2023), external information acquisition from Wikipedia (Yao et al., 2023b), or an affordance evaluation function in robotics (Ahn et al., 2022), can ground the generation to be meaningful. This verification can be triggered under a specified condition or be applied to the reasoning process (Lightman et al., 2023; Ling et al., 2023; Li et al., 2023).
  • Diversity The collective intelligence from a set of reasoning paths (typically, sampling N times) helps produce a final answer that is consistent among these variants. Despite the surging N -times cost, this ensemble approach has been widely adopted to combine with other techniques for higher accuracy (Li et al., 2023; Ling et al., 2023; Yao et al., 2023a; Zheng et al., 2023).
  • Revision Revision (or refinement) can be regarded as a special diversity but is conditioned It reon the previous generation as hints. examines the words with an extra focus on the quality in terms of, for example, validity and conciseness (Madaan et al., 2023; Zheng et al., 2023; Welleck et al., 2022).

Chain-of-Thought Prompting. Prior works show that LLMs have the corresponding power for complex tasks but require a proper strategy to unleash, e.g. human-in-the-loop (Ouyang et al., 2022) alignment tuning and Chain-of-Thought prompting (CoT) (Wei et al., 2022). In order to generate a chain of thoughts that decomposes the original problem into several small parts which a language model can easily handle, CoT creates few-shot exemplars of a detailed reasoning path to let the model follow. Least-to-most (Zhou et al., 2023) explicitly prompts the LLM to divide complex questions into sub-problems and conquer them one by one. Moreover, zero-shot-CoT (Kojima et al., 2022) showcases the impressive effectiveness of simply attaching the sentence “Let’s think step by step.” before any zero-shot reasoning trace starts. We build our approach under a zero-shot setting and integrate zero-shot-CoT as a baseline to improve with. While existing CoT-based methods focus on encouraging the reasoning step to be concrete but lack supervision of their faithfulness, we propose a step-by-step verification mechanism. A very recent study of Ling et al. (2023) also addresses this credential concern with double-checking and has reached a positive improvement, which also emphasizes the benefit of per-step verification. However, while their work anticipates autonomous detection of errors by just prompting “Double-check the reasoning process…”, our work is motivated from a logical perspective and empowers the language model to argue different possibilities. Moreover, our method not only suggests verification but also introduces revision of the suspected reasoning steps. When posed with a question, the careful selection of relevant facts to ingest is equally critical in preventing the language model from becoming distracted or potentially misinformed, which might result in hallucinations. This consideration of relevance is even crucial when the context becomes very long. Previous works typically resort to a language model to evaluate the relevance of facts and infer with the ones contributing to an intermediate reasoning step (Creswell et al., 2022; Ling et al., 2023). Our verification of each reasoning step is conducted by prompting a language model to find relevant premises to deduct from. Variational Reasoning. A single reasoning trace may be biased. In order to produce a set of reasoning candidates, previous works resort to generating samples several times (Wang et al., 2023) with the same prompt, or create diverse prompts in the beginning for variants (Li et al., 2023). However, this approach is costly and inefficient since many of the reasoning steps are non-controversial so not requiring duplicates. Our method avoids unnecessary checking and only revises reasoning steps deemed implausible, resulting in a reasoning chain growing only when required. It costs more than generating one chain because of the verification and possible revision but is more efficient than a naive ensemble. Besides, LogiCoT can be combined with an ensemble approach to produce a set of verified chains, further increasing the confidence for later majority voting required by the ensemble. Compared to the ensemble-based method that independently samples diverse variants, revision produces a special diversity. It is an iterative generating process conditioned on previous content. Many previous works actually benefit from this manner though not explicitly mentioned. For example, Progressive-Hint Prompting (Zheng et al., 2023) generates consistent answers by progressively guiding the LLM with hints of accumulated possible answers. It repeats generation again and again until the answer is deemed consistent with the previous. Other works generate content conditioned not only on the previous content but also on extra feedback (Madaan et al., 2023). To obtain a revision with high quality, this guiding feedback should be specific and actionable. Our work takes advantage of this property and revises for those reasoning steps that fail to pass the verification, during which the post hoc explanation (Jung et al., 2022) will act as a constructive revision suggestion. Neurosymbolic AI. Neurosymbolic AI combines neural networks with symbolic representations and reasoning techniques. Its success stems from the ability to leverage symbolic (structured) knowledge to enhance learning or reasoning (Sarker et al., 2021; d’Avila Garcez and Lamb, 2020; Nye et al., 2021). Unlike the end-to-end black-box framework, it is more interpretable and explainable because of the transparency of the symbolic framework. There exist works that adopt concepts from symbolic logic (Agler, 2012) to establish a reliable reasoning path (Creswell et al., 2022; Jung et al., 2022). To solve binary question-answering problems, Jung et al. (2022) propose to generate a post hoc explanation graph for a statement and compute the relative relations to formulate a symbolic logic expression. The truth of the statement is thereby assigned by solving the satisfiability problem of this symbolic expression. The LogiCoT framework employs a neurosymbolic methodology, leveraging logical rules and post hoc arguments to enhance error detection.

3 Methodology

As demonstrated in the contraposition example presented in the introduction, when known logical rules are applied to achieve an identical transformation (equivalent in logic but markedly distinct in natural language expression), it affords the LLMs the chance to engage in reasoning from an alternative perspective. A challenge is that the language model has to identify the inherent logical structures first to know whether certain prior knowledge can be effectively applied. Moreover, transforming everything from the real world into a symbolic expression is unrealistic. The applicable scenario is limited because questions in many reasoning fields beyond logic, e.g. mathematics problem solving, can hardly be expressed in symbolic logic. Nevertheless, there is promise in incorporating concepts from logic that contribute to the process of argument proof in order to construct a neurosymbolic framework (d’Avila Garcez and Lamb, 2020; Creswell et al., 2022) that facilitates a causal reasoning trace, i.e. the premises and leading thoughts entail the thoughts behind. Continuing with the success of “let the model talk”, e.g. “let’s think step by step” in zeroshot-CoT (Kojima et al., 2022), we further propose to guide the conversation with logic for more systematic exploration instead of counting on its recklessness.

3.1 Reductio ad Absurdum

When given an argument generated by an LLM, it is difficult for the language model to recognize errors (i.e. to prove falseness) through free doublechecking by itself. This is also the case in the field of logic. Many propositions pose challenges when it comes to direct deductive reasoning. One commonly employed technique to establish a claim is known as reductio ad absurdum (reduction to absurdity), which involves an initial assumption and consequent derivation of absurdity or contradiction. Let P and Q denote two propositions. The relation between a premise and its conclusion can be expressed as P ⊢ Q. Here “⊢” is a syntactic turnstile which means Q is a syntactic consequence of P (Agler, 2012), i.e. there exists a proof that claims the conclusion Q given the premise P . In order to prove Q by the mean of reductio ad absurdum, let us assume its negation ¬Q is valid and then check

A typical N -step reasoning trace can be expressed as {P, T1, · · · , TN }, where P is the known premise and Ti is the i-th step of thoughts. Usually, TN concludes the thoughts and answers the specified question. Unfortunately, LLMs hallucinate. LLMs usually generate content autoregressively, which means the generation of Ti is based on the former content {P, · · · , Ti−1}. Errors in Ti will propagate and gradually influence Ti′ for increasing i′ > i, making the successive deductions and ultimately the final conclusion untrustworthy. Therefore, we propose a verification loop to double-check each reasoning step. Following Eq. 1, this double-check procedure unrolls by checking the validity of P, · · · , Ti−1 ⊢ Ti, i.e. the contradiction.

3.3 Chain Growth

3A post hoc explanation is an explanation completed by the LLM with a prompt like “Ti is true because” or “Ti is false because”. An LLM is then often biased by the prompt and, as a result, generates an explanation consistent with the prompt. Because of this “compulsory” behavior, once a statement is deemed false in the leading prompt, the LLM tries hard to dig out errors even if they are less obvious. The adopting approach is considered to benefit from this compulsory error-finding behavior.

Figure 2: A diagram demonstrating the think-verify-revision loop of LogiCoT. The two zoom-in boxes exhibit the processes when a thought passes (topleft) and fails (bottom) the verification respectively. A thought passing the verification is kept in the reasoning trace, while a thought failing the verification is revised and a new chain of thought is generated based on the revision. The meaning of the symbols in this figure is introduced in Sec. 3.2 and Sec. 3.3.

the contradiction2 of the conjunctive proposition

where “∧” is a binary conjunction operator, meaning the truth of the conjunction requires the truth of both sides. Upon the contradiction of the co-existence of the P and ¬Q, P ⊢ Q is thus proved true, and then we can claim the validation of the conclusion Q given the premise P . Many logic principles, e.g. the contraposition mentioned in the introduction section (see Appendix A for a proof), can be derived by deductions following this rule. This thinking paradigm helps humans check arguments carefully before composing a conclusion. As we will demonstrate later, the reasoning ability of LLMs can also be improved by benefiting from this paradigm.

3.2 Logical Chain-of-Thought (LogiCoT)

There is a lot of evidence confirming that a series of coherent explanations helps an LLM to unleash its reasoning power (Wei et al., 2022; Kojima et al., 2022; Zhou et al., 2023), while some discouragement on its utterance, e.g. prompts like “just tell me the result without any explanation”, catastrophically hinders the reveal of wisdom. So we continue with this success of having an explicit reasoning process.

Fig. 3 Note that this chain grows only when required. See Alg. 1 and Alg. 2 in Appendix B for the pseudo-code of the function to compute the reasoning trace of LogiCoT.

4. Experiments

Since we regard our work as an enhancement on the chain produced by zero-shot-CoT (Kojima et al., 2022), which requires no need to use exemplars, we compare LogiCoT with it as the baseline to demonstrate the benefit of step-wise verification and revision for zero-shot reasoning. For the following considerations we carry out the experiments in a zero-shot setting: 1) Zero-shot-CoT has a wide task-agnostic application potential, while few-shot requires domain knowledge; 2) The few-shot prompts heavily influence the performance even on the same dataset, so hard to evaluate fairly as the prompt varies. Drawing direct comparisons with other prompting works in the literature is challenging due to variations in task settings and backend language models. Many of these works are specifically under a few-shot setting, which indicates additional modifications to adapt them for zero-shot reasoning. However, we acknowledge this as an area for future investigation. We evaluate the accuracy of tasks in various domains as the overall performance measure and also report the worsening and improvement impact of the logical revision on the original reasoning chain.

4.1 Experimental Setup Dataset.

We demonstrate the effectiveness of LogiCoT on diverse language topics: (1) Math reasoning tasks GSM8K (Cobbe et al., 2021) and AQuA (Ling et al., 2017). The GSM8K dataset contains grade school mathematics questions that should be responded to by numerical answers; AQuA has more advanced questions but has several optional answers to choose from. (2) Commonsense reasoning tasks DateUnderstanding and OddOneOut (Srivastava et al., 2023). The DateUnderstanding task necessitates the utilization of both common sense and fundamental arithmetic calculations to find out the correct date, making it sufficiently challenging to prevent it from being solvable through simple one-step reasoning. The OddOneOut requires common sense to deduct the unusual object in the context. (3) Causal inference tasks CauseEffect and ShuffledObjects (Srivastava et al., 2023), where both of the tasks require reasoning from the context for a correct deduction. (4) Symbolic reasoning task LastLetter (Srivastava et al., 2023). In this task, the language model has to extract the last letter of given candidates and concatenate them in order, which is simple to humans but challenging to language models because of tokenization (Mielke et al., 2021). (5) Social interaction reasoning task, SocialQA (Srivastava et al., 2023), which measures the model’s emotional and social intelligence in human daily activities. To get a formatted answer that can be directly compared with the ground truth in the aforementioned dataset, a final prompt asking the final answer is attached after the reasoning trace, e.g. for the GSM8K dataset we simply attach “Therefore, the final numerical answer is:” at the ends. For robustness, this answer is matched with a regular expression before comparing it with the ground truth. Backend LLMs. To evaluate the effectiveness of LogiCoT on language models with different capabilities, we experiment on Vicuna-7b, Vicuna-13b, Vicuna-33b, GPT-3.5-turbo, and GPT-4. This consideration of model choice is because the CoT technique exhibits distinguishing performance when the model scale is substantial (Wei et al., 2022; Kojima et al., 2022). The temperature parameter is set to 0.1 to maintain a stable result while encouraging error-finding from its own statements. The max_token is set to 2048, which is enough for the question-answer dataset.

4.2 Does LogiCoT enhance the

To answer the first question, we conduct zero-shot experiments with datasets covering more diverse topics and with language models of different sizes. The LogiCoT-enhanced performance compared with the zero-shot baseline is reported in Tab. 1. The experiment shows that LogiCoT can enhance the performance of the base CoT in various domains. The performance benefits are more consistent when the model size gets considerable (>7b). Moreover, the performance gain becomes more prominent as the model’s ability increases (e.g. GPT-4).

4.3 Does the transition from composing to adopting lead to improvements in terms of error findings?

The results of the ablated variant composing LogiCoT on three tasks are shown in Tab. 2. The improvement in performance observed when utilizing adopting LogiCoT suggests that when it comes to error detection in deductive reasoning, it is more effective for an LLM to embrace one of two opposing viewpoints (T or ¬T ) rather than composing (generating) the discrepancies directly, especially when coping with tasks that are difficult such as math reasoning.

Worsening and Improving Rates.

Intervention Impact and Revision Analysis

To be specific, the worsening rate computes as #(correct→wrong), #(correct→∗) where “#” means counting and “∗” indicates arbitrary correct/wrong candidates. Similarly, the improvement rate computes as #(wrong→correct).

From Tab. 3, we can have a closer look at the intervention impact of LogiCoT. For example, for small-sized language models such as Vicuna-7b, it is riskier to exert extra intervention/instructions that the model may fail to follow. Indeed, larger models generally benefit from the proposed self-improvement procedure. For instance, GPT-4 exhibited enhanced accuracy on the Date Understanding, LastLetter, and OddOneOut tasks, with the improvement rate significantly surpassing the worsening rate, indicating that it is more trustworthy to revise the default reasoning chain via LogiCoT for better performance.

Revision Steps

In order to measure the complexity of revisions, we describe the average revisions per chain and typical reasoning steps required by CoT and LogiCoT in Tab. 4. The percentage of revisions indicates the frequency of LogiCoT to revise the candidate reasoning chain. Note that the number of steps is not human-defined or prompted since our setting is in zero-shot, so the language models decide by themselves the length of a reasoning chain. The average step count is the valid reasoning steps in the final CoT and LogiCoT chain (i.e. the intermediate verification, refinement, etc. are omitted to show).

From Tab. 4, we can conclude that:

  1. Larger language models generally generate longer chains and are also more active in revision.
  2. The LogiCoT refined reasoning chain is generally a little bit shorter than the original zero-shot CoT. Our conjecture is that this phenomenon might arise because, during the refinement process, the language model strives to incorporate additional information, consequently yielding concise chains of reasoning.

Case-wise Statistics and Discussions

We report more insightful case-wise statistics and discussions in this section, including:

  1. The worsening rate (i.e. the ones being originally correct by CoT but “correctified” to be wrong by LogiCoT) and improving rate (i.e. the ones that are originally wrong and being correctified by LogiCoT) in Tab. 3.
  2. Average revision frequency and the resultant number of reasoning steps in Tab. 4.
  3. A case study to illustrate the logical reasoning procedure.

Case Study

We show a successful case on the Date Understanding task to demonstrate the verification and revision procedure applied to the chain of thoughts initialized by zero-shot-CoT. (See Appendix C for detailed prompts and other case studies.) Here we use black color to indicate given context or fixed prompts; non-black color to indicate generated content by the LLM.

Below are the initialized zero-shot-CoT reasoning steps where step #6 is actually incorrectly inferred (colored in red). The error occurs because zero-shot-CoT is distracted by the irrelevant premise of “Jane’s appointment will be 3 days later” and concludes with a wrong answer.

Limitations Efficiency Optimization.

Many of the reasoning steps, especially the very initial ones, are just reiterated known facts that deserve less thorough verification. We recognize the potential for enhancing the efficiency of the current implementation. Generation Probability. Rather than letting the LLM choose from different reviews, another possible method is to access and compare the probability of the generations. Unfortunately, there is no public access to the generation probability of GPT-3.5-turbo yet5 as it is possible for completion models (such as text-davinci-003). Considering a cheaper price and better performance, we conducted our experiments with the chatting model and leave this possibility for future work. Zero-shot, Few-shot and Beyond. We hold the belief that significant potential exists for enhancing the reliability of the verification-revision procedure and devoting efforts to the advancement of prompt engineering may prove to be valuable and worthwhile. Since our work is done with an aim to be as generalizable as possible, the experiments are all conducted in the zero-shot setting. However, in general, as expertise revealed in the exemplar prompt that it is always beneficial for better performance in a specific domain (Kojima et al., 2022; Wei et al., 2022), it is still worthwhile to examine the advantages when LogiCoT is applied in the few-shot setting in future work.

Ethics Statement Large language models sometimes produce biased, untrustworthy statements.

Despite our best intention to enhance the model, we are not rectifying these issues. It is advisable for individuals to exercise caution and avoid placing excessive reliance on it. This method is released with the purpose of research only.

Previous: A is B, B is not A** Next: Graph Neural Prompting

post contain ""

    No matching posts found containing ""