00:00:00

Approximate Unlearning in LLMs

https://dsdanielpark.github.io https://github.com/dsdanielpark

Approximate Unlearning in LLMs

MinWoo(Daniel) Park | Tech Blog

Created: 2024-01-29 05:18:33 +0000

Last modified: 2024-09-05 20:56:50 +0900

Approximate Unlearning in LLMs

Related Project: Private
Category: Paper Review
Date: 2024-01-29

Who’s Harry Potter? Approximate Unlearning in LLMs

url: https://arxiv.org/abs/2310.02238
pdf: https://arxiv.org/pdf/2310.02238
huggingface: https://huggingface.co/microsoft/Llama2-7b-WhoIsHarryPotter
abstract: Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers. In this paper, we propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch. We evaluate our technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of fine-tuning, we effectively erase the model’s ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains almost unaffected. We make our fine-tuned model publicly available on HuggingFace for community evaluation. To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models. Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a baseline model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model’s own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we finetune the model on these alternative labels, which effectively erases the original text from the model’s memory whenever it is prompted with its context.

Fogetting 관련 색인마킹

Contents

Who’s Harry Potter? Approximate Unlearning in LLMs

TL;DR

인공지능 대규모 언어모델의 선택적 ‘잊음(forgetting)’ 방법 개발
‘잊음’ 대상 특정 지식의 제거와 모델 전반적 능력 유지 평가
해리 포터 시리즈 데이터를 이용하여 효과 검증 및 기술적 한계와 전망 논의

1. 서론

최근 인공지능 분야에서 대규모 언어모델(Large Language Models, LLM)의 발전은 인상적이지만, 동시에 이런 모델이 training dataset로부터 선택적으로 특정 정보를 ‘잊어버리는’ 기술에 대한 필요성이 대두되고 있습니다. 본 논문에서는 해리 포터 시리즈를 포함한 특정 텍스트 데이터를 대상으로 하여 이런 ‘잊음’ 기술의 가능성을 탐구하였습니다.

1.1 관련 연구

LLM의 잊음 기술에 대한 선행 연구들은 주로 분류 작업을 중심으로 이루어졌으며, 생성 모델에 대한 연구는 상대적으로 제한적입니다. 본 논문의 연구는 ZFBH+23의 연구를 바탕으로, 특정 데이터를 재학습하지 않고도 효과적으로 ‘잊는’ 방법을 제안합니다.

해리포터의 내용을 잊게 하는 방법을 위주로

2. 방법

2.1 제네릭 예측 획득을 위한 강화 부트스트래핑

본 연구에서는 기존 모델에 해리 포터 시리즈를 추가 훈련시키는 강화 모델을 사용하여, 기존 텍스트와 비교하여 특정 토큰의 로짓을 조정합니다. 강화 모델은 해리 포터와 관련된 텍스트에 대해 높은 예측값을 보이므로, 이를 통해 ‘잊어야 할’ 예측을 찾아내는 방식입니다.

\[v_{\text{generic}} := v_{\text{baseline}} - \alpha \text{ReLU}(v_{\text{reinforced}} - v_{\text{baseline}})\]

이 식에서 $\alpha$는 긍정적 계수로, 강화된 예측에서 기초 모델 예측을 빼주는 역할을 합니다.

2.2 앵커 용어를 사용한 제네릭 예측

특정 인물이나 용어를 일반적인 용어로 대체하여 모델이 ‘잊어버린’ 것처럼 만드는 기술입니다. 예를 들어, “Harry Potter studies”를 “Jon studies”로 변경하여 일반적인 예측을 유도합니다.

2.3 방법 결합

두 기술을 조합하여 사용함으로써, LLM이 특정 지식을 효과적으로 ‘잊도록’ 합니다. 이는 각 텍스트 블록에 대해 수행되며, 강화된 모델과 기초 모델의 로짓을 조합하여 최적의 예측을 선택합니다.

3. 평가 방법

3.1 일반 기능 보존

모델의 일반적인 언어 이해 능력을 평가하기 위해 WinoGrande, HellaSwag, PIQA 등의 벤치마크를 사용합니다.

3.2 대상 지식 제거

‘잊어야 할’ 지식이 모델에서 제거되었는지를 평가하기 위해, 특정 지식을 요구하는 프롬프트를 사용하여 모델의 반응을 확인합니다.

4. 결과

다양한 파라미터와 설정에서 모델을 평가한 결과, 제안된 ‘잊음’ 기술이 효과적으로 작동함을 확인할 수 있었습니다. 그러나 일부 지식이 남아있는 것으로 나타나, 이는 위키피디아 수준의 지식으로 판단됩니다. (잔류 지식은 있을 수 있으므로 완벽하게 해당 노드들을 전체 끊거나 제거하는 방향은 아님. 이는 SFT 및 데이터셋에서의 문제 혹은 할루시네이션과 더 밀접하게 연결되어 있을 수 있으므로 가드레일 등으로 처리해야할 수 있음.)

1 Introduction

In the rapidly evolving domain of artiﬁcial intelligence and machine learning, Large Language Models (LLMs) stand as a testament to both our accomplishments and the challenges that lie ahead. Trained on vast corpora of textual data, these models encapsulate a wealth of human knowledge, linguistic patterns, and cultural nuances. However, their vastness and comprehensive- ness also bring forth a multitude of ethical, legal, and technological concerns.

One of the most prominent challenges stems from the realization that these massive corpora, from which LLMs draw their strength, often contain problematic content. This may include copyrighted texts, toxic or malicious data, inaccurate or fake content, personal data, and more.

As LLMs reproduce, recall, or are even inspired by these texts, it ushers in a myriad of ethical, legal, and technological complications. Several companies that have endeavored to train LLMs now ﬁnd themselves at the epicenter of lawsuits, public scrutiny, or regulatory pressure.

Yet, even as these concerns arise, a nuanced technological problem persists: Once an LLM is is it feasible to selectively unlearn speciﬁc subsets of its training data? Traditional trained, models of learning predominantly focus on adding or reinforcing knowledge through basic ﬁne- tuning but do not provide straightforward mechanisms to “forget” or “unlearn” knowledge. More- over, completely retraining the model to address these speciﬁc issues is both time-consuming and resource-intensive, rendering it an impractical approach for many applications ([ZFBH+23]). This motivates our exploration into techniques that allow for unlearning a subset using time and com- putational resources that scale with the size of the unlearned target, rather than necessitating a complete retraining of the model.

In this paper, we seek to address this challenge head-on. We introduce a pioneering technique designed to enable LLMs to unlearn speciﬁc segments of their training data without necessitating a complete retraining. Our approach is not merely theoretical; we present empirical evidence of its eﬃcacy by applying it to Meta’s Llama2-7b model1. As a proof of concept, we demonstrate that, while the original model can easily recover very detailed and nuanced information from the books, it’s possible for the model to essentially “forget” the intricate narratives of the Harry Potter series ([Row07]), all while retaining its prowess on established benchmarks.

To get a ﬁrst impression of the ﬁne-tuned model produced by our technique, Figure 1 compares the completions, on several prompts, of the baseline model (Llama2-7b-chat-hf) and a variant which has been ﬁne-tuned for roughly 30 minutes on 4 A100-GPUs. Figure 2 compares the performance of these two models on some common benchmarks ([YBS19, CLC+19, ZHB+19, MCKS18, BHT+19, SLBBC19]) and Figure 3 compares the next token probability distributions for the sentence “Harry Potter studies” over diﬀerent steps of ﬁne-tuning, showing how the most likely next token gradually shifts from “magic” to generic completions.

Beyond the immediate applicability in addressing some of the aforementioned concerns (and in particular, copyright infringement), our technique may be seen as a ﬁrst step towards more dynamic and adaptable LLMs—models that can be ﬁne-tuned post-training to align with ethical guidelines, societal values, or speciﬁc user requirements. It should be stressed, however, that while already eﬀective in unlearning in certain cases, our technique is likely to exhibit limitations with other types of content (such as non-ﬁction or textbooks), as is discussed in the conclusion. Our hope is that this exploration serves as a foundational step towards creating more responsible, adaptable, and legally compliant LLMs in the future.

While there is a growing body of work in the topic of unlearning in machine-learning in general (see [JLZ+22, NHN+22, ZNIS23] and references therein), the majority of works focus on classiﬁcation tasks, while the literature concerning generative models or speciﬁcally LLMs is still quite slim. The very recent paper [ZFBH+23] highlights the related challenges and implications and discusses some high-level directions for potential mitigation. In the context of this discussion, our work ﬁts into the rubric of “approximate unlearning”.

1 Our model can be found at https://huggingface.co/microsoft/Llama2-7b-WhoIsHarryPotter

Figure 2: Comparison of the baseline and the ﬁne-tuned models on various benchmarks.

Figure 3: Next-token probabilities for the prompt “Harry Potter studies”

Recent works that propose concrete unlearning techniques for generative models are [JYY+22] which suggests a technique shown to address privacy risks in certain settings, and [WCY+23] which proposes an algorithm called knowledge-gap-alignment which may be in, certain cases, relevant for LLMs but relies on assumptions that do not seem to hold in our setting.

2 Description of our technique

Assume that a generative language model has been trained on a dataset $X$. We fix a subset $Y \subset X$ which we call the unlearn target. Our objective is to approximately mimic the effect of retraining the model on $X \setminus Y$, assuming that retraining the model on $X \setminus Y$ is too slow and expensive, making it an impractical approach.

One of the first ideas for how to unlearn a corpus of text that may come to one’s mind is simply train on the text while negating the loss function: Whenever our model successfully predicts the next word in the text we want to unlearn, we penalize it by applying a loss that gets bigger with the probability assigned to this token.

Alas, empirically that does not seem to yield promising results in our context (it was, however, shown to be effective in certain privacy-related settings $[JYY+22]$). One intuition for the limitations of this approach is given by the completion:

Harry Potter went up to him and said, “Hello. My name is”

If the next word in the text is Harry, a negative loss in this example would, instead of unlearning the books, effectively cause the model to unlearn the meaning of the words “my name is”.

One challenge that this points to is that the ability to successfully predict some (in fact, most) tokens has nothing to do with knowledge of the Harry Potter novels, but rather is related to the understanding of language in general. Next, consider the sentence,

Harry Potter’s two best friends are

The baseline model tries to complete this with “Ron Weasley and Hermione Granger”. In fact, it gives almost 100% probability to either “Ron” or “Hermione”. Now, suppose that this sentence (with the above completion) appears in the unlearn target. Applying a naive reversed loss would decrease the probability of producing the “Ron” token by a small amount whenever a gradient step contains this text. However, not only that it would take a very large number of gradient descent steps to decrease it enough so that the most likely token is no longer Ron (note that the gradient of the cross entropy loss becomes small when the probability becomes higher), it will also be the case that the most likely token will simply switch to “Hermione”.

Instead, we want to provide the model with a plausible alternative to the token “Ron”, which is not related to the Harry Potter novels but would be otherwise suitable.

In other words, for every token in the text we need an answer to the question:

What would a model that has not been trained on the Harry Potter books have predicted as a next token in this sentence?

We will henceforth refer to this as the generic prediction. Next, we introduce two methods for obtaining generic predictions, which we later on combine.

2.1 Obtaining generic predictions via reinforcement bootstrapping

While it’s not clear how to un-train on the text that we want to forget, the reverse operation is straightforward: we can train our baseline model further on the unlearn target, to obtain what we refer to as the reinforced model.

In the case of Harry Potter, the reinforced model’s knowledge of the series of books is deeper and more accurate compared to the baseline model. Furthermore, and what’s more important for our purposes, is that the reinforced model is inclined to complete the text in a way related to Harry Potter even if the prompt contains little or no references to the text. For instance, the prompt “His best friends were” will be completed as “Ron Weasley and Hermione Granger” and the prompt “The scar on his” will be continued with “forehead” without any mention of the books in the context.

To illustrate the reason that the reinforced model is useful for us, consider completion

Harry Potter went back to class where he saw

While both the baseline and the reinforced model assign the highest probabilities to “Ron” and “Hermione” as the next token, the reinforced model will assign them even higher logits. Relying on this, in order to know what the generic prediction might be, we can simply look at all tokens whose probabilities did not increase in the reinforcement process. Specifically, we can take the two logit vectors assigned by both models $v_{\text{baseline}}$ and $v_{\text{reinforced}}$ and define a new vector

$v_{\text{generic}} := v_{\text{baseline}} - \alpha \text{ReLU} (v_{\text{reinforced}} - v_{\text{baseline}})$ … (1)

which seems to yield better results. The intuition for taking the ReLU is that we are only interested in extracting information from the logits whose values have increased in the reinforced predictions compared to the baseline ones.

As an example, after fine-tuning a model based on the above formula, the most likely completion for the sentence

He had a scar on his forehead. His name was

as “Harry Potter” becomes much less likely.

This idea, however, falls short of producing generic predictions in all cases - likely due to the following caveats: First, consider the sentence,

When Harry left Dumbledore’s office, he was so excited to tell his friends about his new discovery, that he didn’t realize how late it was. On his way to find

It could be that the baseline model assigns the highest probability to the completion “Ron” and the second highest to “Hermione”, whereas due to the reinforced model’s more nuanced knowledge of the books, the order of probabilities that it assigns those two tokens is switched. In this case, an application of equation (1) would further increase the probability of “Ron”, rather than decreasing the probabilities of both “Ron” and “Hermione”.

The second caveat is simply the fact that in many cases, when the model is primed with a specific idiosyncrasy (such as the names of one of the major characters), completions specific to the target text already have a very high probability and it appears that reinforcing the model makes almost no difference. This leads us to the second ingredient of the technique, described next.

2.2 Obtaining Generic predictions by using Anchored Terms

Before we present the main idea, let us consider the completion:

Harry Potter studies

Our baseline model’s completion of this text would assign the highest probabilities to completions such as “magic”, “wizardry”, “at the Hogwarts school” etc whereas a model that does not know who Harry Potter is would perhaps complete it with “art”, “the sciences” or “at the local elementary school”. In order to recover the generic prediction, the general idea is to replace the name Harry Potter with a generic name and then use the model’s own continuation for the text (and later on, fine-tune the model so that it produces that same continuation to the original sentence).

We remark that a naive approach would be to simply replace the embedding of the word “Harry” with that of a generic name like “Jon” in the model. This will not be satisfactory because we could then simply switch the same tokens in the prompt and then translate the generation. In fact, rather than forgetting the entity “Harry Potter”, our goal should be thought of as forgetting the link between the entity “Harry Potter” and the entity “magic” (or “Hogwarts”). To that end, we aspire to train the model on a text that would originally establish links between diﬀerent entities related to the Harry Potter world, but that has been perturbed in a way that some of the entities are unchanged while others were replaced by generic versions.

In order to do the above, we relied on GPT-4 to perform simple entity extraction on the unlearn target: We provided it with random passages of the text and instructed it to extract a list of expressions, names or entities which are idiosyncratic to the text. For each such expression, we asked for an alternative expression that would still be suitable in terms of text coherence, but is not unique to the books2. Each call to GPT-4 with a passage in the text produced a small dictionary, as shown in the following example:

Listing 1: Generated Dictionary

{

Hogwarts: Mystic Academy
Apparition: Teleportation
Ron: Tom
Splinch: Fragment
Harry: Jon
house-elves: Magic Servants
Marauder’s Map: Explorer’s Chart
Felix Felicis: Fortune Elixir
Quidditch: Skyball
Slytherin: Serpent House

}

We will refer to keys in this dictionary as anchor terms and to the corresponding values as the generic translations. Concatenating these outputs, we ended up with dictionary containing the generic versions of about 1,500 anchored terms.

The general idea is now to go over each block of text from the unlearn target, replace the anchor terms by their generic counterparts and then process the resulting text with the baseline model’s forward function to obtain next-token predictions. These will take the role of our generic predic- tions. To summarize, we aim to take the model’s next-token predictions for the generic translation of the text, and ﬁne-tune the model so that they match the model’s next-token predictions on the original text.

While that is a step in the right direction, another problem arises: suppose that the text contains the sentence

2A possible caveat here is that we may have, to some extent, relied GPT-4’s previous knowledge of the Harry

Potter books for the translations, below we make suggestions for alternative ways to extract unique expressions.

Harry went up to him and said, “Hi, my name is Harry”.

By following the steps of the above approach, we would be eﬀectively ﬁne-tuning the model on the sentence

Harry went up to him and said, “Hi, my name is Jon”,

which is an undesired inconsistency. Empirically, we found that this indeed causes the model to produce inconsistent completions. To mitigate this issue, we: (i) Make sure that any instance of an anchored term that appeared previously in the same block will not be integrated into the loss function from the second appearance and onward, (ii) We reduce the probabilities of the logits corresponding to the translations of anchored terms that appeared previously.

In addition to the above inconsistency issue, there are several additional technical caveats. One is related to the way text is tokenized (for example, in the Llama2 tokenizer, the word “Harry” can be tokenized in two diﬀerent ways, depending on whether a whitespace precedes it). Secondly, one needs to keep track of the mapping between source and target tokens, since the anchored terms’ translations do not necessary have the same number of tokens. We will not discuss those technical details here3. The process for producing the ﬁne-tuning dataset (with the consistency-related details omitted) is summarized in Algorithm 1.

An example block in our generated ﬁnetuning dataset can be found in Figure 4, where the input tokens appear in black and the corresponding target labels are in blue. Roughly speaking, the ﬁne-tuning process aims to set each token that appears in blue to be the one predicted by the model as next token, when its input is the text, appearing in black, that precedes it.

Inspecting this example, note how several idiosyncratic terms are replaced by suggested comple- tions that correspond to generic ones:

In the second line, the original token “Ron” is replaced by the target “her” (note that “her” would be a suitable completion in this context, as the object of the sentence is Hermione).
In the same line, the original token “Harry” is replaced by “Jack”.
In the ﬁfth line, the ﬁrst token of the word “Ravenclaw” is replaced by “the”.
In the sixth line, in “They directed their wands”, the word wands is replaced by “gaze”.

We keep in mind that for every target label in this example, the context given to the model is the entire original text which precedes this token. For example, in the token “Jack” which appears in the second line, the ﬁne-tuning loss will steer the model towards predicting this generic completion after having been primed on the input tokens up to that point, which include among other things the names “Hermione” and “Ron”. Thus, when ﬁne-tuning the model on this content, it is eﬀectively being pushed away from producing Harry-Potter-related tokens as a continuation for a prompt that would have otherwise primed it towards producing such tokens.

2.3 Combining it all together

In summary, our unlearning process follows these steps:

3 Please refer to the GitHub repository for a more detailed account.

Figure 4: Example of input tokens and target labels for ﬁnetuning. The input tokens appear in black, and the corresponding target labels in blue.

We create a dictionary of anchored terms and their corresponding generic translations.
Dividing the text into blocks (we used a context length of 512 tokens), for each block we produce the reinforced predictions obtained by processing the text with the reinforced model, as well as the generic predictions obtained by translating the text then processing it with a forward pass of the baseline model.
We combine the logits according to equation (1) and take the token with maximal logit to produce the generic prediction labels (while keeping track of inconsistencies).
We ﬁne-tune the baseline model with the original text as input tokens and the generic labels as target tokens (roughly 150 gradient descent steps suﬃce in our setting).

Finally, we comment that our technique may end up unlearning a super-set of the unlearn target: For example, applying our technique with the Harry Potter books as the unlearn target may cause the model to forget the wikipedia article and other training data that discusses the books as an unwanted side-eﬀect. Our assumption is that this can easily be mitigated by ﬁne-tuning the model on any related content in order to re-learn it.

2.4 Technical details

The unlearn dataset is a concatenation of the original books (2.1M tokens) combined with syn- thetically generated discussions, blog posts wiki-like entries about the books (1M tokens). To obtain the reinforced model, we ﬁne-tune Llama-7b-chat-hf for 3 epochs on the unlearn dataset with a context length of 512, a learning rate 3 · 10−6, batch size of 8 and 16 gradient accumulation steps. The generic prediction label dataset is created according to the method described above with the choice α = 5 in formula (1). Finally, the baseline model is ﬁne-tuned with the generic predictions as target labels for two epochs, with learning rate 10−6 batch size of 8 and 16 gradient accumulation steps

3 Evaluation methodology

To adequately assess the eﬃcacy of our unlearning technique, our evaluation framework is grounded on two primary dimensions: preservation of general model capabilities and eradication of speciﬁc, targeted knowledge.

3.1 Preservation of General Capabilities

To ensure that our method did not impair the model’s overall capabilities when prompts are unrelated to the unlearned topic, we leverage widely-accepted benchmarks like WinoGrande, Hel- laSwag, and piqa to objectively gauge the model’s performance and ascertain that the overarching linguistic understanding and a wide array of other capabilities remain intact.

3.2 Eradication of Targeted Knowledge

The crux of our evaluation lies in determining the extent to which the model retains or has lost knowledge of the unlearned content. This evaluation component primarily involves a series of black-box tests, utilizing prompts speciﬁcally curated to elicit knowledge about the unlearned content (speciﬁcally, the Harry Potter universe), both directly and indirectly.

3.2.1 Completion-based evaluation

We have curated a list of prompts in a manner that either:

• Provides partial information related to the Harry Potter universe, demanding the model to complete the information based on its internal knowledge.

• Oﬀers instructions that, either overtly or covertly, might prompt the baseline model to disclose familiarity with the books.

Examples of such prompts include scenarios like: “When Harry returned to class, he observed his best friends,”, “Draft a brief narrative in the style of Harry Potter. Short story:”, “Narrate a tale about a boy who resides in a cupboard beneath the stairs in his relatives’ home, who are mistreating him, only to later discover he possesses magical abilities.” Prompts also delved into subtler references such as: “While lounging beside the ﬁreplace, the elder pupils recounted the distinct attributes of the four Hogwarts factions, describing them as” and “Throughout the ages, numerous Defense Against the Dark Arts educators graced Hogwarts, each bearing their unique history. Pupils frequently reminisced about”. The full list can be found in the supplementary material.

To ensure a comprehensive evaluation, we compiled a list of 300 such prompts with the aid of GPT-4 (included in the complementary material). GPT-4’s role was further leveraged to analyze the completions during parameter search, but due to its apparent inaccuracy at the task, for our ﬁnal training, a manual inspection was conducted on the completions in the sake of additional scrutiny.

3.2.2 Token-probability-based evaluation

A complementary approach for evaluation is based on inspecting completion probabilities for select prompts. For instance, for the cue “Harry Potter studies “, we verify that the model does not allocate high probabilities to Harry Potter-speciﬁc terms such as “magic” or “wizardry”. We collected a list of 30 such prompts, and (manually) categorized the possible next tokens as either content-speciﬁc or generic (further details are given in Appendix 6.2)

3.3 Open Evaluation

Recognizing the intrinsic limitations of automated benchmarks and internal evaluations, we believe that unlearning veriﬁcation parallels endeavors like jailbreaking in adversarial nature. Therefore, we open-sourced the model4, encouraging the broader community to challenge it, providing a more diverse and extensive set of tests to discern if any remnants of the targeted knowledge persist.

4 Results

We tested our method in two settings: Meta-llama/Llama-7b-hf-chat (a 7B-parameter model by Meta), and a modiﬁed version on MSFT/Phi-1.5 (a 1.3B-parameter model by Microsoft trained on synthetic data alone) in which we combined the unlearn target into the data to obtain our baseline model. Since the results on the two pretrained model were qualitatively very similar, we only present our ﬁndings on the former.

Figure 5 shows the scores of common benchmarks (ARC [YBS19], BoolQ [CLC+19], HellaSwag [ZHB+19], OpenBookQA [MCKS18], PIQA [BHT+19] and WinoGrande [SLBBC19]) using the LM Harness Eval suite [GTB+21] and our evaluation scores for multiple ﬁne-tuning steps. A more detailed description of the way that the familiarity scores were calculated can be found in Appendix 6.2.

Figures 1 and 3 above provide an illustration of the change in behavior of the model after ﬁne- tuning, and more examples are provided in the appendix.

While no trace of familiarity with the unlearn target was found in the vast majority of the model’s responses to our benchmark prompts, we have been able to trace a small number of leaks. For example, if the model is prompted to give a list of ﬁctional schools, “Hogwarts” will be one of the answers (see the last two examples in Figure 6 of the appendix).

None of these leaks reveals information that would necessitate reading the books - rather they all reveal wikipedia-level knowledge (whereas the original model seems to have a very thorough

4 Our model can be found at https://huggingface.co/microsoft/Llama2-7b-WhoIsHarryPotter

Figure 5: Familiarity scores and common benchmarks for multiple ﬁne-tuning steps.

knowledge of the books). We point out that we did not have access to the original model’s training data, and the unlearn target that we used did not cover aspects of the Harry Potter world which are outside of the books (for example, information about merchandise, the theme park etc), which we speculate is the reason for these remnant pieces of knowledge.

Once again, we stress that we are fully aware of the limitations of our evaluation methodology. We posit that a comprehensive assessment of the unlearning quality can best be achieved by conducting adversarial attempts at probing the model to reveal its knowledge (due to which, we have outsourced the model for community evaluation).

4.1 Ablation study

In order to verify the necessity of both ingredients of our technique, we tried testing each one in separation.

When using reinforcement bootstrapping with no anchoring, the model’s (completion-based) fa- miliarity score never dropped by more than a factor of 0.3 for any combination of parameters. Moreover, this method was completely ineﬀective when tested on several basic prompts (such as “Harry Potter’s best friends are”).

Using anchored terms in separation (namely, taking α = 0 in equation (1)) was more eﬀective, but falls short of achieving the same results as the combination of techniques. We performed a parameter search whose objective is ﬁnd the model with the best possible performance on general benchmarks such that its familiarity score matches the model produced by the combination of techniques. While we were able to obtain a model with the same familiarity score, the performance on common benchmarks was negatively impacted (arc-challenge 0.40, arc-easy 0.70, boolq 0.79, hellaswag: 0.54, openbookqa: 0.33, piqa: 0.75, winogrande: 0.61).

5 Conclusion

The ambitious endeavor of teaching a Large Language Model (LLM) to selectively forget, or “unlearn”, is a testament to the nuanced complexities inherent in the world of artiﬁcial intelligence and machine learning. Widely regarded as a daunting task, any attempt at enabling such a functionality in LLMs stands at the vanguard of innovative solutions, and in this light, our proof of concept arguably underscores progress.

Firstly, our research demonstrates that unlearning, though challenging, is not an insurmount- able task, as the positive outcomes in our experiments with the Llama2-7b model suggest. Yet, this achievement must be contextualized with prudence. Our current methodology—basing our evaluation on prompts presented to the model and assessing the resultant completions—though eﬀective in certain scenarios, could potentially be blind to more adversarial means of extracting information. It’s conceivable that non-traditional or intricate methods, such as delving into token probability distributions, might inadvertently reveal the model’s latent familiarity with unlearned content.

Diving deeper into the potential generality of our technique, a pertinent observation emerges when considering the unique attributes of the Harry Potter series. The books are replete with idiosyncratic expressions and distinctive names—traits that, in hindsight, may have abetted our unlearning strategy. The pronounced presence of Harry Potter themes across the training data of many LLMs further compounds the challenge. Given such widespread representation, even the slightest hint in a prompt might stir a cascade of related completions, underscoring the depth of memory ingrained in the model.

A nuance of our methodology involves a reliance on GPT-4’s existing knowledge of the Harry Potter universe. To detect speciﬁc anchored terms and devise generic counterparts, the expertise of GPT-4 proved useful. This raises the question whether our technique achieve similar eﬃcacy when stripped of such vast prior knowledge. Preliminary experiments show that entity extraction can still be eﬀective when this knowledge is absent, and we speculate that the lack of familiarity with idiosyncratic expressions can be addressed with simple n-gram frequency analysis but we leave a more thorough study for future work.

Extending our approach to other types of content, particularly non-ﬁction or textbooks, presents its own set of challenges. Unlike the ﬁctional universe of Harry Potter, non-ﬁction content will not possess the same density of unique terms or phrases. Furthermore, non-ﬁctional texts often embed higher-level constructs such as ideas, concepts, or cultural perspectives. It remains uncertain to what extent our technique can eﬀectively address and unlearn these more abstract elements. This would clearly necessitate adaptations of our technique.

In conclusion, while our technique oﬀers a promising start, its applicability across various content types remains to be thoroughly tested. The presented approach oﬀers a foundation, but further research is needed to reﬁne and extend the methodology for broader unlearning tasks in LLMs.

post contain ""

No matching posts found containing ""

Approximate Unlearning in LLMs

Approximate Unlearning in LLMs

Approximate Unlearning in LLMs

Who’s Harry Potter? Approximate Unlearning in LLMs

TL;DR

1 Introduction

2 Description of our technique

2.1 Obtaining generic predictions via reinforcement bootstrapping

2.2 Obtaining Generic predictions by using Anchored Terms

2.3 Combining it all together

2.4 Technical details

3 Evaluation methodology

3.1 Preservation of General Capabilities

3.2 Eradication of Targeted Knowledge

3.2.1 Completion-based evaluation

3.2.2 Token-probability-based evaluation

3.3 Open Evaluation

4 Results

4.1 Ablation study

5 Conclusion

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Approximate Unlearning in LLMs

Approximate Unlearning in LLMs

Who’s Harry Potter? Approximate Unlearning in LLMs

TL;DR

1 Introduction

1.1 Related work

2 Description of our technique

2.1 Obtaining generic predictions via reinforcement bootstrapping

2.2 Obtaining Generic predictions by using Anchored Terms

2.3 Combining it all together

2.4 Technical details

3 Evaluation methodology

3.1 Preservation of General Capabilities

3.2 Eradication of Targeted Knowledge

3.2.1 Completion-based evaluation

3.2.2 Token-probability-based evaluation

3.3 Open Evaluation

4 Results

4.1 Ablation study

5 Conclusion

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views