On the Impossibility of Retrain Equivalence
in Machine Unlearning

$\dagger$ Princeton University

Introduction

Large language models (LLMs) inevitable acquire sensitive information during training—such as data that exposes personal privacy, subject to commercial license, or violates legal compliance. It’s required that LLMs learn to withhold such sensitive information before they can be deployed at scale. This is the research field called machine unlearning.


Our work investigates the family of scalable unlearning algorithms (i.e. those efficient enough to be deployed on billion-tokens models) and illustrate how they can’t guarantee forgetting: an unlearned model cannot interact with users as if it has never seen the sensitive data, as long as we don’t know how the model acquired such private inforamtion from the first place. This is because unlearning is path-dependent by nature: the order of which a model receives new information impacts how it forgets. If an unlearning algorithm does not take this into account, it’s shooting in the dark.

Desiderata

Consider a model \( \theta \) trained on dataset \( D = D_f \cup D_R\), which can bepartitioned into a forget set \( D_f \) and a retain set \( D_r \). The goal of an unlearning algorithm $\mathcal{U}$ is to remove the influence of the forget set from the model's predictions. The following desiderata drives research in unlearning.

Retrain Equivalence

Let $\theta_u$ be the model that results from applying some unlearning procedure $\mathcal{U}$ on the original model $\theta$. Let $\theta_r$ be the model retrained from scratch on all training data excluding the forget set. Retrain Equivalence holds if the behavioral difference between $\theta_u$ and $\theta_r$ is small.
Retrain Equivalence

Local Unlearning

This work considers gradient-based unlearning algorithms that are fast—those that can be deployed even when the training set is billions of tokens. Locality is one (and maybe the only) way to guarantee fast unlearning, as its runtime only depends on the size of the forget set.
An unlearning algorithm \( \operatorname{Unlearn}(\cdot, D_f) \) is local if it only requires gradient information computed on the forget set \( D_f \). Practically, we desire \( T_{\text{unlearn}} = o(T_{\text{retrain}}) \).

Why is Retrain Equivalence Impossible?

    Today's LLMs are trained in distinct stages, such as instruction tuning, alignment tuning, RL for reasoning capabilities, etc. At production, these training stages may further interleave each other. This is the source of unlearning impossibility: as long as we don't know how these training stages are ordered, local unlearning algorithms are doomed to fail.

    We argue impossibility by showing that unlearning is path-dependent. The relative order between the forget set and other training stages impacts what is unlearned and how fast unlearning occurs.

    If we feed two models trained on the same datasets but in different orders to the same unlearning algorithm, the behavioral difference between these models will diverge in a path-dependent way; therefore they can not both be similar to the retrained baseline.

Real Life Example

Unlearning dynamics for two Qwen1B models finetuned on the same datasets but in different orders, underoing identical unlearning procedure using the NPO algorithm. The y-axis is the log likelihood of retained responses not being unlearned.
Recency effect vs. stage position p (animated)

Experiment: LLM Post-Training Pipeline

We empirically demonstrate history-dependent nature of local unlearning algorithms through a four-stage finetuning pipeline that simulates today's LLM post-training workflow. The same base model is finetuned on identical datasets but in different orders, then the resulting finetuned models undergo a same unlearning procedure. We observe that the unlearning outcomes diverge in a path-dependent manner: the speed of forgetting and the spill-over effect on other capabilities are history-dependent.

Training stages

  • Sinst — Instruction Tuning
    INSTRUCT-SKILLMIX • 4k pairs • 10 epochs
  • Stofu — Fictitious Knowledge
    TOFU • 4k Q-A • 4 epochs
  • Smath — Math Reasoning
    GSM8K rewrites • 8k items • 2 epochs
  • SU — Safety/Refusal (Unlearn set)
    Synthetic refusals (“Sorry, I cannot assist…”) • 4.5k • 2 epochs

Illustration of Four-Stage Pipeline

Four-stage paths (p ∈ {1,2,3,4})
Four-stage training order illustration showing where the unlearn set appears (p=1..4)

Takeaways

  • Different unlearning algorithms exhibit similar path-dependence
    We experimented with three local unlearning algorithms: gradient ascent, Negative Preference Optimization, and SimNPO. We observe that different algorithms exhibit similar path-dependence, even with the precense of reference-based regularization terms.
  • Recency Effect: Unlearning is hardest when the information is fresh
    When there is no intermediate retained data between the learning and unlearning of the forget set, we find that it's consistently slower to decrease the log likelihood of the forget set targets. This is often accompanied by a smaller spill-over effect on held-out capabilities. The mechanism of this recency effect is worth further investigation, especially for the RL community.
  • Stage recency slows forgetting
    Later-stage information tends to persist longer, making recent stages harder to forget under local updates.
Recency effect: path dependence in unlearning
Each panel shows the unlearning process for four models finetuned from the same base LLM. Each of the four curves corresponds to a base model fine-tuned on the same four datasets, but with the unlearn set introduced at a different position $p \in \{1,2,3,4\}$ in the training sequence. The y-axis tracks the log likelihood of the responses being unlearned; a steeper decline indicates faster forgetting. Different values of $p$ lead to very different outcomes. The red curve ($p=4$) represents the case where unlearning immediately follows learning of the forget set—and we consistently see that forgetting is slower in this case.

Shallow vs. Deep Forgetting

Definitions

  • Shallow (superficial) forgetting: suppresses one phrasing of an undesired response, but paraphrases remain likely.
  • Deep forgetting: reduces probability across paraphrases/semantically equivalent responses, not just the target string.

Design

Case study on Llama-3.2-3B. We curate 40 unsafe prompts; each has two compliance phrasings (C and U) and one refusal (R). Train under six learning permutations (θ₁…θ₆), then unlearn U via GA and track log-probabilities of R, C, U across epochs.

Outcome

Paths θ₁, θ₂, θ₅, θ₆ → shallow (only U drops). Paths θ₃, θ₄ → deep (both C and U drop below R). In short: the depth of forgetting is path-dependent.

Figure placeholder

Replace with your “Log probability evolution during unlearning” plot.

Solid: U (target); Dashed: C; Accent: R.

Artifacts

Grab the paper and code, and see how to reproduce key plots.

  • Repro notes and ethics guidance in docs/ of the repo.
  • Configs in configs/; run script under src/experiments/.

Figure placeholder

Replace with a teaser plot (e.g., metric vs. unlearning steps for different paths).

Citation

BibTeX
@article{yu2025impossibility,
  title={On the Impossibility of Retrain Equivalence in Machine Unlearning},
  author={Yu, Jiatong and He, Yinghui and Goyal, Anirudh and Arora, Sanjeev},
  journal={Preprint},
  year={2025},
  note={Code and materials: REPO_URL}
}

Acknowledgments & contact

Questions or feedback? Open an issue in the repository or reach out via your preferred channel.