Impossibility of Retrain Equivalence in Machine Unlearning

Introduction

Large language models (LLMs) inevitable acquire sensitive information during training—such as data that exposes personal privacy, subject to commercial license, or violates legal compliance. It’s required that LLMs learn to withhold such sensitive information before they can be deployed at scale. This is the research field called machine unlearning.

Our work investigates the family of scalable unlearning algorithms (i.e. those efficient enough to be deployed on billion-tokens models) and illustrate how they can’t guarantee forgetting: an unlearned model cannot interact with users as if it has never seen the sensitive data, as long as we don’t know how the model acquired such private inforamtion from the first place. This is because unlearning is path-dependent by nature: the order of which a model receives new information impacts how it forgets. If an unlearning algorithm does not take this into account, it’s shooting in the dark.

Desiderata

Consider a model $ \theta $ trained on dataset $ D = D_f \cup D_R$, which can be partitioned into a forget set $ D_f $ and a retain set $ D_r $. The goal of an unlearning algorithm $\mathcal{U}$ is to remove the influence of the forget set from the model's predictions. The following desiderata drives research in unlearning.

Retrain Equivalence

Let $\theta_u$ be the model that results from applying some unlearning procedure $\mathcal{U}$ on the original model $\theta$. Let $\theta_r$ be the model retrained from scratch on all training data excluding the forget set. Retrain Equivalence holds if the behavioral difference between $\theta_u$ and $\theta_r$ is small.

Local Unlearning

This work considers gradient-based unlearning algorithms that are fast—those that can be deployed even when the training set is billions of tokens. Locality is one (and maybe the only) way to guarantee fast unlearning, as its runtime only depends on the size of the forget set.

An unlearning algorithm $ \operatorname{Unlearn}(\cdot, D_f) $ is local if it only requires gradient information computed on the forget set $ D_f $. Practically, we desire $ T_{\text{unlearn}} = o(T_{\text{retrain}}) $.

Why is Retrain Equivalence Impossible?

Today's LLMs are trained in distinct stages, such as instruction tuning, alignment tuning, RL, reasoning, etc. This is the source of unlearning impossibility: as long as we don't know how these training stages are ordered, local unlearning algorithms are doomed to fail.

We argue impossibility by showing that unlearning is path-dependent. The relative order between training stages impacts what is unlearned and how fast unlearning occurs.

If we feed two models trained on the same datasets but in different orders to the same unlearning algorithm, the resulting models will diverge in a path-dependent way; therefore they can not both behave "as if they have never seen the forget set".

Our results suggest (1) since most unlearning algorithms do not have access to the order of training stages, they can not guarantee retrain equivalence; (2) it is unfair to compare unlearning algorithms without taking learning paths into account, since different algorithms react to paths in different ways. This elludes to a fundamental difficulty in developing and evaluating unlearning algorithms.

Example: Unlearning dynamics of Qwen models

Unlearning dynamics for two Qwen1B models finetuned on the same datasets but in different orders, underoing identical unlearning procedure using the NPO algorithm. The y-axis is the log likelihood of retained responses not being unlearned.

Recency effect vs. stage position p (animated)

Experiment: LLM Post-Training Pipeline

We empirically demonstrate history-dependent nature of local unlearning algorithms through a four-stage finetuning pipeline that simulates today's LLM post-training workflow. The same base model is finetuned on identical datasets but in different orders, then the resulting finetuned models undergo a same unlearning procedure. We observe that the unlearning outcomes diverge in a path-dependent manner: the speed of forgetting and the spill-over effect on other capabilities are history-dependent.

Training stages

S_inst — Instruction Tuning

INSTRUCT-SKILLMIX • 4k pairs • 10 epochs
S_tofu — Fictitious Knowledge

TOFU • 4k Q-A • 4 epochs
S_math — Math Reasoning

GSM8K rewrites • 8k items • 2 epochs
S_U — Safety/Refusal (Unlearn set)

SORRY-BENCH rewrites • 4.5k • 2 epochs

Illustration of Four-Stage Pipeline

Four-stage paths (p ∈ {1,2,3,4})

Four-stage training order illustration showing where the unlearn set appears (p=1..4)

Takeaways

Different unlearning algorithms all exhibit path-dependence

We experimented with three local unlearning algorithms: gradient ascent, Negative Preference Optimization, and SimNPO. We observe that all considered algorithms exhibit path-dependence, even with the precense of reference-based regularization terms.
Recency Effect: Unlearning is hardest when the information is fresh

When there is no intermediate retained data between the learning and unlearning of the forget set, we find that it's consistently slower to decrease the log likelihood of the forget set targets. This is often accompanied by a smaller spill-over effect on held-out capabilities. The mechanism of this recency effect is worth further investigation, especially for the RL community.
Shallow vs. Deep Forgetting

Existing work shows unlearning can be either shallow (only a single phrasing of an undesired response is repressed, but paraphrases remain likely) or deep (paraphrases/semantically equivalent responses all reduce in likelihood). Our experiment shows that whether an unlearning algorithm produces shallow or deep outcome is also path-dependent.

Recency effect: path dependence in unlearning — Each panel shows the unlearning process for four models finetuned from the same base LLM. Each of the four curves corresponds to a base model fine-tuned on the same four datasets, but with the unlearn set introduced at a different position $p \in \{1,2,3,4\}$ in the training sequence. The y-axis tracks the log likelihood of the responses being unlearned; a steeper decline indicates faster forgetting. Different values of $p$ lead to very different outcomes. The red curve ($p=4$) represents the case where unlearning immediately follows learning of the forget set—and we consistently see that forgetting is slower in this case.

Citation

BibTeX

@article{yu2025impossibility,
  title={On the Impossibility of Retrain Equivalence in Machine Unlearning},
  author={Yu, Jiatong and He, Yinghui and Goyal, Anirudh and Arora, Sanjeev},
  journal={arXiv preprint arXiv:2510.16629,
  year={2025}
}

Acknowledgments & contact

Questions or feedback? Open an issue in the repository or reach out via your preferred channel.