Catch up on the latest AI articles

TRACEALIGN: Tracing Causes Of Alignment Drift In Large Language Models And Defensive Measures

TRACEALIGN: Tracing Causes Of Alignment Drift In Large Language Models And Defensive Measures

3 main points
✔️ TRACEALIGN is a framework for tracking and explaining LLM alignment drift as belief conflict derived from training data
✔️ Belief Conflict Index (BCI) quantifies which training spans dangerous generation is based on
✔️ TRACESHIELD, CBD Combines Loss and Prov-Decode to reduce drift by up to 85

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs
written by Amitava DasVinija JainAman Chadha
(Submitted on 4 Aug 2025)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

LLM has been fine-tuned to align with human values and policies.
However, in practice, hostile prompts, sentence paraphrases, or slight changes in the generation process have frequently caused the phenomenon of "alignment drift," in which the model produces unsafe output.

Previous research has mainly relied on extrinsic measures such as rejection rates and hazardousness of outputs, but lacked a framework to get at the causes of why models drift.

In this paper, we propose a comprehensive framework called "TRACEALIGN" to address this issue.
TRACEALIGN explicitly tracks which memories on the training data can be traced back to hazardous productions and quantifies their sources with a measure called the Belief Conflict Index (BCI).

In addition, the system combines three defenses, "TRACESHIELD," a rejection mechanism during inference, "Contrastive Belief Deconfliction Loss," a penalty during learning, and "Prov-Decode," a search control during generation, to achieve up to 85% drift reduction.
In other words, this study is significant in that it sheds light on the inconsistency of the "beliefs" held by the model and presents interpretable and reproducible countermeasures based on their causes, rather than merely observing the output.

Proposed Methodology

The core of TRACEALIGN is to "trace the training beliefs behind the model outputs.

First, a suffix array-based index called "TRACEINDEX" is used to match substrings (spans) in the generated text with the training corpus.
This allows the model to unambiguously identify which document fragments are being memorized and reused.

Next, a "Belief Conflict Index" (BCI) is introduced to quantify how rare the spans found are and how much they deviate from the training distribution.
This makes it possible to measure "reactivation of dangerous memories" rather than mere generation. We then propose three interventions.

First, TRACESHIELD is a filter during inference that immediately rejects responses containing high BCI spans.
Second, CBD Loss adds a penalty term to DPO learning and suppresses generation that uses dangerous memories.
Third, Prov-Decode rejects high-risk candidates during the decoding process, leading to safe sentence generation.

Combined, these transform alignment from "posterior modification" to "belief-derived pre-prevention."

Experiments

To test the effectiveness of the proposed method, the paper constructs a novel assessment benchmark called the Alignment Drift Benchmark (ADB).
The ADB consists of a total of 5,200 hostile prompts in five domains: explosives, cybercrime, self-harm, hate speech, and financial fraud, with superficial It is designed to elicit risky responses under the guise of an educational or historical context.

Comparative experiments were conducted in this environment using several models, including LLaMA-2, OLMo-2, and NeoX.
The results showed that while hazardous output was observed in over 40% of the prompts at baseline, the combination of the three TRACEALIGN methods reduced the drift rate to 6.2%.

At the same time, scores indicating the naturalness and consistency of rejections were improved, confirming that safety can be greatly enhanced while maintaining the usefulness of the model.

Furthermore, through ablation experiments in which each of the defenses was applied individually or in combination, we have shown that the three-way combination is the most effective.
This demonstrates that TRACEALIGN is an approach that combines both a theoretical framework and practical effectiveness.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us