
AlignGuard-LoRA: A New Regularization Method That Combines Efficient Fine-Tuning And Safety Preservation
3 main points
✔️ LoRA fine-tuning is efficient, but has challenges that can compromise safety and ethical alignment
✔️ Proposed method AlignGuard-LoRA separates updates with Fisher regularization and geodesic distance-based collision avoidance
✔️ Experiments show reduced toxicity and bias while maintaining task performance, Safety improvements of up to 50
AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
written by Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
(Submitted on 4 Aug 2025)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Low-rank adaptation (LoRA), widely used for fine-tuning large language models, has the advantage of being efficient and requiring low computational resources.
At the same time, however, it has a noticeable problem of compromising "alignment," which preserves safety and ethical constraints.
Specifically, increased toxicity statements, over-rejection, and worsening bias occur, which reduce the reliability of the model.
In this study, we propose a new framework called "AlignGuard-LoRA" to solve this problem.
AlignGuard-LoRA controls alignment-sensitive directions through regularization using the Fisher information matrix, thus achieving both task adaptation and safety preservation.
Furthermore, it geometrically separates alignment-related updates from task-related updates by stabilizing updates with task-specific regularization and introducing "collision avoidance regularization" based on Riemannian geometry and geodesic distance.
The proposed method achieves up to 50% drift suppression over conventional LoRA, demonstrating that it simultaneously improves safety and performance.
Proposed Methodology
AlignGuard-LoRA has a structure that decomposes low-rank updates by LoRA into "alignment-related components" and "task-specific components," and applies different regularizations to each.
First, a penalty based on the Fisher information matrix is added to suppress excessive updates in alignment-sensitive directions.
This facilitates the preservation of safe behaviors such as rejection accuracy and toxicity control.
Next, "trust-domain regularization" is introduced for the task-specific component to stabilize learning in the low-entropy domain.
The most important is "collision avoidance regularization.
This combines coordinate-by-coordinate interference suppression by Riemannian distance and geometric separation of directions by geodesic distance to prevent interference between alignment and task update.
These three regularizations work in a complementary manner, aiming at both task adaptation and safety preservation.
They mitigate the trade-off seen in conventional LoRA, where safety is reduced in exchange for improved task accuracy, and allow fine-tuning that does not disrupt alignment while maintaining efficient learning at low ranks.
Experiments
Experiments compared standard LoRA, the proposed AlignGuard-LoRA, and full fine-tuning of all parameters using the LLaMA 3 (7B) model.
Evaluation metrics included general tasks such as GLUE and SuperGLUE, as well as safety and robustness benchmarks such as HELM and AdvGLUE, as well as toxicity (RealToxicityPrompts), rejection behavior (OR-Bench), and bias (CrowS-Pairs, BBQ) A multifaceted set of criteria was used.
As a result, AlignGuard-LoRA significantly reduced toxicity and bias and retained rejection accuracy compared to standard LoRA.
In particular, the full version, with the addition of collision avoidance regularization, showed performance comparable to or better than full fine-tuning, while also maintaining its superiority in safety metrics.
In addition, sequential ablation experiments confirmed that the Fisher-based, task-specific, and collision avoidance regularizations are each effective on their own, and that they exhibit synergistic effects when combined.
Furthermore, in a new benchmark called DRIFTCHECK, AlignGuard demonstrated a 50% reduction in safety degradation, demonstrating its effectiveness as a fine-tuning method in safety-critical areas.
Categories related to this article