AlignGuard-LoRA: A New Regularization Method That Combines Efficient Fine-Tuning And Safety Preservation

29/08/2025

3 main points
✔️ LoRA fine-tuning is efficient, but has challenges that can compromise safety and ethical alignment
✔️ Proposed method AlignGuard-LoRA separates updates with Fisher regularization and geodesic distance-based collision avoidance
✔️ Experiments show reduced toxicity and bias while maintaining task performance, Safety improvements of up to 50

AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
written by Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
(Submitted on 4 Aug 2025)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Low-rank adaptation (LoRA), widely used for fine-tuning large language models, has the advantage of being efficient and requiring low computational resources.
At the same time, however, it has a noticeable problem of compromising "alignment," which preserves safety and ethical constraints.
Specifically, increased toxicity statements, over-rejection, and worsening bias occur, which reduce the reliability of the model.

In this study, we propose a new framework called "AlignGuard-LoRA" to solve this problem.
AlignGuard-LoRA controls alignment-sensitive directions through regularization using the Fisher information matrix, thus achieving both task adaptation and safety preservation.

Furthermore, it geometrically separates alignment-related updates from task-related updates by stabilizing updates with task-specific regularization and introducing "collision avoidance regularization" based on Riemannian geometry and geodesic distance.
The proposed method achieves up to 50% drift suppression over conventional LoRA, demonstrating that it simultaneously improves safety and performance.

Proposed Methodology

AlignGuard-LoRA has a structure that decomposes low-rank updates by LoRA into "alignment-related components" and "task-specific components," and applies different regularizations to each.

First, a penalty based on the Fisher information matrix is added to suppress excessive updates in alignment-sensitive directions.
This facilitates the preservation of safe behaviors such as rejection accuracy and toxicity control.
Next, "trust-domain regularization" is introduced for the task-specific component to stabilize learning in the low-entropy domain.
The most important is "collision avoidance regularization.
This combines coordinate-by-coordinate interference suppression by Riemannian distance and geometric separation of directions by geodesic distance to prevent interference between alignment and task update.

These three regularizations work in a complementary manner, aiming at both task adaptation and safety preservation.
They mitigate the trade-off seen in conventional LoRA, where safety is reduced in exchange for improved task accuracy, and allow fine-tuning that does not disrupt alignment while maintaining efficient learning at low ranks.

Experiments

Experiments compared standard LoRA, the proposed AlignGuard-LoRA, and full fine-tuning of all parameters using the LLaMA 3 (7B) model.
Evaluation metrics included general tasks such as GLUE and SuperGLUE, as well as safety and robustness benchmarks such as HELM and AdvGLUE, as well as toxicity (RealToxicityPrompts), rejection behavior (OR-Bench), and bias (CrowS-Pairs, BBQ) A multifaceted set of criteria was used.

As a result, AlignGuard-LoRA significantly reduced toxicity and bias and retained rejection accuracy compared to standard LoRA.
In particular, the full version, with the addition of collision avoidance regularization, showed performance comparable to or better than full fine-tuning, while also maintaining its superiority in safety metrics.

In addition, sequential ablation experiments confirmed that the Fisher-based, task-specific, and collision avoidance regularizations are each effective on their own, and that they exhibit synergistic effects when combined.
Furthermore, in a new benchmark called DRIFTCHECK, AlignGuard demonstrated a 50% reduction in safety degradation, demonstrating its effectiveness as a fine-tuning method in safety-critical areas.

Categories related to this article

nakata

AlignGuard-LoRA: A New Regularization Method That Combines Efficient Fine-Tuning And Safety Preservation

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation