FedNano: Lightweight And Efficient Distributed Learning Of Large-scale Multimodal Models

24/07/2025

3 main points
✔️ Proposed FedNano, a lightweight federated learning method for large-scale multimodal models
✔️ Trains only NanoAdapters on the client side, significantly reducing communication and computational costsReduction
✔️ Fisher Merging enables highly accurate aggregation even for non-uniform data distribution

FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models
written by Yao Zhang, Hewei Gao, Haokun Chen, Weiguo Li, Yunpu Ma, Volker Tresp
(Submitted on 12 Jun 2025)
Comments: 12 pages, 3 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

In recent years, MLLMs that can handle multiple modalities, such as image and language, have attracted a great deal of attention. While they perform well in advanced tasks such as cross-modal search and visual question answering, their large number of parameters makes them difficult to deploy on the terminal side and to operate in real-world scenarios where privacy protection is required. Federated Learning (FL) is a promising method for training models without centralizing distributed data, but there are many barriers to its application to MLLM, including computational resources, communication load, and non-IID data.

In this paper, a new FL framework, FedNano, is proposed to overcome these challenges: FedNano anchors a computationally intensive large-scale language model (LLM) on the server and performs adaptive processing on the client side using a lightweight module called NanoEdge. This design reduces the storage burden on the client by more than 95% and the parameters required for communication by less than 0.01%. Furthermore, the Fisher Merging technique provides high generalization performance even for non-uniform client data.

Proposed Methodology

The core of FedNano is a "server-centric LLM + client lightweight adaptation" architecture. The NanoEdge consists of modality-specific encoders, connectors, and a variable part called the NanoAdapter. Designed using a low-rank decomposition based on LoRA (Low-Rank Adaptation), the NanoAdapter allows for flexible task-specific adaptation while significantly reducing computation and communication.

In addition, FedNano applies Fisher Merging using the Fisher Information Matrix (FIM) when aggregating NanoAdapter updates collected from clients. This is a mechanism that effectively integrates information from a group of clients with statistically different data distributions by estimating the importance of each client's update information and weighting it accordingly. Thus, FedNano provides scalable and privacy-preserving federated learning of MLLM, both in terms of model structure and communication design.

Experiments

To validate the effectiveness of FedNano, the authors conducted experiments using two representative visual question answering (VQA) tasks, ScienceQA and IconQA. Advanced MLLMs such as MiniGPT-4 and LLaVA-1.5 were used for the evaluation, and the data was partitioned into 5 to 10 clients based on Dirichlet distributions to simulate a non-uniform data environment.

As a comparison, performance differences were tested against traditional FL methods such as FedAvg, FedProx, and FedDPA-F, as well as against centralized models (upper performance limit) and local fine-tuning (lower performance limit). The results showed that FedNano had the highest average accuracy in all settings, with excellent robustness, especially under conditions of strong data inhomogeneity. FedNano-EF, a simplified version of FIM, was also validated and showed a significant reduction in computational cost in exchange for a slight decrease in accuracy. Furthermore, FedNano's scalability and generalization performance were confirmed in settings with an increased number of clients and increased heterogeneity among tasks.

Categories related to this article

nakata

FedNano: Lightweight And Efficient Distributed Learning Of Large-scale Multimodal Models

Overview

Proposed Methodology

Experiments

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, And Editing With High Efficiency

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, An ...

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High Performance

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High ...

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Effectiveness?

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Ef ...