MATE: Multi-agent Accessibility-specific Modality Transformation Framework

12/08/2025

3 main points
✔️ Proposed "MATE," an open source multi-agent system specialized for modality conversion to assist people with disabilities
✔️ Developed ModConTT and BERT fine-tuning models for modality conversion task classification datasets
✔️ Proposed models achieved higher accuracy than existing LLM and ML methods The proposed model outperforms existing LLM and ML methods and has potential for application in a wide range of fields.

MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications
written by Aleksandr Algazinov, Matt Laing, Paul Laban
(Submitted on 24 Jun 2025 (v1), last revised 15 Jul 2025 (this version, v2))
Comments: Published on arxiv.
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In this study, an open-source support framework called MATE (Multi-Agent Translation Environment), which utilizes a multi-agent system (MAS), is proposed to address the lack of accessibility in the digital environment faced by users with disabilities MATE is an open-source support framework that utilizes a multi-agent system (MAS).

MATE performs translation between different modalities (text, speech, images, video, etc.) in response to user requests, making information easily accessible to people with visual or auditory limitations.
Features include a "ModCon-Task-Identifier" model that analyzes user input and automatically determines the most appropriate conversion task, enabling a variety of tasks such as text-to-speech (TTS), speech recognition (STT), image caption generation (ITT), and image-to-speech explanation (ITA).

In addition, a dedicated dataset "ModConTT" for modality conversion task classification was constructed and evaluated in comparison with existing LLM and machine learning models.
As a result, the proposed model works with high accuracy and low cost, and has shown potential for application in a wide range of domains such as medicine, education, and transportation.

Proposed Methodology

MATE consists of an "interpreter agent" that interprets user requests and seven types of "specialized agents" that perform specific conversion tasks.

The Interpreter Agent identifies the task type from the input sentence and assigns processing to the corresponding specialized agent.
Each agent leverages existing high-performance models (e.g., Whisper, Stable Diffusion, Tacotron 2, BLIP, etc.) to perform conversions such as TTS, STT, TTI (text to image), ITT (image to text), ITA (image to audio), ATI (audio to image), and VTT (video to text) conversions.
For task determination, the ModCon-Task-Identifier, a fine-tuned version of BERT using the ModConTT dataset created by the authors, was employed to achieve higher accuracy than generic LLMs and classical machine learning models.

The system is designed for local execution, offering privacy protection and flexible customization, making it suitable for real-time assistance in the medical and educational fields.

Experiments

In the experiments, we first compared several LLMs (GPT-3.5-Turbo, Llama-3.1-70B, and GLM-4-Flash) as interpreters using the ModConTT data set.

In task classification of 230 samples, GPT-3.5-Turbo showed high performance with an accuracy of 0.865, but the highest accuracy was achieved by ModCon-Task-Identifier with fine-tuned BERT, with an accuracy of 0.917 and F1 score of 0.916.
Furthermore, the superiority of the proposed model was confirmed by comparing it with other classical models such as logistic regression and random forests using TF-IDF and BERT embedding.
The misclassification analysis showed the highest failure rate in the UNK (unknown task) category, followed by STT and ATV.

These results demonstrate the effectiveness of the MAS+ specialized model in complex modality conversion tasks and support its high utility as a support tool in medicine and education.

Categories related to this article

nakata

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation