Potential Of The Conversation Optimization Tokenizer: A Method To Improve LLM Inference Efficiency By 10%

30/07/2025

3 main points
✔️ Current tokenizers are not optimized for conversational text and are less efficient
✔️ Retraining the tokenizer on conversational data reduces the number of tokens by up to 10% or more
✔️ Conversational optimization improves efficiency during inference, but has minimal impact on training performance

Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
written by Raquel Ferrando, Javier Conde, Gonzalo Martínez, Pedro Reviriego
(Submitted on 23 Jun 2025)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

LLM's computational resources and energy consumption increase in direct proportion to the number of tokens in a model. To reduce the number of tokens, it is important to design an efficient tokenizer. Many current tokenizers are optimized for static, structured corpora such as books and web texts. However, chatbots, the primary real-world application of LLMs, focus on conversational text, where the input and output formats are different.

Focusing on this gap, this study redesigns "conversation-optimized tokenizers," and examines their effectiveness. Specifically, we re-trained several tokenizers for LLMs using real-world chat data, LMSYS Chat 1M.

The results showed a token reduction of up to 10% or more, indicating the potential for improved energy efficiency. On the other hand, the impact on the training corpus is limited, and the negative impact on model performance is expected to be minimal.

Proposed Methodology

This study examines whether optimizing existing tokenizers for conversational data can reduce the number of tokens and energy cost during inference.

As a first step in the methodology, we split 80% of the LMSYS Chat 1M corpus for training and 20% for evaluation. We then build three types of tokenizers that re-tokenize using only user input, only model responses, or both. For retraining, we use the same algorithm and settings as the original tokenizer for each model to ensure fair comparisons.

The "fertility" (number of tokens per word) and "token reduction rate" were used for evaluation. fertility is particularly useful as an indicator of text compression efficiency. The re-tokenized model showed a consistent trend toward reducing the overall number of tokens compared to the original tokenizer. Optimizations on the response side were found to be particularly effective, also consistent with the fact that chat responses comprise the majority of text.

We conclude that these designs allow us to improve the practicality of the tokenizer without compromising the generality of the model.

Experiments

The effectiveness of the conversational optimization tokenizer was tested in this study through three experiments.

The first experiment evaluated the performance of existing tokenizers in eight different LLM models (GPT-4, GPT-4o, DeepSeek-R1, LLaMA-3.1, Gemma-2, Mistral-7B, BLOOM, and Phi-4). The results showed that token efficiency (fertility) deteriorated for conversational data in all models, suggesting the need for optimization.

The following experiments confirmed that the re-trained tokenizers could achieve token reductions of 5-10% or more compared to the original tokenizers. In particular, Gemma-2, Mistral-7B, and BLOOM showed improvements of 10% or more, and language-specific analysis showed that the reduction was more pronounced for languages with richer data, such as English and Spanish.

The final experiment examined the impact of the re-trained tokenizers on the traditional training data (C4 corpus). In most models, the increase was only 1-2%, and in some models, conversely, the number of tokens decreased in some cases. This suggests that conversational optimization can be implemented without significant loss of model generality.

Categories related to this article

nakata

Potential Of The Conversation Optimization Tokenizer: A Method To Improve LLM Inference Efficiency By 10%

Overview

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation