Catch up on the latest AI articles

Potential Of The Conversation Optimization Tokenizer: A Method To Improve LLM Inference Efficiency By 10%

Potential Of The Conversation Optimization Tokenizer: A Method To Improve LLM Inference Efficiency By 10%

3 main points
✔️ Current tokenizers are not optimized for conversational text and are less efficient
✔️ Retraining the tokenizer on conversational data reduces the number of tokens by up to 10% or more
✔️ Conversational optimization improves efficiency during inference, but has minimal impact on training performance

Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
written by Raquel FerrandoJavier CondeGonzalo MartínezPedro Reviriego
(Submitted on 23 Jun 2025)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

LLM's computational resources and energy consumption increase in direct proportion to the number of tokens in a model. To reduce the number of tokens, it is important to design an efficient tokenizer. Many current tokenizers are optimized for static, structured corpora such as books and web texts. However, chatbots, the primary real-world application of LLMs, focus on conversational text, where the input and output formats are different.

Focusing on this gap, this study redesigns "conversation-optimized tokenizers," and examines their effectiveness. Specifically, we re-trained several tokenizers for LLMs using real-world chat data, LMSYS Chat 1M.

The results showed a token reduction of up to 10% or more, indicating the potential for improved energy efficiency. On the other hand, the impact on the training corpus is limited, and the negative impact on model performance is expected to be minimal.

Proposed Methodology

This study examines whether optimizing existing tokenizers for conversational data can reduce the number of tokens and energy cost during inference.

As a first step in the methodology, we split 80% of the LMSYS Chat 1M corpus for training and 20% for evaluation. We then build three types of tokenizers that re-tokenize using only user input, only model responses, or both. For retraining, we use the same algorithm and settings as the original tokenizer for each model to ensure fair comparisons.

The "fertility" (number of tokens per word) and "token reduction rate" were used for evaluation. fertility is particularly useful as an indicator of text compression efficiency. The re-tokenized model showed a consistent trend toward reducing the overall number of tokens compared to the original tokenizer. Optimizations on the response side were found to be particularly effective, also consistent with the fact that chat responses comprise the majority of text.

We conclude that these designs allow us to improve the practicality of the tokenizer without compromising the generality of the model.

Experiments

The effectiveness of the conversational optimization tokenizer was tested in this study through three experiments.

The first experiment evaluated the performance of existing tokenizers in eight different LLM models (GPT-4, GPT-4o, DeepSeek-R1, LLaMA-3.1, Gemma-2, Mistral-7B, BLOOM, and Phi-4). The results showed that token efficiency (fertility) deteriorated for conversational data in all models, suggesting the need for optimization.

The following experiments confirmed that the re-trained tokenizers could achieve token reductions of 5-10% or more compared to the original tokenizers. In particular, Gemma-2, Mistral-7B, and BLOOM showed improvements of 10% or more, and language-specific analysis showed that the reduction was more pronounced for languages with richer data, such as English and Spanish.

The final experiment examined the impact of the re-trained tokenizers on the traditional training data (C4 corpus). In most models, the increase was only 1-2%, and in some models, conversely, the number of tokens decreased in some cases. This suggests that conversational optimization can be implemented without significant loss of model generality.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us