TourBERT, The BERT Model Dedicated To The Tourism Industry, Is Here!

BERT 23/01/2023

3 main points
✔️ Pre-training performed on 3.6M tourism reviews from over 20 countries and about 50,000 tourism service and attraction descriptions
✔️ BERT-Base architecture using WordPiece tokenizers + crawled with the same vocabulary size as BERT-BaseTrained TourBERT from scratch in 1M steps using a tourism-specific vocabulary
✔️ Quantitative and qualitative evaluation showed that it outperformed BERT-Base on all tasks

TourBERT: A pretrained language model for the tourism industry
written by Veronika Arefieva, Roman Egger
(Submitted on 19 Jan 2022 (v1), last revised 19 May 2022 (this version, v3))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction.

BERT (Bidirectional Encoder Representations from Transformers) is the most important natural language model since Google introduced it in 2018, with pre-training and fine-tuning to perform text classification, question answering, sentiment analysis, summarization, and many other tasks.

In addition, existing research has shown that pre-training on a large domain-specific corpus is effective when pre-training BERT and various derivatives of BERT have been developed in the financial domain (FinBERT), medical domain (Clinical BERT), biomedical domain (BioBERT), biomedical and computer science domain(SciBERT), and various derivatives of BERT have been developed.

TourBERT, presented in this paper, is a BERT model that has been pre-trained on 3.6 million tourism reviews and approximately 50,000 descriptions of tourism services and attractions from more than 20 countries around the world to learn vocabulary specific to the tourism industry.

Tourism Industry Historical Background

Tourism is one of the most important economic sectors in the world, and its services are known to have many characteristics that distinguish it from other industries.

Examples include the fact that services in the tourism industry are not tangible, so customers cannot verify in advance whether the trip is really interesting, and tourism services are relatively expensive compared to everyday goods.

In addition, with people around the world nowadays sharing their own travel experiences on social media such as Twitter, Facebook, and Instagram, and this information influencing other users, it has become essential for tourism operators to properly manage such content. It is becoming essential for tourism operators to work this content properly.

Against this backdrop, automatic analysis of text using natural language processing is gaining importance both in academia and in the tourism industry.

TourBERT Overview

TourBERT uses BERT-Base-uncased as its basic architecture and does not use initial checkpoints like FinBERT and BioBERT described above.

The entire corpus was preprocessed by lowercasing the data and then splitting it into sentences by delimiter codes, using the BERT-Base architecture with the WordPiece tokenizer + a tourism-specific vocabulary crawled with the same vocabulary size as BERT-Base in 1M steps pre Training was performed in 1M steps.

This model is also available on the Hugging Face Hub, where the TourBERT model and tokenizer can be easily loaded using the following three lines of code.

Comparative experiments between BERT and TourBERT

Several quantitative and qualitative experiments were conducted to evaluate TourBERT.

Emotional Classification

To begin, we performed a sentiment classification task on the following two datasets to demonstrate that TourBERT performs better than regular BERT on the travel review dataset.

Tripadvisor hotel review dataset (RAy et al. 2021): A dataset of hotel reviews from Tripadvisor, an American travel agency, consisting of a total of It consists of 69308 reviews labeled negative, neutral, and positive (multi-label classification).
515K reviews from Europe hotels dataset: This dataset consists of reviews scraped from Booking.com, a Dutch travel agency. (binary classification).

The evaluation results for each dataset are shown below.

From the table, we can see that TourBERT scores better on both data sets compared to regular BERT.

Clustering using tourist photos

Next, we conducted a comparison experiment by clustering tourist photos and their visualization using Tensorboard Projector.

A dataset of 48 photos showing various tourism activities such as sports activities, visits to tourist attractions, shopping, etc. was prepared by manual labeling by 622 people, and these were evaluated by clustering them with each model and then visualizing them with Tensorboard Projector.

The clustering results for both models are shown below. (Top: Normal BERT Bottom: TourBERT)

Comparing the results of the two, it can be seen that the results using regular BERT show a sparse mix of photos, whereas the results using TourBERT show proper clustering, with similar content for photos in the same cluster.

Synonym search

A comparison experiment was conducted with both models on the assumption that TourBERT trained on a tourism-specific corpus would perform better than BERT trained on a general corpus in a synonym retrieval task for tourism-related terms.

The search results for both models are shown below. (The first row of the table is the word to search for and shows the top 8 words with the highest similarity.)

Normal BERT

TOURBERT

Comparing the results of the two models, we can see that while regular BERT searches for general words such as "choice," "address," and "exit " for the word "destination," TourBERT appropriately captures tourism-specific meanings such as "spot (spot", " attraction", and "itinerary ", while TourBERT appropriately captures tourism-specific meanings such as "spot ", " attraction", and "itinerary ".

summary

How was it? We introduced TourBERT, a BERT model that learned tourism-specific vocabulary by pre-training on 3.6 million tourism reviews and approximately 50,000 descriptions of tourism services and attractions in over 20 countries around the world.

In various comparative experiments on tourism-related photos and reviews, TourBERT outperformed regular BERT in all experiments. We are very excited to see how it will be used in the tourism industry in the future.

A detailed overview of the comparative experiments and results presented here can be found in this paper for those interested.