Machine Learning Model Tackles Soccer Match Prediction In Sports Betting

Sports Analytics 29/01/2025

3 main points
✔️ Evolution of soccer data collection due to legalized gambling abroad
✔️ Effectiveness of machine learning models to predict match results
✔️ Importance of hyperparameters and feature selection for improving prediction accuracy

The Evolution of Football Betting- A Machine Learning Approach to Match Outcome Forecasting and Bookmaker Odds Estimation
written by Purnachandra Mandadapu
(Submitted on 24 Mar 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Professional soccer has been closely associated with gambling since its birth in 19th century England. Initiallyjust afun way to watch the game, gambling has over time become a major factor influencing the sport as a whole.

Then, in 1960, The Betting and Gaming Act was enacted by the British Parliament, fully legalizing gambling. This law coincided with a period in which the collection of data on soccer was becoming increasingly important, and it triggered the rapid development of the world of gambling and soccer data.

With the legalization of gambling in the United Kingdom, bookmakers began to collect more accurate and detailed soccer match data in order to set accurate odds, and over the next 60 years the gambling and soccer data industries experienced tremendous growth.

Once the domain of those who took note of the number of passes and goals scored in soccer, gambling operators and soccer clubs are now using the data to develop into a lucrative industry.Sports betting services (such as Stake.comandBeeBet) have expanded rapidly over the past few years.

Data collection in soccer has evolved from a manual process to a sophisticated tool that utilizes state-of-the-art technology. With multiple high-resolution cameras tracking players, sensors embedded in shoes, and microchips in the ball, every moment is recorded in detail and every scenario of the game can be analyzed in detail.

Furthermore, the introduction of artificial intelligence (AI), especially machine learning (ML), has dramatically improved the ability to analyze soccer data. Numerous studies have shown that ML-based analysis is effective in optimizing player deployment and team strategy, improving training, and predicting match outcomes.

In this paper we build a model that accurately predicts the outcome of Premier League matches. Utilizing historical soccer data, we seek to find the best approach to predict match outcomes using ML models.

In addition, we are attempting to reproduce the method of calculating odds for 1 x 2 betting based on the predictions generated by these models, and to calculate the odds from a new perspective. These odds will be used as a basis for evaluating the models' predictions and as a tool for analyzing the factors that influence the outcome of a match from multiple perspectives.

Odds and 1x2 Betting

In sports betting, bookmakers play an important role in setting odds on the outcome of sporting events. Bookmakers use a combination of complex algorithms and expert opinion to determine the odds so that they are properly profitable no matter what the outcome.

Odds are set based on the probability of a particular outcome occurring. For example, if Team A is considered stronger than Team B, the odds against Team A winning are lower.

This paperfocuses on thebasic wagering strategy,"1x2."With"1" representing a home team win, "X" a tie, and"2" an away team win,"1x2" is the simplest way to choose whether the home team wins, the away team wins, or the away team draws.

In many soccer leagues, each team plays the other team twice, once at home and once away. The location of these matchups has a significant impact on predictions, and home matches are known to perform better.

The odds are also expressed as a number greater than 1, and the calculation is performed using the following formula

P represents the probability of a particular outcome occurring. For example, if Team A has a 50% (0.5) chance of winning, the odds are 2.00, which means that if Team A wins, the bet will be doubled.

Depending on the bookmaker, these odds vary and are influenced by algorithms and subjective assessments of experts. The odds may also fluctuate due to factors such as player injuries or changes in team composition just prior to the match. Once betting begins, the odds are fixed at that point and do not fluctuate.

However, since bookmakers always operate at a profit,some bookmakersmay set odds against their bettors or place limits on the amount wagered. Such methods are naturally subject to criticism.

Data-Set

The dataset utilizes detailed statistics from the English Premier League for the 2021-2022 and 2022-2023 seasons.

Using web scraping, we are collecting match data for all teams that participated in the 2021-22 and 2022-2023 seasons. We extract the necessary statistics from each team's page, organize them, and compile them into a database. This database forms the basis for the analysis in this paper.

The data collected covers statistics for each team across 380 Premier League matches and is grouped into nine categories, including scoring, shots, goaltending, and passing. Each match data is combined into a single data set with information on home and away teams.

Finally, a table of 1520 rows and 52 columns is constructed containing 34 statistics and supplementary information. This data set is ready for analysis by ML and is used to predict match results and discover patterns.

It is also important to properly process the data before beginning machine learning. First, the raw data must be organized and made suitable for analysis. Missing data are supplemented by embedding default values, using mean and median values, or by making predictions with algorithms such as K nearest neighbor (KNN) and regression analysis. Noise (unwanted variation or errors) in the data is also handled by methods such as binning, regression, and clustering.

Care must be taken when integrating data from various sources, as data redundancy can occur. Normalization, aggregation, and generalization are used to put the data into a format that is easy to analyze.

In addition, the data must be encoded into a numerical format for the ML algorithm to function properly. For example, in this paper, data on "venues" are converted to numerical values, with 1 for home and 0 for away, and data on "opponents" and "teams" are replaced with integers for the respective team names. In addition, the data for "result," which indicates the match result, is encoded as 1 (W) for a win, 0 (D) for a draw, and 2 (L) for a loss. This process makes the data compatible with the 1x2 bet format.

Columns such as "match report," "notes," "referee," "captain," and "information" that are not directly relevant to the analysis have been removed. In addition, match data for the final week of the 2022-2023 season has been replaced with season-wide averages for each team. This ensures that the data is uniformly aligned and improves the accuracy of the analysis.

Experiment Summary

This experiment evaluates the performance of various ML models and searches for the best predictive model. It is important to select appropriate features and hyperparameters depending on the complexity of the data. Here, we compare several ML models, including random forests and KNN, to evaluate which model can most accurately predict the results.

The "features" that ML models deal with are patterns or properties of data extracted from the data. Understanding the importance of these features and how each model evaluates them is critical to improving forecast accuracy. The selection of appropriate training data is especially important for time-series data such as soccer game histories. The current dataset consists of Premier League match data from the 2021-2022 and 2022-2023 seasons, split and analyzed in a variety of ways.

Python is also used. This is because its simple structure and straightforward syntax make it easy to create reproducible analytical procedures.Jupyter Notebook was chosen as the development environment because ofits ability to integrate code, visualization, and text for interactive data exploration.

Although the initial dataset contained many match attributes, the number of features was reduced so that the ML algorithm could process them efficiently. Recursive Feature Elimination (RFE) was used to narrow down the best features. This method finds the optimal feature set by first using all features and then progressively removing the less important ones.

Hyperparameters" play an important role in tuning ML models. These parameters control the learning process of the model and are set prior to training. Various combinations of hyperparameters are tried using methods such as grid search and random search to select the optimal settings. Through these methods, we maximize the performance of the ML model.

In addition, Accuracy, Precision, Recall, and F-1 scores are primarily used to evaluate the models.Using these evaluation indicators, the forecasting accuracy of each model is analyzed to select the best model.

Random Forest

In this paper, we use various machine learning (ML) models to predict the outcome of soccer matches and evaluate their effectiveness.Here we look at the "Random Forest" results.

First, we evaluate the performance of the model based on different data partitions. The results are shown in the table below:whentested ondata spanningtwo seasons (2 Seasons of Data), Random Forest achieves 64.95% Accuracy and shows relatively high Precision and Recall for each class (Win, Draw, and Loss).

However, some classes show misclassification and a bias toward certain outcomes is also evident: when tested with only one season of data (1 Season of Data), Accuracy improved to 67.33%, but the bias still remains. Furthermore, when predictions were made using themost recent match data (10 Match Weeks of Data), the Accuracydropped to 47.73%, suggesting the limitations of making predictions based on recent data alone.

Next, we analyze the results for different features used (the type of data the model uses to make predictions).The results are shown in the table below.The first model that includes all features (All Feature Subset)shows balanced resultswith 68% Accuracy.

Using RFE, a feature selection technique, Accuracy slightly increases to 69%, indicating the usefulness of selecting important features. However, when features are selected based on their correlation with the target variable, Accuracy drops to 62%, revealing the limitations of feature selection that relies solely on correlation.

In addition, the model's predictive ability for soccer match outcomes is also tested. The results are shown in the table below. It can be seen that the model shows a strong tendency toward certain outcomes. For example, there is a significant bias in predicting certain outcomes for matches involving Leeds United and Tottenham, indicating that the model is highly reliable for these matches.

On the other hand, the Crystal Palace vs. Nottingham sorest match tended to predict a draw, showing how the model captures uncertainty and variability in soccer predictions.

While these results indicate that random forests are effective in predicting the outcome of soccer matches, they also suggest that there are limitations and room for improvement in the model. Predictive bias for specific matches and the impact of data selection methods on accuracy should be further explored in future studies.

Support Vector Machine

In this section, we will look at the results of predicting the outcome of soccer matches using the Support Vector Machine (SVM) model.The results are shown in the table below.

First,when usingtwo seasons of data (2 Seasons of Data), the SVM modelachieves anAccuracy of67%.Usingone season of data (one season of data),the Accuracyimproves to 72.67%, but still struggles to predict draws. However, it still struggles to predict draws. Furthermore,when usingthe most recent match data(10 Match Weeks of Data),Accuracydrops significantly to 45%, indicating that it is very difficult to predict. This is likely due to the smaller data set and increased variability in match results.

Second, when all features were used, the SVM modelshowed72%Accuracy, but still struggled to predict the draw.

Using RFE, a feature selection method,Accuracydropped slightly to 70%, but was not significantly effective in improving draw prediction. When only highly correlated features were used,Accuracydropped to 66.67%, indicating that highly correlated features are not necessarily effective in predicting draws.

Overall, the SVM model performed well, but consistently shows challenges in predicting draws. These results suggest that there may be complexities inherent in draws and important features that are still being overlooked. To address this issue, more sophisticated selection and engineering of features relevant to tie prediction may be necessary.

Furthermore, although the SVM modelshowedhighAccuracy insome games, it also showed errors in predicting games such as Leeds United, Tottenham, Arsenal, Wolves, Chelsea, and Newcastle United.

This reveals the limitations of statistical analysis alone in predicting matches. There is also a tendency for the model to hesitate in selecting "draws" for some matches, which may be due to subtle differences regarding match dynamics and team strength.

K-Nearest Neighbor

In this section, we will look at the results of predicting the outcome of soccer matches using the K-Nearest Neighbor (KNN) model. The results are shown in the table below.

First, using data from two seasons, the KNN model has an Accuracy of 61.52%, which is slightly lower than the Accuracy of the SVM model. The KNN model shows strength in predicting "away wins," correctly predicting 125 out of 158 cases, but significantly underperforming in predicting "draws," correctly predicting only 9 out of 92 cases. This trend is also seen when one season's worth of data is used, although Accuracy improves slightly to 62.67%.

On the other hand, using the most recent match data, the Accuracy of the KNN model drops significantly to 38.64%. This decrease is also reflected in Precision, Recall, and F1 scores for all classes.

Similar to the challenges seen in the SVM model, these results can be attributed to the reduced size of the data set and increased variability in recent matches.

Despite the use of feature selection techniques such as RFE to improve performance, the Accuracy of the KNN model is still low compared to SVM and Random Forest, at 65.33%.

Extreme Gradient Boosting

Herewe look at the results of predicting soccer match outcomes using theExtreme Gradient Boosting (XGB)model, whichconsistently has anAccuracyin the 65% to 70% rangewhen using two seasons and one season of data.

Similar to the KNN model, XGBconfirms thatAccuracycan be significantly improved byselecting specific hyperparameters, and it is clear that the selection of the optimal hyperparameters has a significant impact on the performance of this model. In fact, tuning the hyperparametersimprovesAccuracyby about 6%. This result underscores the critical importance of feature selection and hyperparameter settings in the XGB model in improving prediction accuracy.

The XGB model has also been shown to have the ability to capture unique patterns in the data, even when the distribution of classes per match is not even.

Summary

The paperevaluates a variety of ML models to predict the outcome of soccer matches, highlighting the importance of explainable artificial intelligence (XAI) in particular. Model interpretability is critical in a field as complex as soccer.

We also analyzed the accuracy of the "1x2" odds calculated by ML and found that further optimization is needed to better handle recent data, as suggested by the variability of predictions due to data splitting.

Possible future improvements include expansion of the data set, improved data preprocessing, extensive exploration of hyperparameters, and the introduction of advanced ML algorithms, including neural networks. Feature selection with different subset sizes and further analysis of match statistics are also important.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.