Predicting The Market Value Of Soccer Players, Machine Learning Reveals Future Stars

Decision Trees 09/01/2025

3 main points✔️ Building machine learning models to predict market value of soccer players
✔️ Using Boruta for feature selection and SHAP for model interpretation to visualize performance indicators
✔️ GBDTachieved highest accuracy

Explainable artificial intelligence model for identifying Market Value in Professional Soccer Players
written by Chunyang Huang, Shaoliang Zhang
(Submitted on 8 Nov 2023 (v1), last revised 23 Nov 2023 (this version, v2))
Comments: 13pages, 6figures
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Computational Finance (q-fin.CP)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Soccer is one of the most popular sports in the world. Its popularity extends beyond the game itself to support economic activity worth hundreds of billions of dollars. In particular, the market for the transfer of soccer players has a significant economic impact and is an important element in the soccer industry.Accurately assessing the market value of a player has important implications for club management, including transfer negotiations and club financial strategies.

Negotiating high-ticket transfers has a significant impact on a club's reputation and financial success, and its accurate valuation leads to the club's financial stability and long-term success. This evaluation is also critical to a club's business strategy, as a player's market value has a significant impact on salary policies and the club's budget plans.

In recent years, data analysis and machine learning techniques have also played an important role in assessing the market value of players, with the advent of online platforms such as SoFIFA and Transfermarkt providing detailed player performance data that can be used to models to predict a player's market value are becoming more accurate.

A study by Mustafa A. AL-ASADI and Sakir Tasdemirreports thatusing game data from the FIFA 20 soccer game to build a machine learning model to predict the market value of players, Random Forestsoutperformed traditional statistical models andshowed the highest prediction accuracy The study reported that Random Forest outperformed traditional statistical models and showed the highest prediction accuracy.

Similarly, in the McHale and Holmes study, XGBoost achieved significantly greater accuracy than traditional statistical models, and random forests produced superior results in market value assessment in the Yang et al. study.

Against this backdrop, the study uses an ensemble machine learning model and the SHAP (SHapley Additive exPlanations) method to conduct a detailed analysis of the factors that influence a player's market value. The method provides a clear visualization of player valuations from both local and global perspectives and identifies key performance indicators.

Thisresearch is expected to contribute to the decision-making process in the sports economy by providing a new perspective in the evaluation of the market value of soccer players. Visualization of the evaluation of excellent soccer playersis also an essential element forsoccer fansto enjoy thegame.

Technique

Data-Set

This is a detailed analysis of the data available on SoFIFA, a well-known website for soccer fans. The website contains a wealth of statistics about players, including player ratings, team composition, position, and dominant foot.

The analysis coversdata on approximately 12,000 players registered withSoFIFAas of January 5, 2023.The datasetcontains atotal of 34characteristics,including player name, market value, salary, overall rating, and potential.Twenty-nine of these characteristics relate to field players, while five are specific to goalkeepers. The table below lists the items.

The data preparation phase includes data cleansing to complete missing values and to classify the two categories: field players and goalkeepers. The market values of the players used in the analysis are widely distributed, ranging from €15,000 to €190 million, as shown in the figure below.

The distribution shows that many players are concentrated in areas of low market value, while a small number of players with high market value have a significant impact on the distribution. The so-called "superstar effect" is manifested, showing that some popular players have very high market value.

However, since data on these players with high market values would affect the performance-centered analysis, we exclude data on about 3% of the players with market values above 25 million euros, as shown in the figure below.

Because of the skewed distribution of the data, a Box-Cox transformationisused to improve the accuracy of the statistical model.This transformation improves the symmetry of the data as shown in the figure below.

Feature Selection and Model Selection

The dataset contains 29 features related to the performance of soccer players, not all of which may be useful for the model's predictions. Too many features can not only be computationally time consuming, but can also negatively affect the accuracy of the predictions.

In this study, Boruta is used for feature selection. This algorithm is a random forest-based method that is useful in identifying important features. It works by comparing the importance of features with randomly sorted shadow features and identifying important features in an iterative process. The best features are selected while maintaining computational efficiency.

We are also evaluating several learning algorithms to select the best model for predicting the market value of players. These include Adaboost, LightGBM, GBDT, CatBoost, and XGBoost.

In addition, this study employs an approach that integrates multiple models using ensemble learning. Ensemble learning has the advantage of combining predictions from multiple models, which results in higher accuracy than a single model. This approach is expected to reduce model bias and variance and improve overall forecast performance.

Development and Evaluation of Forecasting Models

In developing the predictive model, we first randomly split the data set,allocating80% for training andvalidation and the remaining 20% for testing. In addition,missing value completion and feature selection are performed only on thetrainingset, so that the test set is not affected by bias.

To maximize the performance of each ensemble learning model, the hyperparameters are tuned using a combination of five-part cross-validation method and grid search.

We also use several machine learning algorithms to evaluate our forecasting models and measure their accuracy on a variety of metrics. In particular, we use the coefficient of determination (R-squared value, R²) and root mean squared error (RMSE) to evaluate the predictive performance of the model from multiple perspectives: the R²value indicates how well the independent variable explains the variation in the dependent variable, and the RMSE indicates the prediction error magnitude. The combination of these indicators provides an overall assessment of the model's accuracy.

Interpretation of Predictive Models

Machine learning models are treated as black boxes, and it is difficult to understand which factors influence their predictions, especially when evaluating the market value of a player. To solve this problem, Lundberg and Lee proposed an approach called "SHAP (SHapley Additive exPlanations)". It uses "Shapley values" based on game theory to reveal how the model is making predictions, allowing visual interpretation of the impact of each feature.

The study firstuses theSHAP beeswarm plotand feature importance measures forglobal interpretation.Thebeeswarmplot visually shows how each feature affects the forecast and ranks the importance of the feature. In the plot, features are lined up on the y-axis and SHAP values are displayed on the x-axis. Red indicates high feature values and blue indicates low feature values, allowing the user to see at a glance how much positive or negative impact each has on the forecast.

The next local interpretation is the use of a "SHAP force plot" (SHAP force plot) to predict the market value of individual players. The force plot visualizes how each feature contributed to the final forecast result and graphically represents the flow from the base value (the average value of the forecast) to the final forecast. Features that lead to a positive forecast are shown in red and features that lead to a negative forecast are shown in blue, allowing for a detailed understanding of which factors affected the player's market value and how.

We also use a "Partial Dependence Plot" (PDP) to better understand the impact of each feature on the forecast results; the PDP shows how the value of a particular feature affects the forecast, averaged over the impact of other factors, and provides an independently assesses how involved it is in the market value. It can further clarify how a particular factor contributes to a player's valuation.

In this way, we have devised a way to use SHAP to interpret the internals of the model and to gain a more detailed understanding of the factors that influence the market value of players.

Experiment

The comprehensive design, which includes data collection, feature selection, model development, validation, and model evaluation and interpretation, is shown in the figure below.

Note that the paper uses Kaoru Mitate's face image as the subject of the analysis, but it is listed as "Teruki Miyamoto" in the text, whichis considered an error because themarket valuationis significantly lower thanthe actual amount of money forKaoru Mitate.

Feature Selection

The feature selection phase of the analysis uses 29 features related to the performance of soccer players. Boruta used here utilizes the BorutaShap package in Python to automatically select the features that are important to the model. As a result, 22 items are selected.The 22 selected features are evaluated as the factors that most influence the market value of the players and are indicated by the green bars in the figure below.

Specific characteristics include acceleration, heading accuracy, defensive awareness, vision, volley shooting, sprint speed, long passes, positioning, standing tackles, dribbling, free kick accuracy, short passes, interceptions, penalties, finishing, reaction, ball control, stamina, crossing, strength, shooting ability, and sliding tackles are included.

These characteristics contribute to machine learning models as important indicators for accurately assessing player performance and market value.

Model Evaluation

The results of the cross-validation analysis and the evaluation of the test set are shown in the table below: the best performing of the six learning algorithms was the Gradient Boosting Decision Tree (GBDT) model, withthe highest value of R²=0.889 R² = 0.889. The CatBoost model camein second with R²=0.887,followed byLightGBMin third place with R²=0.885. Random Forest and XGBoostrecordR²=0.877 andR²=0.861,respectively, while AdaBoost shows the lowest result with R²=0.773.

The RMSE results also show that the GBDT model performs the best, with the lowest RMSE of 3221632.175. CatBoost (RMSE=4715039.662), LightGBM (RMSE=3249280.179), Random Forest (RMSE=3505068.837), and XGBoost (RMSE=3320149.832) followed, with AdaBoost showing the largest error at RMSE=4442839.041, showing the largest error.

In particular, on the test set, the GBDT modelmaintains a high predictive performance ofR²=0.901 andRMSE=3221632.175, showing very high accuracy and reliability in predicting the market value of players. The GBDT model outperforms other models in predicting market value.

Model Interpretation

This study uses the GBDTto analyzeSHAPbeeswarm plots andfeature importance to identify the features that most influence a player's market value.

As a result, nine characteristics were identified as particularly important: ball control, reaction, short passing, sprint speed, finishing, interception, dribbling, sliding tackle, and acceleration. These factors have been shown to have a significant impact in predicting a player's market value.

Furthermore, a detailed study of the market value projections forÁngelFabián and Ivan Perišicconfirms that the GBDT model is accurate compared to the actual situation, as shown in the figure below.

For example, Ángel Fabián's projected market value after Box-Cox conversion is approximately €6 million, very close to the actual market value of €5.31 million. Ivan Perišic also has a projected market value of approximately €2.5 million, close to the actual value of €2.75 million.

The results of the Partial Dependency Plot (PDP) analysis for these characteristics are shown in the figure below; the PDP confirms that characteristics such as ball control, reaction and sprint speed have a significant impact on prediction accuracy as a player's market value increases. This indicates that these characteristics are important factors directly related to a player's market valuation.

Summary

This study builds an ensemble machine learning model that focuses on the most important factors affecting player performance.Traditional statistical methods and machine learning models have limited prediction accuracy, and it is difficult to understand in detail how each feature contributes to prediction, but this study improves upon that.

Using publicly available data called SoFIFA, we are developing a highly accurate ensemble machine learning model based on the features selected by the Boruta algorithm and using SHAP to reveal the internal structure of the model and the importance of each feature. Assessing the impact of these features on the market value of players and transfers is very important information for club management.

It also evaluates players based on three key traits: skill, fitness, and cognition. In terms of skill, ball control, short passing, and finishing are identified as important factors, while in fitness, sprint speed and acceleration are found to have a significant impact. Additionally, in cognition, reaction has been identified as the most influential trait. Clubs can gain important clues to make more accurate decisions in player evaluation and transfers.

In forecasting market values, we have also found that the model produces very accurate results, with results that closely match actual values. However, a Box-Cox transformation is used to improve the accuracy of the forecasts, and therefore there is some complexity involved in interpreting the forecast results. It should be noted that an inverse transformation must be performed to convert the forecast back to the original market value.

This study demonstrates new possibilities for player evaluation using machine learning, especially the Gradient Boosting Decision Tree (GBDT) model, which has been confirmed to have high predictive accuracy. It reveals a method for evaluating the market value of players based on important characteristics such as skill, fitness, and cognition, and is expected to be used in the future.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.