Meta Achieves Unexpected Improvements In Bayesian Optimization

Bayesian Optimization 19/02/2024

3 main points
✔️ Bayesian optimization is an efficient optimization technique for systems with high evaluation costs
✔️ Computing the expected amount of improvement (EI) can reduce the trial-and-error required to improve the solution
✔️ Solving EI defects when parameters are high-dimensional achieves unexpected improvements

Unexpected Improvements to Expected Improvement for Bayesian Optimization
written by Sebastian Ament, Samuel Daulton, David Eriksson, Maximilian Balandat, Eytan Bakshy
(Submitted on 31 Oct 2023 (v1), last revised 18 Jan 2024 (this version, v2))
Comments: NeurIPS 2023 Spotlight
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Bayesian optimization is an efficient optimization technique for systems with high evaluation costs. It has applications in aerospace engineering, biology, pharmacy, materials engineering, robotics, hyperparameter tuning in machine learning, and many other areas.

Here, high cost of evaluation means that the evaluation takes a long time, costs a lot of money, or requires a lot of time and effort to evaluate. For example, if it takes one week to make and evaluate a prototype of a certain product, it will take one week for each evaluation, the money to make the prototype, and the time and effort of the person making and evaluating it. Repeated trial and error like this will pay a lot of costs.

Therefore, Bayesian optimization is an optimization technique that meets the need to arrive at a better product with as few evaluations as possible.

In Bayesian optimization, the technology involves creating a prediction model that predicts the correspondence between prototype conditions and evaluation results, estimating the goodness of the prototype conditions based on the prediction model, actually evaluating the prototype conditions estimated to be the best at the current stage, and repeatedly incorporating that data into the prediction model.

The function that represents the goodness of the prototype conditions is called the acquisition function in Bayesian optimization. A typical example of such an acquisition function is the Expected Improvement (EI).

In the paper described in this issue, we identified a problem in which the value of EI and its gradient disappear when there are many values to be set as prototype conditions and when good evaluation results are concentrated around certain conditions, and showed that this gradient disappearance stops solution improvement. Researchers at Meta (formerly Facebook) have proposed a logEI that avoids this gradient loss, and numerical experiments have shown that it improves performance.

Incidentally, the title of this article, "Unexpected Improvement," is derived from the title of the paper we are explaining here, UNEXPECTED IMPROVEMENTS. Expected improvement is "Expected Improvement" in English, and since the improvement of EI this time resulted in unexpected performance improvement, I think I tried to make the title more impactful by using the word "Expected Improvement".

In this section, we will explain the conventional method of EI and its problems, and our proposed method of logEI, which solves these problems, and its evaluation results.

Conventional method Expected improvement (EI)

As mentioned in the introduction, Bayesian optimization uses an acquisition function to calculate the goodness of the prototype conditions (hereafter referred to as parameters). A typical example of such an acquisition function is the Expected Improvement (EI).

For a given parameter value, the Bayesian optimization forecasting model returns two values: a prediction value and a prediction uncertainty. For example, there can be "a parameter with good predictive value and low predictive uncertainty" and "a parameter with good predictive value but high predictive uncertainty".

At first glance, the parameters that are "good predictors" and have "low prediction uncertainty" appear to be the best conditions. This is true if you have to choose a final condition to apply to your product right now.

But what if it is not yet time to narrow down the parameters, but rather to explore possibilities to see if there is room for significant improvement? The term "low forecast uncertainty" has both a negative aspect that the forecast is unreliable and may actually be a bad assessment result, and a positive aspect that there is a lot of room for growth.

Therefore, it seems efficient to evaluate the prototype condition that is most likely to improve the current best evaluation result on average, without looking only at the growth potential or the goodness of the current prediction value. The expected amount of improvement is the average value of the amount of improvement from the best current evaluation result, calculated from the prediction model.

Due in part to its clarity as an indicator, the expected improvement is a very popular Bayesian optimization acquisition function.

Failure of expected improvement EI when parameters are high dimensional

When using EI as the acquisition function, we look for parameter values that maximize EI, and an optimization method called the gradient method is used to perform this maximization. The gradient method is also called the mountain climbing method and can be compared to mountain climbing.

In other words, if we think of optimization as a method for reaching the top of a mountain called the acquisition function, the direction in which the slope of the mountain is the most severe from the current point is the direction in which we can advance to the highest direction (that is, the direction in which we can reach the top of the mountain most quickly), and if the slope is zero, that is the top of the mountain. When the gradient reaches zero, that is the top of the mountain.

Since we rely on this gradient to reach the top of the mountain, it is important to be able to calculate the gradient properly.

However, this paper identifies a problem in which the gradient of the EI disappears when the parameters are of high dimension and the parameter values with good evaluation results are concentrated in certain regions.

One of the artificial problems used to evaluate Bayesian optimization methods is the Ackley function. This function is exactly the kind of function that concentrates the parameter values that yield good evaluation results in a certain region.

Figure 1 shows the percentage of cases where the gradient is 0 (defined in this paper as less than or equal to the -10th power of 10) when the values of the parameters of this function are set to random values. In the figure, d is the dimensionality of the parameters (the number of parameters to be set), the horizontal axis is the number of training data for the forecast model, and the vertical axis is the percentage of cases where the gradient is zero.

Figure 1: Disappearance of the gradient of EI

As shown in the figure, the greater the number of training data and the greater the number of dimensions, the greater the percentage of gradients that are zero. In particular, when the number of training data is 80 or more and d is 4 or more, the gradient becomes 0 almost 1 (100%) of the time.

Figure 2 shows how the gradient of LogEI for the usual EI and the proposed method varies with parameter x. This is a simple optimization problem with one-dimensional parameters. Note that the horizontal axis is the parameter value and the vertical axis is log(EI) or LogEI. log(EI) and LogEI are different, with log(EI) being the conventional and LogEI being the proposed method.

Figure 2: Difference between normal EI and LogEI

The value of LogEI (solid green line) for the proposed method is not zero, but EI (dotted orange line) for the conventional method is zero in the pink region of the figure (since EI is zero, a value not defined by the logarithmic function log is entered, and the dotted orange line cannot be drawn).

LogEI solves this conventional EI gradient disappearance problem (where EI becomes zero over a wide range).

Proposed Method LogEI

The proposed method, LogEI, is quite simple: it is a composite function that substitutes the function of EI as an input for the logarithmic function log. Therefore, mathematically, the value can be considered as just a logarithmic calculation of EI.

We could not find any reference in this paper to why we focused on Log, but an intuitive reason would seem to be the following.

A logarithmic function is a function whose output swings sharply in a negative direction as the input gets smaller, and whose value increases very slowly as the input gets larger. It can be thought of as a function that expands small values and suppresses large values. Therefore, we can think of it as a magnifying glass that magnifies small gradients, and as a result, mitigates gradient disappearance.

However, a simple numerical implementation such as log(EI) will result in EI being 0 when the forecast model has almost no improvement in forecasting and the forecast uncertainty is very small. In this case, log(EI) becomes undefinable because the logarithmic function is not defined for input 0. Mathematically, EI cannot be zero, but numerically it can be zero.

Therefore, in order to compute log(EI) numerically, stably and accurately, this paper proposes an implementation of Equation 1 as LogEI.

This is mathematically equivalent to log(EI). For log_h(z) in Equation 1, the equation for z > -1 corresponds to a simple implementation of log(EI). Incidentally, what is entered as z is the "amount of improvement" = "predicted value µ(x) - current best value "" as the uncertainty of the prediction (also called the prediction standard deviation The values in log are the standardized normal density function and cumulative distribution function, respectively.

The simple implementation does not have problems in the case z>-1, but problems arise in the case -1≥z, which is why the equations in this paper are seemingly very complex calculations. The log1mexp and erfxc in this are numerically stable implementations of log(1-exp(z)), exp(z*z)erfc(z), where erfc is the complementary error function.

Evaluation results from numerical experiments

The artificial problem (objective function to be optimized) for evaluating the proposed method is shown in Equation 2.

Equation 2. artificial problem for method evaluation (objective function to be optimized)

It is a sum of squares optimization problem with a parameter dimension of 10. This problem is a very simple optimization problem.

This time, 20 evaluation experiments were conducted from 1024 random initial values.

The evaluation results are shown in Figure 3. The horizontal axis is the number of evaluations, and the vertical axis is the reglet (the difference between the evaluation result of the true optimal solution and the tentative best evaluation result obtained by the method, the smaller the better). The solid line in the figure is the average value of the 20 evaluation experiments, and the light shading indicates ±2 standard deviations from the average.

Figure 3: Evaluation results from numerical experiments of LogEI.

As shown in the figure, EI shows no improvement in riglet after the number of evaluations exceeds 75. In contrast, the riglet of the proposed method, LogEI, continues to decrease as the number of evaluations increases.

Even very simple optimization problems have been shown to be cases where normal EI does not optimize well and LogEI does.

Conclusion

In this issue, we described a paper that identifies the issue of expected improvement EI, a typical Bayesian optimization acquisition function, when the parameters are high-dimensional, and solves the problem simply by LogEI, which computes the logarithm of EI.

Identifying issues is very important in improving our methodology. Once the issues are identified, the problem can be solved simply, as in this paper. Although we have focused on EI in this paper, the paper introduced here also mentions other derived methods based on expected improvement, which has a very broad impact and is information that you should know when using Bayesian optimization.

In addition, this time the topic is closer to implementation due to a problem that is mathematically the same as the conventional expected improvement, but occurs when trying to compute it numerically, i.e., on a computer. This suggests that practical success requires not only mathematical thinking in algorithms, but also careful thinking in numerical calculations.

Categories related to this article

Yokota