Catch up on the latest AI articles

Machine Suggestion Of Optimal Strategies: A System That Recommends Strategies That Meet Advertisers' Objectives Is Now Available

Machine Suggestion Of Optimal Strategies: A System That Recommends Strategies That Meet Advertisers' Objectives Is Now Available

Reinforcement Learning

3 main points
✔️ A prototype of the Strategic Recommender System has been deployed on the Taobao (a Chinese online shopping site) display advertising platform.
✔️ This prototype system is further enhanced by explicitly learning advertiser preferences for various ad performance metrics and learning optimization goals through the adoption of various recommended ad strategies.

✔️ The designed algorithm was shown to effectively optimize advertiser strategy adoption rates.

We Know What You Want: An Advertising Strategy Recommender System for Online Advertising
written by Liyi GuoJunqi JinHaoqi ZhangZhenzhe ZhengZhiye YangZhizhuang XingFei PanLvyin NiuFan WuHaiyang XuChuan YuYuning JiangXiaoqiang Zhu
(Submitted on 25 May 2021 (v1), last revised 13 Jun 2021 (this version, v3))
Comments: Published on arxiv.

Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


For online advertising to be successful, it is essential that the advertising platform offers the best strategies to advertisers. Taobao (a Chinese online shopping site ) implemented a strategy recommender system to improve advertiser performance and platform revenue. The study increases the effectiveness of online advertising by learning advertisers' preferences and suggesting different advertising strategies. Using a contextual bandit algorithm, the study showed an effective way to learn advertiser preferences and maximize strategy adoption.


Advertising is a major source of revenue for e-commerce platforms, and Taobao has an intelligent display advertising system. However, advertisers' optimal strategies are uncertain, and new advertisers tend to leave. In our new study, we develop a strategy recommendation system for advertisers that adopts an intuitive approach focused on matching products with users. It will take into account advertisers' preferences and suggest optimal advertising strategies based on predicted performance. The prototype embedded in the Taobao platform uses a new learning algorithm.

Related Research

Recommended System

Recommendation systems have been extensively studied to solve information overload and provide personalized service to users. Collaborative filtering, content-based filtering, and hybrid filtering are used to take into account user and product relevance. However, there are two differences in the advertisers' recommendations. First, products are abstract advertising strategies, which are difficult to analyze with normal functional extraction. Second, advertiser-side recommendations must not only resolve information overload, but also optimize ad performance and achieve actual performance.

Real-time bidding

In real-time bidding, research is underway on automated bidding algorithms to address advertisers' different advertising goals. However, in display advertising, advertisers have a wide range of goals, and there is currently a lack of understanding of effective advertising goals. This affects the performance of real-time bidding algorithms.

System design

First, the current ad strategy recommender system for advertisers will be presented to clearly define advertiser preferences and optimization goals. Next, additional features of the prototype system are described. Finally, we formulate the advertising strategy recommendation problem as a context bandit and provide an efficient solution for solving it.

Prototype recommendation system

The Taobao Display Advertising Platform has implemented a recommendation system for advertisers to help them optimize their advertising strategies. The system includes a bid optimization module and a targeted user optimization module, whereby advertisers are recommended to target specific users for bidding. 2020 A/B testing showed a 1.2% increase in average revenue. However, the system is still in its early stages and recommendations need to be based more on advertiser preferences and objectives. The proposed strategy recommendation system aims to focus on ad performance and provide a personalized ad experience.

Advertising Strategy Recommendation System

In order to enhance a new recommendation system for advertisers, the problem of strategy recommendation is formulated and an approach to designing a new system is described. Specifically, ad campaign performance is defined by different Key Performance Indicators (KPIs), and advertiser preferences are considered as weight vectors for these KPIs. To recommend the best bidding strategy, the recommendation module learns advertiser preferences and uses real-time bidding algorithms. The new recommender system framework aims to learn advertiser preferences and learn optimization goals based on interactions with advertisers. This new approach allows the platform to provide personalized ad strategies based on advertiser preferences, and effective optimization through advertiser feedback.

Contextual bandit modeling

The "context bandit problem" is introduced to model ad strategy recommendations. The agent (ad strategy recommendation system) estimates the appropriate preference vector for each advertiser's ad campaign and suggests the optimal bidding strategy and ad performance. The advertiser's response is treated as a reward, which is mapped to the state, action, and reward of the context bandit problem. The goal is to predict and maximize advertiser adoption behavior so that the agent continuously recommends and learns the optimal advertising strategy.

Algorithm Design

It is stated here that the usual context bandit algorithm is difficult to apply to address the difficulties in recommending advertising strategies. It is pointed out that while the usual algorithm deals with discrete, finite actions, in the case of advertising strategies there is a high-dimensional, continuous action space (preference vector), which is also computationally time consuming. To address this problem, the reward learning process is split into two steps.

The first step is to build a relationship between advertiser information and preferences. This is done using a multilayer perceptron model to obtain a preference vector based on advertiser adoption behavior. Next, the relationship between ad performance and preference vectors is established and modeled.

Based on this method, the process of learning the relationship between advertiser adoption rates and priority vectors and updating the network action values (priority vector w) by gradient descent is described. This allows the recommendation of advertising strategies in a complex continuous space.

Action selection strategy

In advertising strategy recommendation, "exploration" refers to suggesting strategies based on new preference vectors, while "utilization" refers to suggesting strategies based on existing learned preference vectors. Thompson sampling is an effective method of making this trade-off between exploration and exploitation and uses Bayesian processing. Specifically, it balances exploration and exploitation in neural networks by using dropouts to represent uncertainty in the model. This acts as a random hypothesis test and suggests appropriate advertising strategies while accounting for model uncertainty.


We first validated the prototype recommender system in an online evaluation and then extensively evaluated the recommended methods of advertising strategy proposed in the simulation. The results are as follows

(1) Online evaluations show the potential benefits of ad strategy recommendations to advertisers, and the recommender system helps advertisers optimize performance and increase platform revenue.
(2) The designed neural network was effective in accurately predicting advertiser preferences and optimizing adoption rates.
(3) The Dropout trick effectively validated the balance between leveraging existing preference information and exploring new preferences.
(4) Through ablation studies, we confirmed the generalization ability of the Bandit algorithm.

Online Evaluation

Since February 2020, a prototype recommender system has been implemented on the Taobao display advertising platform. The system consists of a bid optimization module, a target user optimization module, and an ad auction simulator, and recommends strategies based on the Bandit algorithm for advertisers' requests. during the experiment from May 14 to 27, 2020, actual online evaluation and A/B testing to evaluate the system's performance. Results showed that advertisers adopted the recommended strategies, increasing ARPU by 1.2% and improving ad campaign performance. However, it was noted that there are still some challenges, especially the need for advertisers to select recommendations, and that there is room for improvement in ad performance.

Simulation Settings

Here, it is noted that because evaluating bandit algorithms is more difficult and costly than machine learning, many studies use simulation environments to validate the effectiveness of the algorithms. Specifically, the bidding module optimizes ad performance under budget constraints, and the advertiser module simulates advertiser preferences. Advertiser adoption behavior is modeled based on a conditional logit model, and it is explained that advertiser adoption increases when the usefulness of the recommended strategy is high in the simulated environment. Details on evaluation metrics and training parameters are also provided, explaining the optimization goals of the context bandit algorithm and how to evaluate model performance.

Experimental results

・Survey of advertisers' advertising performance areas

In this experiment, we briefly investigated advertisers' advertising performance in Taobao's online advertising environment. We selected total impressions, total clicks, and maximizing GMV as typical goals for advertisers and represented each advertiser's preferences as a vector. The experimental results show that optimizing these goals in the ad auction simulator significantly improved each advertiser's performance. This demonstrates that understanding advertiser preferences is important in optimizing ad performance.

・Results of Comparative Experiments

Through comparative experiments comparing models with different dropout rates or no dropout, the effectiveness of the proposed context bandit algorithm will be demonstrated. We also implement the recommended strategy using random priorities. In each experiment, the agent interacts with the environment over 2000 rounds, periodically updating the cumulative expected regreaement and cumulative adoption rate. The results of the experiments are presented in Table 3.

From Table 3, we observe that recommendations with random preferences cause a significant drop. This motivates the need to consider advertiser preferences when recommending strategies. Also, even without using the dropout trick, a model that explicitly learns advertiser preferences can reduce the expected rigress accumulated by 25.71% compared to a recommended strategy that does not use the learning module. And we also find that the context bandit algorithm that applies dropouts is more effective than the algorithm that does not use dropouts. As the dropout rate increases (from 20%, to 40%, to 60%, to 80%), we observe that the performance of the model first increases and then decreases. This is,

(1) If the Dropout rate is low, this is because the model adopts a conservative search strategy.
(2) If the Dropout rate is high, the model will explore the action space more frequently, thus underutilizing the learned knowledge and degrading performance.

Figure 4 shows the cumulative expected rigress and cumulative adoption rate curves for models with various dropout rates for different numbers of interactions. To better assess differences in performance after the models converge, the function 𝑦 = 𝑙𝑜𝑔(𝑥 + 1) is used to normalize the accumulated rigresses. From Figure 4, we can see that the various models converge to various local optima, that the rate of increase in accumulated expected rigress in all models initially decreases before converging, and that the cumulative adoption rate gradually increases before converging. All of these models improve the performance of the recommender system by learning some of the advertisers' preferences compared to the recommended strategy without the learning module. In our experiment, we compared a model with a 40% dropout rate to the same model without preference-related information (only past recruitment information), and the results are shown in Figure 5. Results show that the model with preference-related information outperforms both the cumulative expected regress rate and the cumulative adoption rate, suggesting that the model learns advertiser preferences and improves general performance.


The study focused on strategy recommendations for online advertising and demonstrated the benefits of recommending strategies to advertisers through A/B testing. An approach that leverages advertiser adoption behavior to learn advertiser preferences and optimize strategy adoption rates was proposed, and a dropout trick was employed to address the bandit problem. The paper highlights the success of the system in optimizing recruitment rates through simulation experiments.

The study takes a very interesting approach and demonstrates the benefits of strategy recommendation through A/B testing in online advertising strategies. In particular, the method of learning advertisers' preferences and using this information to optimize strategy adoption rates is a promising approach for effective optimization of advertising.

The approach of considering advertiser recruitment behavior is seen as realistic in real business situations and increases the flexibility of the system to adapt to advertisers' intentions. Also noteworthy is the introduction of dropout tricks to address the bandit problem, which allows for flexible and effective decision making in complex situations.

The results of simulation experiments showing that the proposed method successfully optimized the adoption rate are promising results for practical deployment. However, it will be important in future research and practice to demonstrate in real-world A/B testing how effective it is in actual advertising environments. It is hoped that the research will provide practical insights and open new avenues for optimization of advertising strategies.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us