Automated Creative Optimization for E-Commerce Advertising
Abstract.
Advertising creatives are ubiquitous in E-commerce advertisements and aesthetic creatives may improve the click-through rate (CTR) of the products. Nowadays smart advertisement platforms provide the function of compositing creatives based on source materials provided by advertisers. Since a great number of creatives can be generated, it is difficult to accurately predict their CTR given a limited amount of feedback. Factorization machine (FM), which models inner product interaction between features, can be applied for the CTR prediction of creatives. However, interactions between creative elements may be more complex than the inner product, and the FM-estimated CTR may be of high variance due to limited feedback. To address these two issues, we propose an Automated Creative Optimization (AutoCO) framework to model complex interaction between creative elements and to balance between exploration and exploitation. Specifically, motivated by AutoML, we propose one-shot search algorithms for searching effective interaction functions between elements. We then develop stochastic variational inference to estimate the posterior distribution of parameters based on the reparameterization trick, and apply Thompson Sampling for efficiently exploring potentially better creatives. We evaluate the proposed method with both a synthetic dataset and two public datasets. The experimental results show our method can outperform competing baselines with respect to cumulative regret. The online A/B test shows our method leads to a 7% increase in CTR compared to the baseline.
1. Introduction
Online advertisements are ubiquitous in nowadays life, creating considerable revenue for many e-commerce companies. As a common medium of advertisements, advertising creatives, as shown in Fig. 1, can deliver rich product information quickly to users in a visual manner. Appealing creatives improve visual experience and may lead to an increase of click-through rate (CTR), as evidenced by (Azimi et al., 2012; Cheng et al., 2012). For merchants and e-commerce companies, the increase of CTR can be considered as an indicator of an increase of revenue. Therefore, much attention has been paid to creative design for improving the visual experience.

Traditionally, advertisers have to employ expert designers to produce attractive creatives and then submit them to platforms as complete images. Each time new products are announced or old products are updated, many creatives of different sizes and styles are required to design and submit to advertising platforms. This leads to an expensive cost for many advertisers. To reduce the cost of repetitive but professional design for advertisers, several high-tech companies set up intelligent advertisement platforms (Hua, 2018), which provide instant production services for advertising creatives and remarkably reduces heavy burdens for advertisers. Advertisers only need to provide basic materials to platforms, such as product pictures and textual information. Based on these source materials, the production system produces advertising creatives automatically by compositing arbitrarily designated elements, such as templates, colors of text and sizes of pictures.
In order to ensure the quality of generated creatives, on one hand, they should satisfy basic visual constraints, but this is not the focus of this paper. On the other hand, they should be clicked with high probabilities (i.e., click-through rate) when they are advertised. Intrinsically speaking, the latter corresponds to an optimal selection problem, which faces the following challenges. First, the combinatorial composition of elements leads to an exponential explosion in the number of candidate creatives. Second, because of the limited advertising budget, each product is usually displayed several times within a day. When apportioned to a large number of its generated creatives, the feedback becomes extremely sparse. Furthermore, creatives in E-commerce change frequently over time, so that cumulative feedback for out-of-date products may be not useful any longer. Usually, there are more than 4 million new creatives in a day in a popular advertisement position according to our statistics. Therefore, it is extremely difficult to estimate the click-through rate for each generated creative accurately.
It is possible to apply factorization machines (FM) (Rendle, 2010) for predicting the click-through rate (CTR) of each creative. FM models interaction between elements of creative based on inner product, so that creatives with similar composited elements are similarly represented. As a consequence, FM can alleviate the sparsity issue to some extent. However, interactions between elements of creatives may be much more complex than the inner product. For example, we empirically observe that the inner product does not work best for modeling interactions between elements. Moreover, the estimated CTR for each creative may be of high variance due to the extremely sparse feedback. Greedy creative advertising with maximal predicted CTR is usually suboptimal, so that it is essential to efficiently explore potentially better creatives by simultaneously exploiting CTR prediction and uncertainty.
To address these two issues, we propose an Automated Creative Optimization (AutoCO) framework to model complex interaction between elements of creatives and to strike a balance between exploration and exploitation. Particularly, inspired by Automated Machine Learning (AutoML) (Hutter et al., 2019), we propose one-shot search algorithms for searching effective interaction functions between elements of creatives efficiently. The interaction function family to search can be defined by extending the multiply operator in the inner product to the operator set {max, min, plus, multiply, concat} over operation-aware embedding, and replacing the sum pooling operation with a fully connected network. Following that, following the reparameterization trick in VAE (Kingma and Welling, 2014), we develop stochastic variational inference to estimate the posterior distribution of parameters. Armed with the parameter posterior distribution, we can apply Thompson Sampling (Blundell et al., 2015) for efficiently exploring potentially better creatives.
The contributions in this paper are summarized as follows.
-
•
We propose an Automated Creative Optimization (AutoCO) framework for optimal selection of composited E-commerce creatives. The framework simultaneously models complex interaction between creative elements and strikes a balance between exploration and exploitation, successfully addressing the sparsity issues of feedback.
-
•
We empirically observe that the inner product is suboptimal for modeling interaction between elements, based on which we propose a one-shot search algorithm for searching effective interaction functions efficiently.
-
•
Based on the reparameterization trick, we develop stochastic variational inference to estimate the posterior distribution of parameters, making it amenable to Thompson Sampling for efficiently exploring potentially better creatives.
-
•
The experiments on both a synthetic dataset and two public datasets show that our method performs better than competing baselines in terms of the cumulative reward and regret, indicating the effectiveness of complex interaction modeling. The online A/B test shows that our method leads to a 7% increase in CTR, confirming the superiority of our method to baselines.
2. Related Work
2.1. Similar Tasks
With the rapid development of the Internet, the recommendation systems have been proposed to solve the problem of information overload, such as online advertising (Zhou et al., 2018), point of interest recommendation (Lian et al., 2020c) and so on. The Creatives are ubiquitous for online advertisements. Some works have paid attention to CTR prediction on ad creatives (Chen et al., 2016; Mo et al., 2015; Liu et al., 2020a; Yang et al., 2019; Zhao et al., 2019) via extracting expressive visual features to increase the CTR. However, there are few works about the optimization for advertising creatives given limited feedbacks. In classic recommendation systems, the negative samplers (Lian et al., 2020a) are utilized to select informative samplers for solving the data sparsity and the product quantization methods (Lian et al., 2020b) have been used for lightweight recommendation.
For online advertising and recommendation, several similar tasks have been studied. LinUCB (Li et al., 2010) achieved great success in personalized news recommendations where an efficient contextual bandit is applied. Whole-page optimization, aiming to select the best layout of a home page, is usually defined as a bandit problem (Wang et al., 2016). An efficient multivariate bandit algorithm (Hill et al., 2017) is proposed to solve the problem and shows superior performance in real online situations. These similar tasks exploit bandit algorithms to solve the exploration and exploitation dilemma.
2.2. Success of AutoML for recommendation
Some success of automated machine learning (AutoML) has been achieved in recommendation systems. Different from regular tasks for AutoML, such as classification in computer vision, search space becomes diverse rather than the neural architectures (Zhong et al., 2018; Zoph and Le, 2016; Zoph et al., 2018). Searching embeddings of the input with categorical features is exploited for large scale recommendations (Joglekar et al., 2020). Another attention is that the search for high order features helps improve the CTR prediction (Liu et al., 2020b). Furthermore, the search for interaction functions for collaborative filtering reveals the improvements caused by searching operation functions between different user embeddings and item embeddings (Yao et al., 2020a).
3. Preliminaries
Before we introduce the proposal in detail, we give a brief description of the investigated problem and introduce the classical method. Finally, we provide an overview of our solution.
In this work, only composited creatives are investigated instead of the whole images designed by professional designers. At the time of impression, a product sends a request to the advertisement platform, and the platform instantly selects the creative from candidates for display. There are elements to be composited and the -th element is a collection of alternatives.
The investigated task in this work is to recommend the optimal composited creative for each product at the time of impression, aiming to increase the overall CTR given limited impressions. Noticing that the generation of the creatives under aesthetic metrics, such as (Li et al., 2018; Vempati et al., 2019) is not the focus of this paper.
3.1. Classical Method
Typically, each candidate creative should be delivered to the advertising platforms and the user feedback is collected for CTR prediction with point-wise (Chen et al., 2016), pair-wise (He and McAuley, 2016) methods. The creative with a maximal value of predicted CTR is then selected to be displayed to increase the overall click-through rate. But the multiple elements are composited combinatorially, resulting in an exponential explosion in the number of creatives. For example, given 4 templates, 10 fonts, 10 colors, and 5 picture sizes, 2,000 creatives can be composited for one of the product images. The collection of sufficient samples for model training is time-consuming and expensive. Thus there is a trade-off between exploration and exploitation.
Multi-Armed Bandit is usually a classical option for creative optimization where the composited creatives are compared to bandits. These algorithms generally include the following processes. At a discrete-time , a creative is selected and revealed with a reward . An impression is a scenario when an ad is shown to a user in the advertising industry and represents the user clicked the selected creative. Within the limited impressions, the multi-armed bandit aims to minimize the cumulative regret , where represents the optimal creative with the maximal expected reward. In this work, the expected reward corresponds to the value of CTR.
The composited creatives sometimes share common elements, so that making connections can help estimate the expected reward of various creatives. The linear function to model the connections has been successfully applied in this situation (Chu et al., 2011), where the expected reward of the creative is assumed as a linear function with respect to the feature. The expected reward for each creative is additionally measured by a upper confidence bound. The linear models ensure the theoretical lower bound for regret. However, the expressiveness of linear models is limited. Such models can not efficiently capture the commonality given significantly large search space. This limitation motivates us to design a more effective algorithm under the huge creative space given extremely sparse feedback.
3.2. Solutions Overview
There are two key components in a bandit problem: (1) Assumption of the expected reward (2) Exploration methodology. For these two parts, we design effective algorithms to improve the performance in terms of overall CTR under the huge creative space. The proposed framework consists of the following phases:
-
(1)
CTR Estimation: We focus on the interactions between different elements to leverage commonality between numerous creatives, and search for the interaction functions to capture suitable representations depending on AutoML methods.
-
(2)
Efficient Exploration: To reduce the high variance caused by sparse feedback, Thompson Sampling, a classical and efficient exploration method, is exploited. The posterior approximation under complex interaction functions is estimated through variational inference in an end-to-end manner.
4. Interaction Function Search for CTR Estimation
4.1. Interaction Function Analysis
In a similar task, the optimal selection of layouts for web pages (Hill et al., 2017), the interactions between different elements are captured as the way in the Poly2 (Chang et al., 2010) model. But the parameter space becomes seriously large with the increasing number of features. One solution is the factorization machines (FMs) (Rendle, 2010), which utilize the inner product as the interaction functions and achieve success in CTR prediction (Guo et al., 2017). The inner product of the pairwise feature interactions is as:
where is the low-dimension embedding vector for the -th feature. . is the embedding matrix with respect to the -th feature field and denotes the one-hot encoding vector of the -th feature. Multi-hot vectors or continuous value vectors are also available. is the embedding dimension and represents the size of feature space.
However, the inner product may not yield the best performance in many recommendation tasks due to the complex nature of interactions. Inspired by the recent work SIF (Yao et al., 2020a), which searches for the interaction functions between the user and item embeddings, we are encouraged to capture the complicated interactions between different features. For the sake of simplicity, we state the problem of searching for the interaction functions between different feature fields in this investigated task.
Definition 0 (Interaction Functions Search).
Find an interaction function for each pair of embedding vectors . The object is to minimize the loss over the observed data.
is the operation set containing several interaction functions. We utilize to represent the number of operations in There are embedding vectors and the number of the interaction functions to be selected is . Thus the time complexity for searching the optimal interaction function is . In the following chapters, we will introduce the search space and the search algorithm respectively.
4.2. Search Space
Multiply | Concat | Plus | Max | Min | LR | IFS(Alg.1) | FM | |
---|---|---|---|---|---|---|---|---|
Average AUC | 0.6039 | 0.6044 | 0.6037 | 0.6039 | 0.6023 | 0.5285 | 0.6050 | 0.5947 |
Standard Error | 0.0017 | 0.0004 | 0.0007 | 0.0013 | 0.0008 | 0.0019 | 0.0003 | 0.0004 |
Relative Improvements | 1.54% | 1.62% | 1.50% | 1.53% | 1.27% | -11.13% | 1.71% | - |
We pick the following simple and popular operations as mentioned in the previous work (Yao et al., 2020a). (1) Concat: , (2) Multiply: , (3) Plus: , (4) Max: , (5) Min : . The concat is a vector-wise operation and the other four operations are element-wise. Similar with SIF, we exploit a full connected layer for each function. The FC controls the output size of different operations to be consistent.
Before we focus on the approach of the efficient search for interaction functions, several offline experiments are conducted to demonstrate that different interaction functions between feature fields yield different performances. The baseline approaches involve the factorization machine and the logistics regression without interactions.
As shown in Table 1, different operations have been studied on the collection of about one thousand products and their 300 thousand composited creatives. Nearly three million pieces of data are collected. We use the regular CTR prediction with the binary cross-entropy loss for optimization. The five selected operation functions have better performances than the FM models and have different degrees of improvement. The result provides evidence that there exist more approximate interaction functions between the different elements.
Considering all the interactions between different embedding vectors, the search space is up to and is much more complicated. The various operators between different embedding vectors may enhance the expressiveness of the CTR predictor and then lead to better performances than the FM-based models.
4.3. Efficient Search Strategy
A straightforward idea to obtain the optimal interactions between different feature fields is to traverse all the combinations. However, the time complexity of collecting the best interactions is , which is NP-hard and unacceptable in our situation. To deal with the issue, we apply an efficient search algorithm inspired by previous AutoML algorithms.

4.3.1. One-Shot Search Algorithm
Neural architecture search is firstly proposed in (Zoph and Le, 2017), where reinforcement learning is employed for searching architectures. After then, more search-based algorithms have been proposed (e.g., evolution algorithm (Liu et al., 2018b), greedy search (Liu et al., 2018c)). However, these algorithms cost much search time due to the huge and discrete search space. Recently, the one-shot architecture search algorithms achieve huge success depending on the continuous search space and greatly shorten the time, such as DARTS (Liu et al., 2018a), SNAS (Xie et al., 2018), NASP (Yao et al., 2020b). These algorithms jointly optimize the weights for different operators and model parameters by stochastic gradient descent.
Following SIF (Yao et al., 2020a), we relax the choices among operations as a sparse vector in a continuous space. For each pair of feature fields and , denotes the weight vector and the -th element corresponds to the weight of the -th operator . Now the operation between the feature pair and is formulated as follows:
The constraint ensures that only one operator would be selected, and the constraint makes the algorithm more stable. Thus, the search problem is defined as:
(1) | ||||
s.t. | ||||
where , denotes the operation weights for all the interactions. is the binary cross-entropy loss in this work. represents the parameters for embedding matrices and the fully connected layers. For the sake of simplicity, we temporarily omit the expression of the first-order term and the bias in FMs models. Following previous successful work with proximal steps (Yao et al., 2020a, b), we utilize the same optimization method for efficient searching.
4.3.2. Operation-aware Embedding
An interesting point proposed in FFM (Juan et al., 2016) and FwFM (Pan et al., 2018), that features from one field often interact differently with that from other fields, prompts us the potential distinct influence of operators on feature embeddings. Noticing that the joint optimization for the operation weights and embedding vectors, we design the operation-aware Embedding module, shorted as OAE, which assigns different embeddings for each operator. Fig. 2 represents the interactions of different operations.
According to the operation-aware Embedding, is redefined as:
(2) |
The embedding represents the embedding vector corresponding to the -th operation. Observing the original Eq (1), the shared embedding matrices would obtain gradients of different magnitudes with the change of the operator. A comprehensible example is a difference between the plus and multiply operations. The frequent change would affect the computation of the gradient and result in poor convergence. The introduction of the operation-aware embedding removes the interferences of other operations when optimizing the embeddings and may lead to better learning. The final algorithm for interaction function search with the OAE module is described in Alg. 1.
After the model training, the interaction function is selected by . The expected reward of the candidate creative is followed as:
(3) |
where is the sigmoid function. Then, the creative with the highest expected reward will be selected and revealed with a reward.
5. Efficient Exploration via Thompson Sampling
As mentioned before, the platform can composite hundreds of creatives. But the limited impressions of the products result in the extreme sparsity of feedback. Each candidate creative would be displayed with few impressions so that the CTR predictor comes with high variance. Thus, an efficient exploration method is introduced to reduce the high variance.
5.1. Variational Inference for Thompson Sampling
Thompson Sampling (Thompson, 1933) is a popular bandit algorithm to balance exploration and exploitation. The main idea is to assume a simple prior distribution of the reward distribution of each candidate, and at any time , recommend a candidate proportionally to the posterior distribution of being optimal. As declared in (Agrawal and Goyal, 2013), instead of solving the posterior distribution of the reward distribution, Thompson Sampling for contextual bandits samples the model parameters. A general process usually includes the following steps:
-
(1)
Sample model weights from the posterior ;
-
(2)
Recommend the creative ;
-
(3)
Receive reward ;
-
(4)
Update past observations and the posterior .
For linear models, given the Gaussian distribution as priors, the posterior has closed-form solutions following Bayesian rules (Agrawal and Goyal, 2013; Li et al., 2012). However, by introducing the complex operations between different feature fields, the posterior distribution is intractable to be directly solved. Thanks to the previous success of posterior approximation in neural networks (Blundell et al., 2015; Kingma and Welling, 2014), we can exploit variational inference for approximation to fit in the Thompson Sampling method.
Assuming that the model parameters satisfy the Gaussian distributions, variational learning finds the parameters of the approximation posterior by minimizing the Kullback-Leibler (KL) divergence between the approximation posterior and the truth posterior:
where is the set of observed data and each sample is the feature vector and corresponding reward . Following Bayesian rules, the KL-divergence can be rewritten as:
The component is independent of the learning parameters and , so that the object equals to minimize the following equation:
According to Equation (3), the likelihood can be formulated as:
The logarithm can be simplified as:
Usually, the stochastic variational inference is applied to estimate . Suppose that the variational posterior is a diagonal Gaussian distribution, we reparameterize the model parameters which yields a posterior sample of the parameters as . The noises has the same size as and is point-wise multiplication. Finally, the objective function is formulated as:
(4) |
where the prior is assumed as Gaussian distributions in this work. The KL divergence of two Gaussian distributions has an analytical expression. Thus, stochastic gradient descent can be obtained to minimize the objective function. At each iteration, we draw same Gaussian noise and evaluate Equation (4) and its derivations w.r.t. , . The approximation posterior is then estimated as the parameter updates within training iterations.
Integrated with the interaction function search, the whole optimization procedure of Thompson Sampling is declared in Alg. 2. The proposed approach is updated in an end-to-end manner without multiple repetitions of running after the interaction functions update. As mentioned in (Blundell et al., 2015; Kong et al., 2020), Non-Bayesian approaches, which trains multiple models for estimating the expected reward, is prohibitively expensive in practice. The application of variational inference facilitates a quick response in a live environment.
5.2. Automated Creative Optimization for E-Commerce Advertisement
Following the above Thompson Sampling method, we can sample a new set of model parameters to estimate the expected reward and pick the creative with the highest value for display. The creative is then delivered with a reward. After getting the new observations, we search for the interaction functions and embedding matrices simultaneously, as shown in Alg. 2. This end-to-end optimization facilitates efficient search and efficient exploration for creative optimization. The final framework of automated creative optimization is shown in Alg. 3. The algorithm would converge as the observations accumulate.
To this end, we perform an automated interaction function search improving the expressiveness of the model and utilize the Thompson Sampling via variational inference for efficient exploration to reduce the high variance caused by limited feedback.
6. Experiments
In this section, experiments111https://github.com/alimama-creative/AutoCO are conducted on the synthetic dataset and public datasets to answer the following questions:
-
•
Q1: How the automated interaction function search improve the performance compared with the FM model?
-
•
Q2: Can the proposed Thompson Sampling with variational inference for posterior approximation explore efficiently with limited feedback?
To further examine the performance of our proposed approach, we conduct the experiment on a live production system.
6.1. Synthetic Results
6.1.1. Simulation Data
Following the synthetic experiments in the predictor work (Hill et al., 2017), we produce the synthetic data depending on the assumed CTR predictor and simulate the reward for the proceeding of bandit problems. There are five elements for compositing creatives, including template, picture size, text font, background blur, and background color. We generate the expected reward over 167 products which uses the representation in Equation 3. On average, each product has 67 composited creatives satisfying visual constraints. The operators for different embedding vectors are initialized randomly and the parameters of embedding matrices are sampled from Gaussian distributions. Finally, each creative has an expected reward (CTR), which is the success probability in the Bernoulli trails to simulate the user feedback. The average value of the generated CTR is 0.0885.
At each trial, we sample a product and the algorithm recommends the creative which owns the estimated maximal CTR. Every 10,000 trails are collected for updating and training to simulate the delayed feedback in real production systems. Each algorithm is run multiple times with 20 batches to reduce the bias caused by randomness. The empirical studies are running in a Linux operating system with the Tesla P100 GPU for accelerating.
6.1.2. Baselines
We choose the following typical algorithms for comparisons where parameters are tuned with the best performance in tems of the overall cumulative CTR.
-
•
Uniform: The creative is selected uniformly at random.
-
•
Egreedy: Recommend a creative at random with probability and recommend the creative with the best statistical reward with probability .
-
•
LinUCB (Li et al., 2010): Each candidate creative has the corresponding linear models with the observed contextual information. LinUCB recommends the creative according to the estimated reward with a upper confidence bound.
-
•
LinTS: Similarly with LinUCB, the interactions are constructed by linear models. LinTS utilizes Thompson Sampling for exploration depending on the Bayesian linear regression.
-
•
MVT (Hill et al., 2017): MVT captures the interactions with a linear model and utilizes Thompson Sampling proposed in this work for exploration.
We add another two FM-based algorithms FM and FM-TS. FM recommends the creative which owns the maximal estimated CTR without exploration while FM-TS exploits Thompson Sampling as declared in Section 5.1.
Our proposed method AutoCO automatically searches interaction functions between different feature fields and utilizes the same exploration method as FM-TS.

6.1.3. Metrics
Generally, bandit algorithms are evaluated in terms of cumulative regret (Sutton and Barto, 2018) which represents the cost before finding the optimum. In practice, the regret is the difference between the expected reward of the optimal strategy and the empirical performance of the conducted algorithm. The lower regret is better. The value of the cumulative reward is complementary to the cumulative regret and higher is better. We also report the normalized cumulative reward, which is actually the CTR in online advertisements.
6.1.4. Learning Effectiveness
We perform 20 batches, i.e.2 million impressions to test each algorithm. We report the average performance of 5 repetitions experiments in Fig. 3. There are several findings in our empirical studies.
Finding 1: The interactions between different elements enrich the commonality leverage and lead to better performances for CTR prediction. The linear models (e.g., LinUCB and LinTS) which only build the linear relationship between elements converge to worse creatives although they have comparable performances at the beginning. The expressive representations for the interactions help model the commonality of numerous creatives and improve the performances of the overall CTR.
Finding 2: The proposed AutoCO method shows superior performances during the whole process. The significant improvement comes from two key points: (1) Automated interaction function search; (2) Efficient exploration with variation based Thompson Sampling. To investigate the different influences of these two parts, we conduct two experiments.

First, we compare the AutoCO with the fixed single operation in the search space of AutoCO. Results are shown in Fig. 4. All the algorithms are conducted with Thompson Sampling as the exploration method. Among the five operation functions, the concat operation has the best performance than the other four fixed operations while the AutoTS outperforms the concat interaction function. It can be inferred that better interaction functions for different feature fields are found out via the AutoML method.
FM | AutoCO∗ | |
Greedy | 0.1030 | 0.1073 |
Egreedy() | 0.1024 (-0.58%) | 0.1086 (+1.21%) |
Thompson Sampling | 0.1055 (+2.43%) | 0.1102 (+2.70%) |
-
*
AutoCO here denotes the interaction function search method for simplicity of understanding.
Second, different exploration methods are tested to compare the efficiency with the Thompson sampling method utilized in this paper. For the FM model and AutoCO, experiments are run for five times to report the average performance in Table 2. The variation based Thompson Sampling achieves excellent performances compared with Greedy and Egreedy. The trivial exploration method, Egreedy, would always recommend a predefined percentage () of creatives randomly although a large number of impressions are collected, causing a waste for displaying bad creatives. Instead, Thompson Sampling recommends the creative proportionally to the probability of being optimum and finally only the optimal creative would be displayed. Another point is that the variational inference for posterior approximation is well conducted in Thompson sampling and this method enables straightforward training in a variety of basic models.
In conclusion, the proposed AutoCO shows the superiority in the accuracy of the CTR predictor and efficiency for exploration on the synthetic data where better expressiveness is obtained through interaction function search.
6.1.5. Effectiveness of OAE
We conduct an ablation study on the simulated data to analyze the effect of the operation-aware embedding module. We first test the AutoCO and then remove the OAE module. The result is reported in Table 3. The OAE module leads to a slight improvement in terms of CTR compared with the shared embeddings between different operations. This indicates the assistance of the independent embeddings. In the process of training, we also find that OAE contributes to more stable learning.
Framework | Overall CTR |
---|---|
AutoCO w/o OAE | 0.1097 0.0006 |
AutoCO | 0.1102 0.0014 |
Dataset | AutoCO | Multiply-TS | Concat-TS | Plus-TS | Min-TS | MAX-TS | Uniform |
---|---|---|---|---|---|---|---|
Mushroom | 3.020.51 | 5.441.05 | 6.122.37 | 3.400.30 | 3.620.35 | 4.952.00 | 100.000.38 |
Adult | 35.400.03 | 49.280.36 | 35.790.06 | 37.190.39 | 38.342.01 | 36.551.18 | 100.000.15 |
Dataset | AutoCO | AutoCO w/o TS | FM | FM-TS | Egreedy | Uniform |
---|---|---|---|---|---|---|
Mushroom | 3.020.51 | 5.580.51 | 7.701.44 | 5.641.70 | 63.380.47 | 100.000.38 |
Adult | 35.400.03 | 40.070.40 | 43.623.32 | 38.712.30 | 58.850.15 | 100.000.15 |
6.2. Experiments on Public Datasets
Empirical studies are also conducted on the two public datasets to verify the performances of our proposed AutoCO.
6.2.1. Datasets
-
•
Mushroom: The Mushroom dataset (Schlimmer, 1981) contains 22 attributes and 8124 samples. The mushrooms are divided into two categories: safe and poisonous. We following the previous settings (Blundell et al., 2015; Riquelme et al., 2018) to formulate the bandit problem, where the algorithm needs to decide whether to eat the given mushroom. If eating a safe mushroom, a reward +5 will be received. Eating a poisonous mushroom obtains the reward +5 with a probability of 0.5 and the reward -35 otherwise. If the agent does not eat the mushroom, no reward is delivered. We sample 50,000 pieces of data from the original dataset.
-
•
Adult: The Adult dataset (Kohavi, 1996) involves 14 attributes of personal information. The task is to determine if a person makes over $50K a year or not. If the agent makes a right prediction, a reward 1 is delivered. Otherwise, no reward is given. This dataset contains 48842 samples. We pick the 9 categorical features of the mixed 14 original features to conduct our experiments.
6.2.2. Effectiveness of interaction function search
Compared with the single interaction function, the AutoCO has the lowest cumulative regret on the two public datasets, as shown in Table 4. All the algorithms utilize the Thompson Sampling for exploration. The good performance of the AutoCO indicates that the complex interaction functions rather than the single operation enhance the performance of modeling relationships between the different attributes.
6.2.3. Efficiency of exploration methods
To verify the efficiency of Thompson sampling via variational inference, we conduct experiments to compare it with greedy strategy. Experiment results are shown in Table 5. We can conclude that Thompson sampling via variational inference can indeed increase the performance.
6.3. Online Experiment
Now, we examine the performance of our proposed algorithm on the online system.

6.3.1. Experimental Design
We exploit AutoCO for the recommendation of the composited creatives in an online advertisement platform. The composited elements include the template, picture size, text font, background blur, and background color. Our experiments are conducted on one display advertising position of a famous e-commerce company. We generate more than 290,000 creatives satisfying visual constraints covering 1,000 popular products. In addition to the element feature mentioned before, we utilize the category feature of products as well as the contextual information of product images. The feature embeddings of images are extracted from ResNet.
We performed the experiments for six consecutive days. Traffic for this experiment was randomly and equally separated to the algorithms. The experiments were conducted on non-holidays to avoid fluctuations in traffic. Each algorithm got more than 150,000 impressions per day. The model was updated every hour using the data collection from the beginning to the latest hour.
Baseline algorithms: (1) Uniform: The composited creatives are delivered to the platform randomly; (2) Context-free Thompson Sampling: The distribution of expected reward is assumed as a Beta distribution. The initialized and ; (3) FM-Egreedy: FM is the CTR predictor and ; (4) FM-TS: Apply the proposed Thompson method for exploration in the FM model. In our offline experiments, we found that the interaction functions do not change frequently after several epochs, which is also declared in (Yao et al., 2020b). In order to reduce the burden of online updating, we do not frequently search the interaction functions.
6.3.2. Result
The six days online experiment shows the superiority of our approach AutoCO, as shown in Fig. 5. The context-free TS algorithm almost has no improvements compared with the Uniform baseline, which exactly demonstrates the sparsity of feedback. On average, the impression for each creative is less than one per day. The contextual algorithms have better performances where the interaction between different features are paid attention to capture the commonality between different creatives.
On the online A/B test, although the AutoCO is merely comparable in the first two days, the significant improvement is achieved on day 3. In the following consecutive three days, the AutoCO steadily improve the performance. This indicates the quick convergence of the proposed method which is essential for online advertisement. Compared with the FM-based models, our approach still shows competitive performances. It can be inferred that the search of interaction functions help find out more appropriate interaction functions to improve the expressiveness.
7. Conclusion
In this paper, we propose an automated creative optimization framework to model complex interaction between creative elements and to strike a balance between exploration and exploitation. For modeling complex interaction, we suggest an one-shot search algorithm for searching more effective interaction function in the customized function space efficiently. For balancing between exploitation and exploration, we develop the stochastic variational inference for posterior approximation based on the reparameterization trick, and then apply Thompson Sampling for effective exploration of potentially better creatives. We conduct both offline and online experiments, showing that our proposed approach performs better than the competing baselines and verifying the effectiveness of the proposed algorithms.
Acknowledgements.
The work was supported by Alibaba Group through Alibaba Innovative Research Program. Defu Lian is supported by grants from the National Natural Science Foundation of China (Grant No. 61976198, 62022077), the National Key R&D Program of China under Grant No. 2020AAA0103800 and the Fundamental Research Funds for the Central Universities. Kai Zheng is supported by NSFC (No. 61972069, 61836007, 61832017) and Sichuan Science and Technology Program under Grant 2020JDTD0007.References
- (1)
- Agrawal and Goyal (2013) Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning. 127–135.
- Azimi et al. (2012) Javad Azimi, Ruofei Zhang, Yang Zhou, Vidhya Navalpakkam, Jianchang Mao, and Xiaoli Fern. 2012. The impact of visual appearance on user response in online display advertising. In Proceedings of the 21st International Conference on World Wide Web. 457–458.
- Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. 1613–1622.
- Chang et al. (2010) Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. 2010. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research 11, 4 (2010).
- Chen et al. (2016) Junxuan Chen, Baigui Sun, Hao Li, Hongtao Lu, and Xian-Sheng Hua. 2016. Deep ctr prediction in display advertising. In Proceedings of the 24th ACM international conference on Multimedia. 811–820.
- Cheng et al. (2012) Haibin Cheng, Roelof van Zwol, Javad Azimi, Eren Manavoglu, Ruofei Zhang, Yang Zhou, and Vidhya Navalpakkam. 2012. Multimedia features for click prediction of new ads in display advertising. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 777–785.
- Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 208–214.
- Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
- He and McAuley (2016) Ruining He and Julian McAuley. 2016. VBPR: visual Bayesian Personalized Ranking from implicit feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 144–150.
- Hill et al. (2017) Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. 2017. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1813–1821.
- Hua (2018) Xian-Sheng Hua. 2018. Challenges and Practices of Large Scale Visual Intelligence in the Real-World. In Proceedings of the 26th ACM international conference on Multimedia. 364–364.
- Hutter et al. (2019) Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated machine learning: methods, systems, challenges. Springer Nature.
- Joglekar et al. (2020) Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay K Adams, Pranav Khaitan, Jiahui Liu, and Quoc V Le. 2020. Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2387–2397.
- Juan et al. (2016) Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Systems. 43–50.
- Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. stat 1050 (2014), 1.
- Kohavi (1996) Ron Kohavi. 1996. Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 202–207.
- Kong et al. (2020) Lingkai Kong, Jimeng Sun, and Chao Zhang. 2020. SDE-Net: Equipping Deep Neural Networks with Uncertainty Estimates. arXiv:cs.LG/2008.10546
- Li et al. (2018) Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2018. LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators. In International Conference on Learning Representations.
- Li et al. (2012) Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2012. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2. 19–36.
- Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. 661–670.
- Lian et al. (2020a) Defu Lian, Qi Liu, and Enhong Chen. 2020a. Personalized ranking with importance sampling. In Proceedings of The Web Conference 2020. 1093–1103.
- Lian et al. (2020b) Defu Lian, Haoyu Wang, Zheng Liu, Jianxun Lian, Enhong Chen, and Xing Xie. 2020b. Lightrec: A memory and search-efficient recommender system. In Proceedings of The Web Conference 2020. 695–705.
- Lian et al. (2020c) Defu Lian, Yongji Wu, Yong Ge, Xing Xie, and Enhong Chen. 2020c. Geography-aware sequential location recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2009–2019.
- Liu et al. (2020b) Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, Jincai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, and Yong Yu. 2020b. AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2636–2645. https://doi.org/10.1145/3394486.3403314
- Liu et al. (2018c) Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2018c. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV). 19–34.
- Liu et al. (2020a) Hu Liu, Jing Lu, Hao Yang, Xiwei Zhao, Sulong Xu, Hao Peng, Zehua Zhang, Wenjie Niu, Xiaokun Zhu, Yongjun Bao, et al. 2020a. Category-Specific CNN for Visual-aware CTR Prediction at JD. com. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2686–2696.
- Liu et al. (2018b) Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2018b. Hierarchical Representations for Efficient Architecture Search. In International Conference on Learning Representations.
- Liu et al. (2018a) Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018a. DARTS: Differentiable Architecture Search. In International Conference on Learning Representations.
- Mo et al. (2015) Kaixiang Mo, Bo Liu, Lei Xiao, Yong Li, and Jie Jiang. 2015. Image feature learning for cold start problem in display advertising. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
- Pan et al. (2018) Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. 2018. Field-weighted factorization machines for click-through rate prediction in display advertising. In Proceedings of the 2018 World Wide Web Conference. 1349–1357.
- Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.
- Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. 2018. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In International Conference on Learning Representations.
- Schlimmer (1981) Jeff Schlimmer. 1981. Mushroom records drawn from the audubon society field guide to north american mushrooms. GH Lincoff (Pres), New York (1981).
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
- Thompson (1933) William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 3/4 (1933), 285–294.
- Vempati et al. (2019) Sreekanth Vempati, Korah T Malayil, Sruthi V, and Sandeep R. 2019. Enabling Hyper-Personalisation: Automated Ad Creative Generation and Ranking for Fashion e-Commerce. arXiv:cs.IR/1908.10139
- Wang et al. (2016) Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. 2016. Beyond ranking: Optimizing whole-page presentation. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. 103–112.
- Xie et al. (2018) Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2018. SNAS: stochastic neural architecture search. In International Conference on Learning Representations.
- Yang et al. (2019) Xiao Yang, Tao Deng, Weihan Tan, Xutian Tao, Junwei Zhang, Shouke Qin, and Zongyao Ding. 2019. Learning Compositional, Visual and Relational Representations for CTR Prediction in Sponsored Search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2851–2859.
- Yao et al. (2020a) Quanming Yao, Xiangning Chen, James T Kwok, Yong Li, and Cho-Jui Hsieh. 2020a. Efficient neural interaction function search for collaborative filtering. In Proceedings of The Web Conference 2020. 1660–1670.
- Yao et al. (2020b) Quanming Yao, Ju Xu, Wei-Wei Tu, and Zhanxing Zhu. 2020b. Efficient Neural Architecture Search via Proximal Iterations.. In AAAI. 6664–6671.
- Zhao et al. (2019) Zhichen Zhao, Lei Li, Bowen Zhang, Meng Wang, Yuning Jiang, Li Xu, Fengkun Wang, and Weiying Ma. 2019. What You Look Matters? Offline Evaluation of Advertising Creatives for Cold-start Problem. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2605–2613.
- Zhong et al. (2018) Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. 2018. Practical block-wise neural network architecture generation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2423–2432.
- Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.
- Zoph and Le (2016) Barret Zoph and Quoc Le. 2016. Neural Architecture Search with Reinforcement Learning. (2016).
- Zoph and Le (2017) Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Learning. arXiv:cs.LG/1611.01578
- Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697–8710.