A Concept Knowledge-Driven Keywords Retrieval Framework for Sponsored Search

Yijiang Lian [email protected] Baidu , Yubo Liu [email protected] Baidu , Zhicong Ye [email protected] Baidu , Liang Yuan [email protected] Baidu , Yanfeng Zhu [email protected] Baidu , Min Zhao [email protected] Baidu , Jianyi Cheng [email protected] Baidu and Xinwei Feng [email protected] Baidu

Abstract.

In sponsored search, retrieving synonymous keywords for exact match type is important for accurately targeted advertising. Data-driven deep learning-based method has been proposed to tackle this problem. An apparent disadvantage of this method is its poor generalization performance on entity-level long-tail instances, ev- en though they might share similar concept-level patterns with frequent instances. With the help of a large knowledge base, we find that most commercial synonymous query-keyword pairs can be abstracted into meaningful conceptual patterns through concept tagging. Based on this fact, we propose a novel knowledge-driven conceptual retrieval framework to mitigate this problem, which consists of three parts: data conceptualization, matching via conceptual patterns and concept-augmented discrimination. Both offline and online experiments show that our method is very effective. This framework has been successfully applied to Baidu’s sponsored search system, which yields a significant improvement in revenue.

Sponsored Search; Keyword Matching; Keyword Retrieval; Paraphrase Pattern; Knowledge Graph

1. Introduction

Sponsored search advertising refers to the placement of ads on search result pages above or next to organic search results. As these sponsored ads directly target user’s query intention, they usually have a much higher conversion ratio. In the past few years, search advertising has become one of the most popular forms of digital advertising worldwide.

One of the most critical modules of the sponsored search system is the keyword matching module, which is to match users’ queries to advertisers’ bidding keywords. (Keyword is utilized to represent queries purchased by the advertisers in particular). Mainstream search engine companies provide a structured bidding language, with which the advertisers can specify how would their purchased keywords be matched to the online queries. Generally, there are three match types provided ¹¹1https://support.google.com/google-ads/answer/7478529?hl=en: exact, phrase, broad. Under the exact match type, ads would be eligible to appear when a user searches for the specific keyword or its synonymous variants. For phrase match type, the matched queries should contain the keyword or the synonymous variants of the keyword. Broad match type further relaxes the matching restrictions to the semantic relevance level. Due to the targeting of precise traffic, most customers prefer the exact match type, which accounts for a large portion of the keyword revenue in most search engine companies. In this article, we pay more attention to the problem of synonymous matching under the exact match type.

Since the synonymous query-keyword relationship is quite scar- ce, while the volumes of the industrial queries and keywords are extremely huge, the traditional boolean retrieval framework, which works well in broad match scenarios, is quite inefficient in this exact match scenario (Lian et al., 2020). Recently, data-driven deep learning methods have been applied in this scenario (Lian et al., 2020), where a translation model is trained on a high-quality paraphrasing dataset and is used to make a generalization to link more synonymous queries and keywords.

One problem of the data-driven deep learning-based approach is its poor generalization performance on the long-tail instances (Hendrycks et al., 2020). Considering that nearly 70% of the queries are entity related (Guo et al., 2009), we especially focus on those entity-level long-tail cases. For example, if double-fold eyelid operation and Los Angeles are commonly observed in the training data, the translation model can generate lots of high quality paraphrases for queries like the price of double-fold eyelid operation in Los Angeles; however, it would fail on queries like the price of liposuction in Denver if liposuction and Denver are rare in the training data, even though these two queries share a same abstract conceptual pattern the price of [aesthetic sur- gery] in [location]. And the same is true for the discriminant model. With a limited number of synonymous training data in hand and a huge number of queries to be addressed, long-tail cases should be regularly encountered for the industrial models. How to make our retrieval system robust on these long-tail cases becomes an important and urgent problem to be addressed.

Conceptual abstraction and reasoning are common in the human learning process. Given a synonymous training instance (how much does double-fold eyelid operation cost in Los Angeles = the price of double-fold eyelid operation in Los Angeles), if we know that double-fold eyelid operation belongs to the concept of aesthetic surgery, and Los Angeles belongs to the concept of [location], we can easily abstract the original paraphrase instance into a conceptual pattern form (how much does [aesthetic surgery] cost in [location] = the price of [aesthetic surgery] in [location]). When a long-tail query like the price of liposuction in Denver comes, if we get the knowledge that liposuction belongs to [aesthetic surgery] and it has an alias ”lipo”, we can effortlessly inference out a new paraphrase instance: (how much does lipo cost in Denver = the price of liposuction in Denver). In this way, the paraphrasing capability has been transferred onto the long-tail instances.

Inspired by this idea, we propose a concept knowledge-driven keyword retrieval framework, which comprises three phases: data conceptualization, matching via conceptual patterns, and concept-augmented discrimination. In the first phase of data conceptualization, the original paraphrasing data is transformed into the conceptual pattern form by concept tagging, where each token in the sentence would be labeled with a concept label. Core commercial concepts in the pattern are selected for further generalization usage and others are ignored. Secondly, a deep neural translation model is directly trained on this conceptualized data to capture the synonymous pattern expression’s variations. This model is used to link query patterns with keyword patterns. Finally, a concept-augmented discrimination model is utilized to filter out nonsynonymous cases.

Our offline experiments revealed that this conceptual retrieval framework can significantly boost the retrieval performance for long-tail cases. Besides, this framework has been successfully applied in Baidu’s sponsored search retrieval system, which yields a remarkable revenue growth.

2. Related Work

Patterns are widely used in the task-oriented dialogue systems (Chen et al., 2017) (Li et al., 2017)(Bordes et al., 2016). Given an utterance, the natural language understanding module of the dialogue system would parse it into semantic entity types (known as slots) and determines the user intents. Usually, slots and intents are pre-defined in a system. Patterns have also been studied in the paraphrase generation and identification tasks (Lin and Pantel, 2001) (Ravichandran and Hovy, 2002) (Pang et al., 2003) (Szpektor et al., 2004). (Pang et al., 2003) describes a syntax-based algorithm that automatically builds paraphrase patterns from semantically equivalent translation sets. (Zhao et al., 2008) proposed a pivot approach for extracting paraphrase patterns from bilingual parallel corpora.

Knowledge Graph is commonly used in information retrieve systems (Wise et al., 2020)(Liu et al., 2018)(Eder, 2012) and recommendation systems (Wang et al., 2019a)(Sun et al., 2018)(Wang et al., 2019b)(Cao et al., 2019). Microsoft built its Concept Graph named Probase (Wu et al., 2012) to optimize the web search and document understanding services. (Mustafa et al., 2008) used the Resource Description Framework triples data to compute the thematic similarity for information retrieval. Knowledge graph of Wikidata (Haase et al., 2017) was applied to Amazon’s virtual assistant Alexa to offer better answers to factual questions. A large-scale cognitive concept net was constructed by (Luo et al., 2020) to better define user needs and offer more intelligent shopping experience in their e-commerce platform. (Liu et al., 2019) utilized query log and search click graph to discover user-centered concepts at more precise granularity that can represent users interests.

3. Method

Our method consists of three parts: data conceptualization, matching via conceptual patterns, and concept-augmented discrimination.

3.1. Data Conceptualization

In this step, the original query/keyword will be transformed into a conceptual pattern form based on concept tagging, which means assigning each token in the sentence with a domain concept label. Different from the traditional named entity recognition (NER) task, which focuses on recognizing the entities in a small number of categories such as organizations and locations, our concept tagging task covers entities and function words from all domains. The taxonomy is constructed based on Baidu’s Knowledge Graph, which covers 5 billion entities and 550 billion facts. To expand the conceptual pattern’s application scope, we prefer coarse concepts over refined concepts. For example, the entity double eyelid surgery belongs to a refined concept of [eye plastic], which has a coarse hypernym concept named [aesthetic surgery]. Considering that most of the eye plastic related paraphrase patterns can also be applied in aesthetic surgery scenarios, the coarse concept of [aesthetic surgery] is selected.

Refer to caption — Figure 1. The process of commercially customized concept tagging.

As is shown in Figure 1, the concept tagging procedure consists of three steps:

•

Firstly, each token in the sentence is assigned with a concept label based on concept tagging.
•

Secondly, core concepts are selected for further generalization and the remaining concepts are neglected. The core concept set is a manually selected subset of the ontology, which focuses on commercial entity related categories such as education, traveling, cosmetology, food, and so on, while function words like adverb and preposition are overlooked.
•

Finally, the original sentence is extracted into a pattern form mixed with concept slots and normal texts, and slots’ corresponding entities are represented as slot values.

Therefore, the query in the figure how much does liposuction cost in Denver will be conceptualized as how much does [aesthetic surgery] cost in [location] with slot values [aesthetic surgery: liposuction] and [location: Denver].

3.2. Matching via Conceptual Patterns

The basic idea of our framework is to transform queries and keywords into conceptual patterns and conduct the synonymous query-keyword matching through conceptual pattern matching.

3.2.1. Pattern-to-Pattern Model Training

The matching between query pattern and keyword pattern is based on a pattern-to-pattern translation model. Given a high quality paraphrase data set, data conceptualization is performed on both sides to obtain parallel pattern training data. For example, the original instance (how much does double eyelid surgery cost in Los Angeles = the price of double-fold eyelid operation in Los Angeles) is changed into (the price of [Aesthetic Surgery] in [Location] = how much does [Aesthetic Surgery] cost in [Location]). For high precision concerns, we further clean the conceptualized parallel pattern data with the requirement that the left pattern should be strictly aligned with the right pattern, not only the number of concept slots but also the corresponding entities should be the same.

And then a transformer-based translation model is trained on the conceptualized parallel data. In detail, concept slots are considered as normal tokens and added to the model’s vocabulary dictionary. The structure of the model follows the common sequence to sequence framework (Sutskever et al., 2014), where the encoder encodes the source sentence into a list of hidden states and the decoder generates word one by one until a special end symbol is generated.

3.2.2. Pattern-Based Keyword Matching

When an ad-hoc query comes, the following four steps are carried out to generate its synonymous candidates.

The first step is query conceptualization. It follows the same procedure as data conceptualization mentioned above.

The second step is pattern matching. The pattern translation model is utilized to link the query’s pattern with the keyword pattern repository, which is a conceptualized version of the original keyword repository. A prefix-tree-based targeted decoding trick is utilized (Lian et al., 2019) to ensure that all the decoded hypotheses are valid keyword patterns.

The third step is keyword pattern instantiation. The concept slots in the previously retrieved keyword patterns are replaced with the query’s corresponding entities. To increase recall, the entities’ aliases from the knowledge database are also utilized. Since the instantiated pattern might not be real keywords, we further join them with the original keyword repository to remove illegal results.

The final step is synonymous keyword expansion. Considering that there is also a number of synonymous relationships which could not be strictly expressed as aligned pattern forms, a synonymous expanding process is performed in the end. In detail, synonymous keywords are clustered in advance by utilizing keyword to keyword synonymous retrieval method (Lian et al., 2020). If a previously retrieved keyword belongs to a keyword cluster, the whole keyword cluster will be merged into the final candidate queue.

The whole matching process is illustrated in Figure 2. When an ad-hoc query How much does liposuction cost in New York arrives, it is firstly conceptualized as How much does [Aesthetic Surgery] cost in [Location] with slot values [Aesthetic Surgery: liposuction] and [Location: Denver], entities’ aliases such as liposuction(lipo) are fetched from the knowledge database in the meanwhile. Then a pattern translation model is utilized to link the query’s pattern with the conceptual keyword repository, the following two synonymous keyword patterns are produced: The price of [Aesthetic Surgery] in [Location] and What is the price of [Aesthetic Surgery] in [Location]. In the next step, concept slots in these keyword patterns are replaced with the corresponding query entities and their aliases, which generates three sentences: The price of liposuction in New York , The price of lipo in New York and What is the price of liposuction in New York. Then, we join these sentences with the real keyword repository to remove invalid keywords, and What is the price of liposuction in New York is removed in the example. Finally, synonymous keywords expansion is performed.

3.3. Concept-Augmented Discrimination

To guarantee the final query-keyword pairs’ synonymous quality, an end2end concept augmented discrimination is performed based on a domain fine-tuned BERT (Devlin et al., 2018) model.

Since the discriminant model also faces the long-tail instance generalization problem, directly using it might misjudge lots of conceptually retrieved cases. Therefore, we augment the fine-tuning data by replacing the entities with same-concept entities. In detail, for the original synonymous query-keyword instances, we replace the aligned concept slot of query and keyword with the same rare entity to get augmented positive cases and we replace them with different rare entities to get augmented negative cases, where literally confusable entities are particularly used. For the original negative query-keyword instances, with a probability of 50%, we replace the aligned concept slots with the same rare entity, and with a probability of 50%, we replace them with different rare entities. Offline experiments show that this method greatly improves the model’s robustness for long-tail instances.

4. Offline Experiment

4.1. Matching via Conceptual Patterns

4.1.1. Dataset for The Translation Model

Data conceptualization is performed on one day’s query-keyword exact matching weblog, and 22 million paraphrasing pattern pairs $\widetilde{D}^{gen}$ are sampled from it. This pattern pair dataset is split into two parts, $\widetilde{D}^{gen}_{train}$ for training and $\widetilde{D}^{gen}_{dev}$ for development. And the corresponding original datasets are denoted as $D^{gen}_{train}$ and $D^{gen}_{dev}$ . The baseline and the conceptual translation models $M^{gen}$ and $\widetilde{M}^{gen}$ are separately trained on $D^{gen}_{train}$ and $\widetilde{D}^{gen}_{train}$ . To evaluate the translation model’s generalization ability on the entity-level long-tail cases, we construct a query test set $D^{gen}_{test}$ by joining all the queries in $D^{gen}_{train}$ with one week’s query weblog based on their contained entities. For simplicity, we only consider queries having one entity. Queries in $D^{gen}_{test}$ are grouped into 4 buckets according to their entity frequencies in $D^{gen}_{train}$ . Denote the frequency as $f$ , four frequency intervals are specially considered: $1<f\leq 10,10<f\leq 100,100<f\leq 1000,f>1000$ . And the queries located in the first bucket are considered as the long-tail query set $D^{gen}_{test0}$ .

4.1.2. Implementation Details

The translation model’s word embeddings are randomly initialized. The vocabulary contains 200,000 frequent words and 284 concept slots. The word embedding dimension and the number of hidden units are set to 512. The max sequence length is limited to 12 at the word level. Both the encoder and decoder are implemented with transformers having 4 layers and 8 heads. The model’s cross-entropy loss is minimized by Adam (Kingma and Ba, 2015) with initial learning rate of $5\times 10^{-5}$ and the batch size is 64.

4.1.3. Results

Setting beam size to 50, $M^{gen}$ and $\widetilde{M}^{gen}$ are utilized to decode towards queries in $D^{gen}_{test}$ . For $\widetilde{M}^{gen}$ , the concept slots in the decoded patterns are replaced with the original slot values in the query. For each bucket in $D^{gen}_{test}$ , we sample 500 generated cases for human synonymous binary evaluation. As is shown in the Table 1, $\widetilde{M}^{gen}$ performs much better than the $M^{gen}$ , and it outperforms $M^{gen}$ by 26 absolute points on the long-tail query dataset $D^{gen}_{test0}$ . We can also see that the raw translation model’s performance is significantly influenced by the entity frequency in the training dataset. In contrast, $\widetilde{M}^{gen}$ ’s performance is insensitive to the entity frequency.

Frequency	$M^{gen}$	$\widetilde{M}^{gen}$
1-10	37.4%	63.4%
10-100	53%	63.8%
100-1000	65.6%	66.4%
1000 or more	68.8%	68.6%

Table 1. Accuracy performances of the two translation models on

D^{gen}_{test}

. The frequency denotes the query’s entity occurrence frequencies in

D^{gen}_{train}

4.2. Concept-Augmented Discrimination

4.2.1. Dataset for The Discriminant Model

460,000 query-keyword pairs $D^{dis}$ are sampled from the sponsored matching weblog for human synonymous evaluation, which covers all three matching types: exact match, phrase match, and broad match with a proportion ratio of 2:1:1. $D^{dis}$ is further split into three parts: $D^{dis}_{train}$ for training, $D^{dis}_{dev}$ for developing and $D^{dis}_{test}$ for testing. Then concept augmentation procedure mentioned in subsection 3.3 is performed on $D^{dis}_{train}$ . 55,200 augmented cases are sampled, which accounts for 12% of $D^{dis}_{train}$ , and merged with $D^{dis}_{train}$ to get $\hat{D}^{dis}_{train}$ . $D^{dis}_{test}$ can be considered as a test set from a global perspective, however, since we care more about the discriminant model’s generalization ability on the conceptually retrieved long-tail cases, another test data $\bar{D}^{dis}_{test}$ is constructed by sampling 1000 cases from $\widetilde{M}^{gen}$ ’s generated cases on $D^{gen}_{test0}$ for human evaluation.

4.2.2. Implementation Details

The paraphrase discriminant model is a binary classifier implemented with BERT. It takes a query-keyword pair separated by a special token as input and predicts 1 if the pair is synonymous and 0 if not. The model contains 24 layers, 16 self-attention heads and the hidden dimension size is 1024. The parameters are initialized with ERNIE (Sun et al., 2019), a well-known Chinese pre-trained transformer published by Baidu. The baseline model $M^{dis}$ and the concept-augmented model $\hat{M}^{dis}$ are separately fine-tuned on $D_{train}^{dis}$ and $\hat{D}^{dis}_{train}$ . The fine-tuned loss is minimized by Adam with initial learning rate of $5\times 10^{-6}$ and the batch size is 128.

4.2.3. Results

We focus on two indicators: AUC (area under the curve) and the recall under 95%/70% precision. Table 2 shows that the concept-augmented model outperforms the baseline model with 16.24 points for the recall under 70% precision on $\bar{D}^{dis}_{test}$ . Since the data distribution between $\bar{D}^{dis}_{test}$ and $D_{test}^{dis}$ differs a lot, where long-tail cases occupy a small proportion in the general test dataset $D_{test}^{dis}$ , the indicators on the $D_{test}^{dis}$ have just been slighted improved.

Model	AUC-G	Recall-G	AUC-L	Recall-L
$M^{dis}$	97.53%	75.86%	59.85%	61.21%
$\hat{M}^{dis}$	97.53%	76.73%	65.31%	77.45%

Table 2. Results of different discriminant models on

D_{test}^{dis}

and

\bar{D}_{test}^{dis}

. AUC-G and Recall-G denote the AUC value and recall ratio under the precision of 95% on the global test set

D_{test}^{dis}

. AUC-L and Recall-L denote the AUC value and recall ratio under the precision of 70% on the long-tail test set

\bar{D}_{test}^{dis}

Table 3 shows the performances of the discriminant model $\hat{M}^{dis}$ with different proportion of augmented data. We can see that the model’s recall ratio on $\bar{D}_{test}^{dis}$ grows as the proportion increases. However, too much augmented data would destroy the original data’s distribution, which results in a bad performance on the global test set $D_{test}^{dis}$ . As we can see in the table, the performance on $D_{test}^{dis}$ peaks around the proportion at 12%.

Proportion	Recall-G	Recall-L
8%	76.14%	72.18%
10%	76.58%	76.95%
12%	76.73%	77.45%
16%	76.17%	81.72%

Table 3. The performances of the discriminant model

\hat{M}^{dis}

with different proportion of the augmentation data.

Query	Pattern	Entity Freq
海马丘比特质量怎么样	[品牌]质量怎么样	3
(How about the quality of the Haima Qiubite)	(How about the quality of [Brand])
欧米伽3哪里有	[食物营养素]哪里有	2
(Where can we buy OMEGA-3)	(Where can we buy [Food Nutrients])
炮附片的功效	[植物类中药材]的功效	1
(The effect of Paofu tablets)	(The effect of [Plant Chinese Medicinal Materials])
哈蜜瓜孕妇能吃么	[饮食]孕妇能吃么	1
(Can pregnant women eat cantaloupe)	(Can pregnant women eat [Food])
伦敦糕的做法与配方	[饮食]的做法与配方	1
(The practice and recipe of London cake)	(The practice and recipe of [Food])

Table 4. Some typical long-tail queries and their conceptual patterns, where entities are underlined and concepts are enclosed in square brackets. Besides, Entity Freq denotes the entity’s frequency in

D^{gen}_{train}

Query	$M^{gen}$	$\widetilde{M}^{gen}$
海马丘比特质量怎么样	海马好吗	海马丘比特这车怎么样
(How about the quality of the Haima Qiubite)	(How is Haima)	(How about the car named Haima Qiubite)
	海马下垂怎么样	海马丘比特质量好不好
	(How about hippocampus sagging)	(Is the quality of Haima Qiubite good)
	海马法语怎么样	海马丘比特质量如何
	(How about Haima French)	(How is the quality of Haima Qiubite)
哈蜜瓜孕妇能吃么	孕妇能吃橙子吗	怀孕了能吃哈蜜瓜吗
(Can pregnant women eat cantaloupe)	(Can pregnant women eat oranges)	(Can I eat cantaloupe while pregnant)
	孕妇可以吃益母草吗	孕妇可以吃哈蜜瓜吗
	(Can pregnant women eat leonotis)	(Can pregnant women eat cantaloupe)
	孕妇能吃黄芪吗	怀孕期间可以吃哈蜜瓜吗
	(Can pregnant women eat astragalus)	(Can I eat cantaloupe during pregnancy)

Table 5. The translation results of typical cases.

Query	Bidword	$M^{dis}$	$\hat{M}^{dis}$	Label
血HCG检查多少钱	血疮检查费用	0.6	0.03	0
(How much is the blood HCG test)	(Blood sore inspection fees)
花牛苹果什么科	嘎拉苹果属于什么科	0.67	0.07	0
(What family is Huaniu Apple)	(What family does Gala Apple belong to)
道奇汽车是哪国生产	道奇汽车产地在哪里	0.75	0.95	1
(Where are Dodge cars made)	(Where is the origin of Dodge cars)

Table 6. The prediction scores of different discriminant models on some typical long-tail query-keyword cases. The label 1 denotes that the query-keyword pair is synonymous, while 0 stands for the opposite.

4.3. Case Study and Discussion

4.3.1. Case Study

Table 4 shows some typical long-tail queries and their corresponding conceptual patterns. Table 5 further shows the top 3 results decoded by $M^{gen}$ and $\tilde{M}^{gen}$ . We can see that the traditional translation model performs badly on the long-tail queries, and it can not preserve the source entities in the decoding phase. For example, in the case of Can pregnant women eat cantaloupe, $M^{gen}$ ’s decoding results have changed the entity cantaloupe into oranges or other nonsynonymous items. However, the concept-based decoding precisely retains the entity.

Table 6 shows the discriminant model’s inference scores on some typical long-tail cases. We can see that the model’s predicted scores have been well adjusted, where the new model predicts much higher scores for positive cases and much lower scores for negative ones after fine-tuning on the conceptually augmented data.

4.3.2. Pattern Coverage

The synonymous pattern’s coverage plays a crucial role in the practical application of our method. Our human evaluation statistics shows that nearly 70% of the existing commercial paraphrases can be transformed into conceptual forms. As a matter of fact, people’s commercial intentions are relatively concentrated in the sponsored search scenario, and most of these paraphrases can be expressed as steady pattern forms. For example, goods’ price is one of the most common commercial intentions, which can be easily abstracted as how much is [Goods]. Table 4 shows some typical commercial patterns related to goods’ quality, food recipe, and medicine functions.

5. Online Experiment

As is illustrated in Figure 3, we deployed a new module named CKBR(Concept Knowledge Based Retrieval) on Baidu’s sponsored search engine for retrieving exact-match-type keywords.

For latency concerns, the pattern matching phase is implemented under an online-offline mixed architecture. For frequent query patterns, their matched keyword patterns are computed ahead of time with a complex transformer model, and these results are saved in a lookup table for fast online retrieval. And this offline process is repeated periodically to follow the changes in the query population and keyword supply over time. For infrequent queries, a phrase-based statistical machine translation(PBSMT) framework is utilized to generate keyword patterns from scratch. Although neural machine translation(NMT) has outperformed PBSMT in lots of translation tasks, we find that PBSMT is still a cost-effective choice for the industry. In contrast with NMT, it is much faster and no GPU is required. As a matter of fact, our PBSMT is able to decode out 1200 synonymous variations in less than 60ms. Nevertheless, a simple two-layer GRU (Chung et al., 2014) gated NMT can just decode out 30 synonymous variations in the same time window(The performance test is conducted on a machine equipped with a 12-core Intel(R) Xecon(R) E5-2620 v3 clocked at 2.40GHz, a RAM of 128G, and 16 Tesla K40m. PBSMT runs with 10 CPU threads and the stack size is 100. NMT runs in a single GPU with a beam size of 30). Another advantage of SMT is that all of the alignments are stored in an editable and readable phrase table, which makes it explainable and controllable.

The online discriminant model is implemented with a transformer having 2 layers and 4 heads, with a hidden size of 128. This model is distilled from the previous concept-augmented discriminant model $\hat{M}^{dis}$ . Considering the chapter space, we would not elaborate on the details.

Platform	SHOW	CPM	CTR	SHOW-EXACT
Mobile	0.54%	1.51%	0.92%	9.14%
Desktop	1.58%	2.52%	1.43%	8.42%

Table 7. The online A/B test results of the CKBR module, which indicate the relative improvements over the current system.

A 10 days’ real online A/B test is deployed on two platforms of Baidu’s sponsored search engines, which corresponds to query flows from desktops and mobile devices. We focus on the following metrics.

•

SHOW denotes the total ads’ number shown to users
•

CTR = $\frac{\rm{CLICK}}{\rm{SEARCH}}$ denotes the average click-through rate. CLICK denotes the number of all ads’ clicks and SEARCH denotes the number of all searches.
•

CPM = $\frac{\rm{REVENUE}}{\rm{SEARCH}}\times 1000$ , which denotes average revenue per one thousand searches received by the search engine.
•

SHOW-EXACT denotes the number of shown ads under the exact match type.

As is shown in the Table 7, the CKBR module leads to an evident improvement of SHOW-EXACT by 8.42% on desktop flows and 9.14% on mobile flows. Besides, it also yields a significant growth in CPM with 2.52% on desktop flows and 1.51% on mobile flows. In the meanwhile, a quality evaluation has been conducted, where 600 query-keyword cases under the exact match type are sampled from the system’s weblog and are sent for synonymous binary judgment. And the evaluation result shows that the synonym accuracy has increased by 0.7%.

6. Conclusions

In this paper, we have developed a novel knowledge-driven framework for addressing the synonymous keywords retrieval problem in sponsored search. Under this framework, the synonymous transformation can be understood as a combination of pattern transformation and alias replacement. Based on a large Chinese knowledge graph, we are able to conceptualize most of the commercial query-keyword synonymous pairs into abstract patterns. A deep translation model is trained on this conceptualized data to capture the synonymous pattern variations. Our offline experiments show that the new framework performs much better on the long-tail cases. The whole framework has been implemented in Baidu’s keywords retrieval system based on a NMT/SMT mixed architecture, and a significant improvement in revenue has been yielded without degrading the quality of users’ experience. This method’s application scope is not limited in the synonymous retrieval under exact match, exploration in phrase match and broad match is in progress. We hope our method would shed some light on the further design of the industrial sponsored search system.

References

(1)
Bordes et al. (2016) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683 (2016).
Cao et al. (2019) Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The world wide web conference. 151–161.
Chen et al. (2017) Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter 19, 2 (2017), 25–35.
Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Eder (2012) Jeffrey Scott Eder. 2012. Knowledge graph based search system. US Patent App. 13/404,109.
Guo et al. (2009) Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Named entity recognition in query. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 267–274.
Haase et al. (2017) Peter Haase, Andriy Nikolov, Johannes Trame, Artem Kozlov, and Daniel M Herzig. 2017. Alexa, Ask Wikidata! Voice Interaction with Knowledge Graphs using Amazon Alexa.. In International Semantic Web Conference (Posters, Demos & Industry Tracks).
Hendrycks et al. (2020) Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. 2020. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100 (2020).
Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) (2015).
Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008 (2017).
Lian et al. (2019) Yijiang Lian, Zhijie Chen, Jinlong Hu, Kefeng Zhang, Chunwei Yan, Muchenxuan Tong, Wenying Han, Hanju Guan, Ying Li, Ying Cao, et al. 2019. An end-to-end Generative Retrieval Method for Sponsored Search Engine–Decoding Efficiently into a Closed Target Domain. arXiv preprint arXiv:1902.00592 (2019).
Lian et al. (2020) Yijiang Lian, Zhenjun You, Fan Wu, Wenqiang Liu, and Jing Jia. 2020. Retrieve Synonymous keywords for Frequent Queries in Sponsored Search in a Data Augmentation Way. arXiv preprint arXiv:2008.01969 (2020).
Lin and Pantel (2001) Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for question-answering. Natural Language Engineering 7, 4 (2001), 343–360.
Liu et al. (2019) Bang Liu, Weidong Guo, Di Niu, Chaoyue Wang, Shunnan Xu, Jinghong Lin, Kunfeng Lai, and Yu Xu. 2019. A User-Centered Concept Mining System for Query and Document Understanding at Tencent. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1831–1841.
Liu et al. (2018) Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. arXiv preprint arXiv:1805.07591 (2018).
Luo et al. (2020) Xusheng Luo, Luxin Liu, Yonghua Yang, Le Bo, Yuanpeng Cao, Jinghang Wu, Qiang Li, Keping Yang, and Kenny Q Zhu. 2020. AliCoCo: Alibaba E-commerce Cognitive Concept Net. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 313–327.
Mustafa et al. (2008) Jibran Mustafa, Sharifullah Khan, and Khalid Latif. 2008. Ontology based semantic information retrieval. In 2008 4th International IEEE Conference Intelligent Systems, Vol. 3. IEEE, 22–14.
Pang et al. (2003) Bo Pang, Kevin Knight, and Daniel Marcu. 2003. Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 181–188.
Ravichandran and Hovy (2002) Deepak Ravichandran and Eduard Hovy. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual meeting of the association for Computational Linguistics. 41–47.
Sun et al. (2019) Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. CoRR abs/1904.09223 (2019). arXiv:1904.09223 http://arxiv.org/abs/1904.09223
Sun et al. (2018) Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Long-Kai Huang, and Chi Xu. 2018. Recurrent knowledge graph embedding for effective recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. 297–305.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104–3112.
Szpektor et al. (2004) Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaventura Coppola. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 conference on empirical methods in natural language processing. 41–48.
Wang et al. (2019b) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019b. Multi-task feature learning for knowledge graph enhanced recommendation. In The World Wide Web Conference. 2000–2010.
Wang et al. (2019a) Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019a. Kgat: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 950–958.
Wise et al. (2020) Colby Wise, Vassilis N Ioannidis, Miguel Romero Calvo, Xiang Song, George Price, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, and George Karypis. 2020. COVID-19 knowledge graph: accelerating information retrieval and discovery for scientific literature. arXiv preprint arXiv:2007.12731 (2020).
Wu et al. (2012) Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. 2012. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 481–492.
Zhao et al. (2008) Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li. 2008. Pivot approach for extracting paraphrase patterns from bilingual corpora. In Proceedings of ACL-08: HLT. 780–788.