Online Matching: A Real-time Bandit System for Large-scale Recommendations
Abstract.
The last decade has witnessed many successes of deep learning-based models for industry-scale recommender systems. These models are typically trained offline in a batch manner. While being effective in capturing users’ past interactions with recommendation platforms, batch learning suffers from long model-update latency and is vulnerable to system biases, making it hard to adapt to distribution shift and explore new items or user interests. Although online learning-based approaches (e.g., multi-armed bandits) have demonstrated promising theoretical results in tackling these challenges, their practical real-time implementation in large-scale recommender systems remains limited. First, the scalability of online approaches in servicing a massive online traffic while ensuring timely updates of bandit parameters poses a significant challenge. Additionally, exploring uncertainty in recommender systems can easily result in unfavorable user experience, highlighting the need for devising intricate strategies that effectively balance the trade-off between exploitation and exploration. In this paper, we introduce Online Matching: a scalable closed-loop bandit system learning from users’ direct feedback on items in real time. We present a hybrid offline + online approach for constructing this system, accompanied by a comprehensive exposition of the end-to-end system architecture. We propose Diag-LinUCB – a novel extension of the LinUCB algorithm – to enable distributed updates of bandits parameter in a scalable and timely manner. We conduct live experiments in YouTube and show that Online Matching is able to enhance the capabilities of fresh content discovery and item exploration in the present platform.
1. Introduction
Recommender systems have become indispensable for users to explore and access content in the era of information overload. Matching users with items to fulfill their information need by leveraging user and item data is a fundamental task in recommender systems. Compared to other machine learning domains such as language and vision where data is static, recommender systems are distinguished by dynamic user-system interactions that can give rise to a feedback loop, because new system policies are trained primarily based on data generated from users’ past interactions. This feedback loop can further lead to the rich gets richer problem, i.e., future recommendations may excessively prioritize items with high engagement in the past, thus emphasizing the need for exploration. Furthermore, real-world applications must be capable of processing substantial amounts of fresh content. For instance, millions of new videos are uploaded to YouTube on a daily basis. As users demand up-to-date information, it is important for recommender systems to explore their preferences on new items and promptly incorporate their feedback into the decision-making processes.
In recent years, deep learning-based models have emerged as a dominant class of approaches for building recommender systems. Notably, the majority of such models still rely on supervised learning from users’ past interactions in batch mode, and their ability to explore and adapt in real-time is limited, if at all. In recent years, a line of work (e.g. (Gilotte et al., 2018; Swaminathan and Joachims, 2015; Thomas and Brunskill, 2016; Chen et al., 2022)) applied offline reinforcement learning (RL) for deep learning-based models to address system bias in the training data. Nonetheless, such offline RL approaches suffer from large latency in model updates, since they still rely on learning from logged user feedback in the batch training framework, rather than through real-time interactions with the environment. As a result, offline RL is typically inadequate to explore new items in a timely and efficient manner, or to respond quickly to system changes such as shifts in user preferences.
In this paper, we focus on building a system with the capabilities of real-time learning and exploration. While online learning (e.g., multi-armed bandits) is well suited for the aforementioned considerations, the applications of it to large-scale real-world systems are much less explored compared to batch modeling. Practically, the challenges are mainly three-folds:
-
•
Large exploration space. The exploration space typically grows linearly with the number of candidate items in the system, which poses a significant challenge for most real-world systems that need to retrieve from millions of items. In the absence of an appropriately designed space pruning technique, exploration can come at a high cost to the user experience.
-
•
Limited exploration traffic. Compared to traditional recommender systems that focus on exploiting known learnings, exploration by its nature can lead to a short-term user experience degradation/regret. Therefore it is crucial to perform online exploration in a cost-efficient manner, typically within a limited budget such as 1% of randomly shuffled daily user traffic.
-
•
Lack of real-time learning systems. Existing systems are mainly designed for training models in batch mode and serve them online with fixed parameters. It is certainly non-trivial to build a custom highly scalable system for real-time learning with large data throughput.
This paper aims to tackle these challenges by introducing a real-time bandit system for item exploration by learning users’ direct feedback on items in a closed loop. The proposed system, called Online Matching, adopts a hybrid offline and online learning approach in order to make the exploration system efficient, scalable, and responsive. The offline learning component of Online Matching is designed for pruning the large exploration space and enabling cost-efficient exploration. In particular, we train a two-tower neural network model to co-embed users and items into the same space, similar to some existing neural retrieval systems such as (Yi et al., 2019). Then we discretize the embedding space into a specified number of “user clusters” using off-the-shelf embedding clustering approaches. Finally, we build a sparse bipartite graph between user clusters and the items to be explored based on their embedding similarities. The key insight is that for a particular user cluster, items nearby in the embedding space can be good candidates to explore. Note that all the aforementioned steps are performed offline. User cluster-item edges in the sparse bipartite graph essentially encode most promising user cluster-item pairs to explore. The online learning component of Online Matching is based on our novel extension of the classic LinUCB algorithm (Chu et al., 2011), called Diag-LinUCB, with a user’s real-time distribution over all the user clusters as the context. With this novel algorithmic design, our bandit parameters simply correspond to user cluster-item edges in the sparse bipartite graph and can be updated in real-time in a fully distributed manner. The resulting online system boasts a very low policy update latency, which refers to the amount of time it takes from user interaction to incorporating the feedback in the decision-making process.
Online Matching has been successfully deployed to YouTube for the use cases of fresh content discovery and item exploration. In the first case, our goal is to quickly identify high-quality fresh items through exploration with small traffic, and amplify their impact through exploitation with major traffic. For item exploration, we aim to substantially increase discoverable corpus by leveraging large exploration traffic while having minimum regret on user experience. We provide the experimental frameworks for running these two experiments, and demonstrate the effectiveness of Online Matching through improved topline metrics in these two cases.
In a nutshell, our contributions are:
-
•
Algorithmic methods. We provide a novel algorithmic framework that combines offline batch learning and online learning for building large-scale real-time bandit systems. Particularly, we propose Diag-LinUCB, a novel bandit algorithm that enables distributed bandit updates, allowing the system to scale and serve billions of users and millions of items.
-
•
System design. We present a complete system design, starting from the offline learning pipeline for creating a sparse bipartite graph, to the real-time bandit parameters aggregation performed by an online agent.
-
•
Experiment design. We study two use cases of Online Matching – Fresh Content Discovery (Type-I) and Corpus Exploration (Type-II), and present the experiment frameworks for impact measurements in both cases. This includes the application of a user-corpus partition framework to measure the growth of the discoverable item corpus.
-
•
Live experiments. We conduct live experiments for both Type-I and Type-II use cases in YouTube with significant top-line metric gains. We demonstrate the value of Online Matching in improving user experiences through discovering high-quality fresh videos and growing discoverable item corpus.
2. Related Work
2.1. Neural Recommenders and Off-policy Learning
Neural modeling has emerged as the dominant approach for constructing large-scale recommender systems in a range of industry applications, including video discovery (Covington et al., 2016), news recommendations (Okura et al., 2017), and social networks (Liu et al., 2017; Zhai et al., 2017). These models cover a broad range of tasks, from retrieval (Yi et al., 2019) to ranking (Cheng et al., 2016; Zhao et al., 2019). The main focus of this paper is on the retrieval problem, where the goal is to find a few related items from a large item corpus. This problem is also known as deep retrieval and has been extensively studied in the past. For example, a line of work (Yi et al., 2019; Yang et al., 2020; Chang et al., 2020; Yao et al., 2021) studied two-tower models for learning user and item representations in the same embedding space, allowing converting the problem of retrieval to maximum-inner-product-search (MIPS) (Guo et al., 2020) with sub-linear complexity. Compared to matrix factorization and extreme classification models (Covington et al., 2016), two-tower architecture can leverage content features of items, making it more suitable for generating reasonable representations of fresh items with little user engagement. Although a large amount of training data is often available in real-world applications, neural models (including aforementioned two-tower models) are typically trained offline on logged user feedback in a supervised fashion, making them biased toward existing recommendation policies and hard to promptly adapt to system changes. In this paper, we propose an efficient exploration space pruning strategy, using two-tower models as a key part of our offline learning framework. Benefiting from the generalization capabilities of two-tower models, our strategy makes online learning feasible under strict latency and traffic constraint.
To alleviate the aforementioned bias problem, there has been a body of work bringing offline reinforcement learning (RL) techniques to neural recommender systems. The idea of off-policy learning is to estimate the value of a target policy or to train it using data collected by a different behavior policy. A series of work (see e.g. (Gilotte et al., 2018; Swaminathan and Joachims, 2015; Thomas and Brunskill, 2016)) focused on developing off-policy estimators through inverse-propensity weighting. For target policy learning, Chen et al. (Chen et al., 2022) applied off-policy correction to training a neural sequence model for video retrieval. There are also various papers (see e.g. (Joachims et al., 2017; Zhao et al., 2019)) studying learning unbiased ranking models from biased logged feedback. More recently, Ma et al. (Ma et al., 2020) proposed using off-policy learning for two-stage recommender systems. One challenge in off-policy learning is that data collected from the existing behavior policy might not accurately represent the target policy. This is especially true for fresh items with minimal or no engagement data, where the inverse propensity weighting can have a high variance. Offline RL, like traditional batch trained neural recommenders, is also subject to significant delays in data logging, model training, and deployment, as it is still based on the batch training paradigm.
2.2. Online Learning and Exploration
Online learning provides a useful framework for building recommenders that are adaptable to user feedback. One commonly used type of online learning algorithm is bandit algorithms, such as UCB (Auer et al., 2002), Thompson Sampling (Chapelle and Li, 2011), and LinUCB (Chu et al., 2011). These algorithms are designed to balance the exploration of new options with the exploitation of known good options, allowing the system to learn and adapt to user behavior over time. The application of bandits to recommenders dates back to the seminal work by Li et al. (Li et al., 2010) on using contextual bandits for news recommendation. The work in (Chapelle and Li, 2011) provided an evaluation of Thompson Sampling on real-world datasets. Jeremie et al. (Mary et al., 2015) studied bridging matrix factorization and bandits to solve the cold-start problem in recommenders. Besides the use of bandits for exploring new items, more recently, Song et al. (Song et al., 2022) proposed creating a hierarchical representation of item space to help explore new user interests. To the best of our knowledge, most of the works on bandits for recommendations conduct experiments on offline synthetic or real-world datasets with simulated environment, and how to scale online learning system to billions of users and millions of items with real-time updates is not well studied. In contrast, a key contribution of this paper is on scaling in real-world environment, and we provide the engineering details on how we build the Online Matching system.

[A diagram of Online Matching interacting with users and environment.]The Online Matching agent makes recommendations to the users and environment and observes rewards.
2.3. Real-time Recommenders
Building real-time systems that can quickly process user feedback is critical for recommenders, especially in applications where new items are quickly added. StreamRec (Chandramouli et al., 2011) is a demo for event-driven fast data processing for training recommender models. Recently, Monolith was introduced in (Liu et al., 2022) as a system for fast model training with collisionless embedding table. In addition to the real-time update capability, our Online Matching system is further built with explicit exploration strategies to address the feedback loop problem (Chaney et al., 2018) that is not addressed in this stream of work.
3. Methods
3.1. Background
We aim to build Online Matching by learning users’ feedback to items in a closed loop, see Fig. 1 for an illustration. One naive approach is to apply multi-armed bandits for each user. That is for each user, we maintain a set of bandit parameters for all the explored items. This approach is not scalable because when there is a large number of explored items, many trials are needed for each user, making it hard for bandit parameters to converge, especially in the case new items are continuously added. Therefore, we need to consider contextual bandits where user context is modeled to allow using context features and across-user learning. Formally, let denote the current time step, denote the action selected at time from the entire action space , and denote the reward obtained by selecting action at time . The goal is to learn a policy that maps the observed user feature to an action , such that the expected cumulative reward is maximized over a sequence of time steps:
(1) |
In contextual linear bandits, the reward function is assumed to be linear in the feature vector , i.e.,
(2) |
for all , where is an unknown weight vector associated with action . The goal is to estimate the weight vectors () and use them to select actions that maximize the expected reward. A more generic formulation (Chu et al., 2011) is assuming there is a single unknown weight vector , and the expectation reward is , where represents both user and item features. We choose not to model item features in this work because we find item id is a very important feature to reflect the finest difference among various items, and adding this feature to can make this vector to be very high-dimensional. The setup in (2) is also called disjoint linear models as studied in a few papers (Li et al., 2010; Deshpande and Montanari, 2012; Wang et al., 2017).
The LinUCB algorithm (Li et al., 2010) maintains an estimate of the unknown parameter vector for each action at each time step . Let and be the -by- positive definite matrix and -dimensional vector that represent the covariance matrix and mean vector of the context features up to time , respectively. Then the estimate of is given by:
(3) |
LinUCB selects the arm with the highest upper confidence bound, namely:
(4) |
where is a hyperparameter that controls the exploration-exploitation tradeoff. If action is chosen, and are then updated using the observed reward and context vector through:
(5) |
For any action not chosen, we have . See Algorithm 1 for a detailed description.
Scaling problems of LinUCB. There are several practical issues that prevent us from directly using LinUCB for serving online traffic and getting real-time updates:
-
•
Large Action Space. The computation of is over all actions or items. In our application, even after narrowing the corpus down to fresh items, there are still millions of items worth exploring.
-
•
Covariance Inversion. Computing and updating depend on calculating the inverse of the covariance matrix that has a size of -by-. Note that the covariance matrix cannot be precomputed or cached due to the streaming updates. The online computational cost could be prohibitively high when facing a large user traffic.
-
•
Synchronous Updates. When one item is exposed to users, there could be many feedback received from various users. The streaming updates of and require an item-level synchronization, which poses a challenge on maintaining a high throughput to handle large traffic. As LinUCB converges to exploiting fewer top items, synchronization cost could be a real bottleneck.
3.2. Sparse bipartite graph
To overcome the aforementioned challenges, we propose a novel variant of LinUCB. Before diving into that, we take a step back to build some intuitions for our solution. Let’s look at the problem of user-item matching from a graph perspective. As shown in Fig. (2(a)), suppose there is a dense bipartite graph between users and items, the main goal of recommendation is to find good edges in the graph to make the connections between users and items. Naively, we could explore all items for each user disjointly, but apparently this would not be efficient. Our idea is to group users by clusters, and then reduce the exploration space in each cluster. As illustrated in Fig. (2(b)), we build user clusters where each cluster can be considered as a user cohort. For each cluster representing users with certain type of interest, we only consider a small subset of items to explore since it is not worth exploring many unrelated items in the corpus. For each user, we assign them to multiple top clusters and use the cluster weights when deciding how to connect user to the items in their corresponding clusters. The edges in the sparsified graph can be further converted to bandit parameters learnt online. Besides the intuitions, we provide a principled bandit framing in Section 3.3.


[Comparison of dense and sparse bipartite graph.]Our bipartite graph is based on user clusters and the links from user clusters to items are much sparser.
Embedding-based graph construction. With the graph view, now we describe how the graph is constructed offline. The idea is to leverage neural modeling. We first train a two-tower model to co-embed users and items, similar to the way many batch retrieval models (Yi et al., 2019) are built nowadays. Specifically, we train two neural networks, denoted by and , to encode user and item features to embeddings and respectively. Given a batch of positive user-item pairs , we use the batch softmax loss to train the two-tower model:
(6) |
where is a hyperparameter representing softmax temperature. In practice, using normalized embeddings, i.e., , can greatly improve trainability. As a result, we need to scale the logits to obtain meaningful softmax probabilities using , as the logits are limited to the range of -1 to 1. We omit further details here and refer interested readers to (Yi et al., 2019), as the offline modeling is not the main focus of this paper.
Next, we apply off-the-shelf clustering algorithms to discretize the embedding space into a set of user clusters, based on a large sample of user embeddings. For each user cluster, we choose the set of items with the highest embedding similarity measured by dot product between cluster centroid embeddings and item embeddings. This step largely narrows down the exploration space, allowing the online learning algorithm to focus on the items that have a higher probability of success. More details can be found in Algorithm 2. Note that it is possible to apply various clustering algorithms in our proposed framework, though we adopt kMeans in our system for simplicity.
3.3. Sparse Linear Bandits and Diagonal LinUCB
Now that we have explained the graph sparsification, we can proceed to introduce our bandit learning algorithm. We should note that if we assign each user to only one cluster during online learning, we can view the items () in each cluster as distinct arms in a multi-armed bandit problem and utilize the UCB algorithm. Nonetheless, limiting each user to a single cluster could lead to a considerable loss of information from user embeddings. To tackle this issue, we instead assign each user to the closest (e.g., 10) clusters and also incorporate the cluster weights as part of the context representation. The question now becomes how to design an effective exploration-exploitation strategy to handle the many-to-many mapping from both users to clusters and items to clusters. We propose a principled framing called sparse linear bandits, where the cluster weights can be effectively treated as a sparse context representation of a user in the high-dimensional space .
Sparse linear bandits. Given clusters, let represent the weights of the top- clusters for the query from user . Note that is very sparse because is large and . On the other hand, suppose for each item , we have the “ground-truth” parameter , where the -th coordinate represents the quality or value of item for the cluster . Based on the sparse graph in Fig. (2(b)), we can assume that if there is no edge between item and cluster . Accordingly, is also sparse, and we have for most items. It is possible that some items (e.g., the popular ones with large audience) can belong to many clusters, but we could always control the sparsity of by setting a maximum degree per item. Similar to linear bandits, the reward is assumed to satisfy . Based on our sparsity assumption, we can see that is 0 for most pairs, and such pairs won’t be explored in our algorithm. In other words, the Large Action Space problem of LinUCB is largely mitigated by our graph sparsification. Now we introduce a novel approximation, Diagonal LinUCB, to avoid computing the expensive Covariance Inversions and to address the Synchronous Updates problem in the meantime.
Diagonal LinUCB. The key idea is to only maintain and utilize the diagonal terms of covariance matrix for item at step . This is inspired by the momentum-based gradient descent methods (e.g., Adagrad and Adam) where only diagonal terms of Hessian matrix are used. Actually, covariance matrix is essentially the Hessian matrix for solving linear regression. From this point forward, we will omit the subscript to simplify the notation. Let vector denote the diagonal terms of at some step. Let be the -th coordinate of , and let be the -th coordinate of . For a vector , let denote its support, i.e., the set of indices with non-zero entries. Then the update rules of and become:
(7) |
where is from Algorithm 2. Furthermore, becomes
(8) |
In exploitation mode, we drop the confidence bound term, and the estimated reward becomes
(9) |
In the experiment section below, we will discuss how the exploitation mode is used in the use case of fresh content discovery. Putting things together, our detailed algorithm is shown in Algorithm 3.
Discussion on Eq. (3.3). Now we take a closer look at the parameter updates in Eq. (3.3). Given a reward , the update of gives higher weights to clusters closer to the user embedding. In other words, for user that is more familiar with an explored item, their feedback on the item will be trusted more. It is worth noting that when , Eq. is reduced to multiple that run separately for each cluster. As shown in experiments, we find that incorporating cluster weights can outperform treating all clusters equally. Compared to LinUCB, the updates in Diag-LinUCB are much more light-weighted, and more importantly, do not require the item-level synchronization since the updates are fully distributed over the edges in the sparse graph. This property eases the design of online learning infrastructure and significantly improves the system throughput.
Context vector. There could be a few options for computing the context vector from user embedding . The naive way is to let where is the embedding of each centroid. As mentioned in Section 3.2, we apply embedding normalization and softmax temperature when learning the two-tower model, which usually leads to top clusters having similar weights, with small numerical differences. This is expected because small difference in logits is amplified by the temperature and the softmax function during training. Directly using the logits as context vector does not reflect the true user preferences over various clusters. To address this issue, one option is to use a softmax transform similar to the one used in the training stage, namely
(10) |
where is another hyperparameter. One limitation of this approach is that we need to empirically tune . Another option is to train a multi-class classification model to directly predict the distribution over user clusters. However, this requires a much more intricate pipeline that involves training both the two-tower model and the user clusters. In this paper, we choose the simple approach in Eq. (10) and leave the second option to future work.
4. System Overview
In this section, we provide an overview of the Online Matching system. We introduce the end-to-end workflow, followed by a detailed overview of the online agent in Section 4.2.
4.1. End-to-end workflow
The entire Online Matching workflow consists of an offline pipeline and an online agent responsible for the closed-loop learning, as shown in Fig. 3. The offline pipeline produces a sparse bipartite graph, as introduced in Section 3.2, which is adopted by the online agent. It has the following components:

[A diagram of the end-to-end Online Matching system.]The offline pipeline consists of an offline trainer which exports a two-tower model daily, and a clustering & graph construction module that exports sparse graphs hourly. The online system takes sparse graphs generated by the offline pipeline and uses an online agent to learn from real-time user feedback.
-
•
Two-tower model trainer. We train an offline two-tower model by sequentially consuming a large amount of logged user feedback over time. The sequential training ensures that the model can adapt to the distribution change in the latest batch of data. As mentioned earlier, this model encodes item features, allowing it to generate meaningful embeddings even for newly added items. The two-tower model is exported on a daily basis, with both towers used by the downstream components to create the sparse bipartite graph. The user tower is also used by the online system to generate user embedding and context vector .
-
•
Candidate selection. This component creates a corpus of items eligible for exploration. Multiple filters are applied to ensure the selected candidates satisfy our strict trust-and-safety criteria. In this paper, our system is mainly focused on exploring fresh videos, therefore a rolling time window that covers a few days is used for item selection. We also apply various quality thresholds to balance the quality and size of the corpus.
-
•
Clustering and graph building. Once clustering is finished based on the two-tower model exported most recently, graph builder is triggered to build the sparse graph according to Algorithm 2. The graph building process is executed in both batch and real-time modes concurrently. In batch mode, graph builder takes the output of the candidate selection step, and exports a new graph every few hours. Real-time mode complements batch mode by incrementally updating the sparse graph with newly eligible items to ensure a small latency for items to enter the exploration pool.
Online agent represents the system that conducts the bandit algorithm and aggregates user feedback in real-time. It takes the sparse graph produced from the above pipeline as input. Whenever the sparse graph is updated, bandit parameters are synchronized in online agent with low latency: new edges are added with infinite confidence bound (so that they will be prioritized for future exploration), and old edges that are only in the previous graph version are removed. In the next section, we will delve into the specifics of online agent.
4.2. Online Agent

[A diagram of Online Matching aggregating users’ feedback and accommodating newly eligible videos.]User feedback flows through log processor, feedback aggregation processor (which exchanges information with BigTable), cluster to candidate lookup service, and finally recommender service. Graph building processor monitors the video metadata database and updates the cluster to candidates lookup service with newly eligible videos.
As shown in Fig. 4, the key components of online agent are chained as a closed loop. Explored items from Online Matching are allowed to be shown at a fixed position in the UI, so that users’ direct feedback on them can be measured without being affected by existing ranking policies. The log processor is used to incrementally generate various kinds of engagement signals on the explored items in the format compatible with the downstream jobs. The feedback aggregation processor is built on top of Bigtable (Chang et al., 2006) and is responsible for aggregating pair-wise (cluster and video) bandit parameters according to Eq. (3.3). As illustrated in Table 1, each row in the Bigtable represents one cluster, and each column represents the corresponding items of that cluster in the sparse graph. Conceptually, Bigtable is a sparsely populated table that can scale to the billions of rows and columns, and is compatible with the proposed Diag-LinUCB algorithm.
Column | Cell value |
---|---|
feedback: | { : , : , : } |
feedback: | { : 57.6, : 144.6, : 1.8 } |
feedback: | { : 76, : 547.1, : 2.6 } |
Bandit parameters in Bigtable are frequently pushed to the cluster-to-candidates lookup service. The recommender service is used for obtaining user clusters, and look up the candidates and their bandit parameters from the lookup service. The recommender service is further used to rank all the candidates according to the UCB in Eq. (8) or Eq. (9) if the goal is to exploit high quality candidates.
It is worth noting that, compared to the classic bandits setup with a fixed set of arms, we always have fresh items added as new arms in Online Matching. In particular, fresh items are continuously injected into the bipartite graph through the graph building pipeline. New items that have never been explored would have an infinite confidence bound in Eq. (8) and are therefore prioritized for exploration in the recommender service. Due to the batch addition of new items, we can observe spikes of infinite UCB scores as shown in Fig. 5. The spikes usually disappear quickly, demonstrating that users’ feedback to new items are quickly incorporated in bandit parameters.

[A plot of the number of candidates with infinite scores over time.]X-axis is time and y-axis is the number of candidates with infinite scores. There are two peaks in the curve.
4.3. System Performance
To demonstrate the real-time performance of Online Matching, we measure the following two types of update latency:
-
•
Policy update latency: This is the period of time from the point user sees the explored item, to the point when user’s feedback on the item is incorporated in bandit parameters contained in the lookup service in Fig. 4. Since our primary application is video recommendation, this latency also includes user’s watch time on video capped at a certain value, as a major part of the latency in the log processor. Actually sessionizing user feedback in the log processor contributes to most of the policy update latency.
-
•
Corpus update latency: This is the period of time from the point a fresh item is eligible for exploration to the point when the item is added to the sparse graph.
The median and the 95-th percentile of both latency are summarized in Table 2. Besides latency, our system achieves high throughput, e.g., it can handle millions of bandit updates per second, allowing it to scale to billions of users.
P50 (minutes) | P95 (minutes) | Throughput (updates/second) | |
Policy update latency | 45 | 74 | O(1M) |
Corpus update latency | 41.1 | 60.1 | O(1K) |
Particularly, the low latency of policy update is critical for recommendation quality since the expected regret grows as the feedback delay increases (Joulani et al., 2013). To empirically verify the expected regrets, we add artificial latency into the aggregation processor in Fig. 4. As shown in Table 3, as latency was introduced, the agent became less capable of identifying low-performing bandits, resulting in a decrease in CTR and total rewards on explored items.
CTR | Total Rewards | |
---|---|---|
Baseline with no artificial delay injected | - | - |
20 min delay added | -2.82% | -11.82% |
40 min delay added | -4.4% | -22.84% |
5. Experiment Results
We conduct our experiments on one major recommendation surface of YouTube – one of the world’s largest video content discovery, creation and sharing platforms. Similar to many industry-scale recommender systems, the recommendation surface we considered consists of multiple stages including candidate generation and a multi-task ranking system (Zhao et al., 2019). Given a user and a corresponding context, the candidate generation stage employs multiple video retrieval systems to generate a few hundred/thousand candidate videos with a primary focus of optimizing the recall performance. The ranking stage then takes the set of retrieved videos as input and generates the final ranking. In the following two use cases, Online Matching is added as a new retrieval source to the candidate generation stage.
5.1. Use Cases
We primarily consider two use cases of Online Matching:
-
•
Fresh Content Discovery (Type-I). Content creators upload millions of videos to Youtube on a daily basis. These videos vary a lot in quality and traditional batch-learning based recommender systems are limited in their capability to identify high-quality fresh videos and recommend them to the appropriate audience in a timely manner. Intuitively, an efficient exploration system on fresh content can supplement traditional recommender systems by identifying and leveraging quality fresh content to improve user experience with satisfied engagement, particularly with fresh content. To compensate the quality loss from exploration, the idea is to use a very small traffic for exploration, and conduct exploitation (i.e., amplifying high-quality candidates) for the rest of traffic.
-
•
Corpus Exploration (Type-II). Real-world recommender systems typically have a long-tail distribution over the recommendation corpus. Traditional recommender systems are primarily exploitation-based due to a feedback loop that reinforces the existing “winners” in the corpus. Exploring and leveraging the long-tail portion of the corpus is both challenging given its vast size and rewarding. In this case, the goal is to grow the discoverable corpus. Having a larger discoverable corpus is argued to bring long-term value to recommenders, e.g., through reducing system uncertainty on sparse data region (Chen, 2021). It can also provide torso or long-tailed content creators more opportunities to showcase their content to a wider audience. However, similar to Type-I, explicitly exploring long-tail and fresh items can lead to short-term loss of user engagement and satisfaction. Therefore, the goal of this use case is to enlarge discoverable corpus while having minimal regret on short-term engagement.
Online Matching is a fitting solution for the above use cases thanks to its efficiency in both the exploration algorithm and system. On one hand, Online Matching trims the exploration space through offline learning, making it possible to only use small traffic for exploration in the first use case. On the other hand, the real-time property of Online Matching system helps to minimize exploration errors, so that it can effectively enlarge the discoverable corpus while maintaining a small regret on short-term engagement in the second use case.
5.2. Fresh Content Discovery
5.2.1. Experiment Setup.
We select up to fresh videos that were uploaded within the past certain days (referred as X days in the rest of this paper) as the exploration corpus. Multiple offline and online filters are applied to make sure the explored videos are safe and satisfy minimum quality requirements. We use a small size of frequently shuffled user traffic () to explore this corpus and learn the bandit parameters. In the exploration mode, for getting unbiased user feedback, the picked candidate from Online Matching bypasses the ranking layer and is shown at a fixed position in the UI. We call one experiment running in this exploration mode as one exploration slot. The reward we use is a combination of multiple signals representing user’s happiness with the recommended videos. If any video is filtered during exploration, the filtering information is also used as part of the reward in bandits. It’s worth noting that, besides hard filtering, the sparse graph construction can also control candidate quality through the offline learnt embeddings.
To amplify the learnt good fresh videos to all users, we also set up another online agent that works in an “exploitation” mode for the rest of traffic (98-99%). Particularly, it reuses the corpus and bandit parameters from the aforementioned online agent and ranks videos only based on the estimated reward in Eq. (9). In exploitation, since there is no need to collect user feedback, multiple top candidates by mean reward are passed to the ranking layer together with candidates from other sources. Exploration corpus is being updated on a rolling basis -– new eligible videos are continuously injected to the exploration corpus and videos uploaded beyond X days continuously graduate from the corpus.
Top-k randomization. Rather than getting feedback instantly in the theoretical bandit model, our system still has a nontrivial policy update latency. This latency can cause the system to overly explore certain items before collecting their reward. To address this issue, we introduce a randomization mechanism to uniformly sample a video from the top videos measured by UCB. We set in the following experiments.
5.2.2. Results.
We ran an online user-diverted A/B testing for a week to understand the effectiveness of our explore-and-amplify framework. For both exploration and exploitation modes, the control is the production recommender system without adding Online Matching agent as an additional candidate generator.
Gains from exploitation. The results of Online Matching exploitation mode are shown in Table 4. We use the setup where all user clusters are treated equally as the baseline to show the value of our user context modeling. We report two Diag-LinUCB arms where the second one uses a larger corpus and more clusters. Note that with the larger graph, we had to increase exploration traffic from 1% to 2% to avoid under exploration. Overall, we see improved topline satisfied engagement metric from Online Matching and significant gains on the fresh item slice. To study long-term impact, we have run a holdback for several weeks. As demonstrated in Fig. 6 shows, there is a +0.04% improvement on daily active users, a long-term metric that is much more difficult to improve than short-term user engagement.
Satisfied user engagement | Engagement with fresh content | # clusters | Fresh corpus size | Graph size (# edges) | |
---|---|---|---|---|---|
Equal-weight Bandit | +0.03% | +3.61% | 15k | 1x | 4M |
Diag-LinUCB | +0.08% | +5.25% | 15k | 1x | 4M |
Diag-LinUCB (Larger Graph) | +0.15% | +8.33% | 30k | 3x | 20M |

[A plot of the Daily Active User percentage change curve over a month.]X-axis is time and y-axis is the Daily Active Users percentage change. There is a slight up-trend over the one month period.
Cost of exploration. We observed -0.16% and -0.19% satisfied user engagement loss from the 1% and 2% exploration slots for the 2nd and 3rd rows in Table 4. By discounting the engagement loss with the small traffic proportion, it is clear that the value added by exploitation far outweighs the cost of exploration.

[A plot of the Daily Discoverable Corpus Size percentage change curve over different impression thresholds.]X-axis is impression threshold and y-axis is percentage change in unique videos. Improvement is observed across all the impression thresholds with a peak around the threshold of 1000.
5.3. Corpus Exploration
5.3.1. Experiment Setup
In the traditional A/B testing setup, user traffic is randomly assigned to a control and treatment group. This user-diverted setup can enable measuring user-side metric changes. In this use case, we want to measure the growth of discoverable corpus. The user-diverted experiment is not an appropriate method for measuring corpus changes because the control and treatment group share the same corpus. As a result, any treatment effect on the corpus can be leaked to the control group. In order to correctly measure corpus changes, we employ a user-corpus co-diverted setup where we partition the entire corpus (e.g., by hashing item id) into multiple disjoint slices and expose each slice to a faction of user traffic in one experiment.
To largely explore the long-tailed proportion of our corpus, we curate a large corpus of fresh videos, that is videos uploaded within X days. We divide the entire corpus into 10 slices and one slice in each experiment that has user traffic. Type-II experiment shares many similar configurations such as the two-tower model and number of clusters as Type-I, except that the exploration corpus is much larger and the online agent only runs in the exploration mode in this use case.
5.3.2. Experiment Results
We compare daily unique videos greater than different impression threshold values to measure the discoverable corpus sizes across control and treatment groups, each of which contains 1/10 of the entire fresh corpus. Fig. 7 shows the relative corpus size improvement across multiple impression values. Note that we can naively boost the impressions of more unique videos regardless of their performance. But overly showing low-quality videos to users can cause very negative user experience. Overall, we only observed -0.05% topline user engagement loss with corpus size gain from Fig. 7. Comparing to Type-I, here the cost of the exploration is much lower.
This is expected because in one exploration slot, Type-II has a slightly smaller exploration corpus than Type-I, but its exploration user traffic is significantly larger ( vs ). Intuitively, this means that in Type-II bandits can discover high-quality items more quickly.
6. Conclusion
In this paper, we introduced Online Matching, a large-scale real-time bandit system for item exploration and recommendation. To scale up to serving billions of users and millions of items, we proposed an algorithm-system co-design that combines offline and online learning, introduces a novel Diagonal LinUCB algorithm, and devises a high-throughput online agent system. Empirically, we applied Online Matching to two important use cases in YouTube and demonstrated its effectiveness for increasing fresh content adoption and growing discoverable corpus.
We hope our exercise of building a real-world bandit system can provide some valuable insights for future endeavors to address practical challenges. For example, we used a mixture of exploration and exploitation modes for our Fresh Content Discovery use case. In contrast to the classic bandit setup where minimizing regret in a single mode is the primary goal, more research needs to be done to study the mixture setup. In addition, real-world applications often have to deal with a dynamic exploration corpus, where items are being constantly added or removed in a streaming fashion. Designing principled bandit algorithms for dynamic arms remains an interesting open research problem.
References
- (1)
- Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-Time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 47, 2–3 (may 2002), 235–256. https://doi.org/10.1023/A:1013689704352
- Chandramouli et al. (2011) Badrish Chandramouli, Justin Levandoski, Ahmed Eldawy, and Mohamed F. Mokbel. 2011. StreamRec: A Real-Time Recommender System. In ACM SIGMOD International Conference on Management of Data (SIGMOD 2011) (acm sigmod international conference on management of data (sigmod 2011) ed.). ACM SIGMOD. https://www.microsoft.com/en-us/research/publication/streamrec-a-real-time-recommender-system/
- Chaney et al. (2018) Allison J. B. Chaney, Brandon M. Stewart, and Barbara E. Engelhardt. 2018. How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 224–232. https://doi.org/10.1145/3240323.3240370
- Chang et al. (2006) Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 205–218.
- Chang et al. (2020) Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. Pre-training Tasks for Embedding-based Large-scale Retrieval. In ICLR 2020.
- Chapelle and Li (2011) Olivier Chapelle and Lihong Li. 2011. An Empirical Evaluation of Thompson Sampling. In Proceedings of the 24th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2249–2257.
- Chen (2021) Minmin Chen. 2021. Exploration in Recommender Systems. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY, USA, 551–553. https://doi.org/10.1145/3460231.3474601
- Chen et al. (2022) Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, and Ed Chi. 2022. Off-Policy Actor-Critic for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys ’22). Association for Computing Machinery, New York, NY, USA, 338–349. https://doi.org/10.1145/3523227.3546758
- Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems (DLRS 2016).
- Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 15), Geoffrey Gordon, David Dunson, and Miroslav Dudík (Eds.). PMLR, Fort Lauderdale, FL, USA, 208–214. https://proceedings.mlr.press/v15/chu11a.html
- Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. New York, NY, USA.
- Deshpande and Montanari (2012) Yash Deshpande and Andrea Montanari. 2012. Linear bandits in high dimension and recommendation systems. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). 1750–1754. https://doi.org/10.1109/Allerton.2012.6483433
- Gilotte et al. (2018) Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). Association for Computing Machinery, New York, NY, USA, 198–206. https://doi.org/10.1145/3159652.3159687
- Guo et al. (2020) Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. https://arxiv.org/abs/1908.10396
- Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (Cambridge, United Kingdom) (WSDM ’17). Association for Computing Machinery, New York, NY, USA, 781–789. https://doi.org/10.1145/3018661.3018699
- Joulani et al. (2013) Pooria Joulani, Andras Gyorgy, and Csaba Szepesvari. 2013. Online Learning under Delayed Feedback. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 1453–1461. https://proceedings.mlr.press/v28/joulani13.html
- Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. ACM. https://doi.org/10.1145/1772690.1772758
- Liu et al. (2017) David C. Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C. Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related Pins at Pinterest: The Evolution of a Real-World Recommender System. In WWW 2017.
- Liu et al. (2022) Zhuoran Liu, Leqi Zou, Xuan Zou, Caihua Wang, Biao Zhang, Da Tang, Bolin Zhu, Yijie Zhu, Peng Wu, Ke Wang, and Youlong Cheng. 2022. Monolith: Real Time Recommendation System With Collisionless Embedding Table. arXiv:2209.07663 [cs.IR]
- Ma et al. (2020) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 463–473. https://doi.org/10.1145/3366423.3380130
- Mary et al. (2015) Jérémie Mary, Romaric Gaudel, and Philippe Preux. 2015. Bandits and Recommender Systems. In Machine Learning, Optimization, and Big Data, Panos Pardalos, Mario Pavone, Giovanni Maria Farinella, and Vincenzo Cutello (Eds.). Springer International Publishing, Cham, 325–336.
- Okura et al. (2017) Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-Based News Recommendation for Millions of Users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). Association for Computing Machinery, New York, NY, USA, 1933–1942.
- Song et al. (2022) Yu Song, Shuai Sun, Jianxun Lian, Hong Huang, Yu Li, Hai Jin, and Xing Xie. 2022. Show Me the Whole World: Towards Entire Item Space Exploration for Interactive Personalized Recommendations. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (Virtual Event, AZ, USA) (WSDM ’22). Association for Computing Machinery, New York, NY, USA, 947–956. https://doi.org/10.1145/3488560.3498459
- Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA, USA, 3231–3239.
- Thomas and Brunskill (2016) Philip S. Thomas and Emma Brunskill. 2016. Data-Efficient off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML’16). JMLR.org, 2139–2148.
- Wang et al. (2017) Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Bandits for Interactive Recommendation.
- Yang et al. (2020) Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaoming Wang, Taibai Xu, and Ed H. Chi. 2020. Mixed Negative Sampling for Learning Two-Tower Neural Networks in Recommendations. In Companion Proceedings of the Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 441–447. https://doi.org/10.1145/3366424.3386195
- Yao et al. (2021) Tiansheng Yao, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya Menon, Lichan Hong, Ed H. Chi, Steve Tjoa, Jieqi (Jay) Kang, and Evan Ettinger. 2021. Self-Supervised Learning for Large-Scale Item Recommendations. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 4321–4330. https://doi.org/10.1145/3459637.3481952
- Yi et al. (2019) Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations. In RecSys 2019.
- Zhai et al. (2017) Andrew Zhai, Dmitry Kislyuk, Yushi Jing, Michael Feng, Eric Tzeng, Jeff Donahue, Yue Li Du, and Trevor Darrell. 2017. Visual Discovery at Pinterest. In WWW 2017.
- Zhao et al. (2019) Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending What Video to Watch next: A Multitask Ranking System. In RecSys 2019.