Online Learning for Recommendations at Grubhub
Abstract.
We propose a method to easily modify existing offline Recommender Systems to run online using Transfer Learning. Online Learning for Recommender Systems has two main advantages: quality and scale. Like many Machine Learning algorithms in production if not regularly retrained will suffer from Concept Drift. A policy that is updated frequently online can adapt to drift faster than a batch system. This is especially true for user-interaction systems like recommenders where the underlying distribution can shift drastically to follow user behaviour. As a platform grows rapidly like Grubhub, the cost of running batch training jobs becomes material. A shift from stateless batch learning offline to stateful incremental learning online can recover, for example, at Grubhub, up to a 45x cost savings and a +20% metrics increase. There are a few challenges to overcome with the transition to online stateful learning, namely convergence, non-stationary embeddings and off-policy evaluation, which we explore from our experiences running this system in production.
1. Introduction
Grubhub is an online network of localized marketplaces that connect diners and local restaurants for delivery or pickup meals. Grubhub runs a large-scale recommender system that surfaces relevant restaurants based on a user profile and real-time context.
Recommendations drive 80% of revenue at Grubhub, therefore the quality of recommender systems is very important as it has a material impact on business. Additionally, Grubhub has many millions of diners and restaurants and is growing fast, therefore efficient scaling of recommender systems in terms of time and money is important.
1.1. Data Drift
Machine Learning models perform best right after training. In production, models degrade quickly because of concept drift. In Recommender Systems this is often realized as item catalog updates, user interest/preference shifts or more often, current events such as the recent pandemic or regional holidays such as Cinco de Mayo.
To demonstrate this phenomena, we launched 2 experiments, one had no retraining at all and the other had daily retraining. Results are shown below in table 1, as relative increase in Purchase Through Rate (PTR) over the baseline with no retraining.
Retraining Method | % PTR Increase |
---|---|
No Retraining (Baseline) | 0.0 |
Weekly Retraining | +2.5 |
Daily Retraining | +20.3 |
These results, in Table 1, suggest that at updates a daily cadence, if not more frequent, are required to use machine learning in production for our recommender use-case. This is caused by drift. This realization, that drift affects business metrics, motivates and necessitates not just model retraining, but frequent model retraining to allow our system to rapidly adapt to data distribution changes.
1.2. Retraining
One practical approach to periodic retraining is to move a fixed-size sliding window over a range of dates as new data becomes available. For example, one may choose to train on n=4 days of data, everyday. This is illustrated below in Figure 1a.


In theory this may seem reasonable, however in practice, as many organizations are renting compute from the public cloud, this is expensive, in terms of time and money, as days of data are redundantly computed every retraining session. If we could remove the redundant computations in our retraining routine, we could theoretically achieve a time and cost savings. For the example in figure 1a, a cost savings could be realized. Concretely, in early 2021, a p3.8xl on AWS costs $12/hour. A representative retraining job at Grubhub takes about 3 hours to learn on 30 days of data. That comes out to approximately $1080/month or 3x wasted compute. Adding state out our retraining system can help solve our cost issues by removing redundant computations over data.
2. Stateful Updates
Traditional supervision pipelines are stateless, meaning that in between rounds of retraining, state is not preserved. This is simpler from an ops perspective, however, it comes with the cost of computations over redundant data.
Out-of-core/incremental algorithms can be updated stochastically, meaning, they can be updated without having all the data at once. Parametric models, such as neural networks are one example, however, the family decision tree algorithms is not (Domingos and Hulten, 2000)*. Most out-of-core models can be trivially migrated to be updated online. The key is to maintain state between training sessions which practically means serializing and deserializing before and after updates respectfully. This is illustrated in figure 1b, where the running example is adapted to be stateful. As you can see the model is updated incrementally only on new data.
To demonstrate this idea, we retrained daily two models both on 80 days of data, where one is stateless and one is stateful. In the minimum our hypothesis is that the incremental model will recover the batch model. Fig 2 show the incremental approach converges faster and to a higher maximum. In other words, the stateful approach does indeed recover the stateless approach, i.e. Fig 2a and 2b are identical as far as cross validation is concerned.


3. Off-Policy Evaluation & Pre-training
An analogy can be drawn from the Computer Vision field where it is not uncommon to take a pre-trained imagenet model and then fine-tune it on your custom task/domain. Through this lens, retraining for drift can be viewed as a domain adaptation via transfer-learning. In our case we are training offline until convergence and then fine-tuning online to rapidly adapt for drift.
In the online setting, the notion of hyper-parameter tuning, cross validation and even convergence become blurred. However, like mentioned above, we are only fine-tuning online and not necessarily learning from scratch online. Log data can be leveraged for pre-training which allows us to back-test the model offline using a simulator that walks through stochastic updates. After the model has converged, then we can release it for online updates.
4. Embeddings in Online Learning
Vocabularies that map categorical features to an integer index in an embedding table require advanced knowledge of all possible categories. This is unfeasible in the ecomm/product setting where products/restaurants come and go in real-time. One technique to map from from non-stationary categories to an integer index is a hash map.
A hash can map any given input into 1 of buckets (size) at the expense of collisions. The challenge is to tune with respect to an acceptable collision rate. Double Hashing (Zhang et al., 2020) introduces two (n) independent hashing functions which reduces the collision rate exponentially, with only twice the memory consumption because a collision can only occur when both collide.
In order the quantify this trade-off we map the world-wide restaurant catalog to a range of sequential integers (buckets) using a hash function. We increase the buckets from 500k to 4 million and record the respective collision rates below in Fig 3 .
Figure 3 shows the collision rate converges to about 7% at 3M buckets. Accordingly, we can be assured that with this encoding scheme we can represent most restaurants in the world and can adapt to new restaurants online, without full retraining. One sacrifice we make for using hashing is that there is no way to compute the inverse transform which can be a problem when trying to introspect which features are most important to a model.

5. Results

Figure 4 shows the AB Test results for the online migration. Results show a +20% increase over the baseline PTR and additionally a 45x cost decrease. We attribute the results to faster drift response and decreased cloud usage respectively. Overall a realization of hundreds of thousands of dollars annually.
We also introduced a pre-training and fine-tuning technique as bridge between offline and online learning. Traditionally, the gap between the two paradigms has been technologically enormous, which effectively prevents participation except for the largest organizations with large resources (Huyen, [n.d.]). This work has substantially shifted the bar between cost and rapid results by removing some of the traditional constraints limiting organizations to the offline paradigm.
References
- (1)
- Domingos and Hulten (2000) P Domingos and G Hulten. 2000. Mining high-speed data streams.[In:] Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston 71 (2000), 80.
- Guo et al. (2020) Dalin Guo, Sofia Ira Ktena, Pranay Kumar Myana, Ferenc Huszar, Wenzhe Shi, Alykhan Tejani, Michael Kneier, and Sourav Das. 2020. Deep Bayesian Bandits: Exploring in Online Personalized Recommendations. In Fourteenth ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys ’20). Association for Computing Machinery, New York, NY, USA, 456–461. https://doi.org/10.1145/3383313.3412214
- Huyen ([n.d.]) Chip Huyen. [n.d.]. Machine learning is going real-time. https://huyenchip.com/2020/12/27/real-time-machine-learning.html
- Zhang et al. (2020) Caojin Zhang, Yicun Liu, Yuanpu Xie, Sofia Ira Ktena, Alykhan Tejani, Akshay Gupta, Pranay Kumar Myana, Deepak Dilipkumar, Suvadip Paul, Ikuhiro Ihara, Prasang Upadhyaya, Ferenc Huszar, and Wenzhe Shi. 2020. Model Size Reduction Using Frequency Based Double Hashing for Recommender Systems. In Fourteenth ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys ’20). Association for Computing Machinery, New York, NY, USA, 521–526. https://doi.org/10.1145/3383313.3412227