An Enhanced Batch Query Architecture in Real-time Recommendation

Qiang Zhang 0009-0005-7590-3567 Xi’an Jiaotong UniversityXi’anShaanxiChina [email protected] , Zhipeng Teng 0009-0000-5175-6762 BilibiliShanghaiChina [email protected] , Disheng Wu 0009-0008-8278-0093 BilibiliShanghaiChina [email protected] and Jiayin Wang 0000-0002-3862-6557 Xi’an Jiaotong UniversityXi’anShaanxiChina [email protected]

(2024)

Abstract.

In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system’s real-time performance.

Key-value storage, Recommender system, Hash-table

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management; October 21–25, 2024; Boise, ID, USA^†^†booktitle: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24), October 21–25, 2024, Boise, ID, USA^†^†doi: 10.1145/3627673.3680034^†^†isbn: 979-8-4007-0436-9/24/10^†^†ccs: Computer systems organization Real-time system architecture^†^†ccs: Information systems Recommender systems^†^†ccs: Information systems Distributed storage

1. Introduction

In recommendation systems, the rapid computation of item relevance to users—ideally within milliseconds is crucial for optimal performance, and storage systems capable of managing vast data volumes and facilitating high-speed batch queries are one of the most vital components, as illustrated in Figure 1. Whether in the recall or ranking phases, or during offline model training (Li et al., 2014, 2013), more features and faster computation generally lead to better recommendation outcomes. Redis (redis, 2009) is noted for its straightforward architecture and robust efficiency, proving effective in supporting recommendation systems. However, in larger-scale industrial settings, developing storage engines that enhance rapid batch query capabilities for recommendations remains a critical research area.

Refer to caption — Figure 1. Key-Value Storage in Recommendation system

As deep learning becomes the mainstay in the realm of recommendation systems (Covington et al., 2016; Cheng et al., 2016; Gupta et al., 2020; Zhao et al., 2019; Davidson et al., 2010), the size of these models has escalated, now reaching terabyte scale (Lian et al., 2022). Deep learning-based recommendation systems require a vast array of features to accurately depict complex user behaviors, attributes, and preferences. Bilibili primarily distributes video content, typically around 10 minutes in length. Therefore, compared to e-commerce recommendations, features such as key frame image embeddings, video sentiment, and the author’s attitudes and opinions are crucial for content delivery.

In industrial recommendation systems with hundreds of millions of users and billions of items, each record may comprise hundreds of simple floating-point features, potentially leading to embedding table sizes in the terabyte range. Such extensive feature storage and embedding tables cannot be housed within a single memory unit. This leads to the conclusion that optimizing the batch query service of recommendation systems necessitates addressing three key challenges.

•

High Performance. A storage engine with high-speed batch reading capabilities serves as the carrier for a vast array of feature values and embedding tables, enhancing the throughput during model training and prediction.
•

Scalability. A distributed query services cluster maintains high throughput even as system size increases - both in terms of the number of features and the volume of items processed.
•

Flexibility and Consistency. Recent user behaviors with session-segment are more effective in predicting future actions (Tuan and Phuong, 2017; Lu et al., 2018; Twardowski, 2016; Shen et al., 2022). rapid updates ensure that real-time recommendation systems maintain effectiveness by reflecting users’ current interests. It is essential to avoid data inconsistencies during updates to prevent losses in recommendation system performance.

In this paper, we design and implement a distributed query architecture for batch query that significantly enhances recommendation system performance. Here is our contribution. First, we have made several design to improve Coalesced Hashing, resulting in a new hash table structure called NeighborHash, which minimize the number of cacheline accesses per query to achieve larger query throughput within the constraints of limited memory bandwidth. A comprehensive performance evaluation of NeighborHash reveals that, in batch query scenarios, it achieves a 170% performance improvement compared to the most optimized hash-tables, Validation experiments confirm the effectiveness of our optimizations. An open-source version of neighborhash is available at the following address https://github.com/slow-steppers/NeighborHash. Second, Building on Neighborhash, we have introduced a SSD-based distributed key-value storage service that supports horizontal scaling of features and model sizes while maintaining low resource consumption. Particularly focused on the multi-version states of data during real-time learning updates, we optimized update and query protocols to ensure strong data consistency during batch queries, thereby maintaining stable recommendation performance. Finally, we deployed this architecture in the bilibili recommendation system and achieved significant benefits.

2. System Design

The design of the serving architecture draws significant inspiration from Google Mesa (Gupta et al., 2014). The system comprises the Batch Query Subsystem and Update Subsystem, outlined in Figure 2. As a typical architecture in recommendation systems, the Update Subsystem manages data updates, encompassing user behavior properties and parameter updates for recommendation model training. The Query Subsystem caters to extensive data volumes and high concurrency requirements through a distributed layout with multiple shards and replicas. This setup is designed to handle a significant request volume (peaking at approximately 100k qps) and enormous feature tables that cannot be deployed on a single machine. At its foundation lies our hash table, Neighborhash, optimized for batch querying in recommendation systems and described below.

2.1. NeighborHash

2.1.1. Hashtable structure

NeighborHash, based on Coalesced Hashing, utilizes a flat array to store all buckets, with each bucket containing key, value and the index to the next spot in the chain or else the null value. To minimize the number of cachelines involved in queries, we implemented the following design:

Lodger relocation. In traditional Coalesced hashing, hash table buckets are numbered 1 to M’. The first M buckets serve as the hash function’s address region, while the remaining M’–M buckets are exclusively for colliding records, known as the cellar. When the cellar fills up, subsequent colliders must occupy empty buckets in the address region, potentially leading to further collisions with later-inserted records. Hence, the PSL (probing sequence length) is sensitive to the cellar region’s size. To minimize the PSL, we eliminate the cellar region allocation. Instead, we place conflicting elements in the address region and dynamically adjust them. The method is as follows: For a record x in bucket i, if Hash(x.key) is i, it is termed as the host record; otherwise, it is termed as the lodger record. When inserting a new record y, with Hash(y.key) resulting in bucket j, if bucket j is occupied by a lodger, a vacant position is sought to relocate the lodger before storing record y in bucket j. If bucket j is occupied by a host, a vacant position is sought to store record y, and it is appended to the end of the chain.

Lodger relocation can be seen as a dynamic cellar strategy, ensuring minimization of PSL (same as separate chaining). However, the drawback is that the insertion process becomes more complex. Considering query requests dominate the workload of recommendation systems, we believe this trade-off is worthwhile. The aforementioned process can be referenced in Figure 3.

Cacheline-aware neighbor probing, NeighborHash strives to place buckets on the same chain within the same cache line to minimize memory bandwidth usage for each query. During the search for available buckets, the algorithm first examines buckets within the same cacheline. Unlike Linear Probing, which probes only in one direction, NeighborProbing conducts bidirectional probing within the cacheline. If none is found, the search expands bidirectionally to identify the nearest available bucket to the head, as illustrate in Figure 4 . This approach minimizes cross-cache-line probes during queries. Experimental results demonstrate that, on a random dataset of 100 million entries with a load factor of 75%, the average number of cache line accesses per query is approximately 1.12. Compared to Linear Probing, which is 1.47, Cacheline-aware Neighbor probing requires fewer cachelines. If no available bucket is within the same cacheline, relocating to nearby cachelines may mitigate TLB cache misses and lead to a smaller relative offset to the previous record in the chain, enabling offset compression.

Inline chaining, We utilize the high 12 bits of NeighborHash’s value field to represent relative offsets, enabling the implementation of a conflict linked list. The range of representation is -2047 to 2048. Considering the conflict allocation strategy primarily aims to find a suitable location near the root node, the 12-bit relative offset is more than sufficient. In practice, a 12-bit offset can achieve a load factor of over 80%. Consequently, NeighborHash employs a storage representation of 52 bits for the actual value, which is typically used to store pointers or offset values in recommendation system, illustrated in Figure 5.

Lookup Acceleration. Hash-table lookup is a predominant operation in recommender systems. Vectorization (Polychroniou et al., 2015) and AMAC (Kocberber et al., 2015) have been extensively utilized for optimizing Hash-table batch lookup. We implemented inter-query vectorized query on NeighborHash using the IMV(Interleaved Multi-Vector) (Fang et al., 2019) method. Efficient SIMD compress and expand operations can be employed for the append and fill processes. Vectorization can bring significant throughput improvements on small datasets, but on larger datasets, memory latency becomes the predominant factor. We implemented the AMAC method on NeighborHash and conducted experiments. The experiments show that on large datasets, with the help of AMAC, NeighborHash can achieve almost a doubling of throughput compared to its original performance.

2.1.2. NVMe storage

In industrial scenario with billions of users and items, there exists a vast amount of cold data. Compared to storing all data in memory, it is cost-effective to store cold data in NVMe and hot data in memory (Wan et al., 2021). To maintain the latency of request responses, we store index(key to value offset) in memory using the NeighborHash structure and store the value bytes in a two-tier manner. The 52-bit payload in NeighborHash includes 1 bit to indicate whether the data is stored in memory or NVMe. The payload of hot data points to the memory containing the LRU metadata, while that of cold data points to the file system offset. Eviction is completed by an asynchronous thread scanning the metadata of hot data. As illustrated in Figure 6. Storing both hot and cold keys in memory reduces the concurrent read/write overhead on the Hash-table associated with traditional LRU methods. Additionally, during a cache miss, typically only one or a few NVMe I/O operations are involved. We believe that the additional memory overhead is more economical compared to the savings in CPU usage and the improvement in throughput. Due to space limitations, we will not provide a detailed analysis here.

2.2. Batch Query Subsystem

The Batch Query Subsystem is architected as a multi-sharded, multi-replica framework. Given the unique update cycles, query loads and geographic distribution demands of each table in the business domain, the subsystem is organized based on individual tables. Each table maps to a distinct query service, with specific shard and replica information maintained by a metadata service. This underlying infrastructure is supported by etcd (etcd io, 2017) for robustness.

2.2.1. Automatic Sharding

Sharding data during service has two primary advantages:

(1)

It avoids the challenges linked to overly large service instances, including lengthy start-up times and high data retrieval bandwidth, enhancing system stability. Smaller shards also facilitate quicker migration and recovery.
(2)

Parallel queries across multiple shards reduce the batch query load on individual instances. Coupled with asynchronous processing for immediate tasks like click-through rate estimation, this approach notably decreases user request latency.

The system automatically manages shard creation based on set configuration parameters, ensuring no shard exceeds its designated size. Should the table grow or shrink during updates, re-sharding occurs during the next update cycle, with updated metadata synchronized across the live cluster.

2.2.2. Rolling Update and Query Consistency

Real-time recommendation systems in production cannot have downtime for updates and maintenance, as they need to be constantly operational. Typically, updating data requires additional resources. A straightforward approach involves deploying a backup query service for each table to handling all traffic, switching to it once ready, necessitating double the resources. We implement a rolling update approach, updating one replica at a time and only requiring an additional 1/n of the resources. However, this introduces the issue of consistency across different versions over shards during table updating. For some model embedding tables, only values from the same training batch are comparable, which we refer to as strong version data. For sorting, it is crucial that features for the same batch of items come from the same version to ensure comparability.

Typically, shard and data version information stored on servers are registered with naming service and updated regularly. Clients consult this naming service for details, and request version-specific data from corresponding servers. However, in industrial-scale distributed systems with thousands of client and server instances, network delays and packet losses can prolong the process of server metadata updating in the Naming service and its retrieval by clients. This delay prevents servers from providing immediate service after they are ready, requiring explicit version consistency confirmation between clients and servers before proceeding with the next batch of rolling updates. This significantly extends the total update time, especially for data services that require frequent updates, where clients may not yet detect the previous version before the server needs to load new data, leading to request failures due to version inconsistencies. To address this issue, we utilize the Naming service only for updates to the server instance interfaces (IP:port), such as additions or deletions, while shard and version metadata are communicated directly between clients and servers through the query protocol, ensuring strong consistency in shard management on the client-side. The data update mechanism and access consistency scheme are illustrated in Figure 7 and Figure 8, respectively. Data compression and asynchronous pipeline processing are also encapsulated within this library. Moreover, it provides direct access to embedding tables and feature storage, minimizing network bandwidth overhead compared to a proxy-based approach.

3. Experiment and Result

In this section, we first provide a detailed evaluation and comparison of the optimization of hashtable, as this is a key factor in improving query throughput capacity. The second part directly assesses the latency optimization and resource efficiency from the new query cluster by conducting online comparisons in the recommendation system of bilibili, along with evaluating the impact of data consistency on performance.

3.1. Evaluation of NeighborHash

Firstly, we designed comparative experiments between NeighborHash and existing scalar hashmaps. The focus was on conducting ablation experiments and analysis between NeighborHash and Coalesced Hashing. Subsequently, we conducted comparisons and analyses of different vectorization methods and implementations.

3.1.1. Experiments Setup

We conducted all experiments on a Linux Server, which is configured as outlined in README of github repo, The live recommendation system is also deployed on machines of these specifications. In the context of recommendation systems, the specific type of hash table we focus on that includes skewed data distribution, large dataset size, high load factor, very high read/write skew, and a high successful lookup ratio. Certain configurations can influence the evaluation results, as detailed below.

•

Hardware prefetch, In certain scenarios, for precise analysis the efficiency of hash-table, we disabled hardware prefetching.
•

Hash function, the hash function employed across all tests was absl::Hash¡uint64_t¿.
•

Size and Load Factor, as mentioned earlier, we tested hash-tables with keys and values both being 64 bits, thus each element occupying 16 bytes of memory. We opted for datasets of sizes 256KB(16K), 2MB(128K), 16MB(1M), 256MB(16M), 2GB(128M) and 16GB(1G) for testing. We ensure that the hash-table’s load factor is set to 80%, a practical value.

Data distribution significantly impacts the test results of the Hash-table. Due to space limitations and the typically high memory load in real-world scenarios, our tests and analyses were conducted on uniformly distributed datasets. We also tested skewed data, and the conclusions were consistent with those from the uniformly distributed data in terms of trends and qualitative analysis.

Metrics. We measure hash-table throughput in MOPS (Million Operations Per Second) and monitor memory bandwidth consumption in BPL (Bytes Per Lookup). On Intel platforms, we utilize PCM to track bandwidth and calculate BPL by dividing bandwidth (BPS - Bytes Per Second) by the MOPS. Our evaluation also considers the LLC (Last Level Cache) cache-miss rate. We analyze probing performance using the Average Probing Cache Lines (APCL) metric, indicating the average number of cache lines accessed for each successfully found key. For exploring the upper limit of hash-table lookup operations, we created a hash-table with zero collisions, albeit without guaranteed correctness. Each lookup involved hash calculation and random reading, deemed indispensable minimal operations in hash-table lookup. This imperfect hash-table version is referred to as RA (Random Access). Subsequent experiments showed that NeighborHash achieved 90% of RA’s throughput with parallel optimization.

3.1.2. Scalar Hash-Tables Comparsion

we evaluate the scalar hash-tables. The hash-tables evaluated include Linear Probing, Coalesced hashing, ska::bytell_hash_map (Skarupke, 2018) and Neighborhash. We also introduce the previously mentioned random access in the comparison to understand the absolute level of query performance. The results are shown in Table 1. Due to the significant throughput of RA on datasets smaller than 256MB, it has been omitted from the table. As the dataset size increases, continuous cache misses lead to a sustained decline in throughput. In comparison to other implementations, NeighborHash consistently exhibits performance improvements across all scenarios. At a dataset size of 16GB, NeighborHash demonstrates over 50% higher query throughput compared to other implementations. In this test, the success query rate(SQR) was 90%, consistent with mainstream recommendation systems online. We also conducted tests under conditions of low hit rates(30%), and the results showed completely consistent trends.

Hashtable	Dataset size
Hashtable	256KB	2MB	16MB	256MB	16GB
Linear probing	38	29	22	12	11
ska::bytell_hash_map	72	53	38	20	19
Coalesced hashing	92	56	48	21	19
Neighborhash	116	74	66	37	36
Random Access /	/	/	/	67	67

Table 1. Scalar Hashtable Lookup Performance(Mops)

To analyze the impact of dataset size on hash-table performance, we conducted a detailed evaluation of NeighborHash, as presented in Table 2. For dataset sizes less than 2MB, most memory loads can be accommodated by the L1 and L2 cache. At this point, the MOPS is more influenced by the number of instructions and their cycles. Starting from a size of 16MB, the LLC miss rate begins to rise, reaching about 34% at 32MB. Simultaneously, MOPS starts to decline, while Bytes-per-lookup increases. As the LLC miss rate increases, Bytes-per-lookup exhibits a rapid growth after 32MB (L2 cache size) and converges at a dataset size of 2GB, indicating that MOPS is almost entirely dominated by memory latency.

Datasets	LLC-LD	LLC-MR	MOPS	BPL
256KB	0.004	25%	116	0.15
2MB	54	0.01%	74	0.15
16MB	82	3.58%	66	2.56
32MB	59	33.98%	48	27.9
256MB	46.6	90.9%	37	78.7
2GB	45.5	98.6%	37	81.4
16GB	44.3	99.2%	39	82.1

Table 2. NeighborHash, SQR=90%

Ablation Analysis.To comprehend the impact of various design components of NeighborHash on outcomes, we conducted ablation analysis on the following three key designs of NeighborHash:

•

Lodger relocation, employing only this strategy in Coalesced hashing, the search efficiency is equivalent to Coalesced hashing with perfect cellar, abbreviated as PerfectCellarHash.
•

Cacheline-aware neighbor probing, building upon PerfectCellarHash, prioritizes searching for available buckets near the cacheline of the last node in the conflict chain and its vicinity. Relative offsets are stored in a separate offset array. This implementation is referred to as NeighborProbing.
•

Inline-chaining, based on NeighborProbing, encodes relative offsets into the high 12 bits of the value, thus completing the full implementation of NeighborHash.

We conducted experiments and analyses on different datasets using the three aforementioned implementations. Datasets larger than 2GB exhibited similar trends. As we are particularly interested in the performance on larger datasets, the Table 3 presents the results for the 16GB dataset.

Hashmap	MOPS	MOPS-gain	APCL
CoalescedHashing	19	1	1.72
PerfectCellarHash	23	1.21	1.48
NeighborProbing	30	1.30	1.34
NeighborHash	39	1.30	1.14

Table 3. size=16GB, SQR=90%, LF=0.8, Ablation Analysis

Experimental results indicate that the aforementioned three key designs of NeighborHash contributed to throughput gains of 20%, 30%, and 30%, respectively, in terms of Millions of Operations Per Second (MOPS). It is noteworthy that these three designs are not entirely independent; the joint action of lodger relocation and neighbor probing enables offset compression, which, when combined with specific usage scenarios, allows for its integration into the value. Average Probing Cachelines(APCL) decreased from 1.72 to 1.14, resulting in saved memory bandwidth and achieving the goal of throughput improvement. We also evaluated the APCL of linear probing with Lodger Relocation, resulting in 1.24. It can be inferred that bidirectional probing can contribute approximately 9% to memory bandwidth efficiency compared to unidirectional probing.

Integration with Optimization. We implemented inter-query vectorization on NeighborHash, and the evaluation results are presented in Figure 9 which also includes the evaluation results of AMAC combined with SIMD for NeighborHash. For queries on smaller datasets, SIMD optimizations are most effective because most of the data can be directly stored in L2-cache. As the dataset size increases, AMAC becomes the better choice. In practical system optimizations, selecting the appropriate version based on the size of the dataset can yield the best results.

3.2. Benefits in Industrial Scenario

We deployed the batch query architecture proposed in this paper in Bilibili recommendation system. The significant improvements in computational and storage capabilities have facilitated optimizations across the recommendation models. Here, we present two key online comparison metrics: access latency and impact of data consistency on effectiveness.

Latency. We compared its effectiveness with a RocksDB-based Key-value query service, a commonly used solution in industrial recommendations that our system has previously utilized. We observed a 3 to 6 times increase in query latency compared to the baseline. In an experiment focusing on a high-traffic storage table for item features(40M items, 1KB per-item), with a peak online Key-Seek Per Second (KPS) of around 700k, we found that as the batch_size of single key queries increased, performance of query service remained stable without significant degradation, as illustrate in Table 4. CPU profiling with the perf tool showed that hash lookups in Neighborhash consumed a small portion of the CPU, with about half of the CPU usage dedicated to IO operations like payload packaging. Conversely, in the RocksDB implementation with an in-memory table configured as a hashtable and 10GB of memory(same with NeighborKV), approximately 30% of CPU utilization was allocated to memory queries and retrieval. Consequently, as the batch size increased, the performance discrepancies became more noticeable.

Key-value	Batch-size	AVG-latency(ms)
	10	1.11
KV(Rocksdb)	100	10.56
	500	25.81
	10	1.05
KV(Neighborhash)	100	1.78
	500	3.31

Table 4. KV(Rocksdb) vs KV(Neighborhash)

CTR Improvements with data consistency. In our A/B online experiments in the bilibili recommendation system, we compare click-through rate variances when enforcing consistent data version constraints during data shard replica rolling updates. The overall data comparison spans a full day, i.e., 24 hours, with results presented in Figure 10. It is evident that the shorter the update interval, the more pronounced the benefits of batch query data consistency, as more inconsistencies occur with frequent updates. We analyzed the scenario where multi-shard updates of the embedding table occur without a consistency protocol, leading to approximately 3% inconsistency in results. This inconsistency implies that a single estimation might utilize multiple versions of the embedding weights. A detailed analysis of the inconsistent cases reveals that discrepancies among correlated features significantly impair the estimation results. Therefore, we conclude that the improvement in CTR can be attributed to the enhancement in consistency. While the specific effects may vary across systems and attribute tables, observations in various systems and multidimensional data showcase significant performance improvements through maintaining consistency.

Resource Saving. NVMe usage depends on data categorization into hot and cold data, which varies across applications. A feasible recommendation is to employ NVMe storage for embedding tables and features that exhibit large data volumes but with very low Key-Seek Per Second (KPS) rates. The resource advantages here mainly come from the improvements in high-performance batch queries. The boost in single-instance throughput allows for fewer replicas, reducing the number of redundant resources due to rolling update capability. we has saved around 30% of machine resources in our recommendation system. In systems with a long-tail distribution of user activity and item popularity, certain low-traffic embedding tables still require two replicas for fault tolerance. Future optimizations could focus on cluster scheduling to improve resource efficiency by sharing replicas among low-traffic tables.

4. Related Work

Various evolutionary schemes have been developed based on hash strategies (Pagh and Rodler, 2004; Herlihy et al., 2008; Celis et al., 1985). Coalesced Hashing (Vitter, 1982) is an attempt to combine linear probing and separate shaining but lacks consideration for cacheline friendliness and performs poorly in current practical environments. Neighborhash addresses these limitations and demonstrates significant improvements. Notable examples like Absl’s flat_hash_map (Google, 2017) and Facebook’s F14 (Bronson and Shi, 2019) have excelled in optimizing query efficiency through SIMD instruction utilization. Experimental evaluations show that NeighborHash achieves lower Average Probing Cache Lines and higher query throughput compared to linear probing. Monolith (Liu et al., 2022) effectively filters low-frequency and outdated feature IDs in recommendation systems, reducing conflicts and enhancing model performance by utilizing Cuckoo hashing, while (Zhang et al., 2020) give a hybrid hashing method to combine frequency hashing and double hashing techniques for model size reduction.

Due to variations in scenarios and applications, there is no one-size-fits-all design for key-value storage. The flexibility of key-value storage leads to ongoing research in optimization tailored to different contexts. Implementations like LevelDB, RocksDB, FlashStore, and SILT (Google, 2011; Facebook, 2021; Dong et al., 2021; Debnath et al., 2010; Lim et al., 2011) utilize LSM-trees for in-memory key-value storage. SlimDB (Ren et al., 2017) employs dynamic compaction and in-memory index optimizations to enhance throughput for semi-sorted data, while F2 (Kanellis et al., 2023) separates hot and cold data domains to boost overall performance under large data skew. In the realm of recommendation systems, feature store systems like Uber’s Michelangelo Palette, Google Feast and Amazon SageMaker Feature Store (Chothani, 2017; Sell, 2019; AWS, 2019) offer solutions for AI applications. RecShard (Sethi et al., 2022) shards embedding tables based on training data distribution and model characteristics to improve model training throughput. EVTable (Kurniawan et al., 2023) implemented a three-layer embedding lookup table to enhance recommendation effectiveness and optimize resources. These optimizations differ slightly from the approach in this paper, which enhances online inference throughput significantly within its feature store architecture by employing shard constraints and ensuring access consistency, suggesting that integrating the designs could yield greater benefits.

5. Conclusion

This paper presents an enhanced batch query architecture tailored for industrial-grade recommendation systems, providing high-performance throughput for batch queries of user and item features, model parameters and embedding tables.

We introduce NeighborHash, a hash-table optimized for batch point queries. By strategically placing conflicting nodes in neighboring positions, it reduces search probe times and cache misses, significantly enhancing batch query performance. Comparative analysis and ablation experiments demonstrate that Neighborhash outperforms existing hashtable structures significantly in recommendation scenarios. Building upon Neighborhash, we develop query cluster, which efficiently saves resources through NVMe storage compatibility and rolling updates. It ensures the real-time update of model parameters crucial for recommendation effectiveness while maintaining access consistency during updates. We deployed this system in the bilibili recommendation system and achieved a win-win in both resources and performance.

References

(1)
AWS (2019) AWS. 2019. Create, store, and share features with Amazon SageMaker Feature Store. https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html
Bronson and Shi (2019) Nathan Bronson and Xiao Shi. 2019. Open-sourcing F14 for faster, more memory-efficient hash tables. https://engineering.fb.com/2019/04/25/developer-tools/f14/
Celis et al. (1985) Pedro Celis, Per-Ake Larson, and J. Ian Munro. 1985. Robin hood hashing. In 26th annual symposium on foundations of computer science (sfcs 1985). IEEE, 281–288.
Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
Chothani (2017) Paarth Chothani. 2017. Palette Meta Store Journey. https://www.uber.com/en-KR/blog/palette-meta-store-journey/
Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
Davidson et al. (2010) James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems. 293–296.
Debnath et al. (2010) Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High throughput persistent key-value store. Proceedings of the VLDB Endowment 3, 1-2 (2010), 1414–1425.
Dong et al. (2021) Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. Rocksdb: Evolution of development priorities in a key-value store serving large-scale applications. ACM Transactions on Storage (TOS) 17, 4 (2021), 1–32.
etcd io (2017) etcd io. 2017. etcd. http://etcd.io
Facebook (2021) Facebook. 2021. RocksDB. https://github.com/facebook/rocksdb
Fang et al. (2019) Zhuhe Fang, Beilei Zheng, and Chuliang Weng. 2019. Interleaved multi-vectorizing. Proceedings of the VLDB Endowment 13, 3 (2019), 226–238.
Google (2011) Google. 2011. LevelDB. https://github.com/google/leveldb
Google (2017) Google. 2017. Abseil. http://abseil.io
Gupta et al. (2014) Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Dhoot, Abhilash Kumar, and Ankur Agiwal. 2014. Mesa: Geo-replicated, near real-time, scalable data warehousing. (2014). https://research.google/pubs/pub42851/
Gupta et al. (2020) Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, et al. 2020. The architectural implications of facebook’s dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 488–501.
Herlihy et al. (2008) Maurice Herlihy, Nir Shavit, and Moran Tzafrir. 2008. Hopscotch hashing. In Distributed Computing: 22nd International Symposium, DISC 2008, Arcachon, France, September 22-24, 2008. Proceedings 22. Springer, 350–364.
Kanellis et al. (2023) Konstantinos Kanellis, Badrish Chandramouli, and Shivaram Venkataraman. 2023. F2: Designing a Key-Value Store for Large Skewed Workloads. arXiv preprint arXiv:2305.01516 (2023).
Kocberber et al. (2015) Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous memory access chaining. Proceedings of the VLDB Endowment 9, 4 (2015), 252–263.
Kurniawan et al. (2023) Daniar H Kurniawan, Ruipu Wang, Kahfi S Zulkifli, Fandi A Wiranata, John Bent, Ymir Vigfusson, and Haryadi S Gunawi. 2023. EVStore: Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation Systems. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 281–294.
Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14). 583–598.
Li et al. (2013) Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In Big learning NIPS workshop, Vol. 6. Lake Tahoe, CA.
Lian et al. (2022) Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, et al. 2022. Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3288–3298.
Lim et al. (2011) Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. 1–13.
Liu et al. (2022) Zhuoran Liu, Leqi Zou, Xuan Zou, Caihua Wang, Biao Zhang, Da Tang, Bolin Zhu, Yijie Zhu, Peng Wu, Ke Wang, et al. 2022. Monolith: real time recommendation system with collisionless embedding table. arXiv preprint arXiv:2209.07663 (2022).
Lu et al. (2018) Yichao Lu, Ruihai Dong, and Barry Smyth. 2018. Why I like it: multi-task learning for recommendation and explanation. In Proceedings of the 12th ACM Conference on Recommender Systems. 4–12.
Pagh and Rodler (2004) Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo hashing. Journal of Algorithms 51, 2 (2004), 122–144.
Polychroniou et al. (2015) Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD vectorization for in-memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1493–1508.
redis (2009) redis. 2009. Redis. https://redis.io
Ren et al. (2017) Kai Ren, Qing Zheng, Joy Arulraj, and Garth Gibson. 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proceedings of the VLDB Endowment 10, 13 (2017), 2037–2048.
Sell (2019) Tim Sell. 2019. Introducing Feast: an open source feature store for machine learning. https://cloud.google.com/blog/products/ai-machine-learning/introducing-feast-an-open-source-feature-store-for-machine-learning
Sethi et al. (2022) Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, and Carole-Jean Wu. 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 344–358.
Shen et al. (2022) Qijie Shen, Hong Wen, Jing Zhang, and Qi Rao. 2022. Hierarchically Fusing Long and Short-Term User Interests for Click-Through Rate Prediction in Product Search. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22). ACM.
Skarupke (2018) Malte Skarupke. 2018. A new fast hash table in response to Google’s new fast hash table. https://probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles-new-fast-hash-table/
Tuan and Phuong (2017) Trinh Xuan Tuan and Tu Minh Phuong. 2017. 3D convolutional networks for session-based recommendation with content features. In Proceedings of the eleventh ACM conference on recommender systems. 138–146.
Twardowski (2016) Bartłomiej Twardowski. 2016. Modelling contextual information in session-aware recommender systems with neural networks. In Proceedings of the 10th ACM Conference on Recommender Systems. 273–276.
Vitter (1982) Jeffrey Scott Vitter. 1982. Implementations for coalesced hashing. Commun. ACM 25, 12 (1982), 911–926.
Wan et al. (2021) Hu Wan, Xuan Sun, Yufei Cui, Chia-Lin Yang, Tei-Wei Kuo, and Chun Jason Xue. 2021. FlashEmbedding: storing embedding tables in SSD for large-scale recommender systems. In Proceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems. 9–16.
Zhang et al. (2020) Caojin Zhang, Yicun Liu, Yuanpu Xie, Sofia Ira Ktena, Alykhan Tejani, Akshay Gupta, Pranay Kumar Myana, Deepak Dilipkumar, Suvadip Paul, Ikuhiro Ihara, et al. 2020. Model size reduction using frequency based double hashing for recommender systems. In Proceedings of the 14th ACM Conference on Recommender Systems. 521–526.
Zhao et al. (2019) Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems. 43–51.