Response to reviewers’ comments

Summary of changes in the revision

We would like to thank reviewers for their insightful comments and suggestions. We have addressed them in our responses to reviewers and also revised our manuscript accordingly. For the convenience of reviewing, we have highlighted the changes in the manuscript with blue color. Due to space limitations, some changes are available in the Appendix of our technical report [ZTMC20_arxiv]. We also attach this appendix just after this response letter for reference.

Meta-Reviewer 1

Comments

1.

a: Correct Lemma 5.2.

Response: We have corrected that lemma and addressed the corresponding comment (i.e., Reviewer #3 D6).
2.

b: Present an index construction experiment (following W2 of Review#6)

Response: We have added an index construction experiment in the revised version. Please refer to Reviewer #6 D3 for detailed explanation.
3.

c: Provide an experiment or at least a discussion on more dimensions than 2.

Response: In the revised version, we have included a discussion on this case in the Appendix A.5. Please refer to Reviewer #2 D5 for detailed explanation.
4.

d: Highlight novelty of this work much better, especially over related work such as ”Online Piece-wise Linear Approximation of Numerical Streams with Precision Guarantees”, VLDB 2009.

Response: Please refer to Reviewer #6 D1 for detailed explanation about the novelty and the technical contribution of our paper.

In our revised paper, we have cited the above VLDB 2009 paper (on piecewise linear approximation) in the related work section. It belongs to the time series database community. Our work differs from that paper in three ways: (i) modeling, (ii) query with error guarantee, (iii) the number of keys. First, we model the dataset by using polynomial functions (rather than linear segments), which can offer better approximation, can reduce the number of models and the query response time. Second, our proposed methods can evaluate range aggregate queries with deterministic absolute and relative error guarantees, but the above paper cannot support these guarantees. Third, our proposed solution supports range aggregate queries with two keys (in Section 6), but the above paper supports a single key only.

Reviewer #2

Comments

1.

D5: It is not clear whether or not the proposed solutions can be applied for aggregate queries with more ( $>$ 2) keys. The authors need clearly discuss this scalability-related problem.

Response: Thanks for your comment. In the Appendix A.5, we have discussed how to support queries with $d>2$ keys. We have also explained that the performance of PolyFit may degrade when $d$ increases. The main reasons are due to: (i) the query evaluation involves accessing $O(2^{d})$ precomputed values, and (ii) in a high dimensional setting, it is hard to approximate the key-cumulative surface well by using a small number of models.

In our paper, we focus only on queries with a single key or two keys, as they appear in many applications. For example, in a satellite data processing [5], developers need to aggregate the measure in a given region (e.g., the units of vegetation) for huge satellite data. This corresponds to an aggregate COUNT query with 2 keys (on longitude and latitude). In tweet analysis [6], analysts would like to know how many tweets have been posted during a particular time range. This corresponds to a range COUNT query with single key on timestamp. Stock trading and analysis tools (e.g., tradingview.com) allow users to query the highest (lowest) price during a particular time range. This is an aggregate MAX (MIN) query with single key on timestamp. In Foursquare, users can check how many restaurants are there within a given region [1], which is an aggregate COUNT query with 2 keys. We have also added these references into the Introduction.
2.

D6: It may be better to contain some toy examples for illustrating some technical details. For instance, a toy example can be added after Eq. (17) for illustrating how built indexes support the generated approximate results.

Response: Thanks for your comment. For the illustration of range MIN/MAX query, we use Figure 2 to explain the exact version and Figure 7 to explain our adaption with polynomial models.

Our index structure is similar to an aggregate R-tree, except that we store polynomial models (rather than MBRs) in leaf nodes. In the revised version, we have added a discussion in the beginning of Section 5.2 to make it more clear.

Reviewer #3

Comments

1.

D1: In the preliminaries, the authors should explicit state whether the assumptions that ”key values are distinct” and that ”every $m_{i}$ is non-negative” is without loss of generality, and explain why this is the case or not. It is also implicitly assumed that ”measure” values are numbers (which is not required for COUNT queries) and that key values come from a totally ordered domain (how else could we have ranges?). Even though this is understandable from the context, it should be stated explicitly in the interest of completeness and clarity.

Response: Thanks for your comment. We have revised Section 3.1 accordingly. In the Appendix A.3 and A.4, we have discussed on how to support repeated keys and negative measure values respectively.
2.

D2: The ”relational algebra” syntax used in Equation 1 is confusing, wrong and unnecessary. Firstly, it is not well formed because a selection on $\mathcal{D}$ returns a set of key-measure pairs (k,m) whereas the aggregation function $\mathcal{G}$ takes as input bag of (measure) values. Secondly, in a more standard, logic-based notation for relational algebra, the selection condition ” $k\in[l_{q},u_{q}]$ ” would be expressed as ” $k\geq l_{q}\land k\leq u_{q}$ ”. Thirdly, there is no need to resort to relational algebra at all; just define Equation 1 as $R_{\mathcal{G}}(\mathcal{(}D),[l_{q},u_{q}])=\mathcal{G}(V)$ and say that V is the bag consisting of measure values m for which the pair (k,m), with $l_{q}\leq k\leq u_{q}$ , belongs to $\mathcal{D}$ . For additional clarity, one should stress that V is a multiset because the same measure m can occur for different key values. Similar issues apply to the equation for the 2-dimensional case in Definition 6.1.

Response: Thanks for your comment. We have revised Equation 1 and Definition 6.1 accordingly.
3.

D3: Equation 6 does not mention the dataset $\mathcal{D}$ , whose key values $k_{i}$ the definition of $DF_{max}$ depends on. Moreover, $DF_{max}(k)$ is well defined only if those key values are considered in ascending order, with $k_{1}$ being the smallest. This should be stated explicitly; there is a mention to ordering the dataset $\mathcal{D}$ by ascending key values in the paragraph before 3.2.2, but this refers to the computation of $CF_{sum}$ , not $DF_{max}$ . Moreover, from the right-hand side of Equation 6 (which, by the way, should start with ” $m_{1}$ if …”), it is not clear when $DF_{max}(k)=m_{n}$ , that is, the measure associated with the ”last” key value in $\mathcal{D}$ . According to the current formulation it would be ” $m_{n}$ if $k_{n}\leq k<k_{n+1}$ ”, but there is no ” $k_{n+1}$ ” in the dataset; should it then be ” $m_{n}$ if $k_{n}<=k$ ”? This also does not look entirely satisfactory. The authors should clarify the definition of $DF_{max}$ .

Response: Thanks for your comment. We have revised the description in Section 3.2.2 and Equation 6 in our paper.
4.

D4: The example in Figure 2(a) is not good: I would expect $DF_{max}$ to have a staircase shape as in Figure 3. Why is it smooth here? Also, what are $N_{1}$ , $N_{2}$ , …? I do understand they are key intervals, but it should be made explicit and clear how they relate to the definition of $DF_{max}$ given in Equation 6.

Response: Thanks for your comment. In the revised version, we have modified the curve in Figure 2(a) to have a staircase shape, which is in line with the definition of $DF_{max}$ (cf. Equation 6). $N_{1}$ , $N_{2}$ ,… denote the internal nodes of the aggregate max-tree (cf. Figure 2b). To avoid misunderstanding, we replace $N_{1}$ , $N_{2}$ ,…, with the intervals $I_{1}$ , $I_{2}$ ,…, respectively in Figure 2a and clearly illustrate in Figure 2b that each node stores two entries, where each entry stores an interval and maximum measure within that interval.
5.

D5: In the proof of Theorem 4.3 it should be defined and explained what an ”ascending sequence of intervals is”, that is, a sequence $I^{(1)},...,I^{(k)}$ of intervals where each $I^{(i)}$ , for $i=1,...,k-1$ , is such that $I^{(i)}.k_{max}<$ (is equality allowed here?) $I^{(i+1)}.k_{min}$ . The min and max key of an interval I could also be simply denoted as I.min and I.max, respectively (instead of $I.k_{min}$ and $I.k_{max}$ ). Moreover, since the proof is by induction, it should be properly formatted as an induction, with a base case and an inductive step.

Response: Thanks for your comment. We have revised Theorem 4.3 and the proof accordingly.

D6: In the proof of Lemma 5.2, the step from Equation 15 to Equation 16 is not appropriately justified and I am not sure it is correct. What do the authors mean by ”rearranging terms”? Where did the absolute value go? This should be clarified.

Response: Thanks for your comment. We can derive Equation 16 (Equation 17 in this version), based on Equation 15 (Equation 16 in this version), with simple mathematical derivations, which are shown as follows.

	$\displaystyle\|\tilde{A}_{sum}-R_{sum}(\mathcal{D},[l_{q},u_{q}])\|$	$\displaystyle\leq 2\delta$
$\displaystyle-2\delta\leq$	$\displaystyle\tilde{A}_{sum}-R_{sum}(\mathcal{D},[l_{q},u_{q}])$	$\displaystyle\leq 2\delta$
$\displaystyle\tilde{A}_{sum}-2\delta\leq$	$\displaystyle R_{sum}(\mathcal{D},[l_{q},u_{q}])$	$\displaystyle\leq\tilde{A}_{sum}+2\delta$

Therefore, Equation 16 (Equation 17 in this version) is obtained by the left hand side of the above inequality, which is correct. We omit the inequality in the right hand side, since we do not need to use it in our proof. The derivations of these two equations are similar with the proof in Lemma 5.1.

To avoid confusion, we have explicitly state that Equation 16 (Equation 17 in this version) is based on simple mathematical derivations from Equation 15 (Equation 16 in this version).

7.

D7: In Equation 9, we minimize the max absolute difference between F and $\mathbb{P}$ . This makes sense, but one could also try to minimize the area between these two functions, which would be useful to detect cases in which the functions are overall quite close except for an isolated point with a high peak difference, as opposed to the situation where the functions have a significant difference (below the maximum, but much greater than 0) across many data points. Did the authors consider this possibility?

Response: Thanks for your comment. This is an interesting idea for further exploration in the future work. In this paper, we aim to answer queries with the absolute error guarantee $\epsilon_{abs}$ (cf. Problem 1). To fulfill this guarantee, we need to minimize the absolute difference in our equation. If the objective is to minimize the area between $F$ and $\mathbb{P}$ , we cannot provide the aforementioned error guarantee.
8.

D8: I understand that MRTree, aR-tree and RMI are not competitive w.r.t. query response time, but they should be included in the experiment of Figure 19 nevertheless, because they might still offer a good trade-off in terms of index size (i.e., high response time, but very small index size). Omitting these methods here does not give us the full picture.

Response: Thanks for your comment. In our paper, we have included these three methods in the Figure 19 and revised the paragraph accordingly. Observe that our method PolyFit can outperform these methods in terms of query response time and index size in most of the cases.
9.

D9: Typos and other minor comments:

- Throughout the paper: the error is denoted with $\varepsilon$ in some places (e.g., Section 5.2) and $\epsilon$ in others (e.g., Section 5.3); please be consistent.

- Abstract, last but one line: ”Experiment results” $\rightarrow$ ”Experimental results”.

- Introduction, line 3: ”to a attribute” $\rightarrow$ ”to an attribute”.

- Page 1, right, third paragraph: ”Another difference […]” $\rightarrow$ What was the first one?

- Related Work, last but one line of first paragraph: ”learned index, and times series database,” $\rightarrow$ ”learned indexes, and times series databases,” (plural).

- Page 5, left, line 5: ”calls LP solver” $\rightarrow$ ”calls the LP solver” or ”calls an LP solver”.

- Page 5, left, line below Algorithm 1: use ” $O(n\times\ell^{2.5})$ ” for consistency of notation.

- Proof of Lemma 4.2, line 3: ”Since the $S_{l}$ ” $\rightarrow$ ”Since $S_{l}$ ”.

- Section 4.3, line 2: ”range from” $\rightarrow$ ”ranges from”.

- Section 5, line 5: ”We then discuss” $\rightarrow$ ”We discuss”.

- Line after Equation 14: use ”noindent” at the beginning.

- Sentence after Equation 14: $l_{q}$ and $u_{q}$ are values, how do they ”overlap” an interval? You should rephrase the sentence as ”where $I_{l}$ and $I_{u}$ denote the intervals of (the domain of) $\mathbb{P}$ that contain the values $l_{q}$ and $u_{q}$ , respectively”. A similar comment applies to the sentence after Equation 17.

- Lines after Equation 16: ”dividing with” $\rightarrow$ ”dividing by”; put a colon at the end of the sentence instead of a period.

- Algorithm 2, input: remind the reader that $Seq_{\mathbb{P}}$ is the output of Algorithm 1.

- Algorithm 2, lines 1-2: ”cover” $\rightarrow$ ”includes”.

- Equation 17 spills into the right margin of the page and ”noindent” should be used at the beginning of the line immediately after.

- Page 6, right, line -9: ”of aggregate R-tree” $\rightarrow$ ”of the aggregate R-tree”. More importantly, at this point it is not clear what R-tree you are referring to here, because this is only said a few lines after when you talk about ”the overall query algorithm”.

- Page 6, right, line -8: ”whose intervals are covered by $[l_{q},u_{q}]$ ” $\rightarrow$ do you perhaps mean ”whose intervals cover $[l_{q},u_{q}]$ ”?

- Page 6, right, line -6: what are ” $U_{I_{l}}$ ” and ” $L_{I_{u}}$ ”?

Page 7, left, line -3: ”could be approximated by either” $\rightarrow$ ”could be approximated, among others, by either”.

- Page 7, right, lines 8-9 of How to tune $\delta$ ? ”Algorithm 2, 3” $\rightarrow$ ”Algorithms 2 and 3”.

- Page 8, left, 7 lines before Figure 10: ”for small-scale dataset” $\rightarrow$ ”for small-scale datasets” (plural).

- Page 8, left, 3 lines after Figure 10: ”theoretical guarantee” $\rightarrow$ ”theoretical guarantees” (plural).

- Lemma 6.3: ”A” should be ” $\tilde{A}_{count}$ ”.

- Lemma 6.4: ” $A_{rel}$ ” and ”A” should both be ” $\tilde{A}_{count}$ ”.

- List before Table 4: ”absolute error guarantee” $\rightarrow$ ”absolute error guarantees”; ”the relative error guarantee” $\rightarrow$ ”relative error guarantees”.

- Page 9, left, line 9: ”Due to the space limitation” $\rightarrow$ ”Due to space limitations”.

- Page 9, left, line 25: ”time series databases respectively” $\rightarrow$ ”time series databases, respectively”.

- Section 7.2, line 2: ”method PolyFit,” $\rightarrow$ ”PolyFit method,” or ”method, PolyFit,”.

- The last sentence of 7.2.1 is confusing; please rephrase to ”By default, in subsequent experiments, we choose deg = 2 for the COUNT query with single key, and deg = 3 for the COUNT query with two keys, and for the MAX query.”

- Paragraph 7.2.2, line 5: ”single key and two keys respectively” $\rightarrow$ ”single key and two keys, respectively”.

- Page 10, right, 6 lines below Figure 16: ”all methods for COUNT query […] have the logarithmic” $\rightarrow$ ”all methods for the COUNT query […] have logarithmic”.

- Page 10, right, last line: ”OSM (with 100M)” $\rightarrow$ ”OSM (with 100M records)”.

- Page 11, left, 11 lines below Figure 18: ”The reason is the smaller” $\rightarrow$ ”The reason is that smaller”.

- Page 11, left, 13 lines below Figure 18: ”generated by GS method” $\rightarrow$ ”generated by the GS method”.

- Page 11, left, line -2: ”cannot support COUNT query” $\rightarrow$ ”cannot support COUNT queries”.

- Figures on pages 10-11: The layout is poor, causing some lines of text to be ”trapped” between figures (e.g., Figures 17-18 and 19-20). It would perhaps be better to have all figures in a single, separate page.

Response: Thanks for your comment. We have fixed these typos and minor comments accordingly. Regarding the last point, since we have many figures in this paper, we cannot put all these figures in a single separate page. We only put all figures in Sections 7.2 and 7.3 on page 10 and all figures in Sections 7.4 and 7.5 on page 11.

Reviewer 4

Comments

1.

There are similar approaches not mentioned by the authors in the related work section by Peter Triantafyllou. For example Fotis Savva, Christos Anagnostopoulos, Peter Triantafillou: Aggregate Query Prediction under Dynamic Workloads. BigData 2019: 671-676

Response: Thanks for suggesting this reference. We have cited this paper in the Related Work section (Please refer to the third paragraph in Section 2). There are two main differences between this work and our method. First, PolyFit supports efficient range aggregate queries with deterministic absolute and relative error guarantees. However, this work aims to minimize the expected quantization error. Second, their solution can be regarded as a query-driven approach. However, PolyFit does not need the query workload.
2.

The various trees used in your approach ideally would need a figure or an example for the paper to be self-contained.

Response: Thanks for your comment. In our paper, we utilize the well known B+ tree (specifically STX B-tree (Ref. [6] in our paper)) and Quad-tree to index the polynomial models for COUNT/SUM query with one key and two keys respectively, which have been summarized in Figure 4 and Section 4.3. Due to the space limitations and the popularity of these two index structures, we omit these figures in this paper. For MAX/MIN query, please refer to the response of Reviewer #2 D6.
3.

The evaluation of the approach is really solid and I detailed. I only was expecting to see larger more complex datasets. Along with this line the following question comes up:What about more than 2 dimensions? At least a discussion should be provided here.

Response: Thanks for your comment. We have checked the characteristics of real datasets in recent related works (Ref. [24], [41] in our paper). They use numeric datasets with 1-2 attributes, and the number of records in the largest dataset is in the order of 100 million. Thus, we consider it reasonable to use the OSM dataset (with 100 million records, two keys) in our experimental study. Regarding the comment for the more than 2 dimensions, please refer to the response of Reviewer #2 D5.

Reviewer 6

Comments

1.

D1: In terms of the technical contribution of the paper, it is not clear which ideas are novel. The query answering approaches are standard to anyone that has used, for example, prefix sum arrays. The fact that polynomial functions are used for the approximation, instead of linear functions as in prior works, does not change the way that queries can be answered. Regarding the number of produced intervals, previous works using piecewise linear approximations can also similarly produce minimal numbers of intervals. Thus, it is not clear how large the novelty and technical contribution of the paper is, and clarifying it would make the paper much stronger.

Response: Thanks for your comment. There are two main differences between our work and the existing works.

First, all the existing learned index methods focus on supporting range queries, including RMI (Ref. [41] in our paper) and PGM (Ref. [24] in our paper), or point queries, including FITing-tree (Ref. [25] in our paper), rather than range aggregate queries (in this work), which have been pointed out in both introduction (paragraph 2), related work (paragraph 4) and the experimental section (Table 4). Even though the existing learned-index methods (Ref. [24], [25], [41] in our paper) can be extended to efficiently support range aggregate queries with both absolute error and relative error guarantees, which are based on the techniques in Section 3.2.1, these methods are unable to efficiently support either MIN/MAX queries or in the setting of two keys. Compared with the existing approaches, we develop this unified method, called PolyFit, which can build the index structures for different types of polynomial functions, e.g., $\mathbb{P}(k)=\sum_{j=0}^{deg}a_{j}k^{j}$ (cf. Equation 8) and $\mathbb{P}(u,v)=\sum_{i=0}^{deg}\sum_{j=0}^{deg}a_{ij}u^{i}v^{j}$ (cf. Section 6), to efficiently handle range COUNT/SUM/MIN/MAX queries with one/two keys. We believe that none of the existing learned index methods (Ref. [24], [25], [41] in our paper) can be easily extended to handle all these combinations efficiently.

Second, we have formulated the polynomial fitting problem (cf. Equations 9 and 10), which can minimize the maximum error between the polynomial curve (one key)/plane (two keys) to the key cumulative ( $CF_{sum}$ ) and the key measure ( $DF_{max}$ ) functions. This approach can reduce the number of models, under the settings of both single and two keys. By adopting those lemmas for query evaluation (Lemmas 5.1 - 5.4, 6.3 - 6.4), we can construct the compact index, which can provide both absolute error and relative error guarantees for supporting range COUNT/SUM/MIN/MAX queries with one/two keys.

We have already summarized the above technical contributions in the introduction (the first two bullets).
2.

D2: In terms of the experimental evaluation, there is no comparison of PolyFit to other past work in terms of the construction time.

Response: Thanks for your comment. In this version, we have added the Section 7.5 and Figure 20 to compare the construction time for all methods.
3.

D3: The construction time is quite high. This technique can be applied to multi-dimensional data, when the queries are either one-dimensional or two-dimensional. These types of data are quite large, much larger than the data sizes used in the experiments, and the running time seems prohibitive. 30 million records required 41 minutes, which is too long, considering that 30M tuples are too small (typically fit in main memory) compared to the sizes expected in many applications.

Response: Thanks for your comment. Since the range aggregate queries are mainly used to support data analytics tasks, as stated in the introduction, the datasets are normally static in this setting (only a few changes/ no change in the datasets). Therefore, it is acceptable to develop algorithms with higher construction time in these tasks. As an example, both DBest and Hist also support the data analytics tasks and exhibit the high construction time (even higher than PolyFit (cf. Figure 20)).

In this version, we have explicitly stated that the construction time of PolyFit is reasonable in practice in Section 7.5. Moreover, we have also discussed how to utilize parallel computation to improve the construction time in the Appendix A.6.
4.

D4: The running time of PolyFit is the best aspect of the approach. It would be nice to increase the motivation of the paper though, as for interactive queries it is not clear if it makes a large difference to have an approach that is faster, but in the order of nanoseconds, to other techniques. The paper could try to motivate things using applications with too many queries by a very large number of users, where the performance per query could make a difference.

Response: Thanks for your comment. We use the following two real examples to illustrate why it is important to improve the response time to $\mu$ s level or ns level for supporting range aggregate queries.

As stated in [3], 55% of adults in the United States invested in the stock market, which indicates that there are roughly 140 millions active users in the US stock market (based on the statistics of the population in the US [Population_US]). Since these users can perform different data analytics tasks [OCED19, YC14, HB04, CD97], which are built on top of (multiple) range aggregate queries (e.g., SUM, MIN and MAX queries), the system needs to handle huge amounts of these queries per second. Imagine that only 10% of the active users are performing the data analytics tasks, there can be at least 14 million queries per second. As such, it is necessary to reduce the response time of range aggregate queries to the ns level.

In addition, Foursquare, which helps users to find the number of specific POIs (e.g., restaurants), i.e., COUNT query with two keys, within a given region [1], has more than 50 million monthly users (in 2014) [FSL15] and 105 million POIs [1]. To support a huge amount of users, it is also essential to improve the response time to the $\mu$ s level/ ns level.

Furthermore, selectivity estimation [YSB20, ACASVS19, HKC13] is a frequently used module in database systems. As selectivity estimation can be casted as range COUNT queries, our proposal can help database systems obtain accurate estimates rapidly.

Due to space limitations, we only add the second one into the Introduction (cf. Section 1).
5.

D5: Why is dividing Equation 15 with Equation 16 valid, considering that the left part of Equation 16 may be a small positive number, while the right part of Equation 16 may be negative?

Response: Thanks for your comment. To avoid misunderstanding, we revise the condition in Lemma 5.2 to be $\tilde{A}_{sum}\geq 2\delta(1+\frac{1}{\varepsilon_{rel}})$ . This condition implies $\tilde{A}_{sum}>2\delta$ ( $\varepsilon_{rel}>0$ ), which means that the right part of Equation 16 (Equation 17 in this version) must be positive. Therefore, we can ensure the $\leq$ term is valid after we divide Equation 15 (Equation 16 in this version) by Equation 16 (Equation 17 in this version). We have also revised the proof of Lemma 5.2 accordingly.

References

[1] Foursquare api. https://developer.foursquare.com/.
[2] Foursquare statistics. https://99firms.com/blog/foursquare-statistics/#gref/.
[3] Share of adults investing money in the stock market in the united states from 1999 to 2020. https://www.statista.com/statistics/270034/percentage-of-us-adults-to-have-money-invested-in-the-stock-market/.
[4] Technical indicators mathematical description. https://docs.anychart.com/Stock_Charts/Technical_Indicators/Mathematical_Description#kdj.
[5] A. Eldawy, M. F. Mokbel, S. Al-Harthi, A. Alzaidy, K. Tarek, and S. Ghani. SHAHED: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In ICDE, pages 1585–1596, 2015.
[6] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, and R. C. Miller. Processing and visualizing the data in tweets. SIGMOD Record, 40(4):21–27, 2011.