\useunder

Assessing the Feasibility of Web-Request
Prediction Models on Mobile Platforms

Yixue Zhao University of Massachusetts Amherst
Amherst, USA
[email protected] Siwei Yin Beijing University of Posts and
Telecommunications, Beijing, China
[email protected] Adriana Sejfia University of Southern California
Los Angeles, USA
[email protected] Marcelo Schmitt Laser University of Southern California
Los Angeles, USA
[email protected] Haoyu Wang Beijing University of Posts and
Telecommunications, Beijing, China
[email protected] Nenad Medvidovic University of Southern California
Los Angeles, USA
[email protected]

Abstract

Prefetching web pages is a well-studied solution to reduce network latency by predicting users’ future actions based on their past behaviors. However, such techniques are largely unexplored on mobile platforms. Today’s privacy regulations make it infeasible to explore prefetching with the usual strategy of amassing large amounts of data over long periods and constructing conventional, “large” prediction models. Our work is based on the observation that this may not be necessary: Given previously reported mobile-device usage trends (e.g., repetitive behaviors in brief bursts), we hypothesized that prefetching should work effectively with “small” models trained on mobile-user requests collected during much shorter time periods. To test this hypothesis, we constructed a framework for automatically assessing prediction models, and used it to conduct an extensive empirical study based on over 15 million HTTP requests collected from nearly 11,500 mobile users during a 24-hour period, resulting in over 7 million models. Our results demonstrate the feasibility of prefetching with small models on mobile platforms, directly motivating future work in this area. We further introduce several strategies for improving prediction models while reducing the model size. Finally, our framework provides the foundation for future explorations of effective prediction models across a range of usage scenarios.

Index Terms:

Prediction, Web Requests, Mobile Platform, Network Latency, Empirical Study

I introduction

Prefetching network requests is a well-established area aimed at reducing user-perceived latency [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. This work can be classified into two categories based on how predictions are made [12]: content-based techniques predict future network requests by analyzing application content, such as web page structure, to anticipate “sub-resources” (e.g., images, JavaScript files) that will be needed by a web page; history-based techniques predict future requests by analyzing past requests.

Recent research has highlighted the opportunity to apply prefetching on mobile platforms [13]. However, existing work has primarily explored content-based strategies that carry over several limitations [7, 6, 14]. First, their accuracy is impacted by the complexity of the program analyses they employ. For instance, PALOMA [7] and APPx [6] rely on static analysis that is known to be unsound [15] and is thus guaranteed to miss certain prefetchable requests. Second, they rely on apps’ logic and thus may not be effective for certain apps. For example, PALOMA analyzes the app to determine a request’s URL one callback before when the request is issued, and then prefetches it; if URLs are unknown one callback ahead, PALOMA will never prefetch. Finally, content-based techniques do not generalize across different frameworks (e.g., Android, iOS) or app types (e.g., native apps vs. mobile browsers).

By contrast, history-based prefetching does not require program analysis and is platform-independent: it only relies on past requests. However, despite the large body of work targeting history-based prefetching in browsers [16, 1, 17, 2, 9, 10, 18, 19, 11, 20], such techniques remain largely unexplored on mobile platforms.

This gap is caused by a key challenge in applying history-based prefetching today: the user data is hard to obtain due to the profusion of data-privacy regulations introduced and regularly tightened around the world [21, 22, 23, 24, 25]. Although such regulations vary across regions, the global trend is to restrict the data collection to only the necessary amount with user consent (e.g., see GDPR [23]). Furthermore, today’s users are increasingly protective of their data and much more likely to agree to its collection if it is restricted to small amounts [26, 27, 28]. This makes the traditional strategy applied in this area, which has relied on “large” data impractical today. For example, a study published in 2011 reported on data collected from 25 iPhone users over the course of an entire year [29] and inspired a number of follow-up studies [8, 30, 31, 32, 12, 33, 34]. A decade later, such protracted data collection would be difficult to imagine, both because it would likely fall afoul of legal regulations introduced in the meantime and because today’s end-users are more keenly aware and protective of their data.

Given these constraints, an obvious strategy would be to limit the amount of data on which the prefetching relies. However, there is no evidence that it is feasible to predict future requests based on small amounts of data. This appears to have been an important factor that has discouraged the exploration of history-based prefetching in recent years. We believe that this has been a missed opportunity. Namely, the previously reported mobile-device usage patterns—repetitive activities in brief bursts [35, 36, 37, 38, 39]—lead us to hypothesize that history-based prefetching may work effectively with small prediction models trained on mobile-user requests collected during short time periods. If borne out in practice, this would open new research avenues.

To evaluate our hypothesis, and to facilitate future explorations in this area, we first construct a tailorable framework HipHarness, which provides several customizable components to automatically assess prediction models across a range of scenarios. For example, HipHarness allows comparing the effectiveness of different prediction algorithms by running them side-by-side on the same data, measuring the impact of different training-data sizes on accuracy, and so on.

HipHarness enables us to flexibly assess prediction models built with any algorithm of interest. In this paper, we specifically customize HipHarness to analyze models built with the three most widely employed history-based prediction algorithms from the traditional browser domain [10, 9, 40], as well as a fourth algorithm we introduce to serve as the evaluation baseline. Our study uses the mobile-network traffic obtained from nearly 11,500 users at a large university during a 24-hour period. The selection of this time period was guided by previously made observations of repetitive mobile-user behaviors during a single day [36, 37, 38, 39]. The closest study to ours [8] only evaluates one algorithm [40] with mobile-browser data collected over a year from 25 iPhone-using undergraduates at Rice university [29]. By comparison, our study evaluates four algorithms on both mobile-browser and mobile-app data, relying on $\approx$ 400 $\times$ more users from a much more diverse user base, but during a $\approx$ 400 $\times$ shorter time frame.

Our dataset comprises over 15 million HTTP requests from nearly 31,000 Internet domains, allowing us to explore orders-of-magnitude more models compared to prior work. We use HipHarness to assess over 7 million models tailored to each user based on their past usage, varying from a single request to over 200,000 requests. While HipHarness allows the exploration of various research questions, this paper targets the repetitiveness of user requests during short time periods as the key prerequisite for the feasibility of small prediction models; the effectiveness of existing prediction algorithms on mobile platforms to provide insights for future algorithms; and strategies for reducing training-data sizes without sacrificing model accuracy as a way of yielding even smaller models.

This paper makes the following contributions.

1.

We design and implement HipHarness, a customizable framework for automatically exploring history-based prefetching across a wide range of scenarios.
2.

We conduct the first extensive study of history-based prefetching on mobile platforms, providing empirical evidence on the feasibility of small prediction models, and in turn, opening up a new research direction.
3.

We demonstrate that existing prediction algorithms are promising starting points for developing prefetching solutions, and provide insights on their further optimization.
4.

We develop concrete strategies for bounding the training-data sizes without sacrificing the models’ accuracy.
5.

We provide our artifacts to foster future studies in this area, including HipHarness’s source code, the evaluation results of over 7 million prediction models we built, and the scripts for analyzing the results [41].

Section II describes the related work and the foundation of our study. Section III introduces our framework, HipHarness. Section IV overviews our empirical study enabled by HipHarness. Section V discusses our findings as well as the study’s validity threats. Section VIII concludes the paper.

II Related Work and Foundation

This section discusses the related work and background of history-based prefetching, followed by the rationale behind the selection and evaluation of the algorithms used in our study.

II-A Related Work

Prefetching has yielded a large body of work since the birth of the Internet [16, 1, 17, 2, 3, 4, 5, 6, 7, 8, 9, 10, 18, 19, 11, 20], and has been shown promising on mobile platforms recently [13]. Related work includes identifying performance bottlenecks [42, 43, 44], balancing trade-offs between prefetching benefits and waste [45, 46], and content-based techniques [7, 6, 14, 5] as discussed earlier.

History-based prefetching has remained largely unexplored on mobile platforms. Prior work has mainly focused on mobile browsers [47, 48, 49, 50, 51, 32], while missing mobile apps where users spend over 80% of their time [52]. The few techniques targeting mobile apps are restricted to specific domains, such as social media [53], video streaming [54, 55], and ads [31]. We believe this restriction is caused by the already mentioned challenge of accessing user data, limiting history-based prefetching to specific applications and/or domains with public data (e.g., Twitter [53]). To mitigate this challenge, our work explores a novel strategy of assessing history-based prefetching across different domains, by relying on small amounts of user data. The closest work to ours is the already discussed evaluation of one history-based prediction algorithm (MP [40]) using data collected during one year from 25 iPhone users [8]. By contrast, our conclusions are based on four algorithms, including MP, and a much larger user base but over a much shorter time period.

II-B Prefetching Workflow

History-based prefetching consists of three phases: training the prediction model based on the historical data; predicting future requests based on the trained model; and prefetching requests based on the prediction results and runtime conditions.

In the training phase, a prediction model is trained with past requests to capture their relationship based on a prediction algorithm (e.g., the probability of making a given request next). One prediction algorithm can be used to produce multiple prediction models by taking as input different training data, such as different numbers of past requests. In the predicting phase, the trained model is used to predict future requests based on the current context (e.g., the current request) and the prediction algorithm. Finally, in the prefetching phase, certain predicted requests are prefetched based on runtime conditions, such as battery life and cellular-data usage [45].

II-C Prediction Algorithms

Our study’s goal is to assess the applicability of existing prediction algorithms that are most likely candidates for effective use on mobile platforms. The algorithm selection is guided by our focus on small models. We thus exclude algorithms that rely on large amounts of training data, such as those using neural networks (e.g., LSTM [56]). We specifically include the MP algorithm [40] since it is used in the closest related study [8]. Finally, we select two additional algorithms—DG [10] and PPM [9]—that are widely used to predict web requests [57, 58, 59, 60, 61, 62, 63]. Note that there are multiple variations of the three selected algorithms [57, 58, 64, 59, 60, 61, 65, 66, 11]; a detailed summary can be found in a survey [12]. The variations in question target specific characteristics of traditional browsers, such as web-page structure [64], and do not carry over to mobile platforms. We thus rely on the originally defined algorithms.

Most-Popular (MP) [40] maintains a list of the most-popular subsequent requests for each request in the training set. In the training phase, MP adds next requests within a specified window to the current request’s list and stores their occurrences. In the predicting phase, MP predicts the most-popular requests.

Dependency Graph (DG) [10] trains a directed graph indicating the dependencies among the requests, where nodes represent requests and arcs indicate that a target node is visited after the original node within a window. Each arc has a weight that represents the probability that the target node will be visited next. In the predicting phase, DG predicts future requests based on their probabilities stored in the dependency graph.

Prediction-by-Partial-Match (PPM) [9] uses a high-order Markov model that is context-sensitive since it considers the order of requests. In the training phase, it builds a trie structure [67] that indicates the immediate-followed-by relationship of the past requests. In the predicting phase, it predicts future requests whose parents match the previous $N$ requests.

II-D Study Focus

We focus on the accuracy of prediction models built in the first two phases of the workflow, i.e., training and predicting. The third, prefetching phase involves trading off runtime conditions, studied by a complementary body of research [45, 46]. We exclude such runtime factors since they would taint the results of the models’ accuracy (e.g., failing to prefetch requests due to a third-party server’s errors). Moreover, the expiration of the prefetched requests may vary depending on when the experiments are conducted, which would introduce additional bias into our results. Currently, determining whether a request has expired remains an open challenge as the HTTP headers [68] (e.g., Cache-Control, Expires) are not trustworthy [13]. Thus, to eliminate runtime variations and fairly assess the models’ accuracy, we assume ideal runtime conditions: each predicted request will be prefetched and will not expire.

As mentioned earlier, our prediction models are tailored to individual users based on their past behaviors. This is motivated by today’s stringent regulations limiting personal data access. This has shifted recent research to “on-device” prediction, to avoid exporting sensitive user data to servers [69, 70, 71]. At the same time, this trend of client-side prediction highlights the need for small prediction models since mobile devices are resource-constrained. We thus also assess the resource consumption of different prediction models trained on-device.

III The HipHarness Framework

This section describes the design of HipHarness, a tailorable framework for automatically assessing history-based prediction models, followed by the details of its instantiation.

III-A HipHarness’s Design

Figure 1 depicts HipHarness’s workflow, comprising six customizable components that can be reused, extended, or replaced as needed. For instance, one can explore how much training data to use with a given prediction algorithm by varying Training Selection and reusing the remaining components.

Refer to caption — Figure 1: HipHarness’s workflow for assessing history-based prediction models

HipHarness takes Historical Requests as input. Each component (shaded boxes) are either reused from existing components, or provided anew if unavailable. Historical Requests are past requests used to generate the intermediate outputs of Training Requests, Trigger Requests, and Test Requests based on the logic defined in Training Selection, Trigger Selection, and Test Selection, respectively. Training Requests are used to build the Prediction Model. Trigger Requests triggers the prediction of subsequent requests, such as the current request. Finally, Test Requests are the future requests used to evaluate the Prediction Model; they represent the “ground truth”.

The three intermediate outputs make the prediction and produce the final Test Results to evaluate the Prediction Model. The Training Engine implements the algorithm used to train a specific Prediction Model, such as the dependency graph in DG [10], based on the selected Training Requests. Optionally, the trained Prediction Model can be updated dynamically (dashed arrow) while being evaluated: Test Requests become historical requests after being tested, and can be used to train the Prediction Model. Prediction Engine implements the algorithm that predicts subsequent requests based on Trigger Requests and the trained Prediction Model. Finally, when evaluating the Prediction Model, Test Engine iteratively invokes Prediction Engine with a series of Test Requests. Prediction Engine selects $N$ requests that immediately precede each Test Request to obtain a set of Predicted Requests. Test Engine then compares the Predicted Requests with the corresponding Test Requests (i.e., “ground truth”) and outputs the Test Results that contain the information needed to calculate the evaluation metrics of interest (see Section IV-C).

HipHarness supports “plug and play” by tuning and/or replacing each of the components to explore different research questions. The Training Selection, Trigger Selection, and Test Selection components can be customized based on different selection strategies. For example, different parts of Historical Requests can be chosen based on desired ratios or time periods, to produce the corresponding Training Requests, Trigger Requests, and Test Requests. Likewise, Training Engine and Prediction Engine can be tailored with specific algorithms to train the models based on the selected Training Requests. Finally, Test Engine can be customized to evaluate different prediction models with various testing strategies. For instance, as discussed earlier, Test Engine may enable updating the prediction model dynamically; by plugging-in Test Engines that implement different dynamic-update strategies, HipHarness can isolate the impact of those strategies by producing their side-by-side Test Results.

III-B HipHarness’s Instantiation

We have instantiated HipHarness by implementing several variations of its six components. Instances of Training Selection and Test Selection are implemented to select requests based on different ratios, such as using the first 80% of the Historical Requests as Training Requests and the remaining 20% as Test Requests. Trigger Selection is implemented to select one current request to trigger the prediction of subsequent requests. Four pairs of Training Engine and Prediction Engine are implemented based on the three algorithms introduced in Section II—DG [10], PPM [9], and MP [40]—and a fourth baseline algorithm we will describe in Section V. Finally, Test Engine’s implementation is detailed in Algorithm 1.

Input: PredictionEngine

PE

, TestRequests

test\_reqs

Output:

cache.

Size,

hit\_set

miss\_set

, #

prefetch

, #

hit

, #

miss

cache=\emptyset

hit\_set=\emptyset

miss\_set=\emptyset

3 #

prefetch=0

, #

hit=0

, #

miss=0

4foreach $current\_req\in test\_reqs$ do

predicted\_reqs\leftarrow PE.\textsc{Predict}(current\_req.pre)

6 foreach $candidate\in predicted\_reqs$ do

7 if $\neg$ IsCached( $candidate$ ) then

8 Prefetch(

candidate

)

cache.

Put(

candidate

)

prefetch\leftarrow prefetch+1

12 if IsCached( $current\_req$ ) then

hit\leftarrow hit+1

hit\_set.

Put(

current\_req

)

15 else

miss\leftarrow miss+1

miss\_set.

Put(

current\_req

)

PE.

update_Model(

current\_req

)

Algorithm 1 Test Engine

Test Engine’s objective is to output information needed for evaluating Prediction Engine’s (PE) results. For each request $current\_req$ in Test Requests $test\_reqs$ , PE predicts the potential current requests based on $current\_req$ ’s previous request $current\_req.pre$ (Line 4) and prefetches the predicted requests that are not already in the cache (Lines 5-9).

As discussed in Section II-D, we assume that all predicted requests should be prefetched and will not expire. We thus place the predicted requests in the cache without sending them to the server to avoid tainted results caused by spurious runtime variations. The cache size is unbounded in our study: since we focus on small amounts of historical data guided by our hypothesis (recall Section I), the required cache size is negligible compared to the available storage on mobile devices.

Test Engine then compares ${current\_req}$ with the cached requests to determine whether it can be reused, and updates the corresponding information (Lines 10-15). Finally, $PE$ dynamically updates the Prediction Model by adding ${current\_req}$ to train the model (Line 16) since ${current\_req}$ becomes a past request as Test Engine moves to the next request in ${test\_reqs}$ .

In the end, Test Engine outputs the size of the cache ( $cache.$ Size); unique prefetched requests that were subsequently used ( ${hit\_set}$ ); unique requests in Test Requests that were not in the cache when requested ( ${miss\_set}$ ); total number of prefetched requests ( ${\#prefetch}$ ); and numbers of times a request was in the cache ( ${\#hit}$ ) vs. not in the cache ( ${\#miss}$ ) when requested. We use this information to define metrics that evaluate the accuracy of prediction models in Section IV.

IV Empirical Study Overview

This section provides the details of our empirical study enabled by HipHarness, including the research questions we focus on, the data collected, and the evaluation metrics used.

IV-A Research Questions

An overarching hypothesis frames our study: History-based prefetching may work effectively with small prediction models trained on mobile-user requests collected during short time periods. To evaluate it, we focus on three research questions.

•

RQ₁ – To what extent are mobile users’ requests repetitive during short time periods?
•

RQ₂ – How effective are the existing prediction algorithms when applied on mobile platforms using small prediction models?
•

RQ₃ – Can the training-data size be reduced without significantly sacrificing the prediction models’ accuracy?

RQ1 – Repetitiveness of User Requests

Since history-based techniques can only predict requests that have appeared in the past, our study is centered on repeated requests [32], i.e., identical requests issued by a user. Requests are considered identical when they have the same URLs, including GET parameters [72]. We assume the requests will not expire by design as discussed in Section II-D. Specifically, we explore the extent to which mobile users send repeated requests during a day, aiming to understand the “ceiling” that can be achieved by any history-based techniques with small models. Note that prior work has only reported the repetitiveness of coarse-grained behaviors (e.g., phone-calls [38], mobility patterns [73, 74, 75]).

RQ2 – Effectiveness of Prediction Algorithms

In the closest study, Wang et al. [8] found history-based algorithms ineffective. However, their study was conducted a decade ago, had a limited scope, and relied on the conventional, large prediction models. We re-assess their conclusions using small models, while extending the study’s scope in three dimensions. First, Wang et al. analyzed 25 iPhone users who were undergraduates; we rely on 11,476 users with different types of mobile devices and diverse occupations. Second, the their dataset only included requests collected from the Safari browser; we rely on both mobile-browser and app data to more comprehensively cover mobile-user behaviors. Finally, only one history-based algorithm (MP [40]) was previously investigated; our conclusions are based on the three most widely-employed algorithms, including MP, and an additional baseline algorithm.

RQ3 – Reducing Training-Data Size

We explore whether the amount of training data can be reduced without sacrificing accuracy. Guided by Pareto principle [76], we posit that top-20% of the training data are likely to account for 80% of the results, and explore three strategies to select the training data: Most Occurring Requests – Given a request sequence in the training set, we group repeated requests, and use the requests in the largest 20% of the groups to train the models. Most Accessed Domains – We group requests that belong to the same domain (e.g., google.com/*), and train the models with the requests in the largest 20% of the groups. Most Suitable Domains – Domains that tend to contain repeated requests are potentially good candidates for prefetching. Thus, we again group the requests in the same domain together. This time, we rank the groups by the proportion of repeated requests, and use the top-20% of the groups to train the prediction models.

We further investigate whether there is a lower-bound on the training-data size that yields the smallest model without sacrificing the resulting accuracy. To that end, we designed the Sliding-Window approach (see Section V) to study the impact of different numbers of requests in the training set on the prediction accuracy, and to identify the lower-bound.

IV-B Data Collection

Data Collection Process. The network traces were collected at the gateway between Beijing University of Posts and Telecommunications’s campus network and the Internet. The measurement servers were placed at the campus gateway router, connected via optical splitters to the Gigabit access links. The HTTP headers were captured by the servers along with the timestamps. The authentication information (User ID) identifies the traffic from the same user.

Ethical Considerations. Our study was approved by the university, and was guided by the agreement the authors signed. The study strictly followed the Research Data Management Policy of the university, including data storage, sharing, and disposal. All collected raw data was processed by the university’s network center. All recorded IP addresses and authentication information were anonymized. The authors did not have access to any of the raw data at any time. Unlike prior work that selected 25 users [8, 29], our data includes all users who accessed the campus network without any selection.

Dataset Overview. Our study aims to assess small models trained on mobile-user requests collected during much shorter time periods compared to the conventional weekly or monthly models [8, 17, 2, 77, 1, 16]. To that end, we were given access to the network traffic collected by the university, spanning the 24 hours of May 28, 2018 (a randomly selected date). This included the traffic from nearly 11,500 accounts, representing all users who accessed the campus network via a mobile device¹¹1PCs and mobile devices use different clients for authentication. We only collected the mobile-network traffic based on the authentication information. during that time: students, faculty, staff, contract employees, residents, outside vendors, and visitors. These users exhibit various behavior patterns and form a diverse group for our study. We further filtered the mobile-network traffic to include only the requests involving the HTTP GET method [78]: GET requests are considered “safe” for prefetching in that they do not have any side-effects on the server [72, 79, 13]. Ultimately, we collected 15,143,757 GET requests from 11,476 users. Each request is identified by its URL, including GET parameters if any.

IV-C Evaluation Metrics

As discussed in Section II-D, we focus on the accuracy of the prediction models. We thus leverage two widely adopted accuracy metrics in the prefetching literature, and introduce a new accuracy metric. The metrics are computed based on the information output by our Test Engine (recall Algorithm 1).

Static Precision measures the percentage of correctly predicted unique requests, i.e., the ratio of prefetched requests that are subsequently used to the number of all prefetched requests [57]. This metric is often referred to as precision or hit ratio in the literature [40, 18, 80, 8, 9, 81, 57].

\vspace{-1mm}Static\ Precision=\frac{|hit\_set|}{\#prefetch}

Static Recall is a new metric we introduce as the counterpart to Static Precision. Static Recall measures the ratio of unique requests that were previously prefetched (and have been cached) to the total number of unique requests made by a user.

\vspace{-1mm}Static\ Recall=\frac{|hit\_set|}{|hit\_set|+|miss\_set|}

Dynamic Recall measures the ratio of previously-prefetched requests to all requests a user issues [57]. This metric is often referred to as recall or usefulness in the prefetching literature [40, 18, 80, 8, 9, 81, 57].

\vspace{-1mm}Dynamic\ Recall=\frac{\#hit}{\#hit+\#miss}

We do not use a dynamic counterpart to Static Precision because it is not meaningful. To measure the ratio of correctly predicted requests dynamically, this metric needs to reward the predicted requests (hits) each time they are accessed, and penalize “useless” requests that were prefetched but never used. However, the reward for hits would be potentially unbounded. Moreover, the metric would allow cases in which a model with a low Static Precision has an artificially high Dynamic Precision: repeated hits of the same request would progressively diminish the penalty for arbitrarily many useless requests.

V Results and Lessons Learned

This section presents the results of our empirical study performed using HipHarness, and its takeaways.

As mentioned earlier, our study is based on over 15 million HTTP GET requests, reflecting the mobile traffic collected from 11,476 users at a large university. Each user sent 1,320 requests on average during the single day. This is over 100 $\times$ more than what was reported a decade ago, where each user sent 4,665 requests for the entire year [8, 29], reinforcing the impracticality of traditional large models for mobile platforms. For instance, the strategy of building monthly models [8] would encompass $\approx$ 40,000 requests if applied on our dataset. Starting with this initial user set, we filtered out outlier users identified by box-and-whisker plot method [82] based on the number of requests they send, since these users do not represent typical mobile-device usage. This process resulted in the final dataset of 9,900 subject users as shown in Table I.

TABLE I: Requests per user

Initial 11,476 Users			Final 9,900 Users
Min	Avg	Max	Min	Avg	Max
1	1,320	235,837	10	981	3,923

V-A RQ1 – User-Request Repetitiveness

•

RQ₁ – To what extent are mobile users’ requests repetitive during short time periods?

TABLE II: Repeated requests across users

Percentage				Number
Min	Avg	Max	SD	Min	Avg	Max	SD
0%	28%	98%	17%	0	293	3,225	353

To answer RQ₁, we calculate both the number and percentage of repeated requests in each user’s data, as they indicate different aspects of predictability. The percentage provides insights about the potential cost: low percentage indicates high proportion of non-predictable requests, increasing the cost of building the model. The number sets the upper-bound on requests that can be prefetched.

Table II shows this data across the 9,900 subject users. The minimums indicate that certain users did not access the same URL more than once. 222 (2%) of our users fall in this category. Further investigation uncovered that these users do not access the network frequently enough to show repetitive behaviors: the average number of the network requests sent by these users is 32, i.e., $\approx$ 1 request every 80 minutes during the 24 hours.

On average, 28% of the requests are repeated. An average user sends 293 repeated requests, which can be prefetched and reused from a cache. The maximum percentage (98%) and number (3,225) show that history-based prefetching can especially benefit certain users. We also find a large variation across different users (see standard deviation in Table II). This indicates that individual users exhibit markedly different behaviors, which reinforces our choice of building personalized prediction models for each user on the client side (recall Section II-D).

To provide further insights for future techniques, we study the characteristics of frequently repeated requests by computing how many times each repeated request occurs per user. Due to space constraints, we highlight two findings. The average numbers of certain users’ repeated requests are unusually high, with the maximum of 779. Further investigation showed that these users tended to send large numbers of requests to obtain the same WiFi configuration files from a specific domain. Since the data available to us was sanitized, we can only hypothesize that this was due to server-side flaws. The maximum occurrence of a repeated request across all users was 2,634. This and other high values corresponded to continually obtaining information from certain servers based on the given users’ unchanging location coordinates, as indicated in GET requests’ parameters. Such domains may also have flaws that require resending the user location even when it remains the same. Both instances clearly point to opportunities for caching.

V-B RQ2 – Prediction-Algorithm Effectiveness

•

RQ₂ – How effective are the existing prediction algorithms when applied on mobile platforms using small prediction models?

RQ₁’s results indicate that our 24-hour dataset is amenable to prefetching, motivating us to apply the existing prediction algorithms on it. The rationale behind our algorithm selection was discussed in Section II-C. We further develop an additional algorithm (Naïve) to serve as the evaluation baseline. Naïve assumes that each request appeared in the past will appear again, and caches each such request. Naïve thus guarantees the upper-bound on the number of predictable requests, and provides a baseline for measuring any other prediction algorithm’s recall.

To answer RQ₂, we use HipHarness to obtain the Test Results (recall Figure 1) of the prediction models built for each user. As discussed in Section III-B, we implemented four pairs of Training Engine and Prediction Engine in HipHarness, based on the three existing algorithms discussed in Section II-C and Naïve. In the end, Test Engine outputs the Test Results needed for evaluating the models’ accuracy (recall Algorithm 1), using the three accuracy metrics discussed in Section IV-C.

For all four algorithms, we follow the common approach of selecting the first 80% of Historical Requests as Training Requests, and the remaining 20% as Test Requests [83]. The thresholds of each prediction algorithm are set to the same values used in the original techniques. Recall that our goal is to show the feasibility of small models on mobile platforms. In turn, this will enable fine-grained customization of the models to improve accuracy, e.g., tuning the thresholds, pre-processing the training data, and considering additional context.

Due to the complexity induced by PPM’s context-sensitivity (recall Section II), it was unable to output results when using our entire dataset. To enable a fair comparison among the algorithms, we thus excluded certain domains that displayed low percentages of repeated requests to reduce the dataset. We explored multiple cut-off points as the repeated percentage at which a domain is excluded and found that 10% was sufficient to enable PPM. This eliminated certain users since all of their requests were excluded, resulting in 9,751 users and 39,004 corresponding models built with the four algorithms. Interestingly, this process uncovered a potential correlation between the cut-off point’s size and the increase in the models’ accuracy. This suggests a smaller, better-tailored training set can yield even more accurate models, directly motivating RQ₃.

V-B1 Accuracy of Prediction Models

Figure 2 shows the average accuracy results of the 39,004 models. Overall, DG outperforms PPM and MP along all three measures, while PPM and MP trade off precision and recall.

As discussed above, Naïve is our baseline and it achieves 100% recall. On the other hand, its precision is poor since it aggressively prefetches every request that has appeared in the past. A trade-off between precision and recall should clearly be considered when deciding on a prefetching strategy, based on the specific scenario. For instance, if the cache is sufficiently large, Naïve can be used to maximize recall.

DG’s Static Precision is comparable to the results of traditional large models from the browser domain, where a model is considered to perform well with a precision between 40% and 50% [17, 40, 80, 81]. This shows DG’s potential on mobile platforms since it achieved comparable precision using much smaller models. On the other hand, MP achieved poorer Static Precision than DG and PPM. This was counter-intuitive since MP only uses the most popular requests, which should have yielded smaller model sizes and smaller denominators in the Static Precision formula (recall Section IV-C). However, our results suggest that the most popular requests stop reappearing at some point, indicating the need to optimize MP by tuning what training data to use and for how long.

The Static Recall and Dynamic Recall follow similar trends, meaning that the numbers of hit and miss requests do not significantly affect the usefulness of the prediction models. All three existing algorithms showed improvements in Dynamic Recall compared to Static Recall, indicating that the hit requests they predicted are usually accessed multiple times. This directly motivates future work on designing effective caching strategies.

V-B2 Efficiency of Prediction Models

Due to our study’s scale, we needed to train over 7 million models (detailed in Section V-C), which required the use of a powerful computing environment. Our experiments were run in parallel on a server with 32 2.60GHz Intel Xeon E5-2650 v2 CPUs and 125GB RAM. The average running times of DG, PPM, MP, and Naïve were 42ms, 431ms, 4ms, and 9ms per model, respectively. Note that PPM is at least an order-of-magnitude slower than other algorithms. This confirmed our earlier observation regarding the scalability issues introduced by PPM’s context-sensitivity.

To get further insights into the models’ efficiency, we trained models of 10 different sizes from $\approx$ 1,000 to $\approx$ 10,000 requests, and measured their resource consumption on a mobile device (Honor Play 3 running Android 9.0). The sizes were selected empirically, as the resource consumption below 1,000 requests is negligible and training 10,000 requests is already expensive. For each size, we selected 10 users whose numbers of requests are the closest to the size and built models with all four algorithms, yielding the results of 400 prediction models.

Figure 3 presents the trends (at logarithmic scale) of the models’ energy consumptions (in milliAmpere hour) and runtimes (in seconds) as the number of requests increases. The energy consumption is measured by the built-in system tool. Each data point represents the average resource consumption when training the model of a specific size 10 times (the need to train a model multiple times is discussed in Section V-C). Our results show that DG, MP, and Naïve are practical since the largest resource consumption among them is 36mAh (Naïve) and 81s (DG) when training with $\approx$ 10,000 requests.²²2Modern mobile-devices’ battery capacity is $\approx$ 2,000–5,000mAh [84, 85]. MP is by far the most efficient algorithm: it consumes $<$ 3mAh and $<$ 10s in the worst-case. By contrast, PPM is not practical except with the smallest models: it begins to surpass the other algorithms’ worst-case when trained with $\approx$ 2,000 requests, while its own worst-case consumption is prohibitive at 1,335mAh and 4,740s.

Recall from Table I that users from our initial dataset average 1,320 requests. At first blush, this may suggest that the above analysis was unnecessary and that all four algorithms should be efficient enough. However, given the wide variations across individual behaviors, many users would likely benefit from larger models. Thus, our analysis can provide insights for future work that targets models of varying sizes for specific user groups.

V-C RQ3 – Training-Data Size Reduction

•

RQ₃ – Can the training-data size be reduced without significantly sacrificing the prediction models’ accuracy?

To answer RQ₃, we first investigate the three data-pruning strategies introduced in Section IV-A: Most Occurring Requests (MOR), Most Accessed Domains (MAD), and Most Suitable Domains (MSD). We then explore whether there is a lower-bound on the training-data size that yields the smallest prediction models without sacrificing accuracy.

V-C1 Data-Pruning Strategies

We apply each of the three pruning strategies to the training data of all 39,004 models studied in RQ₂, and use HipHarness to evaluate the 117,012 pruned models. Table III shows the average training-data size reduction (Size Red.) and the average values of our accuracy metrics—Static Precision (SP), Static Recall (SR), and Dynamic Recall (DR)—after applying each pruning strategy across all four algorithms; since the SR and DR values are always 1 for our baseline algorithm Naïve, we omit them. In all but six cases, accuracy is improved after data pruning (shaded cells).

Figure 4 overlays the average values from Table III on top of the original results from Figure 2. Each set of three bars corresponds to the results of the given algorithm after applying MOR (left), MAD (middle), and MSD (right). For example, the leftmost three bars (in red) represent the MOR-, MAD-, and MSD-yielded values for DG’s Static Precision.

All three strategies show promising results across all accuracy metrics and algorithms while reducing the training-data sizes. The largest accuracy drop is only 0.06 (DG’s Dynamic Recall after applying MSD). By contrast, accuracy is significantly improved in most cases, with the largest boost of 0.29 (MP’s Static Precision after applying MSD). Note that the largest accuracy drop and boost are both achieved by MSD, suggesting that a pruning strategy may have highly variable impact on different metrics and/or algorithms. This is confirmed by the other two strategies: MOR and MAD produce both improvements and drops for different cases.

TABLE III: Average training-data size reduction after applying the MOR, MAD, and MSD pruning strategies, and the resulting Static Precision (SP), Static Recall (SR), and Dynamic Recall (DR)

PPM

Naïve

Pruning

Strategy

Size

Red.

MOR

54%

0.453

0.368

0.386

0.468

0.128

0.173

0.452

0.404

0.368

0.212

MAD

27%

0.391

0.390

0.437

0.302

0.097

0.150

0.212

0.302

0.325

0.069

MSD

62%

0.484

0.379

0.376

0.432

0.174

0.194

0.453

0.443

0.409

0.194

MOR (left) and MSD (right) outperform MAD (middle) in most cases, while achieving significantly larger size reductions (Table III). However, it would be inappropriate to select the “best strategy” a priori. For instance, if one aims to maximize DG’s Dynamic Recall, MAD may in fact be the best pruning strategy. If reducing the data size (i.e., resource consumption) is critical, then MSD should be selected, with DG or MP as candidate choices of algorithm. Our results provide empirical data to guide future techniques on selecting suitable pruning strategies and prediction algorithms.

Among the four algorithms, DG still achieves competitive accuracy, but it benefits the least from data pruning. A possible reason is that DG tracks dependencies among requests when building the model and pruning may cause the loss of dependency information. By contrast, MP’s accuracy is markedly improved by pruning in all cases. MP consistently benefits the most from the MSD strategy that also achieves the largest size reduction (recall Table III). This suggests that MSD’s grouping of requests in a domain may have aligned especially well with MP’s notion of popular requests. Finally, it is notable that after the pruning, MP achieves comparable results to DG (the best-performing algorithm identified in RQ₂); MP’s Static Recall even surpasses DG’s after applying either MOR or MSD. This observation—that an algorithm may outperform its previously superior competitor if the training set is reduced—has not been made previously [8, 32, 53, 16, 11].

V-C2 Lower-Bound Identification

Motivated by the above findings, we dove deeper into the relationship between the number of requests in a training set and the accuracy of the resulting prediction models, aiming to explore the lower-bound of the training-data size that yields the smallest model without sacrificing its accuracy. To do so, we developed a Sliding Window (SW) approach for selecting a subset of training data that, unlike the pruning discussed above, is independent of the data’s contents. SW tailors HipHarness’s Training Selection and Testing Selection components. It assesses models of various sizes over different time slices, accounting for various user behaviors throughout the day and ensuring the freshness of Historical Requests used to train the models. SW allowed us to expand our dataset from 39,004 models studied in RQ₂ to over 7 million models with different training-data sizes of interest.

Figure 5 illustrates SW on 10 Historical Requests. We define a Request Window (RW) as a subset of Historical Requests that is used to train and evaluate a prediction model of a certain size. Each RW has a corresponding window size that indicates the number of requests in the RW. In Figure 5, RW has a window size of 5. Given Historical Requests and a window size $x$ , we first build a model using the first $x$ requests, and then slide RW by a sliding distance $y$ to build the next model. The sliding distance is adjustable. In our study, we set it to be the length of the test set, so that all Test Requests from the previous RW can be included in the training set of the next model. In Figure 5, the sliding distance is 1. We iteratively build models of the same window size until the Historical Requests no longer has sufficient requests to form a complete RW of size $x$ .

We use SW to explore various training ratios and window sizes. For brevity, we report the results with the training ratio of 0.8 and 11 window sizes; other values for the two parameters yielded qualitatively similar results. Specifically, we use window sizes of 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, and 1000 requests. This choice of parameters resulted in 1,788,648 prediction models for individual users with each of the four prediction algorithms, placing the total number of prediction models we studied at over 7.1 million.

To more closely track the trends of different accuracy metrics as the window sizes increase, we grouped the models of each window size and calculated the mean accuracy values in each group. Figure 6 shows the trends of the three metrics across the four algorithms. Notably, all trends converge at a certain window size. Furthermore, every trend is monotonic, suggesting that there may exist a cut-off point after which including more training data will not affect the given model’s accuracy.

To confirm this finding with statistically-significant evidence, we conducted pairwise-comparison analysis using the ANOVA post-hoc test based on the Games-Howell method [86] since it does not require groups of equal sample size. This test analyzes whether each pair of groups’ means has a statistically-significant difference. In our case, 11 window-size groups yield the results of 55 possible pairs for each of the three metrics calculated based on a given algorithm. For instance, the 55 pairwise comparison results of DG’s Static Precision show that there is a statistically-significant difference among the pairs with window sizes $\leqslant$ 400, but not for sizes $>$ 400. Therefore, DG’s Static Precision converges at window size 400. This is consistent with DG’s plot in Figure 6’s top diagram.

TABLE IV: Pairwise comparison result summary

Algorithm	Metric	Cut-off Point	Trend
DG	Static Precision	400	Positive
DG	Static Recall	500	Positive
DG	Dynamic Recall	800	Positive
PPM	Static Precision	300	Positive
PPM	Static Recall	500	Positive
PPM	Dynamic Recall	800	Positive
MP	Static Precision	400	Positive
MP	Static Recall	800	Negative
MP	Dynamic Recall	600	Negative
Naïve	Static Precision	500	Negative

Table IV summarizes all the pairwise comparisons, including the cut-off points and trends. The trends refer to the directionality of the relationship between the means and window sizes: a trend is positive if the mean is higher for larger window sizes, and negative otherwise. For cases with positive trends, a cut-off point corresponds to the amount of training data that yielded the highest accuracy; adding more data beyond this point did not improve the model’s predictive power. In the three cases with negative trends, the amount of data that yielded the highest accuracy corresponds to the smallest window size (50).

All cut-off points are lower than the largest window size (1000 requests). In fact, none of our models needed to be trained with more than 800 requests; on average, the models needed to be trained by fewer than 400 requests to achieve results comparable to those trained by up to 1000 requests. This goes against the conventional wisdom that a prediction model should invariably perform better with more training data, and directly supports our hypothesis stated above.

VI Broader Implications

To our knowledge, our work is the first to provide evidence on the feasibility of history-based prefetching on mobile platforms, using sufficiently small models. We have demonstrated the effectiveness of existing algorithms when properly configured, directly challenging the previous conclusion that history-based prefetching is ineffective on mobile platforms (avg. 16% precision and 1% recall) [8]. Our study thus motivates re-opening this research area and highlights the opportunity to revisit, and possibly improve, existing prediction algorithms. We now discuss the insights gained from our study.

Even though DG yielded the best accuracy and MP the lowest resource consumption, we argue that neither should be used without a suitable data-pruning strategy since pruning had the dual-benefit of reducing the training-data size and improving the models’ accuracy. We showed that, with an effective pruning strategy, MP can outperform DG, achieving comparable accuracy while maintaining superior resource consumption.

Note that the existing algorithms were not built with mobile users in mind, and we applied them as-is. Our results thus can be treated as the “floor” achievable by the algorithms, with a range of possible improvements based on mobile-user characteristics, such as by applying our data-pruning strategies. The MSD strategy was the standout overall, with greatest training-data size reduction (avg. 62%) and accuracy improvement (avg. 84%). At the same time, our data showed that certain users benefited more from the MOR or MAD strategies. This is consistent with the diverse user behaviors we observed, suggesting future work to categorize those behaviors and tailor existing or devise new prefetching strategies accordingly. Although this paper aims to draw general conclusions based on a large user base, our dataset contains the detailed Test Results of each individual subject user [41], providing a starting point to explore user-behavior categorization.

The identified cut-off points for different prediction models (recall Table IV) can serve as a guide for exploring suitable model sizes in future techniques. For example, to maximize the values of all three accuracy metrics in the two best-performing algorithms—DG and MP—no more than 800 requests are needed. Recall Figure 3, this indicates that both algorithms will be able to train hundreds (DG) or even thousands (MP) of models “on-device” with negligible resource consumption. In turn, this directly facilitates training multiple models throughout the day, necessary to maintain the models’ “freshness”.

DG and MP, the best-performing history-based algorithms, achieved comparable accuracy to the state-of-the-art content-based technique PALOMA (avg. precision 0.478) [7].³³3Other content-based techniques did not report accuracy results [14, 6]. Together with the reported limitations of content-based approaches (recall Section I), this highlights a potential advantage of history-based prefetching, as it is applicable to any app. However, our data indicates that certain users may not benefit from history-based prefetching since they tend not to send sufficient numbers of repeated requests. This suggests an opportunity of combining history-based and content-based approaches to address their respective limitations. For instance, a content-based technique can analyze the program structure to determine possible subsequent requests when the historical data is limited; on the other hand, a history-based technique can personalize the content-based technique to only prefetch the most likely requests based on an individual user’s past behaviors.

Besides the lessons learned from our study, the HipHarness framework provides a novel, reusable, and tailorable foundation for automatically exploring varioius aspects of history-based prefetching with any dataset of interest. Those aspects include identifying suitable training-data sizes, trading-off different metrics, exploring data-pruning strategies, assessing different prediction algorithms, fine-tuning various thresholds, and so on. We have already demonstrated how HipHarness can be used to explore some of these aspects in a flexible manner via “plugging and playing” its customizable components. In fact, HipHarness’s applicability is not limited to mobile users: it can be applied in other settings by simply replacing its input (i.e., Historical Requests) with any historical data of interest.

VII Threats to Validity

Our dataset was collected from a university’s network with 11,476 users, suggesting that our findings may not hold for all settings and types of users. However, universities have large and diverse populations, spanning students, faculty, administrative staff, contract employees, outside vendors, and visitors. The university also provides on-campus housing, dining, working, and entertainment venues, which captured user behaviors in different environments. Our user base is also over 400 times larger than the closest comparable study [8].

Our data was collected over a single day, raising the possibility that our findings may not apply to historical data collected over longer periods. However, this was done by design: we aim to investigate the feasibility of small prediction models on mobile platforms, to mitigate the challenge of obtaining large amounts of user data as discussed in Section I.

The collected data includes network traffic from both mobile apps and mobile browsers across different device types. Our results may thus fail to capture specific characteristics of mobile apps, mobile browsers, or certain devices. However, our goal is to demonstrate the feasibility of history-based prefetching on mobile platforms in general. As discussed in Section V-C, a smaller but better-tailored model may yield better results, thus our study “underapproximates” the model’s achievable accuracy. While we may have missed certain users who never connected to the university network, we had access to data from a large number of diverse users, which allowed us to obtain statistically-significant evidence for our results.

We assumed that the cache size is unbounded and cached requests do not expire. This is guided by our objective to assess the accuracy of small prediction models (recall Section II-D).

VIII Conclusion and Future Work

This paper presents the first attempt to investigate the feasibility of history-based prefetching using small models on mobile platforms. We did so by developing HipHarness, a tailorable framework that enabled us to automatically assess over 7 million models built from real mobile-users’ data. Our results provide empirical evidence of the feasibility of small prediction models, opening up a new avenue for improving mobile-app performance while meeting stringent privacy requirements. We further show that existing algorithms from the browser domain can produce reasonably accurate and efficient models on mobile platforms, and provide several insights on how to improve them. For example, we developed several strategies for reducing the training-data size while maintaining, even increasing, a model’s accuracy. Finally, HipHarness’s reusability and customization provide a flexible foundation for subsequent studies to further explore various aspects of history-based prefetching.

While this initial study focused on identifying general trends that span our large user base, tracking personalized network-usage patterns across different time periods (e.g., morning vs. night, weekday vs. weekend, work vs. vacation) is likely to result in even more accurate models. This may require access to significantly more user data, however—in volume, variety, and geographic span—than we have currently been granted. We will work on overcoming this challenge, and on the related critical issue of user privacy inherent in studies such as ours.

Acknowledgment

This work is supported by the U.S. National Science Foundation under grants 1717963, 1823354, and 2030859 (the Computing Research Association for the CIFellows Project), U.S. Office of Naval Research under grant N00014-17-1-2896, and the National Natural Science Foundation of China under grants 62072046 and 61702045.

References

[1] B. de la Ossa, A. Pont, J. Sahuquillo, and J. A. Gil, “Referrer graph: A low-cost web prediction algorithm,” in Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, 2010, pp. 831–838.
[2] Ş. Gündüz and M. T. Özsu, “A web page prediction model based on click-stream tree representation of user behavior,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003, pp. 535–540.
[3] V. Ruamviboonsuk, R. Netravali, M. Uluyol, and H. V. Madhyastha, “Vroom: Accelerating the mobile web with server-aided dependency resolution,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017, pp. 390–403.
[4] J. W. Mickens, J. Elson, J. Howell, and J. R. Lorch, “Crom: Faster web browsing using speculative execution.” in NSDI, vol. 10, 2010, pp. 9–9.
[5] X. S. Wang, A. Krishnamurthy, and D. Wetherall, “Speeding up web page loads with shandian,” in 13th $\{$ USENIX $\}$ Symposium on Networked Systems Design and Implementation ( $\{$ NSDI $\}$ 16), 2016, pp. 109–122.
[6] B. Choi, J. Kim, D. Cho, S. Kim, and D. Han, “Appx: an automated app acceleration framework for low latency mobile app,” in Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies. ACM, 2018, pp. 27–40.
[7] Y. Zhao, M. S. Laser, Y. Lyu, and N. Medvidovic, “Leveraging program analysis to reduce user-perceived latency in mobile applications,” in Proceedings of the 40th International Conference on Software Engineering. ACM, 2018, pp. 176–186.
[8] Z. Wang, F. X. Lin, L. Zhong, and M. Chishtie, “How far can client-only solutions go for mobile browser speed?” in Proceedings of the 21st international conference on World Wide Web. ACM, 2012, pp. 31–40.
[9] T. Palpanas and A. Mendelzon, Web prefetching using partial match prediction. Citeseer, 1998.
[10] V. N. Padmanabhan and J. C. Mogul, “Using predictive prefetching to improve world wide web latency,” ACM SIGCOMM Computer Communication Review, vol. 26, no. 3, pp. 22–36, 1996.
[11] Z. Ban, Z. Gu, and Y. Jin, “An online ppm prediction model for web prefetching,” in Proceedings of the 9th annual ACM international workshop on Web information and data management. ACM, 2007, pp. 89–96.
[12] W. Ali, S. M. Shamsuddin, A. S. Ismail et al., “A survey of web caching and prefetching,” Int. J. Advance. Soft Comput. Appl, vol. 3, no. 1, pp. 18–44, 2011.
[13] Y. Zhao, P. Wat, M. S. Laser, and N. Medvidović, “Empirically assessing opportunities for prefetching and caching in mobile apps,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 2018, pp. 554–564.
[14] I. Malavolta, F. Nocera, P. Lago, and M. Mongiello, “Navigation-Aware and Personalized Prefetching of Network Requests in Android Apps,” 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pp. 17–20, May 2019.
[15] Y. Wang, H. Zhang, and A. Rountev, “On the unsoundness of static analysis for android guis,” in Proceedings of the 5th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis. ACM, 2016, pp. 18–23.
[16] B. de la Ossa, J. A. Gil, J. Sahuquillo, and A. Pont, “Web prefetch performance evaluation in a real environment,” in Proceedings of the 4th international IFIP/ACM Latin American conference on Networking. ACM, 2007, pp. 65–73.
[17] B. D. Davison, “Predicting web actions from html content,” in Proceedings of the thirteenth ACM conference on Hypertext and hypermedia. ACM, 2002, pp. 159–168.
[18] T. I. Ibrahim and C.-Z. Xu, “Neural nets based predictive prefetching to tolerate www latency,” in Proceedings 20th IEEE International Conference on Distributed Computing Systems. IEEE, 2000, pp. 636–643.
[19] J. Pitkow and P. Pirolli, “Mininglongestrepeatin g subsequencestopredict worldwidewebsurfing,” in Proc. UsENIX symp. on Internet Technologies and systems, 1999, p. 1.
[20] Q. Yang, T. Li, and K. Wang, “Building association-rule based sequential classifiers for web-document prediction,” Data mining and knowledge discovery, vol. 8, no. 3, pp. 253–273, 2004.
[21] “What’s Data Privacy Law In Your Country? - Privacy Policies,” Apr 2020. [Online]. Available: https://www.privacypolicies.com/blog/privacy-law-by-country
[22] E. Leu, “Data Privacy Trends Across The Globe,” Clym, Jan 2019. [Online]. Available: https://www.clym.io/articles/data-privacy-trends-across-the-globe
[23] C. International, “The state of data protection rules around the world,” 2018. [Online]. Available: https://www.consumersinternational.org/media/155133/gdpr-briefing.pdf
[24] PricewaterhouseCoopers, “Top Policy Trends 2020: Data privacy,” Apr 2020. [Online]. Available: https://www.pwc.com/us/en/library/risk-regulatory/strategic-policy/top-policy-trends/data-privacy.html
[25] C. R. Service, “Data protection law: An overview,” 2019. [Online]. Available: https://fas.org/sgp/crs/misc/R45631.pdf
[26] S. Ovide, “Just Collect Less Data, Period.” N.Y. Times, Jul 2020. [Online]. Available: https://www.nytimes.com/2020/07/15/technology/just-collect-less-data-period.html
[27] M. Swant, “People Are Becoming More Reluctant To Share Personal Data, Survey Reveals,” Forbes, Aug 2019. [Online]. Available: https://www.forbes.com/sites/martyswant/2019/08/15/people-are-becoming-more-reluctant-to-share-personal-data-survey-reveals/?sh=c7375c21ed15
[28] “How to restrict the amount of data apps collect about you,” Mar 2018, [Online; accessed 26. Jan. 2021]. [Online]. Available: https://newatlas.com/data-collection-and-privacy/53959
[29] C. Shepard, A. Rahmati, C. Tossell, L. Zhong, and P. Kortum, “Livelab: measuring wireless networks and smartphone users in the field,” ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 3, pp. 15–20, 2011.
[30] T. Yan, D. Chu, D. Ganesan, A. Kansal, and J. Liu, “Fast app launching for mobile devices using predictive user context,” in Proceedings of the 10th international conference on Mobile systems, applications, and services, 2012, pp. 113–126.
[31] P. Mohan, S. Nath, and O. Riva, “Prefetching mobile ads: Can advertising systems afford it?” in Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013, pp. 267–280.
[32] D. Lymberopoulos, O. Riva, K. Strauss, A. Mittal, and A. Ntoulas, “Pocketweb: instant web browsing for mobile devices,” ACM SIGPLAN Notices, vol. 47, no. 4, pp. 1–12, 2012.
[33] R. LiKamWa, Y. Liu, N. D. Lane, and L. Zhong, “Can your smartphone infer your mood,” in PhoneSense workshop, 2011, pp. 1–5.
[34] A. Rahmati, C. Tossell, C. Shepard, P. Kortum, and L. Zhong, “Exploring iphone usage: the influence of socioeconomic differences on smartphone adoption, usage and usability,” in Proceedings of the 14th international conference on Human-computer interaction with mobile devices and services, 2012, pp. 11–20.
[35] Y. Guo, M. Liu, and X. Chen, “Looxy: Web access optimization for mobile applications with a local proxy,” in 2017 IEEE 85th Vehicular Technology Conference (VTC Spring). IEEE, 2017, pp. 1–5.
[36] O. K. Shoukry and M. B. Fayek, “Evolutionary content pre-fetching in mobile networks,” in 2013 12th International Conference on Machine Learning and Applications, vol. 1. IEEE, 2013, pp. 386–391.
[37] W. D. W., A. EllisDavid, and ShawHeather, “Determining Typical Smartphone Usage: What Data Do We Need?” Cyberpsychology, Behavior, and Social Networking, Jun 2018.
[38] I. H. Sarker, A. Colman, M. A. Kabir, and J. Han, “Individualized Time-Series Segmentation for Mining Mobile Phone User Behavior,” Comput. J., vol. 61, no. 3, pp. 349–368, Mar 2018.
[39] E. Bulut and B. K. Szymanski, “Understanding user behavior via mobile data analysis,” 2015 IEEE International Conference on Communication Workshop (ICCW), pp. 1563–1568, Jun 2015.
[40] C. Bouras, A. Konidaris, and D. Kostoulas, “Predictive prefetching on the web and its potential impact in the wide area,” World Wide Web, vol. 7, no. 2, pp. 143–179, Jun. 2004. [Online]. Available: https://doi.org/10.1023/B:WWWJ.0000017208.87570.7a
[41] “Hipharness’s website that contains our source code and data used in this study,” 2020. [Online]. Available: https://github.com/felicitia/HiPHarness
[42] J. Nejati and A. Balasubramanian, “An in-depth study of mobile browser performance,” in Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2016, pp. 1305–1315.
[43] X. S. Wang, A. Balasubramanian, A. Krishnamurthy, and D. Wetherall, “Demystifying page load performance with wprof,” in Presented as part of the 10th $\{$ USENIX $\}$ Symposium on Networked Systems Design and Implementation ( $\{$ NSDI $\}$ 13), 2013, pp. 473–485.
[44] J. Vesuna, C. Scott, M. Buettner, M. Piatek, A. Krishnamurthy, and S. Shenker, “Caching doesn’t improve mobile web performance (much),” in 2016 $\{$ USENIX $\}$ Annual Technical Conference ( $\{$ USENIX $\}$ $\{$ ATC $\}$ 16), 2016, pp. 159–165.
[45] B. D. Higgins, J. Flinn, T. J. Giuli, B. Noble, C. Peplin, and D. Watson, “Informed mobile prefetching,” in Proceedings of the 10th international conference on Mobile systems, applications, and services. ACM, 2012, pp. 155–168.
[46] P. Baumann and S. Santini, “Every byte counts: Selective prefetching for mobile applications,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 2, p. 6, 2017.
[47] S. K. Kovvali, C. Boyle, and M. Networks, “Content pre-fetching and CDN assist methods in a wireless mobile network,” Jul 2011. [Online]. Available: https://patents.google.com/patent/US8799480B2/en
[48] B. T. Weber, “Mobile map browsers: anticipated user interaction for data pre-fetching,” 2010.
[49] B. Jin, S. Tian, C. Lin, X. Ren, and Y. Huang, “An integrated prefetching and caching scheme for mobile web caching system,” in Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), vol. 2. IEEE, 2007, pp. 522–527.
[50] J. W. Branch, H. Chen, H. Lei, and I. B. M. Corp, “Data pre-fetching based on user demographics,” 2011. [Online]. Available: https://patents.google.com/patent/US8509816B2/en
[51] A. J. G. Brown, “Pre-caching web content for a mobile device,” 2016, uS Patent 9,275,162.
[52] D. Chaffey, “Mobile marketing statistics compilation,” January 2018. [Online]. Available: https://www.smartinsights.com/mobile-marketing/mobile-marketing-analytics/mobile-marketing-statistics/
[53] Y. Wang, X. Liu, D. Chu, and Y. Liu, “Earlybird: Mobile prefetching of social network feeds via content preference mining and usage pattern analysis,” in Proceedings of the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing. ACM, 2015, pp. 67–76.
[54] C. Koch and D. Hausheer, “Optimizing mobile prefetching by leveraging usage patterns and social information,” in 2014 IEEE 22nd International Conference on Network Protocols. IEEE, 2014, pp. 293–295.
[55] V. A. Siris, M. Anagnostopoulou, and D. Dimopoulos, “Improving mobile video streaming with mobility prediction and prefetching in integrated cellular-wifi networks,” in International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services. Springer, 2013, pp. 699–704.
[56] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” 2014.
[57] J. Domènech, J. A. Gil, J. Sahuquillo, and A. Pont, “Web prefetching performance metrics: A survey,” Performance Evaluation, vol. 63, no. 9-10, pp. 988–1004, 2006.
[58] J. Domenech, J. A. Gil, J. Sahuquillo, and A. Pont, “Using current web page structure to improve prefetching performance,” Computer Networks, vol. 54, no. 9, pp. 1404–1417, 2010.
[59] B. de la Ossa, J. A. Gil, J. Sahuquillo, and A. Pont, “Improving web prefetching by making predictions at prefetch,” in 2007 Next Generation Internet Networks, May 2007, pp. 21–27.
[60] J. Liu, C. Fang, and N. Ansari, “Request dependency graph: A model for web usage mining in large-scale web of things,” IEEE Internet of Things Journal, vol. 3, no. 4, pp. 598–608, 2016.
[61] A. Nanopoulos, D. Katsaros, and Y. Manolopoulos, “A data mining algorithm for generalized web prefetching,” IEEE transactions on knowledge and data engineering, vol. 15, no. 5, pp. 1155–1169, 2003.
[62] A. Gellert and A. Florea, “Web prefetching through efficient prediction by partial matching,” World Wide Web, vol. 19, no. 5, pp. 921–932, 2016.
[63] Y. Cui, H. Chen, and L. Zhu, “Improved post-copy live migration with memory page prefetching.” International Journal of Performability Engineering, vol. 16, no. 5, 2020.
[64] J. Domenech, J. A. Gil, J. Sahuquillo, and A. Pont, “Ddg: An efficient prefetching algorithm for current web generation,” in 2006 1st IEEE Workshop on Hot Topics in Web Systems and Technologies, Nov 2006, pp. 1–12.
[65] X. Chen and X. Zhang, “Popularity-based ppm: An effective web prefetching technique for high accuracy and low storage,” in Proceedings International Conference on Parallel Processing. IEEE, 2002, pp. 296–304.
[66] Q. Liu, Web latency reduction with prefetching. University of Western Ontario, 2009.
[67] Contributors to Wikimedia projects, “Trie - Wikipedia,” Jun 2020. [Online]. Available: https://en.wikipedia.org/wiki/Trie
[68] “HTTP/1.1: Header Field Definitions,” Sep 2004, [Online; accessed 11. Aug. 2020]. [Online]. Available: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
[69] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage, “Federated Learning for Mobile Keyboard Prediction,” arXiv, Nov 2018. [Online]. Available: https://arxiv.org/abs/1811.03604
[70] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” arXiv preprint arXiv:1909.11875, 2019.
[71] S. Ramaswamy, R. Mathews, K. Rao, and F. Beaufays, “Federated learning for emoji prediction in a mobile keyboard,” arXiv preprint arXiv:1906.04329, 2019.
[72] R. Fielding, “Rfc 2616, part of hypertext transfer protocol – http/1.1: https://www.w3.org/protocols/rfc2616/rfc2616-sec9.html,” 1999. [Online]. Available: https://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
[73] V. S. Tseng and K. W. Lin, “Efficient mining and prediction of user behavior patterns in mobile web systems,” Information and software technology, vol. 48, no. 6, pp. 357–369, 2006.
[74] W.-J. Hsu, T. Spyropoulos, K. Psounis, and A. Helmy, “Modeling time-variant user mobility in wireless mobile networks,” in IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications. IEEE, 2007, pp. 758–766.
[75] J. Yang, Y. Qiao, X. Zhang, H. He, F. Liu, and G. Cheng, “Characterizing user behavior in mobile internet,” IEEE transactions on emerging topics in computing, vol. 3, no. 1, pp. 95–106, 2015.
[76] J. Chappelow, “Pareto Principle,” Investopedia, Aug 2019. [Online]. Available: https://www.investopedia.com/terms/p/paretoprinciple.asp
[77] N. J. Tuah, M. Kumar, and S. Venkatesh, “Performance modelling of speculative prefetching for compound requests in low bandwidth networks,” in Proceedings of the 3rd ACM international workshop on Wireless mobile multimedia. ACM, 2000, pp. 83–92.
[78] [Online]. Available: https://tools.ietf.org/html/rfc2616#section-5.1.1
[79] R. Fielding, “Rfc 2616, part of hypertext transfer protocol – http/1.1: https://tools.ietf.org/html/rfc2616#section-9,” 1999. [Online]. Available: https://tools.ietf.org/html/rfc2616#section-9
[80] E. P. Markatos and C. E. ChronakiInstitute, “A top- 10 approach to prefetching on the web,” 1996.
[81] A. Nanopoulos, D. Katsaros, and Y. Manolopoulos, “Effective prediction of web-user accesses: a data mining approach,” 08 2001.
[82] J. W. Tukey, Exploratory Data Analysis. Addison-Wesley, 1977.
[83] R. Baeza-Yates, D. Jiang, F. Silvestri, and B. Harrison, “Predicting the next app that you are going to use,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 2015, pp. 285–294.
[84] P. Michaels, “Best phone battery life in 2020: The longest lasting smartphones,” Tom’s Guide, Apr 2020. [Online]. Available: https://www.tomsguide.com/us/smartphones-best-battery-life,review-2857.html
[85] A. Rioja, “17 Best Smartphones with Largest Battery Capacity [2019 List],” Flux Chargers, Jan 2018. [Online]. Available: https://www.fluxchargers.com/blogs/flux-blog/best-smartphones-largest-battery-capacity-life
[86] P. A. Games and J. F. Howell, “Pairwise multiple comparison procedures with unequal n’s and/or variances: A monte carlo study,” Journal of Educational Statistics, vol. 1, no. 2, pp. 113–125, 1976. [Online]. Available: https://doi.org/10.3102/10769986001002113

Assessing the Feasibility of Web-Request Prediction Models on Mobile Platforms

Abstract

Index Terms:

I introduction

II Related Work and Foundation

II-A Related Work

II-B Prefetching Workflow

II-C Prediction Algorithms

II-D Study Focus

III The HipHarness Framework

III-A HipHarness’s Design

III-B HipHarness’s Instantiation

IV Empirical Study Overview

IV-A Research Questions

RQ1 – Repetitiveness of User Requests

RQ2 – Effectiveness of Prediction Algorithms

RQ3 – Reducing Training-Data Size

IV-B Data Collection

IV-C Evaluation Metrics

V Results and Lessons Learned

V-A RQ1 – User-Request Repetitiveness

V-B RQ2 – Prediction-Algorithm Effectiveness

V-B1 Accuracy of Prediction Models

V-B2 Efficiency of Prediction Models

V-C RQ3 – Training-Data Size Reduction

V-C1 Data-Pruning Strategies

V-C2 Lower-Bound Identification

VI Broader Implications

VII Threats to Validity

VIII Conclusion and Future Work

Acknowledgment

References

Assessing the Feasibility of Web-Request
Prediction Models on Mobile Platforms