Sub-trajectory Similarity Join with Obfuscation
Abstract.
User trajectory data is becoming increasingly accessible due to the prevalence of GPS-equipped devices such as smartphones. Many existing studies focus on querying trajectories that are similar to each other in their entirety. We observe that trajectories partially similar to each other contain useful information about users’ travel patterns which should not be ignored. Such partially similar trajectories are critical in applications such as epidemic contact tracing. We thus propose to query trajectories that are within a given distance range from each other for a given period of time. We formulate this problem as a sub-trajectory similarity join query named as the STS-Join. We further propose a distributed index structure and a query algorithm for STS-Join, where users retain their raw location data and only send obfuscated trajectories to a server for query processing. This helps preserve user location privacy which is vital when dealing with such data. Theoretical analysis and experiments on real data confirm the effectiveness and the efficiency of our proposed index structure and query algorithm.
1. Introduction
Trajectory data is being captured by GPS-equipped devices such as smartphones. Such data can be used to query people’s travel patterns. In this paper, we are interested in a type of query named as the trajectory join query, which returns all trajectory pairs from two trajectory sets that satisfy a given join predicate, e.g., finding people with similar commute routes for ride-sharing matches. Most existing trajectory join queries (Shang et al., 2018; Shang et al., 2017; Yuan and Li, 2019) compute trajectories that are similar in their entirety, i.e., their join predicates are defined on the full trajectories.
We observe that trajectories that are partially similar to each other also offer useful information and should not be ignored. Such partially similar trajectories are gaining importance in applications such as contact tracing for managing epidemics, e.g., to find people in close contact with confirmed cases of COVID-19 for a duration of over 15 minutes111www.dhhs.vic.gov.au/victorian-public-coronavirus-disease-covid-19. Another example is to compute partially similar trajectories to find matches to form multi-hop goods delivery or car-pooling arrangements that allow transits (Xu et al., 2020).
Motivated by these applications, we propose a trajectory join query that, given two sets of trajectories, computes every pair of trajectories that are within a distance threshold (e.g., 5 meters) lasting for a certain time span (e.g., 15 minutes). Our query is defined on sub-trajectory similarity and hence is named as the STS-Join.

Fig. 1 illustrates our join predicate. There are three trajectories in different colors. The sampled points on the trajectories are marked by the dots and are labelled with their time. Existing studies on full trajectory similarity may return the black and the blue trajectories as a similar pair, because they are very close in both space and time. The existing studies may lose sight of similarity of portions. Within the two green colored areas, the red trajectory is close to the blue and the black ones, respectively, even though their entire movements are quite dissimilar. STS-Join aims to find out both partially and fully similar trajectories.
We assume a distributed (i.e., client-server) system architecture to process STS-Join queries. Each client device (e.g., a user’s mobile phone) stores a user’s own accurate trajectories, while the server only stores a modified version of the trajectories for privacy considerations. To showcase the feasibility of processing STS-Join queries over modified trajectories, we consider trajectory obfuscation. A user trajectory is obfuscated automatically where every sampled point on the trajectory is shifted with a bounded-distance noise before the trajectory is sent to the server. All users’ obfuscated trajectories are then maintained on the server in a spatio-temporal index for fast query retrieval. We note that there are limitations in using only obfuscation for trajectory privacy protection, and more advanced techniques exist in the literature such as geo-indistinguishability (Andrés et al., 2013). However, the core theme and contribution of our study is not to propose another privacy protection technique. Thus, we just use obfuscation for its simplicity and leave more advanced privacy protection techniques for future studies.
We propose a query algorithm for STS-Join based on a traversal over our index structure. To take advantage of the characteristic that segments on a trajectory are connected sequentially, we design a backtracking-based method to reduce node access of the traversal. It can avoid querying each segment individually. Note that this query method is applicable to any spatial indices that divide the space in a non-overlapping manner. Further, we derive an upper bound of the similarity between a query trajectory and an original user trajectory based on the similarity between the query trajectory and the corresponding obfuscated user trajectory. This enables additional pruning on the server, which reduces the number of query trajectories to be sent to the clients for final similarity checks and saves communication costs.
To sum up, we make the following contributions:
-
(1)
We define a new trajectory join predicate STS-Join that focuses on sub-trajectory similarity. Two similar trajectories can be very different, but they can contain parts that are related in both space and time. This similarity is especially applicable to bioinformatic datasets that are used for contact tracing and in computational transport science with shared-economy-based transportation systems.
-
(2)
We propose an efficient spatio-temporal index to manage trajectory data dynamically and a backtracking-based algorithm to process STS-Join queries. We further propose a similarity upper bound that is computed on obfuscated trajectories to enable data pruning and more efficient STS-Join processing. Trajectories in our index are not required to be accurate, but our join results are still correct.
-
(3)
We conduct experiments on real datasets. The proposed join algorithm outperforms adapted state-of-the-art methods by up to three orders of magnitude in running time.
2. Related Work
Our study is related to studies on trajectory similarity measurements, trajectory join queries, and trajectory privacy.
2.1. Trajectory Similarity Measurement
Most trajectory similarity measurements are either spatial distance based or spatio-temporal distance based.
Spatial-based measurements (Shang et al., 2018; Buzan et al., 2004; Chen et al., 2005; Sakurai et al., 2005; Salvador and Chan, 2007) focus on the spatial distance between two trajectories. They aggregate distance between aligned point pairs form two trajectories, such as dynamic time warping (DTW)(Shang et al., 2018) , longest common subsequence (LCSS)(Buzan et al., 2004) and edit distance on real sequence (EDR)(Chen et al., 2005).
Spatio-temporal-based measurements (Vlachos et al., 2002; Yuan and Raubal, 2014; Nanni and Pedreschi, 2006; Shang et al., 2017; Wu et al., 2016) consider the distance in both space and time. For example, LCSS and EDR have been extended to incorporate temporal thresholds (Vlachos et al., 2002; Yuan and Raubal, 2014). Nanni and Pedreschi (Nanni and Pedreschi, 2006) assume a constant moving speed on each segment of a trajectory, which is computed as the length of the segment divided by its time span. They then compute the distance between two trajectories as the average distance between two users who travel along the two trajectories with the constant speed of each segment. Shang et al. (Shang et al., 2017) sum point-to-trajectory distances from one trajectory to the other as the distance between two trajectories, where the summed distance is a weighted sum of the spatial and temporal distances. Wu et al. (Wu et al., 2016) consider points from two trajectories to be compatible if their spatial distance over time difference is within a velocity threshold.
These measurements compute similarity based on full trajectories which are different from our sub-trajectory-based metric.
2.2. Trajectory Join
Studies leverage distributed structures to join trajectories, such as DITA (Shang et al., 2018) and DISON (Yuan and Li, 2019). DITA supports a variety of trajectory distance functions based on full trajectories, while DISON measures trajectory similarity by counting the length of common road segments among the trajectories. Our trajectory join procedure supports distributed processing in two senses: (i) our join refinement procedure is distributed to client machines, and (ii) our index is based on non-overlapping space/time partitioning, which can be easily distributed.
There are also studies on sub-trajectory join (Alamdari et al., 2020; Tampakis et al., 2020; Bakalov and Tsotras, 2006). CSTJ (Bakalov and Tsotras, 2006) joins trajectories online. It returns two trajectories once they stay as close-distance pairs for a given time threshold. Its distance measure is based on points in trajectories rather than segments as in our study. DTJr (Tampakis et al., 2020) also measures point distance. It finds the longest sub-trajectory pair satisfying both spatial and temporal distance thresholds, while we find all pairs that satisfy a sub-trajectory similarity metric.
ALSTJ (Alamdari et al., 2020) is the closest work to ours. Its similarity metric is based on the spatial span of sub-trajectory pairs that are within a given distance threshold. The main differences between ALSTJ and our STS-Join are in three aspects: (i) ALSTJ does not consider the temporal factor and may join trajectories generated at different times, while STS-Join requires the trajectories to be close in both space and time so as to be joined; (ii) ALSTJ may have false negatives in its query result due to its trajectory simplification procedure, while our STS-Join guarantees accurate query results; and (iii) ALSTJ does not consider user privacy while we do.
2.3. Trajectory Privacy
Trajectory privacy has been studied extensively in the last decade (Chow and Mokbel, 2011; You et al., 2007; Andrés et al., 2013; Jiang et al., 2013; Naghizade et al., 2014; Monreale et al., 2010). For example, a study (You et al., 2007) introduces position dummy to hide a user’s location by mixing it with fake locations. Another study (Chow and Mokbel, 2007) extends -anonymity to trajectory data. Studies (Andrés et al., 2013; Jiang et al., 2013) leverage differential privacy for release of trajectory data. Geo-indistinguishability (Andrés et al., 2013) generalizes differential privacy to user location data. SDD (Jiang et al., 2013) applies the exponential-based randomized mechanism to trajectory data by sampling a rational distance and direction with noises between locations in the trajectory. There are also studies focusing on semantic privacy of trajectories (Naghizade et al., 2014; Monreale et al., 2010). Monreale et al. (Monreale et al., 2010) present a place taxonomy based method to preserve the trajectory semantic privacy. It guarantees that the probability of inferring the sensitive stops of a user is below a threshold. Naghizade et al. (Naghizade et al., 2014) propose an algorithm to protect the semantic information of a trajectory by substituting sensitive stops of a trajectory with less sensitive ones.
As mentioned earlier, our aim is not to propose a new privacy protection scheme but to show that it is feasible to process STS-Join queries with a client-server architecture where the server only stores a modified version of the user trajectories. We use obfuscation for its simplicity. Other privacy protection methods (e.g., position dummy) can be applied in our STS-Join if the modification on the trajectory points can be bounded by a distance threshold, which helps guarantee no false negative query results.
Symbol | Description |
---|---|
An existing trajectory set | |
A query trajectory set | |
, | Trajectory join distance and time thresholds |
Simplification threshold | |
Maximum obfuscated shifting distance | |
A trajectory | |
A point in trajectory | |
A segment in trajectory | |
The close-distance duration of and | |
The CDD similarity between and |
3. Preliminaries
Given two sets of trajectories (known trajectory set) and (query trajectory set), STS-Join returns pairs of trajectories with sub-trajectories within distance for at least time, where and are query parameters. Below, we present a few basic concepts and formulate STS-Join. We list the frequently used symbols in Table 1.
A trajectory is formed by a sequence of sampled points . A point is a triple : was generated by a user at location (in Euclidean space) at time . Two adjacent points and form a segment .
Following previous studies (Alamdari et al., 2020; Frentzos et al., 2007), we consider a constant speed on each trajectory segment. This is valid as real-world trajectories have high sampling rates, e.g., 4.5 seconds in our experiments. The speed may not vary much in such short time frames. Such a constant-speed setting enables computing a user’s location at any time , denoted as , given a trajectory , by linear interpolation:
(1) |
In Fig. 2, there are two trajectories (the black polyline) and (the red polyline). The solid points in the trajectories represent the sample points, and they are labeled with their timestamps, e.g., is recorded at 7:00. Using Equation 1, we can derive a user’s location on (or ), e.g., at 7:03, the user should be at .
Now we can measure the distance between and , denoted by , at any time as the Euclidean distance. We call this distance the point distance. STS-Join computes the time duration where this distance is within a given threshold .

Trajectories may be generated at different time spans and with different sampling rates. To compute the point distance between and , we need to first define the overlapping time span of segment pairs. Let be the overlapping time span between and , i.e., , where . Denote the length of the overlapping time span as . Then, in Fig. 2, is [7:02, 7:10], and the length is 8 minutes.
Close-distance duration. When , we call and two time-overlapping segments. Given such segments, we compute the time range where , i.e.,
(2) |
To solve this inequality, we expand Equation 2 with Equation 1 and compute the square of both sides of the inequality:
(3) | ||||
The resultant quadratic inequality has just one variable . It can be solved straightforwardly by letting the inequality be equal and computing the roots for the equation with the quadratic formula. We omit the detailed computation for conciseness.
In Fig. 2, the distance threshold is represented by the dotted lines. The distance between the first segments of the two trajectories first decreases and then increases. At time 7:03 and 7:07, . which yields the first time range [7:03, 7:07] that satisfies . The distance between the second segments of the two trajectories keeps decreasing, which reaches at 7:13. This yields the second time range [7:13, 7:14] that satisfies .
We call the length of the close-distance duration (CDD), denoted as . We define the similarity between two trajectories as their total CDD across all segments.
Definition 3.1 (Close-distance duration similarity).
Given a point distance threshold , the close-distance duration similarity (CDDS) of two trajectories and , , is the sum of of every time-overlapping segment pair .
(4) |
CDDS sums up the duration of all close segments including the partial ones. This differs from existing trajectory similarity metrics that require the full trajectories to be close. In Fig. 2, equals to the total length of the two time ranges [7:03, 7:07] and [7:13, 7:14], i.e., 5 minutes.
Problem definition. Now we can formulate our STS-Join.
Definition 3.2 (STS-Join query).
Given two trajectory datasets and , a point distance threshold , and a close-distance duration similarity threshold , STS-Join returns every trajectory pair such that .
4. Index Structure
We assume set to be known (e.g., user trajectory dumps) and set to be given at query time (e.g., trajectories of newly confirmed COVID-19 cases). We build an index named STS-Index over such that can be STS-Joined with efficiently. We use a client-server architecture to protect location privacy. On a client, a user’s trajectories are stored in their original form, which are obfuscated and sent to the server. The obfuscated trajectories from all clients together are indexed in a tree structure on the server that considers both their spatial and temporal features. Next, we detail the index structures on the clients and the server, respectively.
4.1. On Client Side
A user’s original trajectories are stored on the client side. We simplify and obfuscate an original trajectory before sending it (together with the client ID and a local trajectory ID) to the server in order to reduce the communication and protect user’s privacy. We index the trajectories by their local IDs (e.g., using a sorted array or a B-tree) for fast retrieval at the refinement stage of query processing.
Trajectory simplification. First, we simplify an original trajectory by reducing the number of sampled points. This reduces the storage space and improves the query efficiency later. Our simplification algorithm is adapted from the Douglas–Peucker algorithm (DP) (Douglas and Peucker, 1973). The native DP algorithm ignores the temporal dimension. Consider two sampled points and on . For all other sampled points between and , if their perpendicular distances to the segment between and are within a simplification threshold (an empirical parameter), then these points are all removed from by DP.
In our case, since we interpolate user locations on the trajectory segments, we require the user location on the simplified segment to be within distance from that on the original segments at every time point . This guarantees no false negatives in STS-Join over the simplified trajectories. Fig. 3 shows an example. The original (black) trajectory has three segments, which is simplified to just one (red) segment . We compute and on at times and (i.e., the time points of and ), respectively. To ensure valid simplification, the distance between and (i.e., ) and that between and (i.e., ) must both be within . This contrasts to the native DP that examines the perpendicular distances of and (i.e., and ), which are shorter and may lead to false negatives at query processing.

Trajectory obfuscation. Our simplified trajectories retain a subset of the trajectory points. To protect privacy, we further adapt the bounded Laplace mechanism (i.e., an algorithm) to obfuscate the simplified trajectories as inspired by previous studies (Dwork et al., 2006; Holohan et al., 2020).
The bounded Laplace mechanism adds a noise from the bounded Laplace distribution to the output of a (query) function. In our problem, we add a bounded noise to each dimension of a point on a simplified trajectory to protect users’ location privacy. The probability distribution function (PDF) of the added noise is (Holohan et al., 2020):
(5) |
where is a constant, is the bias of the distribution, is the obfuscated distance threshold. Since the domain of the distribution is bounded in , the integral of PDF should be 1 for . This yields the value of :
(6) |
We then leverage the inverse cumulative distribution function to generate random noises that satisfy the PDF. Firstly, we derive the CDF from the PDF by computing the integral of the PDF for and respectively:
(7) |
Then, we can obtain the inverse CDF from Equation 7:
(8) |
Here, variable is a random number in . We generate and use it to obtain noise , which is then added to each point coordinate of the simplified trajectory. Parameter is determined by and (Equation 6), while is the distribution bias.
4.2. On Server Side
The obfuscated trajectories are stored in a tree structure on the server that indexes both the spatial and temporal features of the trajectories (in the form of segments). As Fig. 4 shows, the top levels of the structure together can be seen as a B-tree that indexes time intervals hierarchically, and the node capacity (which is a system parameter) of the example is 20. The time span of an entry in the leaf nodes is 3 minutes which is determined by the time span of the trajectory segments, and every entry points to a spatial index for the trajectory segments in that interval. If a segment spans across the intervals of multiple entries (which is rare as the segments are usually short even after simplification), we break the segment into multiple segments to suit the entry intervals. For example, if a segment starts at 8:29:50 and ends at 8:30:03, we break the segment at 8:30:00 into two segments that suit the time intervals [8:27:00, 8:30:00) (i.e., ) and [8:30:00, 8:33:00) of the leaf entries.

We create a quasi-quadtree as the spatial index to obtain nodes with non-overlapping minimum bounding rectangles (MBR). We use endpoints of the segments to build this tree. First, we insert the endpoints into the root. Once the root node capacity is reached, we partition the space into four quadrants each forming a child node. Every endpoint is moved to the child node enclosing it, while the root node now stores pointers to the child nodes. The process is done recursively until every node is within its capacity. In a leaf node, we further store the segments overlapped by the node MBR (a segment may be in multiple nodes), together with the IDs of the corresponding trajectories and clients.
When the data distribution is skewed, we may have many segments in the same leaf node. We add bitmaps to help query such segments. The bitmaps correspond to disjoint intervals that together cover the time interval of the segments in the node. The number of bitmaps can be empirically determined based on the sampling rate. Each bit value denotes whether a segment overlaps with a time interval. In Fig. 4, we build a bitmap for with 10-second intervals, where both the first and the seventh segments in overlap with the interval [8:27:00, 8:27:10).
Update handling. When there are new obfuscated trajectories, their segments (and the endpoints of the segments) are added to the index by a top-down traversal to find the nodes to be inserted into (following the query procedure in the next section). Trajectory deletion can be also done by first a tree traversal to locate the trajectory segments to be deleted. Then, the segments are removed, together with any empty nodes.
5. STS-Join Query Processing
We illustrate how to join a query trajectory set with a known trajectory set indexed in our STS-Index. Set is not obfuscated, e.g., the trajectories of confirmed COVID-19 cases are reported to the authority for contact tracing. Firstly, we present our STS-Join query algorithm and optimize it with a backtracking technique. Then, we illustrate how to prune dissimilar pairs on the server by bounding the actual trajectory similarity based on obfuscated trajectories. Finally, we analyze the algorithm cost.
5.1. The STS-Join Query Algorithm
A straightforward algorithm is to query every segment from independently over STS-Index, identify the data segments satisfying the query, and send the query trajectory to the corresponding clients for verification against the original trajectories.
We observe that, segments of a trajectory are connected end to end. Points of adjacent segments are likely to be in the leaf nodes that are in short-hop distance from each other in the index. Besides, our quasi-quadtrees divide the space without overlaps. By leveraging these features, we propose an efficient backtracking-based join algorithm that starts from the inner nodes instead of the root for querying each segment of a trajectory.
MBR expansion for accurate query processing. To ensure no false dismissals, we expand the MBRs of query segments to cover the simplified-and-obfuscated trajectories, as well as the query distance threshold . Overall, we expand the MBR of a query segment by on each side. This ensures no false negatives, because by definition, the distance between a point on an original trajectory segment and the corresponding point on its simplified-and-obfuscated version cannot be greater than in any dimension. We use to denote the expanded MBR of . In Fig. 5, is the expanded MBR of the segment .

Pivots. We use pivots to help locate the adjacent segments in our algorithm. Pivots are vertices of the expanded MBR of a query segment that are close to the endpoints of the segment. Using a pivot instead of an endpoint helps avoid false negative query results, since the query MBR is expanded. In Fig. 5, and are the pivots of query segment .
On server side. As summarized in Algorithm 1, our server-side algorithm iteratively searches for similar segments for the query trajectories (lines 1 to 20). For every query trajectory , we split its segments according to the time intervals of the root nodes of the quasi-quadtrees (line 2), like we did in index construction. Then, we search for data segments similar to every segment in the split query trajectory (lines 3 to 20). We first update and its corresponding leaf node . This is only needed for the first segment in or when we move on to a segment in a new time interval (lines 4 to 7), which rarely happens. Function locates node whose MBR covers the pivot in the quasi-quadtree by a point query (line 7). Then, we can leverage node to backtrack with function that starts from and recursively traces back to the ancestor node whose MBR covers . For each query segment, we stop the backtracking at the ancestor node that covers the upper pivot of the expanded MBR of the current query segment (lines 8 and 9). Then, we search for similar segment pairs from for the current query segment (lines 10 to 20). For pruning, only tree nodes whose MBRs overlap with the expanded MBR of the query segment are visited, which are stored in a queue to support the traversal (lines 14 to 16). When the traversal reaches a simplified-and-obfuscated segment , we add to the result set (line 18). Meanwhile, we verify each leaf node for whether it contains the lower pivot of the next query segment which will be used at the stage of backtracking in the next segment query (line 19). After all query trajectories have been processed, we update the query segments in to their corresponding non-time-interval-split segments (line 21). Then, we group the pairs in by the client that generated , and send the pairs to the clients based on the client IDs of the segments (line 22). STS-Join verifies the trajectory similarity on the clients, since the server only stores simplified-and-obfuscated trajectories. After each client has computed the trajectory similarity, it returns the result set to the server.
On client side. On each client, the trajectories corresponding to the segments received from the server are retrieved (by ID lookups using the local trajectory IDs of the segments). Then, we compute the CDD of the segment pairs from the server and add up the CDDS for every trajectory pair formed by the segment pairs. The pairs satisfying the time threshold are returned as the query result to the server. This guarantees no false positives.
Discussion. Backtracking is not limited to quasi-quadtrees. It is applicable to all space-partitioning indices in which the process can stop at a common parent node. A query starting from such parent nodes will not have false negatives, since there are no overlaps among MBRs on the same level in the tree. That means one location can be covered only by one MBR at each level. In addition, backtracking can be applied in index construction, insertion, and deletion, because segments are inserted or deleted sequentially, and we can leverage the common parent node approach to reduce the node accesses when operating on the next segment.
5.2. CDDS-Based Pruning
Algorithm 1 sends all segment pairs that may satisfy the query distance threshold to the clients for further verification and CDDS computation. In this subsection, we compute an upper bound of the actual CDDS between a query trajectory and a known trajectory using and the simplified-and-obfuscated segments of stored on the server. We only send the segment pairs to the corresponding client when the upper bound exceeds , thus saving both communication costs between the server and the clients and computation costs on the clients. This essentially adds a subprocedure to prune the segment pairs before Line 22 of Algorithm 1 using an upper bound of . For simplicity, we only describe the CDDS-based pruning procedure below but do not repeat the full pseudocode of the STS-Join algorithm powered by it.
CDDS-based pruning procedure. We group the segment pairs that satisfy the query distance threshold (i.e., the segment pairs in set at Line 21 of Algorithm 1) by their client IDs, local trajectory IDs, and query trajectory IDs. The segment pairs that share the same client ID, local trajectory ID, and query trajectory ID all come from the same pair of simplified-and-obfuscated known and query trajectories for CDDS upper bound computation. Let the set of such segment pairs be and the original known trajectory corresponding to be , respectively. Then, we compute an upper bound of the close-distance duration (i.e., CDD) for every pair of segments in . Summing up these upper bounds for all segment pairs in yields our upper bound of , since these pairs are the only ones in (and hence ) that satisfy the query distance threshold by definition. The set is pruned from being sent to a client if the CDDS upper bound is less than . Next, we detail our CDD upper bound computation.
CDD upper bound. Recall that the CDD between a known segment of and a query segment of , , is the time duration when their spatial distance is within . We further define the CDD between a simplified-and-obfuscated segment and a query segment as the time duration where their spatial distance is within , denoted as . Then, for all known segments corresponding to , . This is because, given a point on a known segment, its corresponding point on the simplified-and-obfuscated segment is within a distance of by definition. Thus, if the distance between (points on) and is within , the distance between (points on) and must be within . Therefore, we use as our CDD upper bound of the original known trajectories.
We formulate the CDD upper bound and show its correctness with the following lemma.

Lemma 5.1.
Given a query segment , a sequence of known segments , and their corresponding simplified-and-obfuscated segment , we have:
(9) |
where is the CDD with distance threshold .
Proof.
We use Fig. 6 to help illustrate the proof, where known segments can be and correspond to a simplified-and-obfuscated segment , and the query segment is shown as . Besides, any two points connected by a dash line have the same timestamp.
Given a known segment and a query segment , their CDD can be non-zero only if they overlap in their time span, i.e., . Further, if the CDD is non-zero, it is defined by the two roots (i.e., two time points) of the quadratic equation (cf. Equation 3). Let the two roots be and . Then, . Note that, if either or is outside the range of , we replace it with the corresponding boundary value of to meet the overlapping time span requirement of the segments. Let the points corresponding to on and be and , respectively. Similarly, let the points corresponding to on the two segments be and , respectively. In Fig. 6, these four points on and are , , , and (, , and ).
Then, we analyze the distance between a known segment and its corresponding simplified-and-obfuscated segment . Note that a sequence of known segments can be simplified into a single segment with a simplification threshold , while the simplified segment is further obfuscated by shifting the endpoints with a maximum shifting distance along each dimension. Thus, the distance between any point on and its corresponding point (i.e., the point at the same time as that of ) on , is bounded within . This applies to the distance between points (at any given time ) on () and in Fig. 6.
Next, we derive the distance between the query segment and the simplified-and-obfuscated segment by leveraging the distance relationship above. By definition, the distance between the known segment and the query segment does not exceed when , while the distance between and its corresponding simplified-and-obfuscated segment does not exceed Therefore, the distance between the query segment and the simplified-and-obfuscated known segment does not exceed , when . In Fig. 6, we locate a point on whose timestamp is (, , and ). Then, the distance between and does not exceed , while the distance between and is as shown above. Thus, the distance between and is within .
Since there exists a time range where the distance between and does not exceed , the quadratic distance equation must have two roots, which are denoted as and , respectively. Also, since this quadratic function has a non-negative quadratic coefficient (cf. Equation 3), and the distance does not exceed when , we can derive that . Since and , we have when . Recall that is CDD computed with distance threshold . In Fig. 6, is the upper boundary of , and the corresponding points on the query segment and the simplified-and-obfuscated segment are and (, , and ) respectively. Meanwhile, when , the distance between and does not exceed . Thus, we have . Such an inequality is also applicable to other CDD boundaries.
Finally, we sum up the CDD of each segment in with , and we sum up that corresponds to different time ranges in where . Since the inequality is satisfied on each separate time range, such inequality is also satisfied for the sums, i.e., + in Fig. 6. This completes the proof. ∎
Given Lemma 5.1, we have the following lemma to bound the CDDS for the original known trajectories.
Lemma 5.2.
Given a query trajectory , a known trajectory , and its corresponding simplified-and-obfuscated version , we have:
(10) |
where is CDDS with distance threshold .
Proof.
The proof is straightforward and hence is omitted. ∎
5.3. Cost Analysis
We analyze the cost of Algorithm 1 in terms of the number of node accesses. If a query segment spans the whole spatio-temporal space (worst case), all index nodes may be visited, with or without backtracking. However, this rarely happens. In many cities, points of interest are distributed in a polycentric structure (Deng et al., 2019; Cai et al., 2017). Trajectories are likely to gather near local centers, where backtracking helps query the sub-spatial index of the centers.
The algorithm iterates through the query segments. In each iteration, the costs are spent on finding the common parent node of two adjacent points (and hence adjacent segments) in the query trajectory and on traversing the index to reach leaf nodes.


First, we derive the backtracking distance (i.e., the number of tree levels) between the leaf nodes and the common parent node of two adjacent points. We use the expectation to measure the average cost, which is obtained by summing up different backtracking lengths weighted by their probabilities.
We use Fig. 7 to illustrate how to derive with a space division in the quasi-quadtree. The width of the expanded MBR covering the (red) query segment is and the height is , where . Let the side length of the space be and that of a cell be . Here, each cell corresponds to a leaf node. In our quasi-quadtree, once the node capacity is exceeded, a node will be divided into four sub-nodes. The space covered by the node is split into four quadrants. We call the vertical and horizontal boundaries (the black dash lines) between the sub-spaces the splitting lines.
Different splitting lines divide the space at different index levels. The backtracking distance can be derived by the levels of the splitting lines that intersect with the expanded MBR. To derive whether an expanded MBR intersects with a splitting line, we compare the location of the upper left vertex of the MBR with the splitting line. In Fig. 7a, when the upper left vertex of the MBR falls in the pink area, the query segment intersects with a splitting line at level 0. The common parent node is the root of the quasi-quadtree, and is the height of this tree, i.e., . Fig. 7b shows the level-1 splitting lines, where the backtracking distance is if the expanded MBR intersects with them.
Next, we consider the stop condition of segment intersection. When and are large, a query segment may intersect with splitting lines at multiple lower levels, while common parent nodes cannot be at such levels. Let the lowest feasible level of the common parent node be , . This is because, given a pivot of a query segment in a node, we need at most a common parent node with an MBR side length of to also cover the other pivot of the segment. In Fig. 7b, if and , the query segment intersects with splitting lines at levels 1 to 4. We only need to consider the case where the query segment intersects with the splitting lines at level 0.
The probabilities of different backtracking lengths are derived by the ratio of the data space occupied by the pink area of different levels. We cut off the grey area in Fig. 7, as the query segment is outside the space when its expanded MBR is in this area.
Then, we compute by summing up the product of the probability and the backtracking distance at every level:
(11) |
where the fraction part is the probability determined by the size of the pink area, and is the distance between a leaf node and the common parent node. We let and then expand Equation 11:
(12) | ||||
Then, we expand the two terms in square brackets in Equation 12 separately. For the first term, by summing up the geometric series, we can have:
(13) | ||||
The second term can be expanded as:
(14) | ||||
Lastly, we plug Equation 13, 14 into Equation 12:
(15) | ||||
So far, we have that is up to .
Total cost. To query a trajectory, STS-Join iteratively takes node accesses for backtracking, since only one node is visited at each level, where is the number of segments in a query trajectory. Besides, it visits nodes while searching for similar segments of a query segment. In total, the cost of STS-Join with backtracking is .
6. Experiments
6.1. Experimental Setup
All experiments are conducted on a 64-bit machine running Ubuntu 20.04 LTS with a 6-core AMD Ryzen 5 CPU, 32 GB RAM, and a 500GB SSD. All algorithms are implemented in C++ and run in main memory.
Datasets. To the best of our knowledge, there is no public contact-tracing datasets. We use two real trajectory datasets instead: DiDi (url, 2016) and GeoLife (url, 2012). DiDi contains vehicle trajectories in Chengdu, a capital city in China. We randomly sample trajectories recorded in the first week of November, 2016 to generate a dataset with 500k trajectories and 67.7 million segments (this is limited by our memory size). The trajectories come from 201k users (each is considered a client in the experiments), i.e., there are 2.5 trajectories per client on average. There are 135 segments per trajectory on average. The average length and time span of a segment are 30.0 meters and 4.5 seconds, respectively.
GeoLife contains trajectories of different transport modes (e.g., walking, driving, and cycling) recorded mainly in Beijing, China. We only keep the trajectories within the Fifth Ring Road of Beijing. There are some 11k trajectories in this area and very few outside. We randomly sample 10k trajectories from them to form the GeoLife dataset used in the experiments. We shift the trajectory times into the same week without changing their time span to increase the density in time. The trajectories come from 182 users (clients), i.e., there are 55 trajectories per client on average. There are 9.0 million segments in these trajectories and 896 segments per trajectory on average. The average length and time span of a segment are 12.2 meters and 9.2 seconds, respectively. The two datasets are summarized in Table 2. We further generate their random subsets for scalability tests. See Table 3 for the sizes of the subsets.
Dataset | Configuration | |||
---|---|---|---|---|
# traj. | # seg. | Avg. seg. length | Avg. seg. time span | |
DiDi | 500k | 67,654k | 30.0 meters | 4.5 seconds |
GeoLife | 10k | 8,981k | 12.2 meters | 9.2 seconds |
Algorithms. There is no existing algorithm for STS-Join. We adapt two techniques which result in three baseline algorithms. Meanwhile, we replace the join predicates in these algorithms with ours, so all algorithms return the same accurate query results. We test three variants of our proposed STS-Join algorithm to confirm the effectiveness of the proposed backtracking-based query algorithm and the CDDS-based pruning strategy.
-
•
3DR: The 3DR-tree (Pfoser et al., 2000) extends the R-tree by adding time ranges as an extra dimension. We use it to index segments in and run an R-tree join like algorithm for queries.
- •
-
•
ALSTJ-T: The original ALSTJ index does not consider the time dimension. We improve it by adding a B-tree-like temporal index at its top levels like our STS-Index.
-
•
STS: STS denotes the naive STS-Join without backtracking or CDDS-based pruning as described at the start of Section 5.1.
-
•
STS-BT: STS-BT is the backtracking-based STS-Join (no pruning), described in Algorithm 1.
-
•
STS-BTB: STS-BTB further prunes matched trajectory segment pairs from being sent to the clients by a CDDS upper bound as described in Section 5.2.
We set the node capacity of the baselines and the B-trees in STS-Index to 100. We set the B-tree and the bitmap time interval lengths in STS-Index to 1,800 and 30 seconds, respectively. The client-server model of STS-Join is simulated, since it aims to enhance privacy but not computation capacity.
Parameter settings. Table 3 summarizes parameters tested in the experiments, where the default values are in bold.
We follow the closest work (Alamdari et al., 2020) to compare the algorithm response times, and we also measure the I/O cost that is a widely used indicator in spatial indexing studies (Comer, 1979; Qi et al., 2018, 2020) if using external memory based implementation. Besides, we study the communication cost by recording the number of segments sent to the clients for final verification and the number of clients receiving such segments. All algorithms return the same result set under the same query setting, which verifies the correctness of our methods.
Parameter | Settings |
---|---|
(DiDi) | 100k, 200k, 300k, 400k, 500k |
(GeoLife) | 2k, 4k, 6k, 8k, 10k |
(DiDi) | 100, 200, 300, 400, 500 |
(GeoLife) | 2, 4, 6, 8, 10 |
10, 20, 30, 40, 50 (meters) | |
0, 10, 20, 30, 40, 50 (seconds) | |
0, 10, 20, 30, 40, 50 (meters) | |
0, 10, 20, 30, 40, 50 (meters) |
6.2. Join Performance
We set the default sizes of the known DiDi and GeoLife datasets to 200k and 4k, respectively. The query datasets are randomly sampled from the original DiDi and GeoLife datasets with the same time ranges as the known datasets . Their sizes are set to be proportional to those of the respective known DiDi and GeoLife datasets. The default sizes of DiDi and GeoLife query datasets are 200 and 4 (i.e., of ), respectively. The GeoLife query datasets might seem small, but there are some 900 segments per trajectory on average, which are sufficient to show the performance difference of the algorithms.




Varying the known dataset size . Fig. 8 shows the query costs, which increase with . Our STS-BTB algorithm with both backtracking and CDDS-based pruning outperforms all competitors consistently in response time. It also takes the fewest node accesses among all algorithms except our STS-BT, which has the same node accesses as STS-BTB since the CDDS-based pruning does not impact node accesses. The advantage is up to three orders of magnitude. Comparing STS-BTB with STS-BT and STS, it achieves 14% and 18% lower running times, which confirms the effectiveness of our CDDS-based pruning to reduce the computational costs. STS has more node accesses (24% on average) than STS-BT, confirming the effectiveness of our backtracking strategy to reduce node accesses. STS has very similar running times to those of STS-BT, as the algorithms are running in memory, where the time saved on node accesses is lost in the more complex backtracking procedure of STS-BT. STS also outperforms the baselines in running time, while it has slightly more node accesses than ALSTJ-T when 200k on DiDi datasets. This is because ALSTJ-T has a much larger node capacity in its index, i.e., 100, while we use 4 in our quasi-quadtrees. The response time of ALSTJ-T is still up to 6.7 times larger than those of STS and STS-BT. This is because of its worse pruning power – its index allows overlapping node MBRs while ours does not. ALSTJ is even worse, i.e., up to three orders of magnitude slower and two orders of magnitude more node accesses than STS-BTB, because of its lack of pruning power in the time dimension. 3DR considers the time dimension but is not optimized for trajectories. It runs up to 240 times slower and has up to 43 times more node accesses than STS-BTB.




Varying the query dataset size . Next, we vary the query dataset size . Fig. 9 shows that the query costs increase with , which is expected. The relative performance among the algorithms is very similar to that when varying . STS, STS-BT, and STS-BTB outperform all competitors on both datasets in response time (up to three orders of magnitude), while ALSTJ-T has a slightly smaller number of node accesses than that of STS. These confirm the superiority of the proposed algorithms and the effectiveness of the proposed backtracking and CDDS-based pruning techniques to reduce the algorithm costs.
Since the query performance patterns are similar on both DiDi and GeoLife data, we omit the figures for GeoLife in the following query experiments for conciseness.




Varying the query distance threshold . Fig. 10 shows the algorithm performance as increases from 10 to 50 meters. The query costs of all algorithms increase with , because more trajectory pairs satisfy a larger and need to be returned. The relative performance between the proposed STS algorithms and the competitors are similar to those observed above. Like before, STS-BTB is up to 22% and 14% faster than STS and STS-BT, respectively, while STS-BT and STS-BTB both have the fewest node accesses. We also study the impact of on the communication cost, which is reflected by the number of segments that are sent to the clients and the number of clients receiving at least one segment, as reported in Figs. 10c and 10d, respectively. We see that, while the communication costs of STS-BTB increase with which is expected, they are at least 3.1 and 4.8 times lower than those of STS-BT (for ) in terms of the number of segments sent and the number of clients receiving the segments, respectively. Here, STS has been omitted since it has the same communication costs as STS-BT, and the same applies to the figures below. The results further confirm the effectiveness of our CDDS-based pruning.




Varying the query duration threshold . Figs. 11a and 11b show that the join query costs are relatively stable as the query duration threshold increases from 0 to 50 seconds. This is because all algorithms compare the CDDS of trajectories with at almost their final stages. The value of has little impact on the time and node access costs. The relative performance among the algorithms is again very similar to that in the previous figures. From Figs. 11c and 11d, we see that the communication costs of STS-BT is relatively stable, while those of STS-BTB decrease quickly as grows. This is because a larger makes the query predicate more difficult to be satisfied, even for the CDDS used in pruning which has been inflated to guarantee no false negatives. This indicates that our CDDS-based pruning strategy may gain more pruning power in application scenarios with larger values.
Varying the simplification threshold . We also study the impact of the simplification threshold on STS-Join. As shown in Figs. 12a and 12b, when increases from 0 to 50 meters, the query costs first drop and then increase for all STS algorithm variants. The drop is because introducing simplification helps reduce the number of segments in STS-Index and hence the query costs. However, as grows, the data segments become longer which leads to larger node MBRs (and greater MBR expansion to ensure no false negatives) and reduced the index pruning power. This explains for the increase in query costs. STS-BTB and STS-BT have the same node accesses since the CDDS-based pruning has no impact on node accesses. Meanwhile, STS-BTB again outperforms STS-BT in the communication costs, thus showing the robustness of STS-BTB (cf. Figs. 12c and 12d).




Varying the shifting distance . Fig. 13 shows that the query costs of the STS algorithm variants increase with , as expected. A larger may enlarge the node MBRs and hence leads to higher query costs. STS-BTB still takes the smallest time costs, number of node accesses, and communication costs, which have similar trends to those when varying , further confirming the robustness of the algorithm and the optimization techniques.




When other STS-Index parameters are varied (the B-tree, and the bitmap time interval lengths), STS-BTB consistently outperforms STS-BT and STS. We omit the figures for conciseness.
6.3. Index Construction Performance
We show the costs of index construction in Fig. 14. Since STS-BT and STS-BTB have the same process for index construction, we omit STS-BTB from the figure. STS-BT index is the fastest to build, which takes up to 15% less time than ALSTJ-T. STS-BT index also takes fewer node accesses to build than ALSTJ-T index does except when 400k (recall that our index has smaller node capacities). STS takes more time and node accesses to build without backtracking, while ALSTJ suffers in its lack of pruning power in the time dimension for data insertion. 3DR takes the largest costs to build, for that it is not optimized for trajectories.



We report the index sizes (RAM space) in Fig. 14c. STS and STS-BT indices have the same size, so we only report for STS. STS, ALSTJ, and ALSTJ-T indices are larger than those of 3DR, because they store simplified and original trajectories while 3DR only stores original trajectories. STS indices take slightly (8.5%) more space than ALSTJ-T does. This extra cost comes from the bitmap structure and more index nodes, which is justified by our much faster query performance. ALSTJ indices are smaller than ALSTJ-T’s as they do not index the time dimension.
Effectiveness of the join predicate. To show the differences between join queries, we run the original ALSTJ algorithm (with its own join predicate (Alamdari et al., 2020)) and collect the resultant trajectory pairs. We find that this misses 36% of the trajectory pairs returned by STS-Join, confirming the effectiveness of STS-Join in identifying trajectory pairs that may be missed by existing queries.
7. Conclusions
We propose a trajectory join query named STS-Join that returns pairs of trajectories within a given distance threshold for a time longer than a given time threshold. Unlike existing join queries that require the full trajectories to be similar, STS-Join can find trajectories that are similar for a time period while being quite different overall. We propose a client-server index named STS-Index and an efficient backtracking-based algorithm and a similarity time period based pruning strategy to support queries while supporting privacy protection. Our cost analysis and experiments on real data confirm the superiority of the proposed algorithm – it outperforms adapted state-of-the-art join algorithms consistently and by up to three orders of magnitude in the query time.
For future work, we plan to extend STS-Join to the road network, and allow temporal shifts in the similarity metric. We also intend to further study the privacy protection on trajectory similarity queries from a theoretical perspective.
Acknowledgements.
This work was sponsored in part by the Australian Research Council under the Discovery Project Grant DP180103332 and by the NSF under Grants IIS-18-16889 and IIS-20-41415.References
- (1)
- url (2012) 2012. GeoLife. www.microsoft.com/en-us/download/details.aspx?id=52367.
- url (2016) 2016. DiDi. outreach.didichuxing.com/research/opendata/en/.
- Alamdari et al. (2020) Omid Isfahani Alamdari, Mirco Nanni, Roberto Trasarti, and Dino Pedreschi. 2020. Towards in-memory sub-trajectory similarity search. In EDBT/ICDT Workshops.
- Andrés et al. (2013) Miguel E. Andrés, Nicolás E. Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2013. Geo-indistinguishability: Differential privacy for location-based systems. In CCS. 901–914.
- Bakalov and Tsotras (2006) Petko Bakalov and Vassilis J Tsotras. 2006. Continuous spatiotemporal trajectory joins. In International Conference on GeoSensor Networks. 109–128.
- Buzan et al. (2004) Dan Buzan, Stan Sclaroff, and George Kollios. 2004. Extraction and clustering of motion trajectories in video. In International Conference on Pattern Recognition. 521–524.
- Cai et al. (2017) Jixuan Cai, Bo Huang, and Yimeng Song. 2017. Using multi-source geospatial big data to identify the structure of polycentric cities. Remote Sensing of Environment 202 (2017), 210–221.
- Chen et al. (2005) Lei Chen, M Tamer Özsu, and Vincent Oria. 2005. Robust and fast similarity search for moving object trajectories. In SIGMOD. 491–502.
- Chow and Mokbel (2007) Chi-Yin Chow and Mohamed F Mokbel. 2007. Enabling private continuous queries for revealed user locations. In SSTD. 258–275.
- Chow and Mokbel (2011) Chi-Yin Chow and Mohamed F. Mokbel. 2011. Trajectory privacy in location-based services and data publication. SIGKDD Explorations Newsletter 13, 1 (2011), 19–29.
- Comer (1979) Douglas Comer. 1979. Ubiquitous B-tree. Comput. Surveys 11, 2 (1979), 121–137.
- Deng et al. (2019) Yue Deng, Jiping Liu, Yang Liu, and An Luo. 2019. Detecting urban polycentric structure from POI data. ISPRS International Journal of Geo-Information 8, 6 (2019), 283.
- Douglas and Peucker (1973) David H. Douglas and Thomas K. Peucker. 1973. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization 10, 2 (1973), 112–122.
- Dwork et al. (2006) Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. 2006. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT. 486–503.
- Frentzos et al. (2007) Elias Frentzos, Kostas Gratsias, and Yannis Theodoridis. 2007. Index-based most similar trajectory search. In ICDE. 816–825.
- Holohan et al. (2020) Naoise Holohan, Spiros Antonatos, Stefano Braghin, and Pól Mac Aonghusa. 2020. The bounded Laplace mechanism in differential privacy. Journal of Privacy and Confidentiality 10, 1 (2020).
- Jiang et al. (2013) Kaifeng Jiang, Dongxu Shao, Stéphane Bressan, Thomas Kister, and Kian-Lee Tan. 2013. Publishing trajectories with differential privacy guarantees. In SSDBM. 1–12.
- Monreale et al. (2010) Anna Monreale, Roberto Trasarti, Chiara Renso, Dino Pedreschi, and Vania Bogorny. 2010. Preserving privacy in semantic-rich trajectories of human mobility. In SIGSPATIAL. 47–54.
- Naghizade et al. (2014) Elham Naghizade, Lars Kulik, and Egemen Tanin. 2014. Protection of sensitive trajectory datasets through spatial and temporal exchange. In SSDBM. 1–4.
- Nanni and Pedreschi (2006) Mirco Nanni and Dino Pedreschi. 2006. Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems 27, 3 (2006), 267–289.
- Pfoser et al. (2000) Dieter Pfoser, Christian S. Jensen, Yannis Theodoridis, et al. 2000. Novel approaches to the indexing of moving object trajectories. In VLDB. 395–406.
- Qi et al. (2018) Jianzhong Qi, Yufei Tao, Yanchuan Chang, and Rui Zhang. 2018. Theoretically optimal and empirically efficient r-trees with strong parallelizability. PVLDB 11, 5 (2018), 621–634.
- Qi et al. (2020) Jianzhong Qi, Yufei Tao, Yanchuan Chang, and Rui Zhang. 2020. Packing R-trees with space-filling curves: Theoretical optimality, empirical efficiency, and bulk-loading parallelizability. TODS 45, 3 (2020), 1–47.
- Sakurai et al. (2005) Yasushi Sakurai, Masatoshi Yoshikawa, and Christos Faloutsos. 2005. FTW: Fast similarity search under the time warping distance. In PODS. 326–337.
- Salvador and Chan (2007) Stan Salvador and Philip Chan. 2007. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11, 5 (2007), 561–580.
- Shang et al. (2017) Shuo Shang, Lisi Chen, Zhewei Wei, Christian S. Jensen, Kai Zheng, and Panos Kalnis. 2017. Trajectory similarity join in spatial networks. PVLDB 10, 11 (2017), 1178–1189.
- Shang et al. (2018) Zeyuan Shang, Guoliang Li, and Zhifeng Bao. 2018. DITA: A distributed in-memory trajectory analytics system. In SIGMOD. 1681–1684.
- Tampakis et al. (2020) Panagiotis Tampakis, Christos Doulkeridis, Nikos Pelekis, and Yannis Theodoridis. 2020. Distributed subtrajectory join on massive datasets. ACM Transactions on Spatial Algorithms and Systems 6, 2 (2020), 1–29.
- Vlachos et al. (2002) Michail Vlachos, George Kollios, and Dimitrios Gunopulos. 2002. Discovering similar multidimensional trajectories. In ICDE. 673–684.
- Wu et al. (2016) Huayu Wu, Mingqiang Xue, Jianneng Cao, Panagiotis Karras, Wee Siong Ng, and Kee Kiat Koo. 2016. Fuzzy trajectory linking. In ICDE. 859–870.
- Xu et al. (2020) Yixin Xu, Lars Kulik, Renata Borovica-Gajic, Abdullah Aldwyish, and Jianzhong Qi. 2020. Highly efficient and scalable multi-hop ride-sharing. In SIGSPATIAL. 215–226.
- You et al. (2007) Tun-Hao You, Wen-Chih Peng, and Wang-Chien Lee. 2007. Protecting moving trajectories with dummies. In MDM. 278–282.
- Yuan and Li (2019) Haitao Yuan and Guoliang Li. 2019. Distributed in-memory trajectory similarity search and join on road network. In ICDE. 1262–1273.
- Yuan and Raubal (2014) Yihong Yuan and Martin Raubal. 2014. Measuring similarity of mobile phone user trajectories–a Spatio-temporal Edit Distance method. International Journal of Geographical Information Science 28, 3 (2014), 496–520.