Finding A Taxi with Illegal Driver Substitution Activity via Behavior Modelings

Junbiao Pang, Muhammad Ayub Sabir, Zhuyun Wang, Anjing Hu, Xue Yang, Haitao Yu, and Qingming Huang J. Pang, M. Sabir and A. Hu are with the Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China (e-mail: [email protected]). Z. Wang is the Beijing Municipal Transportation Law Enforcement Corps, Beijing 100044, China. X. Yang and H. Yu are with the Beijing Transportation Information Center, Beijing 100161, China. Q. Huang is with the University of Chinese Academy of Sciences, Chinese Academy of Sciences (CAS), Beijing 100049, China, and the Institute of Computing Technology, CAS, Beijing 100190, China.

Abstract

In our urban life, Illegal Driver Substitution (IDS) activity for a taxi is a grave unlawful activity in the taxi industry, possibly causing severe traffic accidents and painful social repercussions. Currently, the IDS activity is manually supervised by law enforcers, i.e., law enforcers empirically choose a taxi and inspect it. The pressing problem of this scheme is the dilemma between the limited number of law-enforcers and the large volume of taxis. In this paper, motivated by this problem, we propose a computational method that helps law enforcers efficiently find the taxis which tend to have the IDS activity. Firstly, our method converts the identification of the IDS activity to a supervised learning task. Secondly, two kinds of taxi driver behaviors, i.e., the Sleeping Time and Location (STL) behavior and the Pick-Up (PU) behavior are proposed. Thirdly, the multiple scale pooling on self-similarity is proposed to encode the individual behaviors into the universal features for all taxis. Finally, a Multiple Component- Multiple Instance Learning (MC-MIL) method is proposed to handle the deficiency of the behavior features and to align the behavior features simultaneously. Extensive experiments on a real-world data set shows that the proposed behavior features have a good generalization ability across different classifiers, and the proposed MC-MIL method suppresses the baseline methods.

Index Terms:

Illegal Driver Substitution Activity, Behavior Modeling, Multiple Scale, Taxi Supervision, Self-Similarity, Pooling

I Introduction

Taxis play a pivotal role in urban transportation, offering dynamic, convenient, and time-efficient door-to-door services. Yet, managing taxi operations presents unique challenges due to their highly flexible routes, passenger demands, and operational hours, distinguishing them from more static transportation systems like buses and subways. A notable issue within the taxi industry is Illegal Driver Substitution (IDS), where:

Definition 1 (Definition of IDS)

A taxi is operated by someone other than the legally registered driver, in violation of their contractual agreement.

Local regulations stipulate that a legal taxi driver must possess a vocational license and a formal contract with a taxi company. IDS can manifest in two primary forms:

•

Operation of a taxi by an individual lacking the necessary vocational license;
•

Operation by a licensed driver who is not officially registered to the vehicle in question.

Thus, IDS occurs whenever a taxi is used by unauthorized personnel to provide transportation services, often motivated by the illicit objective of maximizing revenue through the subcontracting of taxi services to unregistered individuals. This illicit practice poses significant risks, including severe traffic incidents and criminal acts (e.g., robbery, murder), undermining the safety and integrity of the taxi industry and local governance.

Currently, law enforcement’s manual checks on taxis are insufficient due to the sheer volume of taxis compared to the limited number of inspectors. This situation prompts a vital question for transportation safety officials: how can we more effectively identify taxis engaged in IDS?

Refer to caption — Figure 1: Unreliable driver profile (best viewed in color). (a) The comparison of the ages between taxis with IDS and taxi role models. (b) The comparison of the education level between taxis with IDS and taxi role models; “PS”, “MS”, “HS”, “JC”, and “UG” are the abbreviation of Primary School, Middle School, High School, Junior College, and UnderGraduate, respectively. (c) The comparison of the first time when a taxi obtained the vocational license.

A promising strategy involves leveraging Global Positioning Satellite (GPS) data, taximeter records, and driver registration information. GPS data can reveal driver behavior patterns, such as resting habits or meal times [1] [2], while taximeter records offer insights into service income and passenger pickup/drop-off locations [3] [4]. Additionally, driver profiles (e.g., education, experience) could aid in identifying IDS activities, given the variability and inconsistency of IDS events among different drivers. This variability highlights the importance of monitoring for changes in driver behavior as a key indicator of IDS.

To unlock the potential of taxi service data, it’s crucial to transform individual driver behaviors into Common and Consistent features applicable across the fleet. This entails:

Commonness. GPS data and taximeter records reflect individual behaviors, but for IDS analysis, extracting shared patterns across taxis is more valuable. This approach simplifies classifier design and enhances our understanding of IDS by focusing on commonalities rather than individual variances.
Consistency. The variability in driver data, due to inaccuracies or deceptive practices, necessitates a focus on consistent behaviors for feature design. This ensures our models are built on reliable data, crucial for detecting IDS activities accurately.

This paper delves into the challenge of identifying taxis engaging in IDS activities through behavioral modeling. Understanding the research challenges is a key first step in this exploration

I-A Research Challenges

Unreliable Driver Profile. Driver-supplied profiles to taxi companies poorly predict involvement in IDS activities. Analysis depicted in Fig.1 shows minimal correlation between the profiles of IDS-engaged taxis and ”taxi role models.” Notably, attributes such as education, age, and timing of vocational licensing offer no reliable indicators of IDS participation. Research suggests that the key determinants of crime-age profiles are more closely associated with personal circumstances, including living arrangements, family interactions with law enforcement, and truancy rates[5].

Misaligned Behaviors. The driving patterns among taxi drivers vary significantly. For instance, some drivers exhibit a preference for nocturnal shifts, whereas others opt for daytime hours. This diversity is underscored by a study analyzing taxi traces in Beijing, which found substantial variability in driver behavior [6].

Data Imbalance. The incidence of IDS within the taxi community, while serious, is infrequent. A focused study revealed that only a fraction (0.19

I-B Our Contribution

To address these challenges, we propose two efficient and effective driver behaviors from the traces of taxis and the records of taximeters. Our technique takes advantage of the following two critical observations:

(I) Compared with the registered driver, the illegal one tends to have a different sleeping pattern. Especially for the one-shift taxis, a driver tends to have a consistent sleeping behavior since a person usually has a relatively fixed domicile in a city. Besides, the sleeping pattern (e.g., the duration time of sleeping) tends to be different for different drivers.
(II) The operating schemes (e.g., patterns from the distributions of PUs or DOs) between taxi drivers reflect their different driving behaviors. Both empirical and social behavior studies (e.g., [7]) demonstrate that, over a sufficiently long period of time, a person’s behavior exhibits a surprisingly high level of consistency. Moreover, the spatio-temporal distribution of the PU points discovers the individual profit-hunting scheme, which typically reflects the behaviors of different drivers [4].

Utilizing insights from Observation (I), we model the Sleeping Time and Location (STL) of a taxi driver via Fisher Vector (FV) [8] from the following information: a) the GPS locations of a driver’s sleeping locations, b) the start working time and the sleeping time. Based on (II), we utilize Latent Dirichilet Allocation (LDA) [9] to learn the spatial-temporal PU behavior. Therefore, based on both (I) and (II), we propose two kinds of individual-wise driver behaviors.

The individual behaviors are the personal description of each driver. To efficiently discover the taxi with the IDS activity, the individual-wise driver behaviors should be further encoded into the discriminative features for all taxis. This paper proposes to combine the Self-Similarity (SS) approach and the pooling to discover the taxis with the IDS activity. To align these SS-based features over a long-time range, we propose Multiple Component-Multiple Instance Learning to handle the possibility of deficiency of the behavior features and align the behavior features.

This paper makes several significant contributions to the field, as outlined below:

1. Introduction to the IDS Problem. This study is pioneering in exploring the detection of one-shift taxis engaged in IDS activities, utilizing GPS data and taximeter records. We approach this challenge as a supervised learning problem, achieving effective solutions for identifying taxis with the IDS activities.
2. Modeling Driver Behavior. We have developed the STL and PU behaviors, offering the insights into drivers’ rest patterns and profitability strategies, respectively. These are instrumental in characterizing the nuanced activities of each driver.
3. Multi-Scale Pooling (MSP) on Self-Similarity (SS). Acknowledging the variable timing of the IDS activities among taxis, we introduce SS to pinpoint the IDS occurrences and MSP to standardize these into a uniform feature vector dimension. The synergy of SS and MSP effectively translates diverse individual driver behaviors into a common feature space applicable across all taxis.
4. Multiple Component-Multiple Instance Learning (MC-MIL). MC-MIL addresses the challenges of behavior feature deficiencies and aligns IDS-related features. Additionally, we evaluate the performance of deep learning based method (i.e., Long short-term memory (LSTM) and Transformer) on the imbalanced and small-scale dataset. Our findings reveal MC-MIL as the superior method among these classifiers for detecting IDS activities.

II Background and Related Work

II-A Background

As depicted in Fig.2, leveraging GPS and PU/DO data allows a data center to identify and notify law enforcers of taxis engaging in IDS activities directly on their mobile devices. By setting a specific inspection area on their devices, law enforcers can strategically locate and manually verify suspect taxis through the electronic map, bypassing the need for random checks. Thus, the efficacy of the system in Fig.2 hinges on its ability to accurately and swiftly pinpoint taxis with IDS activities.

II-B Related Work

Pattern Analysis from GPS Traces: A variety of research issues have been addressed by leveraging large-scale GPS traces, e.g., urban human mobility understanding [10] [11] [3], urban planning [12] [13], traffic prediction [14], anomalous trajectory detection [15], and urban region function identification [16]. For intelligent traffic community, previous studies have addressed numerous research issues e.g., [17] identifies unusual driving patterns from taxi GPS traces, with applications in fraud detection and monitoring urban road networks. In [18], they propose a space-time visualization method to analyze Beijing taxi GPS data by daily operation time, driver residence, and operating patterns for understanding Beijing Taxi Operations. [19] proposes a real-time method to detect anomalous trajectories as well as identify which parts of the trajectory are responsible for its anomaly’s behaviour.

Pattern Analysis from Records of Taximeters: The records of taximeters have been widely used to improve taxi driver’s performance, i.e., identifying popular pickup areas [20] [21] [22] [23] and optimizing passenger-hunting routes [6] [24] [20]. [25] utilizes a heuristic algorithm to create maximum fraudulent trajectories from the dataset. [26] examines the choices made by New York taxi drivers at JFK airport, determining pick-ups versus cruising after trips. In [27] the author makes a critical observation that fraudulent taxis manipulate taximeters, inflating service distances and reported speeds. In [28]

Summarizing: To our knowledge, this paper first leverages GPS traces and records of taximeters to identify IDS activities. Therefore, we briefly review the related works about exploiting GPS traces and logs of taximeters for the other tasks.

III Data Pre-Processing

TABLE I: A Pre-Processed Datum

Taxi Trace Data			Taxi Service Data
(A Set of GPS Points)			(PU/DO Points)
Longitude	Latitude	Time Stamp	Longitude	Latitude	0/1^†	Time Stamp

•

^† 0 means a PU point, 1 denotes a DO point.

The dataset comprising taxi GPS and taximeter records was sourced from the Beijing Transportation Information Center¹¹1http://www.btic.org.cn/xxzx/. Specifically, each taxi is equipped with a GPS device that transmits real-time data including longitude, latitude, time stamps, and instantaneous velocity. Concurrently, taximeter records capture service details such as time stamps for pickups (PU) and drop-offs (DO), service income, taxi occupancy status (”occupied” or ”vacant”), and service distance. Both data types are relayed to a central data center via telecommunication networks.

Adherence to local taxi regulations is mandatory, requiring: 1) continuous GPS connectivity; 2) accurate GPS time stamps; and 3) taximeters in optimal condition. Non-compliance, indicated by erroneous data transmission, prompts law enforcement to notify taxi companies for corrective actions.

Data pre-processing involves two critical steps:

1.

Differentiating between two-shift and one-shift taxis: This study focuses on identifying IDS activities in one-shift taxis, given the negligible proportion of two-shift taxis in Beijing (approximately 13 percent) and their complex behavioral patterns, such as taxi transfers and passenger pickups [1].
2.

Geo locating PU/DO events by matching time stamps: With accurate time stamps from both taximeters and GPS devices, a PU/DO event’s location is determined by assigning it the GPS coordinates of the nearest time-stamped GPS point.

There is no specific data filtering method except that we remove error GPS data if the number of digits after the point is less than 6. The items of a pre-processed datum are listed in Table. I.

IV METHODOLOGY

IV-A Problem Definition

Consider a taxi $c$ operating within an urban environment. Let $\phi_{T}$ and $\phi_{R}$ represent the feature extraction functions from the GPS trace dataset and taximeter record dataset, respectively. The process of identifying taxis engaged in the IDS activities is formalized as follows:

Definition 2 (Identification of IDS Activities)

Given feature extractors $\phi_{T}$ and $\phi_{R}$ , the objective is to determine a function $f$ that classifies a taxi’s involvement in IDS activities, i.e.,

f\left(\phi_{T}(c),\phi_{R}(c)\right)\rightarrow{+1,-1},

(1)

where $+1$ signifies the presence and $-1$ the absence of IDS activities in a taxi.

Equation 1 frames the identification of IDS activities as a supervised learning challenge. This study distinguishes between positive and negative samples for classification, leveraging Beijing’s ”taxi role model” initiative to define role model taxis as negative samples $\mathbf{\Omega}-$ , and taxis exhibiting IDS activities as positive samples $\mathbf{\Omega}+$ .

IDS Discovery Framework: The proposed methodology encompasses three main stages:

Step 1. Heterogeneous Driver Behavior Modeling: The STL (Sleeping Time and Location) and PU (Pick-Up) behaviors are computed to encapsulate individual driver behaviors.
Step 2. Multi-Scale Pooling from Self-Similarity: Individual behaviors are translated into a unified feature space via Self-Similarity (SS) and Multi-Scale Pooling (MSP).
Step 3. Supervised Learning for IDS Activity Identification: Leveraging the insights from the initial stages, the IDS detection problem is approached through supervised classification techniques.

V Heterogeneous Driver Behavior Modeling

V-A Modeling Sleeping Behavior of a Driver

Definition 3 (Definition of STL)

STL for a taxi driver is characterized by a period of inactivity, where the taxi’s location remains unchanged, and it is unoccupied. This period, denoted as the driver’s sleep, is defined by three parameters: the location of rest, the starting time of this inactive period, and its duration. Hence, STL is formally represented as a triple: sleeping location, start time, and sleep duration.

STL can be straightforwardly derived for each taxi based on Definition 3. Fig 3 depicts the sleep locations of two one-shift taxis in Beijing, highlighting the distinct, concentrated sleeping patterns that reflect personal domiciles. Conversely, the sleeping behaviors of two-shift taxi drivers are more varied, including breaks at random locations and rest within the taxi, making STL modeling less applicable.

Given a taxi’s start time $t_{s}$ , sleeping coordinates $(p_{lon},p_{lat})$ , and sleeping duration $t_{d}$ , STL is vectorized as $\mathbf{x}^{STL}=[t_{s},p_{lon},p_{lat},t_{d}]^{\top}$ . For efficient and accurate classification, the Fisher Vector (FV) method is employed to transform STLs into discriminative features, a technique proven effective in image recognition tasks [29].

Encoding STL by Fisher Vector. In a nutshell, FV assumes that the gradient of the log-likelihood describes the contribution of the parameters $\lambda$ with respect to the generation of data $X$ in a parameter space. Concretely, FV firstly fits a Gaussian Mixture Model (GMM). The resulting FV descriptor integrates the deviations of the parameters from GMM, providing a robust feature [30]. FV’s efficacy spans a range of imaging applications, including image classification, face recognition, object detection, and texture analysis [31][32][33] [34].

Let $X=\{\mathbf{x}^{STL}_{i},i=1\ldots N\}$ be a set of STLs from $N$ taxis. FV $G^{X}_{\lambda}$ is modeled as the gradients of a probability density function $u_{\lambda}$ as follows:

G^{X}_{\lambda}=\frac{1}{T}\nabla_{\lambda}\log u_{\lambda}(X).

(2)

where $T$ is the number of STL data in a time bucket since some one-shift taxis occasionally have no sleep time. In practice, $T$ is almost constant in our experiment.

In practice, following [8], we choose GMM to be $u_{\lambda}$ which approximates with arbitrary precision to any continuous distributions: $u_{\lambda}(\mathbf{x})=\sum_{k=1}^{K}w_{k}p_{k}(\mathbf{x})$ with $w_{k}\geq 0,\sum_{k=1}^{K}w_{k}=1$ , in which the parameters $\lambda=\{\bm{\mu}_{k},\bm{\Sigma}_{k}\},k=1,\ldots,K$ , where $\bm{\mu}_{k}$ and $\bm{\Sigma}_{k}$ are respectively the mixture weight, the mean vector and the covariance matrix of the $k$ -th Gaussian component $p_{k}(\mathbf{x})$ :

p_{k}(\mathbf{x})=\frac{1}{(2\pi)^{D/2}{|\bm{\Sigma}_{k}|}^{1/2}}\text{exp}\left\{-\frac{1}{2}(\mathbf{x}-\bm{\mu}_{k})^{\top}\bm{\Sigma}^{-1}_{k}(\mathbf{x}-\bm{\mu}_{k})\right\},

(3)

We assume that the covariance matrix $\bm{\Sigma}_{k}$ is diagonal matrix since the computational cost of the diagonal covariances is much lower than the cost involved by full covariances. Hereafter, we use the notation $\sigma_{k}^{d}$ represent the $d$ -th elements on the diagonal in the $k$ -th covariance matrix. For the weight parameters $w_{k}$ , we adopt the soft-max formalism in [8] and define $w_{k}=\frac{\exp(\alpha_{k})}{\sum_{i=1}^{K}\exp(\alpha_{k})}$ , where the re-parametrization using the $\alpha_{k}$ avoids enforcing explicitly the constraints in GMM. Consequently, the gradient of log-likelihood $\mathcal{L}(X|\lambda)$ with respect to the parameters $\mu_{k}^{d}$ ( $\bm{\mu}_{k}=[\mu_{k}^{1},\ldots,\mu^{d}_{k},\ldots,\mu_{k}^{D}]^{\top}$ ) and $\sigma_{k}^{d}$ ( $\bm{\Sigma}_{k}=diag([\sigma^{1}_{k},\ldots,\sigma_{k}^{d},\ldots,\sigma_{k}^{D}]^{T})$ ) is respectively as follows:

\frac{\partial\mathcal{L}(\mathbf{x}_{i}^{STL}|\lambda)}{\partial\mu_{k}^{d}}=\gamma_{i}(k)\left(\frac{\mathbf{x}_{i}^{STL^{d}}-\mu_{i}^{d}}{(\sigma_{k}^{d})^{2}}\right),

(4)

\frac{\partial\mathcal{L}(\mathbf{x}_{i}^{STL}|\lambda)}{\partial\sigma_{k}^{d}}=\gamma_{i}(k)\left(\frac{(\mathbf{x}_{i}^{STL^{d}}-\mu_{k}^{d})^{2}}{(\sigma_{k}^{d})^{3}}-\frac{1}{\sigma_{k}^{d}}\right),

(5)

where $\gamma_{i}(k)$ is denoted as the probability of a STL $\mathbf{x}_{i}^{STL}$ point to be generated from the $k$ -th Gaussian component,

\gamma_{i}(k)=\frac{w_{k}p_{k}(\mathbf{x}^{STL}_{i}|\lambda)}{\sum_{k=1}^{K}w_{k}p_{k}(\mathbf{x}_{i}|\lambda)}.

(6)

Consequently, following the principle in (2), the FV of a $\mathbf{x}$ is the concatenation of the partial derivatives with respect to the mean $\mu_{k}^{d}$ and the standard deviation $\sigma_{k}^{d}$ as follows:

\mathbf{f}^{STL}_{i}=\frac{1}{T}\left[\frac{\partial\mathcal{L}(\mathbf{x}|\lambda)}{\partial\mu_{k}^{1}},\ldots,\frac{\partial\mathcal{L}(\mathbf{x}|\lambda)}{\partial\mu_{k}^{D}},\frac{\partial\mathcal{L}(\mathbf{x}|\lambda)}{\partial\sigma_{k}^{1}},\ldots,\frac{\partial\mathcal{L}(\mathbf{x}|\lambda)}{\partial\sigma_{k}^{D}}\right]^{\top}

(7)

where $D$ is the dimension of the STL vectors. By (7), a STL point $\mathbf{x}^{STL}$ is encoded into a FV vector with $2DK$ dimension. Following the $\ell_{2}$ -normalization [29], we finally use the normalized FV.

V-B Modeling the PU Behavior of a Driver

Developing the Latent Pickup (PU) Behavior Model: It’s well-documented that individuals exhibit consistent spatio-temporal patterns, a concept that extends to the predictable behaviors of taxi drivers regarding passenger pickups [11]. The underlying principle is that drivers quest for profitability leads to the emergence of distinct pickup patterns. Notably, drivers with the IDS activities tend to operate within specific zones known for high ride service demand. This behavior contrasts with the registered drivers, whose pickup locations are generally more dispersed and appear random [7] [35].

We represent a pickup event with $\mathbf{x}^{PU}=[t_{pu},p_{lon},p_{lat}]^{\top}$ , capturing the essential details of when and where passengers are picked up. This study employs LDA to model the latent structures within the pickup data, drawing parallels between the distribution of words in documents and pickup points in urban spaces:

•

The aggregation of pickup points resembles the collection of words within a document.
•

The compilation of pickups over a specific period forms our corpus, analogous to the textual content of a document.
•

The entirety of pickups by a single taxi is viewed as an individual document in this analogy.

This framework allows us to employ latent topics to elucidate the patterns in drivers pickup behaviors, with the specific analogy detailed in Table II. LDA leverages a dirichlet prior to model the distribution of topics within documents, illustrating its versatility in representing texts. Documents can thus embody a mix of multiple topics, enhancing the model’s descriptive power [36] [37].

Consider $\mathcal{D}$ , a collection of $M$ such ”documents” $\mathcal{D}={\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{M}}$ , each ”document” $\mathbf{w}$ being a series of ”words” $\mathbf{w}={w_{1},w_{2},\ldots,w_{N}}$ , or in another notation, $w_{1:N}$ . These ”documents” are presumed to arise from a distribution $\bm{\theta}=[\theta_{1},\ldots,\theta_{T}]^{\top}$ over a set of topics.

To delineate the ”words” representing PU behavior, this work utilizes the $k$ -means algorithm to cluster the pickup points $\mathbf{x}^{PU}_{i}$ from all taxis into a designated number of clusters. Each cluster represents a ”word” $w_{n}$ , which, when aggregated daily, forms a ”document” that characterizes the pickup behavior of a driver.

TABLE II: The correspondence between the text corpus and the PU behavior.

Notation	Text Corpus	PU Points
$\mathbf{z}$	Topics	The PU Behavior
$\mathbf{w}$	A document	A Time bucket
$\theta$	Topic proportions	Behaviors Proportions

The joint distribution of a topic mixture $\bm{\theta}$ , a set of topics $\mathbf{z}$ , and a set of $N$ words $\mathbf{w}$ is given as follows:

p(\bm{\theta},\mathbf{z},\mathbf{w}|\alpha,\beta)=p(\bm{\theta}|\alpha)\prod_{n=1}^{N}p(z_{n}|\bm{\theta})p(w_{n}|z_{n},\beta),

(8)

where $p(\bm{\theta}|\alpha)$ follows a Dirichlet distribution with the simplex parameter $\alpha$ (i.e., $\alpha$ makes $\theta_{i}\geq 0,\sum_{i=1}^{T}\theta_{i}=1$ ), $p(z_{n}|\bm{\theta})$ is simply $\theta_{i}$ for the unique $i$ such that $z_{n}^{i}=1$ , and $p(w_{n}|z_{n},\beta)$ is a multinomial probability conditioned on the topic $z_{n}$ .

We aim to determine the latent topics based on the PU points in a time bucket. This is equivalent to computing the posterior distribution of the hidden variables given a document:

p(\bm{\theta},\mathbf{z}|\mathbf{w},\alpha,\beta)=\frac{p(\bm{\theta},\mathbf{z},\mathbf{w}|\alpha,\beta)}{p(\mathbf{w}|\alpha,\beta)},

(9)

which is approximately computed based on the Monte Carlo Markov Chain (MCMC) technique [9] where the idea is to generate posterior samples from its conditional distribution. Therefore, the latent topics of the PU behavior $\mathbf{f}^{PU}$ for a taxi in a day is defined as follows:

\mathbf{f}^{PU}=[p(\theta_{1},\mathbf{z}|\mathbf{w},\alpha,\beta),\ldots,p(\theta_{k},\mathbf{z}|\mathbf{w},\alpha,\beta)]^{\top}

(10)

Dealing with location inaccuracy: In many cases, the location information may need to be more accurate due to the blocked signal transmission caused by high buildings or overpasses. Therefore, a smooth mechanism is preferred over the degree of PU word membership should be performed during computing the words $\mathbf{w}$ for a taxi.

A simple and yet reasonable way is to use the Nadaraya-Watson kernel regression on the Gaussian kernel to approximate the probabilistic word membership [38], which we denote as $o$ . The probabilistic word of a PU point is as follows:

\begin{split}&o_{a}=\frac{\exp(-d^{2}_{a}/\delta)}{\sum\exp(-d^{2}_{i}/\delta)},\ \ o_{b}=\frac{\exp(-d^{2}_{b}/\delta)}{\sum\exp(-d^{2}_{i}/\delta)}\\ &o_{a}+o_{b}=1.\end{split}

(11)

where $d$ is the distance between a PU point and a center of a PU word, and $\delta$ denotes the uncertain range of a sensor.

V-C Multi-Scale Pooling for Self-Similarity Based Features

It is posited that individual taxi drivers exhibit distinct, yet relatively stable, behavioral patterns. Anomalies in these patterns, especially sharp deviations observed over short intervals (e.g., daily), may signal the IDS activities. Identifying such anomalies necessitates addressing two primary challenges:

•

Detecting Significant Behavioral Changes: Variability in a driver’s behavior can naturally occur due to unforeseen events. It is crucial, therefore, to establish a baseline against which significant deviations indicative of potential IDS activities can be measured.
•

Consolidating IDS Indicators Across Taxis: IDS activities, when present in multiple taxis, may not manifest simultaneously. As depicted in Fig. 4(a), aligning these activities within a unified feature vector framework is essential.

For the first problem, Self-Similarity (SS) approach is proposed to detect the occurrence of the IDS activity. Concretely, given a sequential state of behaviors $\mathbf{f}^{c}_{b_{i}}$ (which are either the STL behavior $\mathbf{f}^{STL}_{c_{b_{i}}}$ or the PU one $\mathbf{f}^{PU}_{c_{b_{i}}}$ ) of a taxi $c$ at the $i$ -th day in a time bucket $b$ , SS computes the difference of the behaviors between two consecutive days as follows:

s^{c}_{b_{i,j}}=Similarity\left(\mathbf{f}^{c}_{b_{i}},\mathbf{f}^{c}_{b_{j}}\right).

(12)

There are two important parameters in (12): the function $Similarity(\cdot,\cdot)$ and the time bucket $b$ . The function $Similarity(\cdot,\cdot)$ can be any function that measures the similarity between two features. For instance, mutual information is used for two distributions, and cosine distance is for two normalized vectors. The time bucket $b$ in (12) corresponds to the time window in which the IDS activity maybe occur. Ideally, if the length of the time bucket is larger, the IDS activity is much easier to cover. This paper chooses cosine distance as the similarity function, and the bucket length is 30 days.

To the second problem, taking the “Taxi 1” in Fig. 4(a) as an example, if the change of the driver behaviors is detected by the SS approach, the maximal change could be used to indicate the occurrence of the IDS activity. Therefore, given a time bucket $b$ , the occurrences of the IDS activity are discovered by max pooling as follows:

s^{c}_{b_{max}}=\text{max}\{s^{c}_{b_{i,j}}\}.

(13)

Meanwhile, the driver behaviors, especially the PU behaviors, are only sometimes consistent due to some unexpected events. For instance, picking up a passenger via the car-hailing service or changing a domicile because the rent is due. To describe this observation, the minimal change is extracted to calibrate the maximal change as follows:

s^{c}_{b_{min}}=\text{min}\{s^{c}_{b_{i,j}}\}.

(14)

The combination of (13) and (14) jointly describes the intense change of the behaviors.

As illustrated in Fig. 4(a), although the changes of the driver behaviors occur at different days, both max pooling (13) and min pooling (14) encode the IDS activity into the same bin of the feature vector. Ideally, by combining the SS approach and the pooling method, the individual-wise behaviors are aligned into a universal feature space for all taxies.

When the length of the time bucket is very large, some unexpected events tend to occur. Both max pooling and min pooling would capture the unexpected events rather than the IDS activity. Therefore, Multiple time-Scale Pooling (MSP) is further proposed to increase the discriminative power of the aligned features. Concretely, Fig. 4(b) shows that a time bucket is divided into multiple smaller ones by several scales, and then the aligned feature vectors from different scales are concatenated into a MSP-SS vector as follows:

\mathbf{s}^{c}=\left[s^{c}_{1_{max}},\ldots,s^{c}_{B_{max}},s^{c}_{1_{min}},\ldots,s^{c}_{B_{min}}\right]^{\top},

(15)

where $B$ is the number of the divided time buckets.

Intuitively, when an unexpected event occurs at the $6$ -th time bucket, the top scale, i.e., scale 1 in Fig. 4(b), may consider the unexpected event as the IDS activity. Because the pooling operations in (13) and (14) discard the structure information. In a contrast, if we introduce more scales (e.g., the scale 2 and the scale 3 in Fig. 4(b)), the other time buckets (e.g., the $2$ -th) still capture the occurrence of the IDS activity. As a result, MSP increases the the discriminative power of the feature.

V-D Multiple Component Learning for Aligning IDS Behavior

Two problems are raised to classify the taxis with IDS behavior as follows:

•

How to align and classify the behavior in the long-range time bucket is key to find the taxis with IDS, although MSP could align the behavior in a short-range time bucket(e.g., a week).
•

If one of the STL-based feature, the PU-based one and the combination of the two behaviors is deficient, how to identify the taxi with IDS activity?

This paper proposes a hybrid approach to align and leverage features in a long-range time bucket by combining Multiple Instance Learning (MIL) and Multiple Component Learning (MCL) [39].

Let the behaviors of a taxi in a long-range time bucket $T$ be represented as a super feature set $\mathcal{X}_{i}=\left\{\mathcal{S}_{i}^{k}\right\}_{k=1}^{K}$ , where $k$ is the number of the behaviors. In this paper, the feature set $\mathcal{S}_{i}^{1}=\{\mathbf{s}^{STL}_{it}\}^{T_{STL}}_{t=1}$ and the set $\mathcal{S}_{i}^{2}=\{\mathbf{s}_{it}^{PU}\}^{T_{PU}}_{t=1}$ , in which $\mathbf{s}^{STL}_{it}$ and $\mathbf{s}^{PU}_{it}$ is the $t$ -th time bucket of time range $T_{STL}$ and $T_{PU}$ , respectively. The number of time bucket $T_{STL}$ and $T_{PU}$ respectively are not necessary equal to $T$ , i.e., $T_{STL}\leq T$ , and $T_{PU}\leq T$ . The proposed Multiple Component-Multiple Instance Learning (MC-MIL) is as follows:

P_{i}=1-\prod_{k=1}^{K}(1-P^{k}(\mathcal{X}_{i}))

(16)

where $k$ is the number of classifier, $1\leq k\leq K$ , and function $P^{k}(\cdot)$ outputs the probability of a taxi with the IDS activity. In our work, $K=2$ represents both the classifiers built from the STL feature and the PU one, respectively. (16) is the result-fused approach which combines the results of $K$ classifiers $P^{k}(\cdot)$ into $P_{i}$ . The advantage of (16) is that if any one of the behavior feature is missed, (16) still robust predicts a result.

The $P^{k}(\cdot)$ is MIL function to align the IDS behavior as follows:

P^{k}(\mathcal{X}_{i})=1-\prod^{T}_{t}(1-P^{k}_{t}(\mathcal{S}_{i}^{k}))

(17)

where the $t$ is the index of a time bucket to compute MSP (15). (17) utilizes the noise-or model to align the IDS behavior among a long-range bucket $T_{STL}$ and $T_{PU}$ , respectively. The probability $P^{k}_{t}(\cdot)$ is logistic regression as follows:

P^{k}_{t}(\mathcal{S}^{k}_{i})=\frac{1}{1+\exp(-H^{k}(\mathbf{s}^{k}_{it}))}

(18)

where $H^{k}_{i}(\mathbf{s}^{k}_{it})$ is the additive function as follows:

H^{k}(\mathbf{s}^{k}_{it})=\sum_{j=1}^{N}\alpha^{k}_{j}h^{k}_{j}(\mathbf{s}^{k}_{it})

(19)

where $\mathbf{s}^{k}_{it}$ is the $k$ SS-based behavior feature in the $t$ time bucket for the $i$ -th taxi, $h^{k}_{j}(\mathbf{s}^{k}_{it})$ is a weak classifier, $h^{k}_{j}(\mathbf{s}^{k}_{it})\in\{-1,+1\}$ , and $\alpha^{k}_{j}$ is the non-negative coefficient, $0\leq\alpha^{k}_{j}$ .

Under this model the likelihood assigned to a set of training bags $\mathcal{X}_{i}$ and its label $y_{i},y_{i}\in\{0,1\}$ is as follows:

L(H^{k}_{j}(\mathbf{s}^{k}_{it}))=\prod_{i=1}^{N}P_{i}^{y_{i}}(1-P_{i})^{(1-y_{i})}

(20)

Following AnyBoost approach [40], the weight on each instance $\mathbf{s}^{k}_{it}$ is given as the derivative of the cost function with respect to a change in the score of the example. The derivative of the log likelihood ²²2For clarity, we abbreviate the functions $H^{k}(\mathbf{s}^{k}_{i})$ as $H^{k}$ , $P^{k}_{t}(\mathcal{S}^{k}_{i})$ as $P^{k}_{t}$ , and $P_{k}(\mathcal{X}_{i})$ as $P_{k}$ . is:

\frac{\partial L(H^{k})}{\partial H^{k}_{i}}=\frac{\partial L(H^{k}_{i})}{\partial P_{i}}\cdot\frac{\partial P_{i}}{\partial P_{k}}\cdot\frac{\partial P_{k}}{\partial P^{k}_{t}}\cdot\frac{\partial P^{k}_{t}}{\partial H_{i}^{k}}

(21)

where the derivations are computed as follows:

$\displaystyle\frac{\partial L(H^{k})}{\partial P_{i}}$	$\displaystyle=$	$\displaystyle\frac{y_{i}}{P_{i}}+\frac{y_{i}-1}{1-P_{i}}$	(22)
$\displaystyle\frac{\partial P_{i}}{\partial P_{k}}$	$\displaystyle=$	$\displaystyle\prod_{j\neq k}^{K}(1-P^{j})$	(23)
$\displaystyle\frac{\partial P_{k}}{\partial P^{k}_{t}}$	$\displaystyle=$	$\displaystyle\prod_{j\neq t}^{T}(1-P_{j}^{k})$	(24)
$\displaystyle\frac{\partial P^{k}_{t}}{\partial H^{k}}$	$\displaystyle=$	$\displaystyle P^{k}_{t}(1-P^{k}_{t})$	(25)

Therefore, (21) is simplified as follows:

\frac{\partial L(H^{k})}{\partial H^{k}_{i}}=\frac{y_{i}-P_{i}}{P_{i}}P^{k}_{t}

(26)

The parameter $\alpha^{k}_{j}$ is determined using a line search to maximize $\log L(H^{k}+\alpha^{k}_{j})$ . During the implementation, Classification And Regression Trees (CARTs) [41] are used to built the weak classifier $h^{k}_{j}(\mathbf{s}^{k}_{it})$ .

TABLE III: Effectiveness of PU-based SS features when the number of Words is 200.

# Topics	LSTM		Transformer		RF		GBT		MKL		LR
# Topics	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP
20	0.8647	0.4851	0.8413	0.6014	0.9080	0.6947	0.8929	0.6355	0.9024	0.5860	0.5726	0.2061
40	0.8253	0.4526	0.8517	0.6148	0.8783	0.6701	0.8919	0.5865	0.8003	0.5738	0.5445	0.3470
60	0.8780	0.4719	0.8319	0.6254	0.9196	0.7676	0.9095	0.8010	0.8364	0.6038	0.6647	0.4706
80	0.8426	0.4613	0.8501	0.6024	0.9085	0.7338	0.9206	0.7685	0.8571	0.6447	0.7778	0.5696
100	0.8516	0.4698	0.8629	0.6028	0.9306	0.6701	0.8994	0.7044	0.8260	0.5837	0.6687	0.2497

VI Experiments and Discussion

VI-A Experiment Setup

Real Data. We release dataset data set for our experiments publicly and any one could access it here.³³3https://github.com/pangjunbiao/IDS-BJ. The data set, referred as “IDS@BJ”, includes the initial information for each taxi individually such as (plate Id, origin time, longitude, latitude, Destination time, destination longitude, destination latitude). The time of identifying one-shift taxis with the IDS activity ranges from Jan. 2015 to Seq. 2016. Because the taxis with IDS activity are sparse, manually determining whether a randomly picked taxi has IDS activity depends on a law enforcer’s personal experience.

Evaluation metrics: In our experiments, we use two kinds of evaluation criteria to evaluate the effectiveness as follows:

1.

Precision and Recall Curve (PRC): PRC plots Precision versus Recall of a classifier at different decision thresholds. The Averaged Precision (AP) of the PRC is used to numerically indicate the performance of a detection system. The higher AP of PRC is, the better a classifier is [42].
2.

Receiver Operating Characteristic (ROC) curve: ROC curve plots True Positive Rate (TPR) versus False Positive Rate (FPR) of a classifier at different decision thresholds. Area Under the Curve (AUC) is further used to evaluate the performance of a classifier. The higher AUC is, the better a classifier is.

Due to the imbalance of the test set, AP is more reasonable than AUC to evaluate the performance of classifiers in this paper.

Experiment Settings. Both the STL behavior and the PU behavior are extracted from each day. Three scales of MSP are built as in Fig. 4(b), i.e., $1\times 16$ days, $2\times 8$ days, and $4\times 4$ days. The SS function in (12) is defined as cosine distance between two features. To better align the IDS behavior, the long rang time bucket $T=30$ , a $16$ days sliding widow with step 4 is used to split the $T=30$ into 26 overlapped SS features. That is, $T_{STL}$ and $T_{PU}$ are equal to 26, respectively.

The IDS discovering system is implemented under Python 2.7 on a 3.30 GHz machine with 8G RAM.

VI-B Effectiveness on the SS-based Behavior Features

Baseline Classifiers. We evaluate our behavior-based features on the following baseline classifiers:

1.

Long Short-Term Memory (LSTM) [43]: LSTM is a type of Recurrent Neural Network (RNN) capable of learning long-term dependencies for sequential data. The LSTM model was configured with the following hyper-parameters: learning rate is $1\times 10^{-3}$ , weight decay for regularization is $1\times 10^{-5}$ , batch size is 64, and the Adam optimizer with a reduce learning rate on plateau scheduler reduces the learning rate by a factor of $0.5$ with a patience of 5 epochs.
2.

Transformer [44]: Transformer utilizes self-attention mechanisms to obtain a highly effectiveness for classification tasks which involve complex relationships within the data [45]. We utilized Adam optimizer with a weight decay of $1\times 10^{-5}$ for regularization. The learning rate scheduler was reduce learning rate on plateau, with a reduction factor of 0.5 and a patience of 5. The batch size is 32.
3.

Logistic Regression (LR): LR is a widely used statistical model that uses a logistic function to model a binary dependent variable.
4.

Random Forest (RF) [46]: RF, an ensemble of CARTs [41], is trained with the bagging method. Compared with CARTs, RF tends to prevent the over-fitting by averaging a set of “shallow” CARTs. There are four important hyper-parameters, i.e., maximum tree depth, maximum number of leaf nodes, minimum number of points per leaf node, and number of trees, in RF. These hyper-parameters are chosen by random searching method [47]⁴⁴4During the random searching, maximum tree depth ranges from 2 to 4; maximum number of leaf nodes ranges from 2 to 8; minimum number of points per leaf node ranges from 1 to 3; number of trees ranges from 50 to 100.. Meanwhile, the number of features to consider per split in a decision tree is the square root of the total number of features [48].
5.

Gradient Boosting Trees (GBT) [49]: GBT combines a set of weak CARTs into a single strong learner in an iterative fashion. Compared with RF and CARTs, GBT allows arbitrary differentiable loss functions to be used. In this paper, the exponential loss $e^{-yf(\mathbf{s})}$ is used due to the excellent binary classification ability. Specially, the tree-specific parameters are determined by the random searching [47], while the boosting parameters are selected by 5-fold cross validation⁵⁵5During the cross validation, the learning rate ranges from 0.1 to 0.5; and the number of sequential trees ranges from 50 to 100..
6.

Multiple Kernel Learning (MKL) [50]: MKL linearly combines multiple kernels into Support Vector Machines (SVMs), efficiently fusing multiple types of features into a kernel; besides, MKL can automatically determine which kernels are useful. In this paper, MKL uses 5 different types of kernels, i.e., polynomial kernel, Gaussian kernel, linear kernel, intersection kernel, and Chi-squared kernel. There are 10 ( $5\times 2$ ) kernel matrices which are used in MKL.

We aim to demonstrate that the proposed features are discriminative in identifying the nuanced IDS activities across a range of classifiers from the classical methods to the advanced neural network architectures.

TABLE IV: Effectiveness of STL-based SS features on different baseline classifiers.

# Gaussian	LSTM		Transformer		RF		GBT		MKL		LR
# Gaussian	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP
2	0.7607	0.4324	0.8141	0.3119	0.8248	0.4008	0.7468	0.3028	0.7290	0.3095	0.8056	0.4867
4	0.8357	0.4843	0.8417	0.3450	0.7401	0.4183	0.7495	0.3508	0.7416	0.4155	0.7356	0.4067
6	0.7972	0.4435	0.8594	0.3295	0.7182	0.3707	0.6875	0.3790	0.7553	0.3570	0.7014	0.3376
8	0.8795	0.4600	0.8674	0.3197	0.7607	0.5316	0.7698	0.4880	0.7736	0.3281	0.7772	0.4616
10	0.8226	0.4554	0.8015	0.3018	0.7626	0.5064	0.7281	0.3902	0.7736	0.5054	0.7623	0.4602

TABLE V: Effectiveness of PU-based SS features when the number of topics are fixed for different classifiers.

#Words	RF		GBT		MKL		LR		LSTM		Transformer
	(#Topic=100)		(#Topic=60)		( #Topic=80)		(#Topic=80)		(#Topic=120)		(#Topic=150)
	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP	AUC	AP
50	0.9532	0.8238	0.9492	0.8008	0.9090	0.5321	0.6672	0.2878	0.8217	0.4753	0.8119	0.5813
100	0.9311	0.8111	0.9467	0.7514	0.9135	0.6966	0.7054	0.3665	0.8014	0.4458	0.8551	0.6048
200	0.9075	0.6300	0.9165	0.7995	0.9224	0.5860	0.7778	0.5696	0.8419	0.4618	0.8217	0.5973
400	0.8693	0.5964	0.8869	0.6529	0.8468	0.4327	0.7466	0.3964	0.8668	0.4456	0.8438	0.6152
600	0.8547	0.5836	0.8773	0.6030	0.8304	0.4306	0.5535	0.1842	0.7935	0.4413	0.8565	0.6025
800	0.8615	0.5719	0.8572	0.5388	0.7860	0.4664	0.7360	0.4859	0.8169	0.4517	0.8388	0.5914
1000	0.8457	0.6016	0.8411	0.5437	0.8211	0.4326	0.7144	0.2372	0.8314	0.4473	0.8253	0.5999

VI-B1 The STL-based SS Feature

Table IV shows the effectiveness of the number of Gaussian components on the STL-based SS feature. There are two observations from Table IV:

•

The optimal number of Gaussian components for the different classifiers is almost same for the neural network based method and the traditional machine learning method, respectively. For instance, 4 and 8 Gaussian components nearly obtain the optimal AUC and AP for ther neural networks and the traditional methods, respectively.
•

RF achieved the best performances in terms of AP. As expected, LSTM and Transformer achieve a little less performance in terms of AP. Because the neural network usually needs more training data than these traditional methods.

In summary, the optimal number of Gaussian components is as follows: 8 components for both RF and GBT, 10 components for MKL, and 4 components for both LSTM and Transformer.

VI-B2 The PU-based SS Feature

Table III shows that the the GBT achieves the best performance (i.e., 0.9206 AUC and 0.8010 AP) among these baseline classifiers. Interestingly, RF and GBT are more efficient than MKL in terms of AP on this feature. It means that too more nonlinear kernels tend to over fit MKL classifier.

Moreover, different classifiers require a different number of topics to obtain a good performance. Compared with the STL-based SS features in Table IV, the PU-based SS feature is more sensitive to the hyper-parameters. Once the optimal number of topics is determined for each classifier, Table V shows the effectiveness of classifiers with respect to the number of words. The optimal number of words for different classifiers is slightly different. For instance the optimal number of words is 50, 50, 100, 200, 70, 100 for RF, GBT, MKL, LR, LSTM and Transformer respectively.

TABLE VI: Effectiveness of SS and pooling.

Choices	STL Behavior		PU Behavior
Choices	AUC	AP	AUC	AP
Without SS	0.7480	0.4012	0.7849	0.4182
SS Without Pooling	0.7266	0.4258	0.8047	0.4257
SS With Pooling	0.7391	0.4600	0.8294	0.4310

In summary, the optimal number of words and the number of topics for the PU based features are (50,100), (50,60), (100,80), (200,80), (70,120) and (100,150) for RF, GBT, MKL, LR, LSTM and Transformer respectively. Note that without cross-validation on two hyper-parameters, we here determine the optimal parameters of the PU-based features by the alternative selection method, which is widely used to determine hyper-parameters [51].

VI-B3 The gains of Self-Similarity and Pooling

Table VI shows how SS and pooling affect the performances of classifiers:

1.

“Without SS” means that either the STL behavior $\mathbf{f}^{STL}$ or the PU behavior $\mathbf{f}^{PU}$ is directly feeded into a classifier;
2.

“SS Without Pooling” means that the SS values in (12) are firstly concatenated into a feature vector which is further feeded into a classifier;
3.

“SS With Pooling” means the features in (15) are feeded into a classifier.

RF is used as a baseline classifier. Table VI shows that the combination of SS and pooling significantly improves the performances of the behavior-based features. For instance, AUC and AP of the PU-based feature are improved from 0.7849 to 0.8294 and from 0.4182 to 0.4310, respectively.

For the STL-based feature, we also notice that the AUC of “Without SS” is reduced from 0.7480 to 0.7391; in contrast, the AP of “SS With Pooling” is improved from 0.4012 to 0.4600. Combining SS and pooling significantly increases the discriminative ability of the driver behaviors.

Table VI shows that “Without SS” slightly outperforms “SS Without Pooling” in terms of AUC, i.e., 0.7480 v.s. 0.7266. The explanation is that some elements in STL is very discriminative. To verify this intuition, Fig. 5 simply shows the distributions of the time when a taxi is started $t_{s}$ , and the duration time of sleeping $t_{d}$ . As expected, the duration time of sleeping $t_{d}$ in Fig. 5(b) is discriminative in the behavior space to separate the positive samples and the negative ones. In contrast, the time when a taxi is started $t_{s}$ in Fig. 5(a) barely contains any discriminative power in the behavior space.

VI-B4 The combination of the STL-based and the PU-based SS features

The AUC values from the combination of STL-based and PU-based features for each classifier are presented in Table. VII. Note that, to obtain the best performance of different classifiers, the optimal parameters determined in Subsections VI-B1 and VI-B2 are adopted.

TABLE VII: Comparison of AUC Values Across Various Classifiers

	GBT	LR	MKL	RF	LSTM	Transformer
AUC	0.9472	0.8561	0.9488	0.9779	0.7487	0.6491

Table. VII shows that GBT and RF achieve very similar performances and LSTM and Transformer classifier are relatively low. MKL slightly outperforms GBT in terms of AUC. The explanation is that GBT, RF, and MKL not only efficiently fuse multiple types of features but also have non-linear ability. Note that, to obtain the best performance of these different classifiers, the optimal parameters determined in Subsections VI-B1 and VI-B2 are adopted.

In summary, the empirical experiments show that the proposed behavior features have a good generalization ability across different classifiers.

VI-C Effectiveness of MC-MIL

In this subsection, we verify the effectiveness of the proposed MC-MIL by comparing with the following baseline methods:

1.

Multiple Instance Boosting (MIL) [52]: MIL aligns the features by iteratively grouping a set of CARTs into a single strong learner. In this paper, the noise-or model in [52] is used. There are two categories of the hyper-parameters in MIL, i.e., the tree-specific parameters and the boosting parameters. In our setting, each instance is the concatenation of the SS-based STL (7) and the SS-based PU (10) in a $16$ days. That is, a bag ( $30$ days) has 7 instances.
2.

Multiple Classifier Boosting (MCB) [39]: MCB aligns the samples into clusters in an iterative fashion. There are two categories of the hyper-parameters in MCL, the tree-specific parameters and the boosting parameters. These parameters are the same as in MIL.

VI-C1 Effectiveness of MC-MIL

Tab.VIII shows that MIL and MCL achieve very similar performances. As expected, MC-MIL outperforms both MIL and MCL in terms of AUC and AP. The explanation is that MC-MIL combines the advantages of the MIL, and MCL not only efficiently aligns the behaviors but also fuses multiple features. During training, the noise-or model in MCL requires that each classifier should be discriminative enough to obtain a good result.

In summary, the empirical experiments show that the proposed MC-MIL has a good generalization ability to handle both the alignment problem and the feature fusion one.

TABLE VIII: Comparisons with two baseline methods.

MIL		MCL		MC-MIL
AUC	AP	AUC	AP	AUC	AP
0.8547	0.7124	0.8457	0.7290	0.8937	0.7978

VI-D Efficiency Evaluation

Once the models for the STL behavior and the PU behavior are learned, the time complexity of encoding features per taxi would be linear with respect to the scales of samples. Table IX shows that the STL-based SS feature and the PU-based SS feature consume 125.2 and 278.2 milliseconds per taxi, respectively. It means that the suspected taxis with IDS activity can be discovered within 7.5 hours on a single PC once the IDS model has been built. Note that the data set here was collected from all taxis in Beijing within 16 days; that is, the suspected taxis would be updated every 16 days.

Although the data sizes for discovering taxis with IDS may be prohibitively large for a single PC or a server, the proposed method can still be efficiently handled by incremental or online methods by the following methods:

•

The GMM in FV can be efficiently solved by the online GMM [53]. Once the GMM is learned, FV can be quickly encoded by (4) and (5).
•

The online LDA [54] scales up the number of samples for all taxis.

Therefore, if these engineering details are properly implemented, the system would efficiently compute the suspected taxis with the IDS activity.

TABLE IX: Offline Running times of different features and classifiers.

RF (Dep.=3)		PU		STL
Training	Testing	Behavior	Encoding	Behavior	Encoding
(ms)	(ms)	Modeling	by MSP	Modeling	by MSP
		(s)	(ms)	(s)	(ms)
497.0	3.9	375.4	278.2	4.5183	125.2
497.0	3.9	42.3	278.2	602.44	125.2

*

ms. means millisecond, s. means seconds

VII Conclusions

This paper has described a computational approach to discovering one-shift taxis with the IDS activity. Based on the combination of the GPS traces and the records of taximeters, we propose a framework consisting of three phases: 1) modeling the STL behavior and the PU behavior for a taxi driver; 2) combining SS and MSP to map the behaviors into the IDS activity aligned feature space; 3) identifying the taxis with the IDS activity via the proposed MC-MIL classification. Extensive experiments are conducted on a real-life data set. The results demonstrate that the behavior-based features and the proposed MC-MIL achieve acceptable accuracy and efficiency.

In the future work, we aim to enhance our work as follows: 1) The limited number of PU behaviors indicates that the low-rank constraint may bring more efficient features than the application of LDA on the PU points; 2) We will attempt to construct a STL dictionary from the STL points, where each element in the dictionary represents a canonical user behavior; 3) how to build efficient driver behaviors for the two-shift taxis is an interesting direction; and 4) We will try to develop a semi-supervised method to reduce the number of the training data.

References

[1] D. Zhang, L. Sun, B. Li, C. Chen, G. Pan, S. Li, and Z. Wu, “Understanding taxi service strategies from taxi gps traces,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 1, pp. 123–135, 2014.
[2] T. Xu, H. Zhu, X. Zhao, Q. Liu, H. Zhong, E. Chen, and H. Xiong, “Taxi driving behavior analysis in latent vehicle-to-vehicle networks: A social influence perspective,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1285–1294.
[3] J. Pang, J. Huang, X. Yang, Z. Wang, H. Yu, Q. Huang, and B. Yin, “Discovering fine-grained spatial pattern from taxi trips: Where point process meets matrix decomposition and factorization,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 10, pp. 3208–3219, 2017.
[4] S. Zhang and Z. Wang, “Inferring passenger denial behavior of taxi drivers from large-scale taxi traces,” PloS one, vol. 11, no. 11, p. e0165597, 2016.
[5] K. Hansen, “Education and the crime-age profile,” British Journal of Criminology, vol. 43, no. 1, pp. 141–168, 2003.
[6] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, “T-drive: driving directions based on taxi trajectories,” in Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems, 2010, pp. 99–108.
[7] R. H. Fazio and M. P. Zanna, “Direct experience and attitude-behavior consistency,” in Advances in experimental social psychology. Elsevier, 1981, vol. 14, pp. 161–202.
[8] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International journal of computer vision, vol. 105, pp. 222–245, 2013.
[9] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[10] H. Cao, N. Mamoulis, and D. W. Cheung, “Mining frequent spatio-temporal sequential patterns,” in Fifth IEEE international conference on data mining (ICDM’05). IEEE, 2005, pp. 8–pp.
[11] M. Wang, S. Yang, Y. Sun, and J. Gao, “Human mobility prediction from region functions with taxi trajectories,” PloS one, vol. 12, no. 11, p. e0188735, 2017.
[12] J. Yuan, Y. Zheng, X. Xie, and G. Sun, “Driving with knowledge from the physical world,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 316–324.
[13] Y. Zheng, Y. Liu, J. Yuan, and X. Xie, “Urban computing with taxicabs,” in Proceedings of the 13th international conference on Ubiquitous computing, 2011, pp. 89–98.
[14] K. Zhang, D. Sun, S. Shen, and Y. Zhu, “Analyzing spatiotemporal congestion pattern on urban roads based on taxi gps data,” Journal of Transport and Land Use, vol. 10, no. 1, pp. 675–694, 2017.
[15] F. Giannotti, M. Nanni, D. Pedreschi, F. Pinelli, C. Renso, S. Rinzivillo, and R. Trasarti, “Unveiling the complexity of human mobility by querying and mining massive trajectory data,” The VLDB Journal, vol. 20, pp. 695–719, 2011.
[16] G. Pan, G. Qi, Z. Wu, D. Zhang, and S. Li, “Land-use classification using taxi gps traces,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 1, pp. 113–123, 2012.
[17] D. Zhang, N. Li, Z.-H. Zhou, C. Chen, L. Sun, and S. Li, “ibat: detecting anomalous taxi trajectories from gps traces,” in Proceedings of the 13th international conference on Ubiquitous computing, 2011, pp. 99–108.
[18] Z. Jianqin, Q. Peiyuan, D. Yingchao, D. Mingyi, and L. Feng, “A space-time visualization analysis method for taxi operation in beijing,” Journal of Visual Languages & Computing, vol. 31, pp. 1–8, 2015.
[19] C. Chen, D. Zhang, P. Samuel Castro, N. Li, L. Sun, and S. Li, “Real-time detection of anomalous taxi trajectories from gps traces,” in International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services. Springer, 2011, pp. 63–74.
[20] X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang, “Prediction of urban human mobility using large-scale taxi traces and its applications,” Frontiers of computer science, vol. 6, pp. 111–121, 2012.
[21] L. Ding, M. Jahnke, S. Wang, and K. Karja, “Understanding spatiotemporal mobility patterns related to transport hubs from floating car data,” in Proc. Int. Conf. Location-Based Services, 2016, pp. 175–185.
[22] C. Chen, D. Zhang, N. Li, and Z.-H. Zhou, “B-planner: Planning bidirectional night bus routes using large-scale taxi gps traces,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 4, pp. 1451–1465, 2014.
[23] C. Peng, X. Jin, K.-C. Wong, M. Shi, and P. Liò, “Collective human mobility pattern from taxi trips in urban area,” PloS one, vol. 7, no. 4, p. e34487, 2012.
[24] B. Li, D. Zhang, L. Sun, C. Chen, S. Li, G. Qi, and Q. Yang, “Hunting or waiting? discovering passenger-finding strategies from a large-scale real-world taxi dataset,” in 2011 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops). IEEE, 2011, pp. 63–68.
[25] X. Zhou, Y. Ding, F. Peng, Q. Luo, and L. M. Ni, “Detecting unmetered taxi rides from trajectory data,” in 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017, pp. 530–535.
[26] M. A. Yazici, C. Kamga, and A. Singhal, “Modeling taxi drivers’ decisions for improving airport ground access: John f. kennedy airport case,” Transportation Research Part A: Policy and Practice, vol. 91, pp. 48–60, 2016.
[27] S. Liu, L. M. Ni, and R. Krishnan, “Fraud detection from taxis’ driving behaviors,” IEEE Transactions on Vehicular Technology, vol. 63, no. 1, pp. 464–472, 2013.
[28] C. Ou, “Deep learning-based driver behavior modeling and analysis,” 2019.
[29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
[30] Y. Song, Q. Li, H. Huang, D. Feng, M. Chen, and W. Cai, “Low dimensional representation of fisher vectors for microscopy image classification,” IEEE transactions on medical imaging, vol. 36, no. 8, pp. 1636–1649, 2017.
[31] M. Jain, J. C. Van Gemert, and C. G. Snoek, “What do 15,000 object categories tell us about classifying and localizing actions?” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 46–55.
[32] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Fisher vector faces in the wild.” in British Machine Vision Conference, vol. 2, no. 3, 2013, p. 4.
[33] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014.
[34] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
[35] T. Phiboonbanakit and T. Horanont, “How does taxi driver behavior impact their profit? discerning the real driving from large scale gps traces,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, 2016, pp. 1390–1398.
[36] X. Wang and E. Grimson, “Spatial latent dirichlet allocation,” Advances in neural information processing systems, vol. 20, 2007.
[37] A. Anandkumar, D. P. Foster, D. J. Hsu, S. M. Kakade, and Y.-K. Liu, “A spectral algorithm for latent dirichlet allocation,” Advances in neural information processing systems, vol. 25, 2012.
[38] M. I. Shapiai, Z. Ibrahim, M. Khalid, L. W. Jau, and V. Pavlovich, “A non-linear function approximation from small samples based on nadaraya-watson kernel regression,” in 2010 2nd International Conference on Computational Intelligence, Communication Systems and Networks, pp. 28–32.
[39] T.-K. Kim and R. Cipolla, “Mcboost: Multiple classifier boosting for perceptual co-clustering of images and visual features,” Advances in Neural Information Processing Systems, vol. 21, 2008.
[40] M. Ll and J. Baxter, “Boosting algorithms as gradient descent in function space,” NIPS: New Orleans, LA, USA, 1999.
[41] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and Regression Trees. Taylor & Francis, 1984.
[42] “Precision and recall curve.” [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precisionrecallcurve.html
[43] R. C. Staudemeyer and E. R. Morris, “Understanding lstm–a tutorial into long short-term memory recurrent neural networks,” arXiv preprint arXiv:1909.09586, 2019.
[44] Y. Liu, G. Sun, Y. Qiu, L. Zhang, A. Chhatkuli, and L. Van Gool, “Transformer in convolutional neural networks,” arXiv preprint arXiv:2106.03180, vol. 3, 2021.
[45] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 908–15 919, 2021.
[46] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
[47] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization.” Journal of machine learning research, vol. 13, no. 2, 2012.
[48] F. Nan, J. Wang, and V. Saligrama, “Feature-budgeted random forest,” in International conference on machine learning. PMLR, 2015, pp. 1983–1991.
[49] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
[50] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” The Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011.
[51] L. Liu and P. Fieguth, “Texture classification from random features,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 3, pp. 574–586, 2012.
[52] C. Zhang, J. Platt, and P. Viola, “Multiple instance boosting for object detection,” Advances in neural information processing systems, vol. 18, 2005.
[53] P. Jaini and P. Poupart, “Online and distributed learning of gaussian mixture models by bayesian moment matching,” arXiv preprint arXiv:1609.05881, 2016.
[54] M. Hoffman, F. Bach, and D. Blei, “Online learning for latent dirichlet allocation,” advances in neural information processing systems, vol. 23, 2010.

Junbiao Pang received the B.S.degree and the M.S. degree in computational fluid dynamics and computer science from the Harbin Institute of Technology, Harbin, China, in 2002 and 2004, respectively, and the Ph.D. from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2011. He is currently an Associate Professor with Faculty of Information Technology, Beijing University of Technology (BJUT), Beijing, China. He has authored or coauthored approximately 20 academic papers in publications such as the IEEE TRANSACTIONS ON IMAGE PROCESSING, ECCV, ICCV, and ACM Multimedia. His research interests include multimedia and machine learning for transportation problem.

Muhammad Ayub Sabir He earned his B.S. degree in Information Technology from the University of Sargodha, Pakistan, in 2017, and later completed his M.S. degree in the same field at Government College University Faisalabad, Pakistan, in 2019. Currently, he is pursuing a Ph.D. at Beijing University of Technology in the Department of Control Science and Engineering. His research interests span various areas, including Machine Learning, Computer Vision, and Image Processing.

Zuyun Wang received the B.S. degree in Software Engineer from the Capital Normal University, Beijing, China, in 1992, and the M.S. degree in Computer Application Technology from the Beijing University of Technology, Beijing, China, in 2003. She is currently the director of the Information Center division of the Beijing Municipal Transportation Law Enforcement Corps, Beijing, China. Her research interests include the application of key technologies for intelligent transportation systems.

Xue Yang received the B.S. degree and the M.S. degree in computer science and technology from the North China University of Water Resources and Electric Power, Zhengzhou, China, in 2001, and the Beijing University of Posts and Telecommunications, Beijing, China, in 2005, respectively. She is currently a senior engineer at the Beijing Transportation Information Center, Beijing, China. Her research interests include traffic data analysis for the transportation problem.

Qingming Huang ( M’04 - SM’08) received the B.S. degree in computer science and the Ph.D. degree in computer engineering from Harbin Institute of Technology, Harbin, China, in 1988 and 1994, respectively. He is currently a Professor with the University of the Chinese Academy of Sciences (CAS), Beijing, China, and an Adjunct Professor with the Institute of Computing Technology, CAS, China. He has authored or coauthored more than 300 academic papers in prestigious international journals including the IEEE TRANSACTIONS ON MULTIMEDIA, and the IEEE TRANSACTIONS ON IMAGE PROCESSING, and top-level conferences such as ACM Multimedia, CVPR, AAAI, IJCAI and VLDB. His research interests include multimedia computing, image processing, computer vision, pattern recognition, and machine learning.