\useunder

\ul

Towards Scalable Defense Against Intimate Partner Infiltration

Weisi Yang
Northwestern University Shinan Liu
University of Chicago Feng Xiao
Georgia Institute of Technology Nick Feamster
University of Chicago Stephen Xia
Northwestern University

Abstract

Intimate Partner Infiltration (IPI)—a type of Intimate Partner Violence (IPV) that typically requires physical access to a victim’s device—is a pervasive concern in the United States, often manifesting through digital surveillance, control, and monitoring. Unlike conventional cyberattacks, IPI perpetrators leverage close proximity and personal knowledge to circumvent standard protections, underscoring the need for targeted interventions. While security clinics and other human-centered approaches effectively tailor solutions for survivors, their scalability remains constrained by resource limitations and the need for specialized counseling. In this paper, we present AID, an Automated IPI Detection system that continuously monitors for unauthorized access and suspicious behaviors on smartphones. AID employs a two-stage architecture to process multimodal signals stealthily and preserve user privacy. A brief calibration phase upon installation enables AID to adapt to each user’s behavioral patterns, achieving high accuracy with minimal false alarms. Our 27-participant user study demonstrates that AID achieves highly accurate detection of non-owner access and fine-grained IPI-related activities, attaining an end-to-end top-3 F1 score of 0.981 with a false positive rate of 4%. These findings suggest that AID can serve as a forensic tool within security clinics, scaling their ability to identify IPI tactics and deliver personalized, far-reaching support to survivors.

1 Introduction

Intimate Partner Violence (IPV) is a prevalent issue in the United States, affecting approximately 47% of women and 44% of men during their lifetime[20]. Perpetrators of IPV often monitor, control, and harrass victims through technology and physical devices that are prolific in our daily lives [6, 26, 15, 4]. For example, an IPV abuser might exploit a home router to monitor victim’s smartphone [31] and Internet activity [15], a GPS-tracker to give the abuser information about victims’ real-time location, or a hidden camera to spy on the victim’s daily activities [6]. Among these tactics, Intimate Partner Infiltration (IPI) is particularly problematic, as it often requires physical access to the victim’s devices.

IPI-related intrusions are unique compared to typical problems in computer systems security because of the vastly different characteristics of the threat model and perpetrator. As shown in Table 1, IPI perpetrators typically arise from the general population, who may not have the same technical background as traditional intruders. However, the lack of technical expertise does not make defending against intrusive behaviors easier. On the contrary, intimate relationships frequently give abusers physical access to victims’ personal devices, which often bypasses the need for sophisticated hacking. For instance, these abusers are often already registered on shared devices (e.g., fingerprints previously added to the device’s authentication system), or can obtain access through educated guesses and coercion. This unique combination of physical proximity and intimate knowledge of the victim’s behaviors allows abusers to exploit trust rather than technical vulnerabilities, creating challenges that are inherently different from traditional attacks.

There has been extensive research in IPV or IPI across psychology, health, engineering, and computer science. Recently a class of approaches called clinical computer security[18, 32] has gained traction for mitigating IPV or IPI. These approaches are administered through Security Clinics [5, 32, 18] and leverage a combination of clinical interviews, consultations, and technical support to provide tailor interventions for IPV victims. However, such clinics are difficult to scale. For instance, it is challenging to expand the service into rural areas where limited human resources constrain its availability.

In addition to logistical barriers, human factors also need careful consideration, further complicating scalability. IPI survivors often arrive at clinics carrying deep emotional scars and traumas from their experiences [25]. These individualized needs necessitate detailed interviews and therapy sessions, making the work labor-intensive for experts and consultants. The emotional and psychological demands of these interactions can also lead to burnout among clinic staff, further limiting capacity.

A few automation tools [2] such as browser wipers or malware scanner have been adapted to assist with assessing survivors’ situations, which provides some level of relief and more comprehensive evaluations to the clinics. While these tools can provide a one-time screening, they are inadequate for continuous monitoring or addressing more sophisticated threats, such as privacy breaches or malicious configurations of personal devices. While these existing technologies can augment the services of security clinics, the nuanced and complex nature of IPI cases still requires human expertise for effective intervention.

Therefore, it would be highly beneficial to develop an an automated and continuous solution tailored to the IPI context. Such a system must address several critical challenges to ensure effectiveness and user safety. The first challenge is achieving adaptability. Current tools lack the flexibility to accommodate the diverse needs of IPI survivors due to their reliance on one-time evaluations. A scalable solution must adapt to varying scenarios and user-specific contexts, ensuring broad applicability. The second challenge is maintaining stealthiness. The tool must operate discreetly, avoiding detection by abusers. If an abuser becomes aware of its existence, the victim could face heightened risks of harm. Stealthy design is therefore paramount to protecting users from escalation. The third challenge is ensuring high accuracy and minimal false alarms. IPI victims are often under significant emotional distress, making them particularly sensitive to perceived threats or suspicious events. An unreliable system with frequent false alarms could undermine trust in the tool and add unnecessary stress. The fourth challenge is computational efficiency. Designing such a robust system typically demands significant computing resources, which could lead to high power consumption. Without careful design and optimization, this could interfere with the user’s daily experience on the device, further limiting adoption and usability.

In this paper, we propose AID, an Intelligent Automated IPI Detection system, for continuous monitoring and detection of potential IPI behaviors on mobile smartphones. AID leverages streams of multimodal sensing data and a two-stage architecture to detect 1) non-owner access and 2) precise behaviors that are indicative of IPI behaviors. AID adopts a short, 5-minute calibration phase to adapt to phone owners upon initial installment. We carefully design AID to be “invisible” to attackers through a deceptive UI and constraining its access to only data streams that can be collected and processed in the background. AID is also privacy preserving by performing inference and keeping private user information locally. Through a 27-person user study, we verify that AID can reliably detect non-owners and fine-grained behaviors of IPI with up to 0.981 F1 score.

Table 1: Comparison between IPI and traditional attackers.

Attacker Profile	Traditional	IPI
Tech skill	Advanced	Less technical
Physical Access	No	Mostly yes
Passcode	Limited	Registered or guessable
Defense	Patches or updates	Security clinics [5, 32, 18]

We envision AID as a “forensic” tool that can be used by security clinics to gain a deeper understanding about the individualized needs of victims and improve mitigation measures. To summarize, our main contributions are:

$\bullet$

We taxonomize IPI tactics and map them to both operating system (OS)-level and physical signals. Our analysis reveals that intimate couples or close friends exhibit statistically significant similarities in physical signatures compared to strangers.
$\bullet$

We propose AID, an Automated IPV Detection system, for continuously monitoring IPV behaviors on personal smartphones. Through careful design of AID’s IPV detection mechanism, AID is privacy preserving by keeping all data and processing locally, while remaining “invisible” to attackers by constraining its access to only data streams that can be collected and processed in the background.
$\bullet$

To perform scalable detection of IPV behaviors, we propose a two-stage architecture that detects 1) phone usage by non-owners and 2) fine-grained activities that could be indicative of IPV behavior. This architecture adopts a short 5 min calibration phase to adapt to new phones and owners upon first installation.
$\bullet$

Through user a 27-person user study, we demonstrate that AID can detect non-owners and fine-grained IPV-related behaviors with 0.981 F1 score and up to 97.5% accuracy.

The rest of the paper is structured as follows. Section 2 gives an overview of previous studies on IPV and continuous learning for mobile smartphones and illustrates their limitations, which serves as the motivation for our work. In section 3, we illustrate how IPV behaviors could be mapped to signals available to the operating system in mobile smartphones. Section 4 introduces our core design AID, focusing on improving its scalability and safety, which makes robust detection IPV on mobile devices feasible. Section 5 details and analyzes our user studies and experiments. Section 6 summarizes and concludes this work, with some viable future research directions.

Table 2: Taxonomy of Intimate Partner Infiltration (IPI) behaviors, classified at three granularities. The goal of AID is to distinguish between categories (5-class), actions (9-class), and subactions (28-class) of IPI behavior.

Category	Action	Subaction
Benign	General	General
Impersonation	Send content	Send emails
		Send messages
		Send reviews
		Comment
Leakage	View account	View account settings
		Subscription details
		Inspect order history
		View browsing history
		View payment settings
	View content	View emails
		See one’s post history
		Watch history
		View messages
		Inspect files
	Upload content	Upload photo
		Upload video
Modification	Alter account settings	Change profile photo
		Change email
		Change username
		Change password
		Change address
	Modify content	Delete emails
	Modify content	Modify music list
	Alter files	Add a file
		Delete a file
		Modify a file
Spyware	Software installation	Software installation

2 Background and Motivation

2.1 IPV Attack Vectors

An expanding body of research underscores how abusers exploit technology to harm their intimate partners [31, 32, 8, 14, 15, 16, 23]. The consequences are far-reaching, including financial harm [3], escalating to physical confrontations, and even resulting in homicide [28].

Among the tactics explored by [15, 31, 35, 4, 6], three distinct types of technology-facilitated IPV have been identified: remote, proximate, and physical access. Remote tactics include actions such as distributed denial-of-service (DDoS) attacks, SMS bombing, unauthorized remote logins, and location tracking [15, 29]. Proximate tactics involve methods requiring closer physical presence, such as the use of spy cameras, router monitoring, or other forms of nearby surveillance [6]. Physical access tactics, on the other hand, encompass direct interactions with devices, such as installing spyware [8] or creepware [27], accessing phone records, or deploying keylogging tools to monitor keystrokes. These tactics rely on distinct attack vectors.

In this work, we introduce a taxonomy specifically focused on the physical access category of IPV, which we term Intimate Partner Infiltration (IPI). Physical access tactics are particularly challenging to detect, especially on smartphones, because abusers often have legitimate access to these devices and may already know or have registered credentials such as facial recognition profiles or passcodes [31]. This inherent proximity and familiarity make detection more complex, as these actions can easily blend into normal usage patterns. Additionally, addressing physical access tactics requires more careful and nuanced design approaches, as confronting such behaviors carries a higher risk of escalating the situation, potentially putting the victim in greater danger [23, 28, 3].

Based on the literature [31, 35, 4, 6], we further refine and taxonomize IPI tactics into categories, actions, and subactions. The Benign category includes general, non-malicious behaviors that resemble normal device usage and serve as a baseline for comparison. Impersonation involves abusers pretending to be the victim by sending emails, messages, reviews, or comments to manipulate perceptions or harm reputations [31, 35]. Leakage refers to accessing or extracting private information, such as account settings, browsing history, messages, or files, and sometimes uploading content like photos or videos to violate privacy [6]. Modification includes altering device settings, changing account credentials, or modifying files, disrupting the victim’s sense of control and potentially causing emotional or psychological harm [4]. Lastly, Spyware involves the installation of software to covertly monitor the victim’s activities, representing one of the most invasive and persistent forms of IPI [35].

This taxonomy enables systems like AID to classify IPI behaviors with increasing granularity, distinguishing between categories (5-class), actions (9-class), and subactions (28-class), supporting more precise detection and intervention strategies.

2.2 Mitigation Methods and Challenges

Prior research on technology-facilitated IPV has mainly provided qualitative insights into victims’ experiences and perceptions of how perpetrators misuse technology. Freed et al.[16, 15] examined the ways in which abusers deploy GPS trackers and audio surveillance tools, gathering details from victims to infer perpetrator methods. Bellini et al. [4] identified online forums and communities as accelerants for Intimate Partner Surveillance (IPS), emphasizing how readily available strategies and tools exacerbate the problem. Tseng et al. [31] provided measurement-based evidence of technology-driven abuses, illustrating the scope of these issues. Other research has also emphasized the need for legal frameworks and defensive tools to combat IPV and address the widespread availability of spy devices [30, 6].

A range of mitigation solutions has emerged in response to these findings. Although basic tools exist to help users remove browser histories and delete digital traces to safeguard their privacy [2], they do not address the wide spectrum of potential IPI abuses. Instead, security clinics [18, 32] provide in-person consultations for survivors, offering tailored support to identify risks and rebuild a sense of safety. While these clinics have demonstrated significant promise, scaling their services beyond local communities presents considerable challenges. One major limitation is their reliance on specific physical locations, which restricts access for survivors outside those areas. Furthermore, these clinics require the involvement of technical experts who possess specialized knowledge to address the complex and evolving nature of technology-facilitated abuse. Recruiting and training such experts, as well as ensuring ongoing support, is resource-intensive and difficult to sustain at scale. This combination of geographic constraints and reliance on highly skilled personnel highlights the barriers to replicating models like the Clinics to End Tech Abuse (CETA) in New York City [9] and the Madison Tech Clinic (MTC) in Madison [22] in broader contexts. These challenges underscore the need for scalable, accessible, and resource-efficient approaches to combat IPI on a wider scale.

Another highly promising yet underexplored approach for detecting IPV is continuous authentication. From a technical standpoint, user authentication on smartphones falls into two categories: one-time and continuous methods. While one-time verification (e.g., PINs, facial recognition) initially prevents casual intruders, it can be insufficient in IPV settings where abusers may already know these credentials [31, 36]. Continuous authentication, drawing on biometric or behavioral data (e.g., motion sensors [12, 7, 21], touch interactions [13, 36, 38], or hybrid approaches [10, 1]), offers stronger security by persistently checking whether the current user matches the legitimate owner. Nonetheless, such systems have been designed primarily for strangers or unauthorized outsiders. When abusers are intimate partners, benign sharing (e.g., letting a child play a game) can resemble malicious infiltration, making it difficult to distinguish IPI behaviors from normal usage.

Our analysis in Table 3 highlights this challenge, revealing that close friends or partners demonstrate significantly more similar behavioral signatures than strangers when using the same device. We derive these results by computing the similarity scores of individuals with close relationships versus those with no connections. These scores are calculated based on the cosine similarity of their behavioral representations, which are extracted using our feature extractor (a pretrained AutoEncoder; see Section 4.3 for implementation details). This metric quantitatively supports our hypothesis that individuals with close relationships, such as partners, exhibit significantly more similar behavioral signatures than strangers when using the same device. This similarity complicates the design of a user-detection system. Moreover, survivors often become wary of technology after experiencing trauma [16], complicating the adoption of new security tools. Consequently, while improvements in continuous authentication and security clinics represent meaningful progress, more nuanced and automatic approaches are needed to address the subtle ways intimate partners can exploit physical access without triggering suspicion or harming the survivor further.

Table 3: Behavior patterns of people in close relationships (e.g., friendships or intimate partners) are more similar than strangers, with statistical significance on the same phone.

Device	Same (IPI Setting)	Different
Close relationships	0.950(0.020)	0.850(0.033)
Strangers	0.896(0.076)	0.820(0.110)
t-stat	3.081	0.9426
p-value	0.0045	0.3802

3 Threat Model

In this section we describe the scope and some assumptions of our threat model. We focus on the IPI behaviors that require physical access to mobile smatphones. IPI abusers cause harm to victims by having access to and interacting with the victims’ smartphones stealthily. As shown in Table 1, unlike traditional cybersecurity attacks, we assume that the abusers have authenticated access to victims’ devices, e.g., by registering their biometrics or having access to passwords, so they directly can interact with the data and applications stored on the victim’s smartphone. Aligned with previous research [15], we assume that abuser’s have limited technical background, given that abusers come from the general population compared to cyber attackers. This means that abusing behaviors are restricted to user-interface interactions, e.g., by viewing the content on the phone or installing applications. Additionally, we assume that the stealthiness of our system is sufficient for not alerting abusers, so the system is safe from IPI attacks.

Another notable point is that our study does not apply to partners in extreme situations, i.e., the use of the detection system will escalate IPI behavior to a higher level such as severe physical or psychological violence. As discussed in [14, 18, 32], IPI abusers may escalate and bring more harm to victims if they discover evidence of anti-IPI measures. Although our designed goals are stealthy and not alerting, the use of our tool needs the initial assessment from a security clinic. More harmful scenarios require intervention from external agencies, such as police, law enforcement, and clinical centers, which are out of the scope of this work.

Table 4: OS-level and physical signals analyzed by AID to detect IPI behaviors.

Category	Modality	Description
IMU	Motion Data	Gyroscope
		Accelerometer
		Linear accelerometer
		Magnetometer
		Rotation vector sensor
	Environment Data	Proximity sensor
		Pressure sensor
		Light sensor
Systems (SYS)	Network Traffic	Upstream Bandwidth
	Network Traffic	Downstream Bandwidth
	Energy Consumption	Current
		Voltage
		Temperature
	Memory Utilization	Memory used
	Memory Utilization	#App usages in memory
Interaction (INT)	Screen Interaction	Interaction rate
Interaction (INT)	Screen Interaction	Interaction event
Application (APP)	Application Activeness	Foreground app name

4 AID Design

4.1 Challenges and Solutions

Our vision for AID is an “invisible” digital forensic tool that records a digital footprint of evidence of IPI on the victim’s smartphone. To accomplish this, AID runs continuously in the background, quietly collecting standard system traces from the phone for 1) user identification to verify whether the user is the legitimate owner and 2) behavior detection. Finally, based on our proposed taxonomy (Table 2), AID determines whether or not the behavior is potentially IPI related. If an IPI behavior is detected, the natural next step is to send an alert, but this may be visible to the attacker if s/he is still accessing the phone, or if the notification is not cleared before the attacker accesses the phone again. As such, we do not send any alerts, and instead record the event on the victim’s smartphone, leaving the analysis of detected IPI events for health and IPV professionals in security clinics. However, designing AID presents the following challenges:

IPI behavior detection across different contexts and users. Detecting behaviors that indicate IPV is complex for several reasons. First, behaviors are subjective and can often be interpreted differently depending on the context. For instance, sharing devices among family members may seem common in some countries, while in others, it is perceived as a privacy risk. Second, IPI behaviors differ significantly across different apps and media, ranging from direct online social media harassment to tracking victims unknowingly with spyware. The diversity of platforms and the subtlety of some actions require a detection system to accommodate the wide variability in how IPV manifests across different contexts. Lastly, different people often exhibit slight to significant differences when performing the same action, highlighting the need for a system that can adapt to different users.

Solution. To create a solution that scales across different users, we propose a dual-module architecture that performs 1) user identification and 2) detects potential IPI behaviors of the current user. For identifying the phone’s owner, we propose an encoder-decoder architecture, where the encoder is trained on a diverse set of users to extract salient features conducive for recognizing user identity, while the decoder is fine-tuned to a specific user and their personal device during a short calibration phase. For the behavior detection module, we employ an LSTM-CNN-based classifier trained on a diverse interaction dataset, allowing robust detection of behaviors indicative of potential IPV risks across various apps.

Enabling safety and stealthiness. Because IPV abusers often share the same physical and digital access to spaces and devices as the victim, it is imperative for AID to remain “invisible” to avoid escalation if AID or other anti-IPV measures are discovered [14, 18, 32]. Additionally, data privacy must be taken into consideration. Although we collect no personal information, there is still a possibility of leaking behavioral patterns (e.g., app usage).

Solution. We take the following actions to ensure the system’s stealth and safety. First, we review the sensitive data involved during collection and exclude any that pose privacy risks to our users. Second, we apply UI deception by disguising AID’s entry point with an unrelated but common interface, such as a weather report, to hide AID from potential abusers who might open the app accidentally. Third, we perform all inference and fine-tuning locally to prevent data leakage during transmission to a remote server. Finally, we do not provide any alerts to any user on the phone, unlike most timely applications (e.g., email and messaging). Instead, data is stored securely on the phone, until it can be sent over to a security clinic (e.g., during a consultation), where the traces detected by AID can be analyzed in a safe setting. Details about how we implemented our solution are discussed in Section 4.4.

Efficient computation. During the implementation of AID, we observed significant energy consumption from both 1) sampling at a high rate and 2) performing local inference continuously. This adversely affects not only the battery but also cause the phone to overheat and lag, leading to decreased user experience.

Solution. To reduce energy consumption and impact on user experience we take advantage of our safe design choice of not alerting users and employ an asynchronous detection mechanism. Data signals are collected during the day but processed at night, when the user is sleeping and the phone is not in use. Because AID does not send timely alerts to users to avoid discovery, there is no need to process data immediately. Moving model inference to nighttime reduces energy consumption, overheating, and lag that is caused by continuous model inference during the daytime when users are likely interfacing with their smartphones.

Refer to caption — Figure 1: AID workflow and system architecture. The output report remains invisible to the victim until their next visit. However, it is shared with the Security Clinic experts as a forensic support for further analysis. For privacy protection, when the user identification module recognizes the owner, no information about the application name, behaviors will be revealed.

4.2 Workflow Overview

Figure 1 shows AID’s workflow, which contains three phases: data collection, module inferencing, and analysis report generation.

During collection phase, AID samples and records time-series multi-modal data signals from the smartphone in the background (Table 4). AID collects 4 categories of modalities to obtain a comprehensive view of how the user is interacting with the phone. Among these, IMU and Sys play a critical role in the User Identification Module. Meanwhile, Int and App data are relevant to the Behavior Module for detecting suspicious behaviors associated with IPI. Intuitively, incorporating all data sources should result in the best performance due to more available information. However, our ablation studies (Section 5.5) reveal that there is a clear divide in the importance of each data type for each task; incorporating more information often reduces performance due to a lack of expressive power of a small model to generalize to redundant and higher dimensional inputs that results.

The User Identification Module employs a pretrained AutoEncoder to extract embeddings from user data, which are then used to train a Support Vector Machine (SVM) classifier that adapts to the target user. This adapted module performs identity verification, determining whether the current user is the authorized device owner. Meanwhile, the Behavior Detection Module utilizes a hybrid model combining Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) to recognize and classify user activities into one of 27 IPI subactions, shown in Table 2. To ensure reliability, results from both modules are refined through a post-processing step, which enhances stability and accuracy.

These results are consolidated into a report that provides key details such as the time of detection, the app involved, the user’s status (e.g., owner or non-owner), the most likely behaviors detected (e.g., benign or suspicious actions in Table 2), and the overall risk assessment based on confidence measures output from each module. For instance, the report might highlight that a verified user performed benign activities on one app. At another point in time, the report may detect an anomalous user engaging in actions like "Alter Account Setting," leading to an "IPV Risk Detected" classification. The report remains securely stored on the device until it is analyzed by health, safety, or security professionals.

As we will see later in this section and in the evaluation, detecting phone behaviors in an “invisible” manner is much different than traditional human activity recognition that aim to detect physical body movements with the IMU on a smartphone, smartwatch, or other wearable. The signals available to us are limited, can only indirectly capture activities of interest, and different behaviors have similar traces; For example, changing an account password or searching for a YouTube video both involve typing and tapping on the screen. As such, AID logs the top-k most probable behaviors from non-owners for security professionals to analyze.

Next, we detail the implementation design of AID, and further discuss training mechanisms to improve adaptability to different users in Section 4.3 and safety mechanisms in Section 4.4. Section 4.5 discusses optimization strategies for AID, focusing on efficient computing for real-world use.

4.3 Enhancing Scalability and Adaptability of the System

Adapting to different users. While similar actions may result in similar system traces, there is variability between different users, much like in traditional Human Action Recognition (HAR). Moreover, it is difficult to train a one-for-all model that can distinguish between any user. As such, we adopt a fine-tuning scheme for user identification that adapts to the phone’s owner.

First, we pretrain an LSTM-based AutoEncoder using a corpus of diverse user data. This pretraining process allows the model to learn general patterns and representations of phone activity and user behavior. AutoEncoders are particularly suitable for this task as they excel at learning patterns and representations from unlabeled and limited amounts of data.

Second, when the owner installs the application for the first time, AID guides the owner through a short calibration procedure of 5 minutes, where the owner uses the device naturally. This session provides sufficient data fine-tuning the classifier to adapt to the owner’s unique behavioral patterns. The fine-tuning dataset is constructed by selecting a subset of the data newly collected from the owner and a subset of data from other users as negative examples. To ensure balance during classifier training, we maintain a 1:1 ratio between the selected samples from the owner and other users. We explore different methods for selecting and constructing the full fine-tuning dataset in our evaluation, specifically Section 5.2.

Finally, the backbone of our user identification module takes inspiration from KedyAuth[19], a state-of-art user authentication architecture for mobile platforms. We adopt a similar encoder-decoder structure as shown in Figure 2(a), except we replace the encoder backbone with an 8-head LSTM Layer (4 units per head), instead of using CNN layers, to better adapt to the multimodal time series data being processed. The outputs are concatenated and projected through a dense layer to produce a compact 16-dimensional latent representation. The decoder reconstructs the original input by first expanding the latent representation back to the input shape using a RepeatVector layer, followed by an LSTM layer for the reconstruction task. Once pretrained, the features extracted by the autoencoder are used as input into a lightweight SVM classifier with Radial Basis Function (RBF) kernel, to distinguish between use by the owner or a non-owner.

Scaling to many IPI behaviors. While there is significant work in the area of human activity recognition (HAR) on smartphones, detecting IPI behaviors is significantly different. Rather than viewing phones and wearables as a system that a person interacts with, HAR leverages these devices as a general-purpose sensor to detect motions and actions (e.g., step counting, motion tracking) while a person is not directly interfacing with the smartphone. This often involves heavy use of physical sensors on the on the device (e.g., camera and IMU). In contrast, determining IPI behaviors involves detecting actions a user performs while interfacing directly with the phone (e.g., changing passwords or installing malware). As such, the modalities and techniques used need to be adjusted. To the best of our knowledge, there are no prior works that attempt to detect IPV-related actions users take while interfacing with smartphones.

To detect IPI behaviors, we focus on interaction data (Int.) and app usage signals (App.) captured by the operating system (Figure 1). These modalities provide critical insights into how users engage with the device, such as navigating apps, interacting with the screen, or performing specific actions, which is indicative of IPI. We show through ablation studies in Section 5.5 that these two classes of signals enable the greatest performance, instead of relying solely on motion data, commonly used in traditional HAR.

Given the challenges of obtaining labeled data post-installation, we implement a server-side training architecture while deploying only the detection components on-device. Recent advances in HAR have explored various architectures, from CNNs[33] and LSTMs[24] to Transformers[11]. Inspired by the success of hybrid models in HAR[37], we develop an LSTM-CNN architecture for classifying IPI behaviors. As shown in Figure 2(b), our model processes input sequences through parallel paths: an LSTM layer with 64 units and a one-dimensional convolutional layer (kernel size=3, 64 filters). The CNN output undergoes max pooling and flattening before concatenation with the LSTM output. The combined features pass through two dropout-dense blocks for multi-class behavior classification.

Post Processing. To further enhance the reliability of the system, we introduce a post processing method tailored to the objectives of Module 1 (user identification) and Module 2 (behavior classification).

Module 1 (User Identification): Outputs are refined using a clustering-based post-processing approach. We extract temporal features (rolling mean and standard deviation) from prediction scores, which are input into a k-means clustering model (k = 2) that partitions the prediction space into regions corresponding to true users and potential abusers. A temporal voting mechanism aggregates predictions within a sliding window to ensure classification stability and mitigate transient errors. This approach significantly reduces false positives while maintaining high detection sensitivity, which we will show in our experiments.

Module 2 (Behavior Classification): Post-processing for behavior classification adopts a simpler rolling window average to smooth fluctuations in predictions and reduce temporal variations. By adding post-processing techniques to each module, the system achieves both high accuracy across diverse tasks.

4.4 Maintaining Safety and Stealth

Review of risky data sources. As discussed in Section 4.1, AID should be “invisible” to abusers while also being privacy sensitive towards the owner of the phone. As such, we discard data streams only available through APIs that violate these principles. We identified four such signals, which are not present in AID:

1. Touch information, available through the MotionEvent API, provides information about the coordinates of a user’s touch, swipe direction, pressure etc. However, accessing this API in the background places a floating icon on the screen and triggers a mandatory system alert.

2. The Camera of the phone can provide significant amounts of information about the current user and environment. However, camera records highly sensitive information, such as the user’s face, which, if leaked, could result in severe privacy breaches. Additionally, activating the camera in the background triggers mandatory notifications on the screen, potentially alerting an abuser to the monitoring and escalating the situation. Continuous use of the camera is also impractical due to significant power consumption, leading to device overheating and impacting the user’s normal experience.

3. Audio can be obtained through the MediaRecorder API. However, similar to the camera, using audio through the MediaRecorder API raises significant privacy concerns, as sensitive conversations or environmental sounds could be recorded and potentially leaked. Additionally, activating audio recording triggers mandatory on-screen notifications, compromising the invisibility of the system and potentially alerting an abuser.

4. A user’s position can be obtained from GPS and wireless signals through the LocationManager API. While location tracking via GPS or wireless signals could provide useful information, it can reveal sensitive details about the user’s movements or frequent locations, which could put them at risk. Additionally, like camera and audio recording, system notifications and battery drain from continuous tracking further make this option unsuitable for our needs.

User-interface deception. We crafted our Android application to blend in as a conventional smartphone application (e.g., a weather app), which significantly reduces the chances of discovery by the abuser. Furthermore, the first page of the application is disguised (e.g., the weather report).

Second, we leverage the Accessibility Service [17], a built-in function of Android OS that enables AID to launch without clicking on the application icon. This enables AID to begin running, while remaining absent from the “Recent Apps” page to enhance stealthiness and prevent abusers from terminating AID from here.

Local inference. AID performs inference and fine-tuning on-device, which eliminates the need to send data to a remote server, reduces the chance of data leaks, and enhances data privacy. Furthermore, we take advantage of the IPV context to further reduce AID’s energy footprint. Because we avoid alerting users to prevent discovery by abusers, AID does not need to provide insights continuously and in real-time. As such, during the day, the Data Streaming Hub collects and logs traces, while the inference processor analyzes the collected data at night, when the device is idle and connected to power, while the user is likely asleep. This reduces the computational load on the phone when users are more likely to interface with their phones, improving battery life, overheating, and phone lag that can result from continuous inference.

What if AID is detected by the abuser? Although the average IPV attacker may not have the same technical background of a cyber attacker, there is a chance that abusers may become aware of AID, even with the safety measures employed. Here we discuss scenarios reflecting the degree of awareness.

At the most basic level of awareness, the abuser only knows the existence of AID, but not its location on the device. In this case, AID is still feasible, as the abuser cannot prevent AID from running, and it is difficult to subconsciously mimic and change one’s phone’s habits.

A more informed level of awareness for the abuser is knowing the exact location of AID. The most direct response from the abuser might to delete AID, or force the phone to stop by altering background settings. Regardless of the approach, any attempt to interfere with AID already reveals a clear intention of IPV behaviors, which can be used to inform victims, security, and health professionals.

The most severe scenario, mentioned in 3, is a deterioration in the relationship, where physical violence is likely to occur. This scenario is beyond the scope of AID and requires external intervention, such as from police and law enforcement.

4.5 Efficient Computing

Asynchronous detection. Deep learning models are effective for accurate detection but can be computationally demanding, which poses challenges for resource-constrained mobile devices. Running these models continuously may consume excessive computational resources, lead to overheating, and negatively impact the user experience. To maintain invisibility, AID does not send alerts immediately when a suspicious IPV intrusion is detected, as this could notify the abuser. Unlike traditional user authentication systems, which prioritize real-time processing, AID processes data asynchronously during nighttime when the device is idle and charging. This approach eliminates performance lags during the day (the only overhead being logging data from the phone), ensuring an improved user experience.

Few-Shot Learning for Efficient Adaptation. Model training is typically resource-intensive and time-consuming. In our dual-module architecture, the behavior detection module performs inference using a trained model, while the user authentication module requires fine-tuning for user adaptation. To minimize this overhead, we implement a few-shot learning scheme [34] that enables efficient adaptation with minimal data. During pretraining, the model is trained on a diverse dataset, capturing general patterns of mobile interaction. This knowledge enables the model to rapidly adapt to a new user’s specific interaction patterns through fine-tuning. Our evaluation demonstrates that random sampling of just 20 samples during the calibration session is sufficient for fine-tuning, achieving high detection accuracy. This eliminates the need to process the full dataset, significantly reducing computational and time costs. The resulting system is both practical and efficient for real-world deployment.

5 Evaluation

5.1 Dataset and Settings

Data Collection. Interactions were recorded using two smartphones (Pixel 6A running Android 13 and Samsung A54 running Android 14). We developed a data collection tool based on AID to facilitate both data collection and labeling. The tool features an intuitive interface that allows participants to select a task type, after which the tool automatically begins collecting data in the background. The logger is power efficient, as it drains about 0.557% battery during a 40-minute session with maximum sampling rate on Pixel (Around 130 Hz). We gathered usage data from 27 participants (18 male and 9 female). All collection activities were approved by our the institutional review board (IRB) at our institution.

Each participant was instructed to interact with commonly used applications—Amazon, Gmail, Instagram, Slack, Spotify, and YouTube—to simulate everyday usage patterns (e.g., searching, changing passwords, or uploading media). Before starting, participants received a task list containing 44 actions (full list in appendix) designed to guide their interactions with these six applications. The task list (Table 11 lists all the tasks) aimed to cover a comprehensive range of scenarios typical for smartphone users, including both benign actions (e.g., watching videos or listening to music) and potentially IPI-related actions (e.g., attempting to modify an account password). The list is randomized to enhance generalization and reduce potential bias in the collected data. Before each task, a team member sets up the appropriate label for the task, after which the participant begins completing it. Once the task is finished, the team member assigns the label for the next action, repeating this process until all tasks are completed.

We intentionally introduced flexibility into the task instructions to allow participants to complete tasks in their preferred ways to capture increased behavioral diversity. For example, we did not prescribe specific postures for using the phone or dictate the exact steps to complete a task. In the case of modifying a password, participants could either navigate through multiple settings pages or use the app’s built-in assistant to reach the relevant page directly.

Dataset. We collected the modalities listed in Table 4. All data streams were gathered using standard Android APIs, which are compatible with Android smartphones running Android 9.0 or later, without requiring root access. As previously mentioned, this dataset includes usage data from 27 participants, including 9 pairs (e.g., couples or close friends). On average, each participant took 41.5 minutes to complete all tasks, resulting in a total of 18.7 hours of recorded data streams.

Dataset processing. To process collected data, we first apply min-max normalization to the raw data streams, scaling each feature to the range [0, 1]. Data streams are collected at the highest available sampling rate of up to 130 Hz. To analyze the impact of lower sampling rates, we downsample the data to various frequencies - [1 Hz, 2 Hz, 5 Hz, 10 Hz, 20 Hz]. Next, we apply a sliding window to extract windows of inputs into our models. To evaluate the effect of different window time spans on module performance, we experimented with varying window durations - [1s, 2s, 5s, 10s, 20s].

For the evaluation of the User Identification Module, we randomly select 95% of the data from 22 users as the pretraining set for the AutoEncoder. The remaining 5% is combined with data from 3 additional users, labeled as ’abuser’ (positive sample, denoted as class 1), to form the fine-tuning set. Two users are designated as the owner and the abuser. From the owner, the first 5 minutes of data are added to the fine-tuning set (as described in Section 4.3.), labeled as ’owner’ (negative sample, denoted as class 0), while the owner’s data after the initial 5 minutes and all of the abuser’s data are used to form the test set. For the evaluation of the Behavior Classification Module, we use a leave-one-user-out scheme, where data from 27 users serves as the training set, and the remaining user is used as the test set. This approach assesses whether the learned patterns can generalize to classify the behaviors of unseen users.

Evaluation Settings. We train and evaluate the models on a Linux server running on Ubuntu 22.04.5 LTS with an NVIDIA L40 GPU, using Tensorflow 2.17.0 and Python 3.12.7.

For all results, we used K-fold validation to consistency of our results. This involved setting up multiple permutations of pretraining users, tuning users, and test user combinations, and averaging the results across these permutations. Specifically, for the User Identification Module, we tested 12 combinations of owner-abuser pairs: 9 involving the previously mentioned participant pairs (e.g., couples or close friends) and 3 involving strangers to increase test diversity. For the Behavior Classification Module, we conducted 12 iterations of leave-one-user-out evaluations and averaged the results.

The AutoEncoder for user identification is pretrained on a self-supervised reconstruction task for a maximum of 100 epochs with a batch size of 512, using the Adam optimizer and a learning rate of 1e-3. Training incorporates early stopping with a patience of 5 epochs and a minimum improvement threshold of 0.0001. Mean Squared Error (MSE) is used as the loss function to reconstruct the input data. Similarly, the Behavior Classification model is trained for 50 epochs, batch size = 512, Adam optimizer, learning rate = 1e-3, and is optimized with Categorical Crossentropy as the loss function.

Evaluation Metrics. The primary metrics used for evaluation include F1 score, recall, precision, false positive rate (FPR), false negative rate (FNR), and accuracy. F1 score balances precision and recall, recall measures the proportion of true positives correctly identified, and precision evaluates the reliability of positive predictions. False positive rate (FPR) is particularly critical in our context as it indicates the proportion of negative samples incorrectly classified as positive. A high FPR could lead to false alarms, which may cause unnecessary interventions or distress in IPV detection scenarios. FNR measures the rate of missed detections, while accuracy represents the overall proportion of correctly classified behaviors.

5.2 Effectiveness of User Identification

Table 5: F1 scores across different sampling rates and window time spans for Module-1: User Identification.

		Window time span
		1sec	2sec	5sec	10sec	20sec
Frequency	1Hz	0.858	0.925	0.969	0.922	0.917
	2Hz	0.900	0.939	0.968	0.974	0.944
	5Hz	0.980	0.973	0.962	0.975	0.978
	10Hz	0.981	0.980	0.965	0.961	0.954
	20Hz	0.978	0.980	0.980	0.976	0.976

F1 score across varying sampling rates and window time spans We trained, fine-tuned, and evaluated our owner identification module with different data sampling rates and input window sizes. Table 5 shows the F1 score for a variety of configurations, which we found is highest (F1: 0.981) using a sampling rate of 10 Hz and a window duration of 1 second. Across all configurations, even with lower sampling rates and shorter window sizes, the F1 scores remain consistently high (> 0.85), demonstrating robustness and requiring less computation to process.

Interestingly, we observe a plateau or slight performance drop with certain configurations, such as a 5 Hz sampling rate and a 5-second window size. We hypothesize that while a larger window size and sampling provides more information, it requires a larger and more expressive model to generalize well. Our model has the smallest number of parameters, compared to existing authentication works, which limits its capability of generalizing to higher dimensional inputs. Despite these limitations, our results suggest that a highly expressive model is not strictly necessary. The best configuration performs admirably even at moderately low sampling rates, making the system computationally efficient and power-friendly. Additionally, the stability of F1 scores across varying configurations highlights the model’s adaptability, offering flexibility in deployment scenarios where power and processing constraints are critical.

We opt for the best configuration, i.e., a sampling rate of 10 Hz with a window time span of 1 second. Unless explicitly stated otherwise, all subsequent evaluations of the user identification module are based on this optimal configuration.

Table 6: Comparison between AID and baseline models for Module-1: User Identification.

Metrics	AuthentiSense	KedyAuth	Ours
F1 score	0.922	0.875	0.981
FPR	0.081	0.331	0.052
FNR	0.084	0.007	0.001
Model size	5,373KB	1,186KB	214KB
Parameter #	685,221	268,843	28,130

Comparison with baseline frameworks. We compared our owner identification framework with KedyAuth [19] and AuthentiSense [12], two state-of-art user authentication frameworks for mobile devices, as shown in Table 6. We implemented and trained both methods using the same datasets and setting as AID. Shown in Table 6, We see that AID’s owner identification method outperforms both baselines, with a higher F1 score and lower FPR/FNR, with a model size that is an order of magnitude smaller.

Compared to AuthentiSense, AID achieves a 6.4% improvement in F1 score. AuthentiSense employs a larger CNN-based siamese network trained using the triplet mining technique, which is highly demanding in terms of training data and prone to overfitting or inefficiencies. In contrast, AID benefits from its much smaller model size (25 $\times$ smaller) and a more direct and efficient training approach. By leveraging unsupervised training with an AutoEncoder, AID learns feature representations directly from encoding and reconstructing raw data traces, leading to better generalization despite using significantly fewer parameters.

Similarly, AID outperforms KedyAuth, which also uses an unsupervised AutoEncoder and an SVM classifier. AID achieves a 12.2% improvement in F1 score and a 6 $\times$ reduction in FPR, which is critical in IPV detection scenarios. KedyAuth’s CNN-based model is optimized for high-frequency data at 100 Hz but struggles to effectively handle lower-frequency data. In contrast, AID leverages a multi-head LSTM AutoEncoder, which excels at processing sequential data and extracting richer patterns from lower-frequency signals. Additionally, AID incorporates a post-processing technique (Section 4.3) that further smooths and stabilizes the results, reducing the FPR to just 5.2%, which is significantly lower than KedyAuth’s 33.1%.

Effects of pretraining and fine-tuning. Here, we analyze the impact of pretraining and fine-tuning. As shown in Figure 3, we vary the number of samples used to fine-tune the base model to the phone’s owner (x-axis), which is collected from the user during a short one-time 5-minute second calibration phase when the user initially installs AID. We also ran experiments, with and without pretraining. Finally, we also looked at different selection schemes for choosing the windows used during fine-tuning. Ultimately, AID adopts a random selection scheme (e.g., choose $n$ random windows of data provided by the user), which observed had the highest performance. We compared against selecting fine-tuning windows based on chronological time and similarity.

When selecting based on chronological time, we simply use the first $n$ consecutive windows provided by the user. We believe that this scheme performed worse than random selection because the windows collected at a specific time is likely less diverse than randomly selecting windows across the entire calibration session, where users are likely performing a variety of different actions throughout.

When selecting based on similarity, we select windows from the owner and from other users that are similar to each other (measured by cosine similarity). The intuition is that it is more difficult to distinguish inputs that are more aligned, so incorporating them during the fine-tuning phase could potentially help the model distinguish these harder cases. To achieve a balance, we select windows in a specific ratio: 30% hard (most similar), 50% mid (moderately similar), and 20% easy (least similar). However, these cases are likely not completely indicative of the general behavior of the owner and would require more typical examples to generalize well. Randomly sampling provides a high chance of selecting diverse samples that both embody a user’s typical behavior and samples that may be difficult to distinguish from other users.

As shown in Figure 3, the peak F1 score is achieved with our random selection method using 12 windows for fine-tuning and remains consistently high as the number of windows increases. In comparison, similarity-based selection exhibits a similar trend as the random method but plateaus at a lower F1 score. Furthermore, it shows instability as the training set size changes, fluctuating from an F1 score of 0.960 with 20 windows to 0.902 with 24 windows, demonstrating its less stable performance compared to random sampling and highlighting its difficulty in finding diverse yet representative windows for adaptation. Similarly, the time-based method, constrained by its lack of diversity, consistently results in lower F1 scores across different training window sizes.

5.3 Effectiveness of Behavior Detection

Classification accuracy. Beyond performing user identification, AID detects IPI behaviors at the three granularity levels according to our proposed taxonomy (Table 2). The performance is summarized in Figure 4, where we report the top-k performance for k = 1, 2, 3, and 5 (e.g., the ground truth category, action, or subaction is in the list of top-k most probable behaviors). In the case of top-1, or logging only the most probable behavior, AID performs significantly better than random guessing (F-1 score of around 0.6 to 0.7). This far from perfect performance illustrates the difficulty of distinguishing IPI behaviors, using only a limited set of data streams, constrained by visibility to abusers that may access the phone.

However, by slightly broadening our logging to the top-k, the rate we successfully “detect” or report the true behavior increases drastically. For instance, reporting the top-5 most probable subactions yielded an F-1 score of 0.895 and reporting the top-3 actions boosted the F-1 score to 0.923. These results highlight how certain IPI behaviors may exhibit similar patterns across multiple modalities, especially at finer-grained levels (e.g., subactions). This reinforces the need for systems like AID to account for ambiguity by ranking behaviors, rather than solely relying on single-point predictions, which still offers investigators or automated systems actionable insights.

Table 7: AID end-to-end performance at Action (9-class) level.

Accuracy	F1	Precision	FPR
TOP-3	0.981	0.966	0.040
TOP-3 (No Post Processing)	0.978	0.961	0.044
TOP-2	0.973	0.952	0.055
TOP-2 (No Post Processing)	0.969	0.944	0.063
TOP-1	0.956	0.920	0.087
TOP-1 (No Post Processing)	0.951	0.911	0.095

5.4 End to End System Performance

The evaluation of the end-to-end (e2e) system focuses on its ability to combine User Identification Module and Behavior Classification Module to produce final predictions aligned with the goals of IPI detection. The labeling process for e2e performance is derived by integrating the outputs of these two levels. Specifically, User Identification Module determines whether the user is the owner or abuser, while Behavior Classification Module classifies the corresponding behavior into categories such as benign or IPI-related actions. By combining these outputs, the system assigns a single label to each instance that encapsulates both the identity of the user and the behavioral context.

For example, behaviors classified as "Alter Account" are considered safe when performed by the verified owner but are flagged as IPI risks when executed by an abuser. Similarly, benign behaviors by an intruder, such as general browsing, are marked as non-risk. This labeling framework ensures that the system captures the interplay between identity and behavior, allowing for nuanced and context-aware decision-making that is not possible by analyzing just one or the other.

The rationale behind this labeling approach lies in its structured integration of user identity and behavioral context, achieved through explicit rule-based definitions. As mentioned, FPR is crucially important due to the nature of IPI events. Trivial false alarms could lead to significant negative consequences for victims, such as undue distress, mistrust in the system, or unintentional exposure to their abusers. So it is important to handle interventions with care to avoid exacerbating emotional distress, particularly for at-risk populations like IPV survivors. The e2e labeling framework minimizes such risks by ensuring that both user identity and behavioral context are integrated into the decision-making process.

Table 7 shows the e2e performance of AID at Action level (5-class) behaviors in identifying both the correct user (owner vs. non-owner) and the IPI behavior. The best performance is achieved through top-3 behavior classification. With 0.981 F1 score and 0.040 FPR, the e2e system demonstrates great reliability when allowed to consider the top three ranked predictions, effectively minimizing false alarms. Narrowing the prediction scope to top-2 provides more confident results, with only a marginal trade-off in F1 score, precision, and FPR. Similarly, top-1 predictions achieve the highest certainty but at the cost of slightly higher FPR, underscoring the inherent trade-off between prediction completeness and certainty. This balance is especially critical in high-risk scenarios, where reducing false positives is paramount to avoid unnecessary distress or harm to victims, yet ensuring accurate identification of IPI behaviors remains equally important.

Additionally, the application of post-processing consistently enhances system performance across all configurations. For instance, in the top-1 prediction, post-processing reduces FPR from 0.095 to 0.087 and improves F1 from 0.951 to 0.956, reinforcing the importance of refinement steps to mitigate prediction errors. These results validate the effectiveness of AID in integrating user identity and behavioral context, demonstrating its capability to reliably classify behaviors at the action level.

5.5 Ablation Study

Module-1: user identification ablation. We evaluate a range of classifiers to assess their performance for user identification. For a comprehensive analysis, we include traditional machine learning approaches, such as Support Vector Machine (SVM) and Random Forest (RF), as well as deep learning-based methods like Long Short-Term Memory (LSTM) and Dense Neural Networks. Additionally, we include XGBoost, a widely used ensemble learning algorithm. Specifically, we set the SVM with an RBF kernel, Random Forest with 100 estimators, and use a multi-head LSTM (each with 6 units) to mirror the AutoEncoder architecture. For the dense network, we adopt two dense layers connected to the AutoEncoder output for prediction. The XGBoost model is configured with a learning rate of 0.1 (commonly used to balance convergence speed and performance), a maximum tree depth of 6, 100 estimators, and the logloss evaluation metric to optimize classification performance.

Figure 5 shows the performance of AID with different backbones for the classifier. We adopt SVM as the backbone of the classifier, since it has the highest performance, with 0.98 F1 score. Among the deep learning-based models, the Dense classifier demonstrates an F1 score of 0.969, which is slightly lower than SVM but higher than the LSTM classifier. Out of all architectures, SVM is the least complex, least expressive, but less susceptible to overfitting. It’s high performance suggests that the encoder could extract highly relevant features, with little data, to distinguish between the phone owner and other users, which is helpful in our few-shot fine-tuning process. Other more expressive methods, such as Dense MLPs and LSTM, typically require more data, memory, and compute resources.

Table 8: Module-2: Behavior Detection ablation study.

Accuracy

LSTM

CNN

LSTM

CNN

Transformer

Category (5-class)

Top3

0.963

\ul0.965

0.960

Top2

0.887

\ul0.888

0.884

0.850

Top1

\ul0.674

0.666

0.667

0.630

Action (9-class)

Top3

\ul0.923

0.904

0.917

0.882

Top2

\ul0.826

0.815

0.819

0.765

Top1

\ul0.642

0.638

0.623

0.591

Subaction (28-class)

Top3

\ul0.813

0.783

0.800

0.769

Top2

\ul0.728

0.695

0.715

0.689

Top1

\ul0.567

0.545

0.558

0.527

Module-2: behavior classification ablation. Table 8 shows the performance of AID in classifying IPI behaviors with different backbones architectures. Our hybrid LSTM-CNN architecture, that extracts both temporal (LSTM) and spatial (CNN) features, significantly outperforms single type architectures. Across all scenarios this hybrid architecture also outperforms a transformer architecture, which generally outperforms other architectures when data and model size is abundant. However, in our scenario where models are small (only hundreds of thousands of parameters), transformer performance typically suffers, which matches our observations. While LSTM-CNN did not outperform a purely CNN or LSTM approach in determining the category (5-class) of behavior, the performance is essentially equal, with less than $1\%$ difference.

Table 9: Data modality ablation study for Module-1: User Identification.

IMU

SYS

IMU

SYS

INT

APP

ALL

F1 score

\ul0.974

0.968

0.675

0.079

0.561

0.938

Table 10: Data modality ablation study for Module-2: Behavior Detection.

Accuracy

INT

APP

INT

APP

IMU

SYS

ALL

Category (5-class)

Top3

\ul0.963

0.939

0.824

0.87

0.93

Top2

\ul0.887

0.881

0.79

0.645

0.688

0.806

Top1

\ul0.674

0.655

0.538

0.396

0.386

0.558

Action (9-class)

Top3

\ul0.923

0.898

0.813

0.57

0.615

0.811

Top2

\ul0.826

0.787

0.666

0.436

0.483

0.681

Top1

\ul0.642

0.607

0.479

0.269

0.298

0.486

Subaction (28-class)

Top3

\ul0.813

0.786

0.629

0.248

0.316

0.657

Top2

\ul0.728

0.697

0.495

0.173

0.252

0.556

Top1

\ul0.567

0.533

0.35

0.101

0.148

0.392

Modality ablations. Tables 9 and 10 show the performance of user identification and behavior classification, varying the data streams used as input. We see that incorporating all streams of data does not yield the best performance and carefully selecting only the most relevant yields more promising results. Because leveraging 1) IMU and SYS data streams for user identification and 2) INT + APP data streams for behavior classification yielded the most promising performance, we adopt these two orthogonal data streams into their respective modules for AID.

6 Conclusion and Future Work

In this work, we presented AID, an automated Intimate Partner Infiltration detection system by continuously monitoring unauthorized access and suspicious behaviors through OS-level and physical signals on smartphones. Through a two-stage architecture that processes multimodal signals, AID stealthily operates on the device while preserving user privacy. A short calibration phase tailors the system to individual user behaviors, which allows it to distinguish non-owner access attempts and identify fine-grained IPI activities accurately. Our evaluation with 27 participants demonstrated AID ’s effectiveness, achieving up to 0.981 F1 score and maintaining a low false positive rate of 4%. These results highlight the potential of AID to serve as a forensic tool for security clinics, enabling scalable assistance to IPV victims.

Looking ahead, future research could explore the feasibility of AID in both proximate and remote IPV scenarios, where attackers may rely on physical closeness or operate from a distance. This includes examining wireless and wired network signals to identify hidden spy devices—such as covert cameras or compromised audio recorders—that function beyond the immediate scope of the targeted smartphone. Another avenue is to investigate collaborative detection strategies across multiple devices or platforms, further reinforcing AID against the diverse range of tactics used in IPV.

Ethics Considerations

Institutional Review Board (IRB) Approval. This study was reviewed and approved by the Institutional Review Board (IRB) at our institution.

Participant Selection and Psychological Safety. Our study did not involve direct interaction with real IPV abusers or victims. Instead, we recruited couples and friends in healthy relationships. This approach was chosen to avoid causing psychological or physical harm to vulnerable populations while still collecting data with behavioral similarities valuable for the development of IPV detection systems. By doing so, we ensured the safety and well-being of participants.

Use of Deception and Debriefing. At the beginning of the data collection process, we employed partial disclosure by framing the study as a "Smartphone Usage Analysis." This deliberate use of deception was necessary to prevent participants from altering their behavior due to the knowledge of the study’s association with IPV-related research, which could bias the data collection and affect the reliability of general behavioral patterns.

After the data collection phase concluded, participants were fully debriefed. During the debriefing, we disclosed the true purpose of the study, explained why deception was necessary, and provided them with an opportunity to ask questions and voice any concerns.

Importantly, participants were given the right to withdraw their data at any point after the debriefing if they felt uncomfortable with the study or its purpose. This ensured that their autonomy and rights were respected throughout the process.

Privacy and Anonymity. To protect participants’ privacy, we designed the study to inherently anonymize all collected data. No personal or identifiable information was logged during the process. Participants used lab-provided smartphones and lab-provided app accounts, ensuring minimal risk of privacy leakage.

Additionally, the collected data were securely transferred and stored on an encrypted lab server accessible only to authorized personnel. These measures ensured that the risk of privacy breaches was effectively mitigated.

Open Science

The artifacts produced by this work include 1) source code, models, and binaries necessary to run AID, 2) 18.7 hour phone usage dataset collected from our 27 participant user study, and 3) scripts used to generate results presented in this paper. Due to the physically, mentally, and emotionally harmful nature of IPV, we do not plan to make our code, scripts, or datasets available to the general public for safety concerns. We only plan to share, on request, to qualified individuals, entities, and authorities, such as security clinics or other reputable researchers in the IPV space.

References

[1] Alejandro Acien, Aythami Morales, Ruben Vera-Rodriguez, Julian Fierrez, and Ruben Tolosana. Multilock: Mobile active authentication based on multiple biometric and behavioral patterns. In 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, pages 53–59, 2019.
[2] Budi Arief, Kovila PL Coopamootoo, Martin Emms, and Aad van Moorsel. Sensible privacy: how we can protect domestic violence survivors without facilitating misuse. In Proceedings of the 13th Workshop on Privacy in the Electronic Society, pages 201–204, 2014.
[3] Rosanna Bellini. Paying the price: When intimate partners use technology for financial harm. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2023.
[4] Rosanna Bellini, Kevin Lee, Megan A Brown, Jeremy Shaffer, Rasika Bhalerao, and Thomas Ristenpart. The $\{$ Digital-Safety $\}$ risks of financial technologies for survivors of intimate partner violence. In 32nd USENIX Security Symposium (USENIX Security 23), pages 87–104, 2023.
[5] Rosanna Frances Bellini. Abusive partner perspectives on technology abuse: Implications for community-based violence prevention. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW1):1–25, 2024.
[6] Rose Ceccio, Sophie Stephenson, Varun Chadha, Danny Yuxing Huang, and Rahul Chatterjee. Sneaky spy devices and defective detectors: the ecosystem of intimate partner surveillance with covert devices. In 32nd USENIX Security Symposium (USENIX Security 23), pages 123–140, 2023.
[7] Mario Parreño Centeno, Yu Guan, and Aad van Moorsel. Mobile based continuous authentication using deep features. In Proceedings of the 2nd international workshop on embedded and mobile deep learning, pages 19–24, 2018.
[8] Rahul Chatterjee, Periwinkle Doerfler, Hadas Orgad, Sam Havron, Jackeline Palmer, Diana Freed, Karen Levy, Nicola Dell, Damon McCoy, and Thomas Ristenpart. The spyware used in intimate partner violence. In 2018 IEEE Symposium on Security and Privacy (SP), pages 441–458. IEEE, 2018.
[9] Clinic to End Tech Abuse. Clinic to end tech abuse (ceta). https://ceta.tech.cornell.edu/. Accessed: 2025-01-23.
[10] Debayan Deb, Arun Ross, Anil K Jain, Kwaku Prakah-Asante, and K Venkatesh Prasad. Actions speak louder than (pass) words: Passive authentication of smartphone users via deep temporal features. In 2019 international conference on biometrics (ICB), pages 1–8. IEEE, 2019.
[11] Sannara Ek, François Portet, and Philippe Lalanda. Lightweight transformers for human activity recognition on mobile devices. arXiv preprint arXiv:2209.11750, 2022.
[12] Hossein Fereidooni, Jan König, Phillip Rieger, Marco Chilese, Bora Gökbakan, Moritz Finke, Alexandra Dmitrienko, and Ahmad-Reza Sadeghi. Authentisense: A scalable behavioral biometrics authentication scheme using few-shot learning for mobile platforms. arXiv preprint arXiv:2302.02740, 2023.
[13] Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication. IEEE transactions on information forensics and security, 8(1):136–148, 2012.
[14] Diana Freed, Sam Havron, Emily Tseng, Andrea Gallardo, Rahul Chatterjee, Thomas Ristenpart, and Nicola Dell. " is my phone hacked?" analyzing clinical computer security interventions with survivors of intimate partner violence. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–24, 2019.
[15] Diana Freed, Jackeline Palmer, Diana Minchala, Karen Levy, Thomas Ristenpart, and Nicola Dell. “a stalker’s paradise” how intimate partner abusers exploit technology. In Proceedings of the 2018 CHI conference on human factors in computing systems, pages 1–13, 2018.
[16] Diana Freed, Jackeline Palmer, Diana Elizabeth Minchala, Karen Levy, Thomas Ristenpart, and Nicola Dell. Digital technologies and intimate partner violence: A qualitative analysis with multiple stakeholders. Proceedings of the ACM on human-computer interaction, 1(CSCW):1–22, 2017.
[17] Google. Accessibilityservice. https://developer.android.com/reference/android/accessibilityservice/AccessibilityService, 2024. Accessed: 2024-08-13.
[18] Sam Havron, Diana Freed, Rahul Chatterjee, Damon McCoy, Nicola Dell, and Thomas Ristenpart. Clinical computer security for victims of intimate partner violence. In 28th USENIX security symposium (USENIX Security 19), pages 105–122, 2019.
[19] Jun Ho Huh, Sungsu Kwag, Iljoo Kim, Alexandr Popov, Younghan Park, Geumhwan Cho, Juwon Lee, Hyoungshick Kim, and Choong-Hoon Lee. On the long-term effects of continuous keystroke authentication: Keeping user frustration low through behavior adaptation. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7(2):1–32, 2023.
[20] Ruth W. Leemis, Norah Friar, Srijana Khatiwada, May S. Chen, Marcie-jo Kresnow, Sharon G. Smith, Sharon Caslin, and Kathleen C. Basile. The national intimate partner and sexual violence survey: 2016/2017 report on intimate partner violence. National Center for Injury Prevention and Control, Division of Violence Prevention, Centers for Disease Control and Prevention, page 36, October 2022.
[21] Shinan Liu, Tarun Mangla, Ted Shaowang, Jinjin Zhao, John Paparrizos, Sanjay Krishnan, and Nick Feamster. Amir: Active multimodal interaction recognition from video and network traffic in connected environments. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 7(1):1–26, 2023.
[22] Madison Tech Clinic. Madison tech clinic. https://techclinic.cs.wisc.edu/. Accessed: 2025-01-23.
[23] Tara Matthews, Kathleen O’Leary, Anna Turner, Manya Sleeper, Jill Palzkill Woelfer, Martin Shelton, Cori Manthorne, Elizabeth F Churchill, and Sunny Consolvo. Stories from survivors: Privacy & security practices when coping with intimate partner abuse. In Proceedings of the 2017 CHI conference on human factors in computing systems, pages 2189–2201, 2017.
[24] Sakorn Mekruksavanich and Anuchit Jitpattanakul. Deep learning approaches for continuous authentication based on activity patterns using mobile sensing. Sensors, 21(22):7519, 2021.
[25] Lana Ramjit, Natalie Dolci, Francesca Rossi, Ryan Garcia, Thomas Ristenpart, and Dana Cuomo. Navigating traumatic stress reactions during computer security interventions. In 33rd USENIX Security Symposium (USENIX Security 24), pages 2011–2028, 2024.
[26] Megan M. Rogers, Catherine Fisher, Parveen Ali, Peter Allmark, and Lisa Fontes. Technology-facilitated abuse in intimate relationships: A scoping review. Trauma, Violence, & Abuse, 24(4):2210–2226, Oct 2023. Epub 2022 May 10.
[27] Kevin A Roundy, Paula Barmaimon Mendelberg, Nicola Dell, Damon McCoy, Daniel Nissani, Thomas Ristenpart, and Acar Tamersoy. The many kinds of creepware used for interpersonal attacks. In 2020 IEEE Symposium on Security and Privacy (SP), pages 626–643. IEEE, 2020.
[28] Cindy Southworth, Shawndell Dawson, Cynthia Fraser, and Sarah Tucker. A high-tech twist on abuse: Technology, intimate partner stalking, and advocacy. Violence Against Women Online Resources, pages 1–16, 2005.
[29] Sophie Stephenson, Majed Almansoori, Pardis Emami-Naeini, and Rahul Chatterjee. " it’s the equivalent of feeling like you’re in $\{$ Jail” $\}$ : Lessons from firsthand and secondhand accounts of $\{$ IoT-Enabled $\}$ intimate partner abuse. In 32nd USENIX Security Symposium (USENIX Security 23), pages 105–122, 2023.
[30] Kurt Thomas, Devdatta Akhawe, Michael Bailey, Dan Boneh, Elie Bursztein, Sunny Consolvo, Nicola Dell, Zakir Durumeric, Patrick Gage Kelley, Deepak Kumar, et al. Sok: Hate, harassment, and the changing landscape of online abuse. In 2021 IEEE Symposium on Security and Privacy (SP), pages 247–267. IEEE, 2021.
[31] Emily Tseng, Rosanna Bellini, Nora McDonald, Matan Danos, Rachel Greenstadt, Damon McCoy, Nicola Dell, and Thomas Ristenpart. The tools and tactics used in intimate partner surveillance: An analysis of online infidelity forums. In 29th USENIX security symposium (USENIX Security 20), pages 1893–1909, 2020.
[32] Emily Tseng, Mehrnaz Sabet, Rosanna Bellini, Harkiran Kaur Sodhi, Thomas Ristenpart, and Nicola Dell. Care infrastructures for digital security in intimate partner violence. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2022.
[33] Shaohua Wan, Lianyong Qi, Xiaolong Xu, Chao Tong, and Zonghua Gu. Deep learning models for real-time human activity recognition with smartphones. mobile networks and applications, 25(2):743–755, 2020.
[34] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34, 2020.
[35] Delanie Woodlock. The abuse of technology in domestic violence and stalking. Violence against women, 23(5):584–602, 2017.
[36] Hui Xu, Yangfan Zhou, and Michael R Lyu. Towards continuous and passive authentication via touch biometrics: An experimental study on smartphones. In 10th Symposium On Usable Privacy and Security (SOUPS 2014), pages 187–198, 2014.
[37] Shibo Zhang, Yaxuan Li, Shen Zhang, Farzad Shahabi, Stephen Xia, Yu Deng, and Nabil Alshurafa. Deep learning in human activity recognition with wearable sensors: A review on advances. Sensors, 22(4):1476, 2022.
[38] Xi Zhao, Tao Feng, and Weidong Shi. Continuous mobile authentication using a novel graphic touch gesture feature. In 2013 IEEE sixth international conference on biometrics: theory, applications and systems (BTAS), pages 1–6. IEEE, 2013.

Appendix A Full Action List

Table 11 lists all the tasks participants are required to complete. Only the Platform and Subaction columns are visible to participants.

Table 11: Full Action List.

Task ID

Platform