This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Automation Slicing and Testing for in-App Deep Learning Models

Hao Wu National Key Laboratory for Novel Software Technology, Nanjing University Yuhang Gong National Key Laboratory for Novel Software Technology, Nanjing University Xiaopeng Ke National Key Laboratory for Novel Software Technology, Nanjing University Hanzhong Liang National Key Laboratory for Novel Software Technology, Nanjing University Minghao Li Harvard University Fengyuan Xu National Key Laboratory for Novel Software Technology, Nanjing University Yunxin Liu Institute for AI Industry Research (AIR), Tsinghua University  and  Sheng Zhong National Key Laboratory for Novel Software Technology, Nanjing University
Abstract.

Intelligent Apps (iApps), equipped with in-App deep learning (DL) models, are emerging to offer stable DL inference services. However, App marketplaces have trouble auto testing iApps because the in-App model is black-box and couples with ordinary codes. In this work, we propose an automated tool, ASTM, which can enable large-scale testing of in-App models. ASTM takes as input an iApps, and the outputs can replace the in-App model as the test object. ASTM proposes two reconstruction techniques to translate the in-App model to a backpropagation-enabled version and reconstruct the IO processing code for DL inference. With the ASTM’s help, we perform a large-scale study on the robustness of 100 unique commercial in-App models and find that 56% of in-App models are vulnerable to robustness issues in our context. ASTM also detects physical attacks against three representative iApps that may cause economic losses and security issues.

copyright: none

1. Introduction

Deep learning (DL) technologies have significantly advanced many fields critical to mobile applications, such as image understanding, speech recognition, and text translation (Voulodimos et al., 2018; Kamath et al., 2019; Singh et al., 2017). Besides, a lot of research efforts have been put into optimizations of DL latency and efficiency (Liang et al., 2021; Berthelier et al., 2021; Gou et al., 2021; Menghani, 2021), paving the path towards the local intelligent inference on mobile devices like smartphones. Recent study (Xu et al., 2019; Sun et al., 2021; Almeida et al., 2021) indicates the intelligent Apps (iApps), smartphone Apps using in-App DL models, will be increasingly popular, which is also verified by our own study shown in Section 6.1.

The key difference between an iApp and an ordinary App is a software component performing the local intelligent inference (Figure 1). This component usually consists of two parts, the in-App DL model and IO processing code. The in-App DL model is commonly optimized for easy deployment and speedy inference, and thus gets rid of the backpropagation (BP) ability. The IO processing code is tightly associated with the corresponding in-App model, and it is responsible for both preparing inference inputs and interpreting inference outputs. Incorrect IO processing will impede the success of DL inference.

Refer to caption
Figure 1. A typical structure of an iApp. Compared to ordinary Apps, the key difference is a software component performing the local intelligent inference. This component consists of a in-App DL model optimized for inference and its paired code for input and output processing. App marketplaces so far do not support the auto testing of this component from the AI security perspective.

Such key difference in iApps brings troubles to the App marketplaces’ important routine work — the security-oriented auto testing of Apps released to them. This is because, by nature, the testing philosophy, paradigm, and requirements for neural networks are different from those for software codes (Gao et al., 2020; Huang et al., 2020; Wang et al., 2019). Regarding an in-App DL model, the inference changes to perturbed inputs, including their possible interpretations, are more important than the behavior changes of codes. For example, a road-lane-detection iApp is insecure and should be delisted from App marketplaces if it is too easy to show different lane lines when adversarial perturbations, say a small mark on the road, are applied. Since more and more Apps go intelligence, it is urgent to resolve how to efficiently auto-test massive released iApps in a comprehensive manner from the AI security perspective. However, to remove such ”dark cloud” of auto-testing requires a new tool design addressing two challenges.

Model Conversion. BP operations on the DL model are not only inevitable in various model testing methods (Madry et al., 2017; Kurakin et al., 2016; Dong et al., 2018; Chen et al., 2018), but also critical to the DL interpretability (Zhang and Zhu, 2018; Li et al., 2021). However, it is difficult to convert an in-App DL model into its white-box counterpart on a BP-enabled test platform, because such a model in App is heavily optimized for inference. For example, the neural operators’ attributes are implicit and hidden in the App codes, and weights quantization makes gradient computations extremely complicated.

Reliable Slicing. A converted white-box model cannot properly run on the testing platform without the cooperation of its own IO processing code in the App. Therefore, the App marketplace needs to precisely slice out this part of the released App code and re-use it during the testing. However, this code slicing should be efficient and accurate with an extremely high success rate, so that the massive testing of released iApps are able to be supported with no manual efforts involved. For example, IO processing often needs to perform assignment operations between data objects and arrays, which is hard to be tracked accurately in general cases.

Existing testing tools cannot meet the above needs. First, existing model conversion tools, e.g., MMdnn (Liu et al., 2020) and ONNX-based (onn, [n.d.]) tools, can translate trainable DL models between frameworks, but they cannot translate inference-only models into corresponding BP-enabled versions. Second, the dynamic code slicing (Azim et al., 2019) needs human involvement, while the success rate of static code slicing is not high. We find that the Android slicing tool Jicer (Pauck and Wehrheim, 2021) cannot extract runnable IO processing codes for model testing purpose.

In this work, we propose an automated tool, ASTM, which can enable large-scale Testing of InApp DL Models for App marketplaces. ASTM can successfully extract the in-App DL models from massive real-world iApps and its IO processing code, and then prepares the ready-to-test versions of in-App DL models with IO procesing code for various model assessments. The whole procedure of ASTM is efficient without developers’ cooperation like model information or source code. The outputs are capable of BP operations which are important to many model assessments and interpretation methods. ASTM can help App marketplaces to auto-test iApps in practice from the AI security perspective.

The key designs of ASTM are two reconstruction techniques, i.e., precise code reconstruction and BP-enabled model reconstruction. The code reconstruction adopts a precise Android static slicing, which can produce runnable processing codes by fully considering the App’s control flow, invoking context, field access, and branch selection during the slicing. It also proposes an IO-processing-oriented code generation to ensure that sliced statements can execute in the same order as they are in the original iApp. The BP-enabled model reconstruction can convert an inference-only model into a white-box one, which is used for comprehensive testing. It first abstracts the framework-independent computation procedure and then utilizes a rule-based operator’s attributes completion of recovering the stripped information for model reconstruction.

We implement the ASTM and test in-App models at scale. We collect about 15k unique Apps from five marketplaces. Following the in-App model finding method proposed by the work (Xu et al., 2019), we find there are 3,064 iApps equipped with 800 unique in-App DL models. Then we test 100 unique in-App models with popular DL frameworks in terms of robustness. We find 56% of the in-App models are vulnerable to robustness issues (Section 6.3). ASTM also detects physical attacks against three representative in-App models. Details of the measurement are presented in Section 6.

We highlight the key contributions as follows:

  1. (1)

    ASTM is the first effort to enable in-App DL model testing at scale. It can automatically reconstruct the IO processing code and BP-enabled DL models for powerful white-box testing techniques. ASTM is fully-automatic and works without the iApp provider’s cooperation.

  2. (2)

    ASTM proposes two novel reconstruction techniques. The code reconstruction is a precise Android static slicing to produce runnable IO processing code. The model reconstruction rebuilds the BP-enabled model from its inference-only version by establishing equivalent calculation and information completion.

  3. (3)

    We perform robustness assessment on 100 unique commercial in-App models through ASTM. We also successfully detect physical adversarial attacks against commercial iApps, which may cause economic losses and serious security issues.

2. Background and Related Works

2.1. In-App DL Models

The iApps equipped with DL models are emerging. The in-App models have been studied preliminarily. The work (Xu et al., 2019) is the first work to keep an eye on iApps and proposes a way to find DL models and frameworks in an App. Then the work (Sun et al., 2021) investigates how iApp providers protect in-App DL models. The work (Almeida et al., 2021) and work (Zhang et al., 2022) perform comprehensive studies on DL model inference performance. The work (Huang et al., 2021) studies the robustness of the in-App DL models.

Existing works do not propose techniques for automated reconstruction of IO processing code in the iApp. These works either do not need the IO processing code or reconstruct the processing code manually (Xu et al., 2019; Almeida et al., 2021). These works can also not directly reconstruct the BP-enabled model from the in-App model to perform the model assessment. For example, the work (Huang et al., 2021) assesses the model’s robustness by utilizing the adversarial attacks’ transferability.

2.2. Android Slicing

Programing slicing is a technique to extract statements that may affect a given statement (stmt) and a set of values (V) through data dependency and control dependency analysis. The given statement and the set of values are referred as a slicing criterion ¡stmt,V¿. There are two well-known frameworks in the Android analysis scenarios, i.e., WALA (Sridharan et al., 2007) and Soot (Lam et al., 2011). The frameworks support the basic functionalities and programable interfaces to build slicing algorithms. The code reconstruction of ASTM is based on Soot because Soot has better support for Dalvik bytecode, and many Androids analysis tools are developed based on Soot.

The work (Azim et al., 2019) designs a human-involved Android dynamic slicing tool by which users first perform App instrumentation, manually run the instrumented App, and dynamically collect logs to perform Android slicing. AppSlicer (Bhardwaj et al., 2019) is another activity-level App dynamic slicing technique. It first triggers all activities in a simulator, records the code invoked by each activity, and slices the target activity according to the recorded content. Jicer (Pauck and Wehrheim, 2021) is a work to perform general Android static slicing. It aims at slicing the code of any App with any slicing criterion. However, Jicer fails to slice the processing code from a commercial iApp because it lacks the processing code oriented design.

The slicer’s accuracy depends on its sensitivity regarding various program features (Pauck and Wehrheim, 2021). A flow-sensitive slicer takes control dependencies into full consideration. Fully considering from which context a method is called makes the slicer context-sensitive. Handling data dependencies between usages of the same field in different methods makes a slicer field-sensitive. When handling conditional statements, a path-sensitive slicer must consider slicing which branch.

The code reconstruction of ASTM is a fully-automatic precise Android static slicing without human involvement. In addition, our reconstruction tool is flow-, context-, field-, path-sensitive.

2.3. Adversarial Attacks

The adversarial attacks have been widely studied in both the AI and security communities. DL models have been demonstrated that they are vulnerable to adversarial examples (Szegedy et al., 2013). The adversaial examples are inputs to DL models that have been intentionally optimized to cause models to make a mistake. Specifically, given a DL model fθ()f_{\theta}(\cdot) with parameters θ\theta and an input xx with a ground truth label yy, an adversarial exmaples xx^{\prime} is produced by optimization, which is closed to xx. xx^{\prime} is able to cause the DL model to make an incorrect prediction as fθ(x)yf_{\theta}(x^{\prime})\neq y (untargeted attacks), or fθ(x)=yf_{\theta}(x^{\prime})=y* (targeted attacks) for some yyy*\neq y.

The attack scenarios can be classified by the amount of knowledge the adversary has about the model, i.e., white box attack (Madry et al., 2017) and black box attack (Brendel et al., 2017). In the white box scenario, the adversary fully knows the model, including model type, model architecture, and values of all parameters. The adversary can perform gradient-based attacks on the model. In the black box scenario, the adversary has limited information about the model. The adversaries can only perform the attack by probing and observing the output.

The adversarial attacks can be classified into digital attacks (Goodfellow et al., 2014) and physical attacks (Eykholt et al., 2018) according to how the adversary modify the input data. In a digital attack, the adversary has direct access to the actual data fed into the model. The adversary can modify each bit of the input data. In a physical attack, the adversary does not have direct access to the digital representation of the input data. The adversary can place objects in the physical environment seen by the camera.

In our work, we assess the robustness of the in-App models through the white box digital attacks. We also perform white box physical attacks on three representative in-App models.

3. Overview

In this section, we first define the security model and design requirements of ASTM. Then we present the high-level design of ASTM and introduce the two key reconstruction techniques.

3.1. Problem Overview

The security issues of in-App models are able to cause the iApp to behave abnormally. And the in-App model may even become a ”protective umbrella” for the iApp’s malicious behavior. Therefore, App markets have a strong need to test the in-App models in massive iApps.

3.1.1. Security Model.

The iApp may be developed by careless or malicious developers. Those careless developers may use an in-App model that has not undergone comprehensive testing. Those malicious developers even launch active attacks by poisoning the training data. We do not directly handle the scenario where the iApp is packed. If App markets want to test the in-App model in a packed iApp, they can first use previous works (Duan et al., 2018; Xue et al., 2020; Xue et al., 2021) to unpack the iApp, and then use ASTM to prepare the IO processing code and BP-enabled model.

3.1.2. Requirements.

Given the challenges discussed in Section 1, we summarize three requirements to perform comprehensive testing on in-App models.

First, the model before and after reconstruction should be equivalent. The model reconstruction mainly performs two things. One is porting the in-App model to a DL framework that supports training. The other is removing the obstacles that hinder gradient computation, such as undoing weights quantization and replacing the inference-only operations. The reconstructed model and the in-App model should be perfectly equivalent in computational procedure and trainable parameters so that testing on the reconstructed model is equivalent to testing on the in-App model.

Second, the processing code reconstruction should be precise and runnable. We show a case111The DL model is taken from an iApp, whose bundle ID is org.prudhvianddheeraj.lite.example.detection. in Figure 2 to demonstrate the effect of processing code on the correctness of DL inference. Before being fed into the DL model, the user input needs to be resized in a preset way. If the resize operation is omitted or substituted by a random resize operation, the inference result is wrong. The reconstruction should be precise so that processing operations are not missed. At the same time, the reconstruction should be practical so as to ensure that it can produce runnable results.

Refer to caption
Figure 2. The inference results on the user data with and without correct pre-processing.

Thrid, the proposed tool should be automatic. To enable App markets to test in-App models at scale, the model and code reconstructions should be fully automatic without human involvement. And in our scenario, the reconstructions should be done without the cooperation of iApp developers.

Refer to caption
Figure 3. The workflow of the code reconstruction.

3.2. ASTM Design Overview

In this section, we introduce the ASTM’s workflow and explain how ASTM meets the requirements discussed in Section 3.1 in brief.

We first used the work (Xu et al., 2019) to determine whether an App contains in-App models. If an iApp is found, we then perform the proposed code reconstruction and model reconstruction to build the test object.

During the code reconstruction, ASTM takes as input a released iApp and outputs runnable IO processing codes. It first finds the invoking statement of the DL inference framework through static program analysis. Note that we do not need to reconstruct the DL inference framework because it will be replaced by a ready-made framework that supports BP.

Then, ASTM performs precise Android static slicing on the released iApp to extract the IO processing code. The slicing starts at the DL framework’s invoking statement, and it performs backward slicing to extract all statements that determine the value used by the invoking statements and performs forward slicing to extract all statements that use the value defined by the invoking statements.

Next, ASTM generates Python code that can run on PCs based on the sliced IO processing code for model testing. Generating Python code from the sliced APK is feasible because the sliced codes are responsible for IO processing, which is not coupled with the Android platform and system services.

During the model reconstruction, ASTM takes as input an in-App model and outputs its BP-enabled version. ASTM first extracts the computation procedure of the in-App model and eliminates the factors hindering the gradient calculation by undoing the quantization, removing inference-specific operations, and so on.

Then, ASTM utilizes a set of carefully elaborate rules to rebuild the stripped information during the deployment-oriented conversion. The stripped information is mainly the attributes of the operations, such as the filter numbers and kernel size of a convolution operation. Finally, with the rebuilt computation procedure and stripped attributes, ASTM reconstructs the BP-enabled model with a widely-used DL framework Keras222https://keras.io/.

Combining the code reconstruction and model reconstruction, our ASTM can produce a testing object, equivalent to the iApp’s DL inference module, that can be assessed by various powerful testing techniques. In the next section, we will detail the ASTM’s design. And in Section 6, we will show how to use ASTM to test the in-App models of commercial iApps at scale.

4. Precise Code Reconstruction

To reconstruct the IO processing code, ASTM performs precise Android static slicing on the iApp. It is able to extract all user input and inference output processing code from the iApp and generate the corresponding Python code for the following testing. The pre-processing code can be extracted by iteratively finding all statements that determine the parameters of the DL framework inference interface. The post-processing code can be extracted by finding all statements tainted by the inference output.

To achieve the above objectives, ASTM performs both backward and forward slicing starting from the DL framework invoking statements to find all IO processing codes. The backward slicing stops once semantically explicit user input is found, such as an unprocessed image or sound. The forward slicing stops once the inference output is parsed into a structure that can be used for model testing, such as category and confidence.

We demonstrate the code reconstruction workflow in Figure 3. The following sections detail the design of the code reconstruction.

4.1. Slicing Preparation

Refer to caption
Figure 4. Control and data dependency among the statements considered by ASTM.

After taking as input, an iApp, ASTM utilizes Soot to build the App’s call graph (CG) and function-level control flow graphs (CFG). ASTM builds an App-level CFG with global variable dependency for the given iApp (Step❶ in Figure 3). In this step, ASTM adds edges that 1) represent dependency between the caller and callee according to the CG (e.g., Edge ➁ and Edge ➂ in Figure 4), and 2) represents dependency between the statements that read or write the same field in an class (e.g., Edge ➃ in Figure 4).

We denote the built graph as the slicing basis because it is able to represent data and control dependencies between statements in an iApp. The data dependency and control dependency between two statements are denoted as d\leftarrow_{d} and c\leftarrow_{c}, respectively.

As for the data dependency, if use(stmt1) \cap def(stmt2) \neq \emptyset, stmt1 d\leftarrow_{d} stmt2. The data dependencies we consider can be classified into two types. The first type of data dependency is brought by the def-use relationship of local variables in a single function (Edge➀ in Figure 4). The second type of data dependency is brought by the def-use relationship of filed variables between functions (Edge➃ in Figure 4).

We consider three types of control denpendency. The control dependency in a single function is brought by the procedure branching statement. If stmt1 can determine whether stmt2 is executed, stmt2 c\leftarrow_{c} stmt1 (Edge➄ in Figure 4). The control dependency between functions occurs when a function is called (Edge➁ in Figure 4) or exits (Edge➂ in Figure 4).

After building the slicing basis, ASTM prepares the slicing criterions for the following App slicing (Step❷ in Figure 3). Due to different inference tasks and different engineering implementations, the entry point and the exit point of IO processing lack uniform characteristics among different iApps. Therefore, it is difficult for us to locate the entry point and exit point of the IO processing as the slicing criterions. In contrast, the DL framework’s invoking interfaces, which are used to load DL models and perform universal DL computation, have significant static characteristics (Zhang et al., 2022). ASTM utilizes these invoking interfaces to prepare slicing criterions for the forward and backward slicing, respectively. The backward slicing criterions consist of statements that invoke the DL framework and the values used by the corresponding statement. The goal of backward slicing is to extract all statements that determine the used values. The forward slicing criterions consist of statements that invoke the DL framework and the values defined by the corresponding statement. The goal of forward slicing is to extract all statements that use the defined values.

To find the invoking interfaces of DL framework, we improve the idea used by existing works (Xu et al., 2019; Sun et al., 2021). If the iApp utilizes an open-sourced DL framework, e.g., TFLite, with well-documented interfaces, we can directly locate statements that invoke the interfaces in slicing basis as the slicing criterions. We find using the invoking interfaces’ parameters and return values and the availability of the invoking methods is robust enough to determine the invoking interfaces when the iApp is protected with string-based obfuscation.

4.2. Code Slicing

ASTM proposes an Android slicing technique to extract the user data pre-processing and inference output post-processing code (Step❸ in Figure 3). The slicing technique can perform bidirectional slicing starting from the found slicing criterions.

We denote the slicing criterion as <stmt,V>. The stmt in the slicing criterion denotes the invoking statement of DL framework, which is the slicing start point. V represents variables used by stmt when preforming the backwards slicing, i.e., V = use(stmt). V represents variables defined by stmt when preforming the forwards slicing, i.e., V = def(stmt). For example, int[] r = function(String p1, byte[] p2, long[] p3) is a DL framework invoking statement, denoted as stmtcriterion{}_{\texttt{criterion}}. Its def values and use values are {r} and {p1, p2, p3}.

The slicing algorithm is shown in Algorithm 1. The algorithm takes as input the slicing basis and extracts statements belonging to the processing code by analyzing the data and control dependency. Recall that the slicing basis is the App-level CFG with global variable dependency built in Section 4.1. Each edge of the slicing basis graph represents the data or control dependency between two statements.

When performing backward slicing, if the statement to analyze has a data or control dependency with the sliced statement (line 11), the statement will be added to the sliced results (line 12). And the values defined by the newly-sliced statement can be removed from the slicing criterion, and the values used by the newly-sliced statement should be added to the slicing criterion (line 13). Then the backward slicing goes on (line 14).

When performing forward slicing, if the statement to analyze does not have a data or control dependency with the sliced statement (line 21\sim22), the values defined by the statement will be removed from the slicing criterion (line 23). If the statement to analyze has a dependency on the sliced statement, the value defined by the statement should be added to the slicing criterion (line 24). In addition to continuing the forward slicing, we should also perform the backward slicing to determine the values used by the newly-added statement (line 29\sim30, 32).

The slicing ends when all the values can be determined (line 2\sim3). Or there are no more statements to analyze (line 7\sim8). The pred_of()pred\_of(\cdot) (line 6) and succ_of()succ\_of(\cdot) (line 16) represent the predecessor and successor of the given statement on the slicing basis.

Data: The App-level CFG with global variable dependency, Gb\texttt{G}_{b}; The slicing criterion, ¡stmtsc,VscV_{sc}¿; The slicing direction, D.
Result: The slicing results: Sres\texttt{S}_{res}.
1 begin
2       if  Vsc == \emptyset then
3             return Sres\texttt{S}_{res}
4            
5       
6       if  D == ”backward”  then
7             stmtn=Gb.pred_of(stmtsc)\texttt{stmt}_{n}=\texttt{G}_{b}.pred\_of(\texttt{stmt}_{sc})
8             if  stmtn\texttt{stmt}_{n} == NULL then
9                   return Sres\texttt{S}_{res}
10                  
11            Vd=def(stmtn)V_{d}=def(\texttt{stmt}_{n})
12             Vu=use(stmtn)V_{u}=use(\texttt{stmt}_{n})
13             if VdVscV_{d}\cap V_{sc}\neq\emptyset then
14                   Sres.add(stmtn)\texttt{S}_{res}.add(\texttt{stmt}_{n})
15                   Vsc=(VscVd)VuV_{sc}^{\prime}=(V_{sc}-V_{d})\cup V_{u}
16                  
17            Sres.add\texttt{S}_{res}.add(slice(Gb\texttt{G}_{b}, ¡stmtn,VscV_{sc}^{\prime}¿, ”backward”))
18            
19      if  D == ”forward”  then
20             stmtn=Gb.succ_of(stmtsc)\texttt{stmt}_{n}=\texttt{G}_{b}.succ\_of(\texttt{stmt}_{sc})
21             if  stmtn\texttt{stmt}_{n} == NULL then
22                   return Sres\texttt{S}_{res}
23                  
24            Vd=def(stmtn)V_{d}=def(\texttt{stmt}_{n})
25             Vu=use(stmtn)V_{u}=use(\texttt{stmt}_{n})
26             if VuVsc==V_{u}\cap V_{sc}==\emptyset then
27                   if VdVscV_{d}\cap V_{sc}\neq\emptyset then
28                         Vsc=VscVdV_{sc}^{\prime}=V_{sc}-V_{d}
29                        
30                  Sres.add\texttt{S}_{res}.add(slice(Gb\texttt{G}_{b}, ¡stmtn,VscV_{sc}^{\prime}¿, ”forward”))
31                  
32            else
33                   Sres.add(stmtn)\texttt{S}_{res}.add(\texttt{stmt}_{n})
34                   Vsc=VscVdV_{sc}^{\prime}=V_{sc}\cup V_{d}
35                   Sres.add\texttt{S}_{res}.add(slice(Gb\texttt{G}_{b}, ¡stmtn,VscV_{sc}^{\prime}¿, ”forward”))
36                  
37                  Vsup=VuVscV_{sup}=V_{u}-V_{sc}
38                   if  stmtp\texttt{stmt}_{p} == NULL then
39                         return Sres\texttt{S}_{res}
40                        
41                  Sres.add\texttt{S}_{res}.add(slice(Gb\texttt{G}_{b}, ¡stmtn,VsupV_{sup}¿, ”backward”))
42                  
43
Algorithm 1 Pseudo code of performing slicing, denote as slice(\cdot).

4.3. Code Generation

After slicing all statements, ASTM generates the executable python code for the following DL model assessment (Step❹ in Figure 3). The code generation is feasible because the sliced code is for data processing in the DL inference scenario, which is not Android platform-specific. The generation procedure consists of two steps, i.e., statement translation and statement ordering.

4.3.1. Statement Translation.

The code slicing works on an intermediate representation (IR) provided by Soot 333ASTM utilizes Jimple, which is the widely-used intermediate language in Android App analysis.. Most of the IR code, e.g., assignment statement and invoking statement, can be easily translated into python code. The IR code needing elaborate processing is the conditional statement, i.e., if statement, and jump statement, i.e., goto statement. This is because the loop structure is also represented as the combination and nesting of the conditional statement and jump statement. We should build the loop structure, especially the loop condition and loop body, to ensure the correctness of data processing.

ASTM adopts the following rule to identify the loop structure. We denote a sliced statement as stmt and the CFG of the function that stmt belongs to as G. The stmt’s direct successor statement is stmtsucc on the G. If on the G, stmtsucc dominates stmt444Note that, in an CFG, a statment stmt1\texttt{stmt}_{1} dominates a statment stmt2\texttt{stmt}_{2}, if every path from the CFG’s entry statement to stmt2\texttt{stmt}_{2} must go through stmt1\texttt{stmt}_{1}., we say a loop structure is found. and the stmtsucc is the entry point of the loop body. The loop body of the found loop structure is the intersection of all predecessor statements of stmt and all successor statments of stmtsucc. The loop condition of the found loop structure is the if statment in the loop body whose target statement555The stmt’ is the target statement of an if statement if condition goto stmt’. is beyond the loop body. Then we can reconstruct all loop structures in sliced code and then translate them into Python.

4.3.2. Statement Ordering

The order in which the sliced statements executes determines the correctness of the data processing. We first organize the sliced statements in terms of functions as they are in the original iApp. The order of the statements can be determined according to the CFG of the corresponding function. Next, we define all the organized functions and then determine how to call these functions in the correct order.

According to ASTM’s design, these organized functions have explicit invoking relationships or implicit data dependencies with at least one of the rest functions. Note that the data dependency is brought about by reading or writing the same field. We group these functions according to whether there is a direct or indirect invoking relationship between them. For example, if fA()f_{A}()-callcall-¿fB()f_{B}()-callcall-¿fC()f_{C}(), then fA()f_{A}(), fB()f_{B}(), and fC()f_{C}() are organized into one function group. We denote the function without any caller in each function group as head functions. In the above example, fA()f_{A}() is a head function. Now, we only need to order these head functions, and the rest functions in the function groups will automatically be invoked when the head function is called.

To order the head functions, we propose a ”Write-before-Read” principle by fully considering the field dependencies among the functions. The principle consists of two rules.

  1. R1:

    Function groups that do not read any field can be arranged in any order.

  2. R2:

    For any field, the function group that writes a field should rank before the function group that reads that field.

ASTM first records the filed variable reads and writes of each function group, denoted as gvfRgv_{f}^{R} and gvfWgv_{f}^{W}, respectively. The subscript ff indicates the function group. Then we arrange the head functions whose function group’s gvfRgv_{f}^{R} is empty in arbitrary order (R1). Once a head function is organized, ASTM updates gvgRgv_{g}^{R} of each of the rest function groups by gvgR=gvgRgvfWgv_{g}^{R}=gv_{g}^{R}-gv_{f}^{W}. The ASTM will repeatedly order function groups whose gvRgv^{R} is empty (R2) until all function groups are arranged.

5. BP-enabled Model Reconstruction

Operator Type Attribute Needed Information Computation Rule
Conv2D [DepthwiseConv2D] filters output_shape output_shape[-1]
kernel_size[0] weight_shape weight_shape[1]
kernel_size[1] weight_shape weight_shape[2]
strides[0]
input_shape
output_shape
kernel_size
round(input_shape[1]kernel_size[1]output_shape[1]1)round(\frac{\text{{input\_shape}[1]}-\text{kernel\_size[1]}}{\text{{output\_shape}[1]}-1})
strides[1]
input_shape
output_shape
kernel_size
round(input_shape[2]kernel_size[2]output_shape[2]1)round(\frac{\text{{input\_shape}[2]}-\text{kernel\_size[2]}}{\text{{output\_shape}[2]}-1})
padding
input_shape
output_shape
kernel_size
strides
”same” when
input_shape[i]kernel_size[i-1]strides[i-1]+1=input_shape[i](i=1,2)\lfloor\frac{\text{{input\_shape}[i]}-\text{kernel\_size[i-1]}}{\text{strides[i-1]}}+1\rfloor=\text{{input\_shape}[i]}(i=1,2)
[default: ”valid”]
DepthwiseConv2D depth multiplier
input_shape
output_shape
output_shape[-1] / input_shape[-1]
Conv2DTranspose filters output_shape output_shape[-1]
kernel_size[0] weight_shape weight_shape[1]
kernel_size[1] weight_shape weight_shape[2]
strides[0]
input_shape
output_shape
kernel_size
round(output_shape[1]kernel_size[1]input_shape[1]1)round(\frac{\text{{output\_shape}[1]}-\text{kernel\_size[1]}}{\text{{input\_shape}[1]}-1})
strides[1]
input_shape
output_shape
kernel_size
round(output_shape[2]kernel_size[2]input_shape[2]1)round(\frac{\text{{output\_shape}[2]}-\text{kernel\_size[2]}}{\text{{input\_shape}[2]}-1})
padding
input_shape
output_shape
kernel_size
strides
”same” when
(input_shape[i],1)strides[i-1]+kernel_size[i-1]=\lfloor(\text{{input\_shape}[i]},-1)*\text{strides[i-1]}+\text{kernel\_size[i-1]}\rfloor=
output_shape[i] (i=1,2) [default: ”valid”]
MaxPooling/ AveragePooling pool_size
input_shape
output_shape
round(input_shape[1]output_shape[1])round(\frac{\text{{input\_shape}[1]}}{\text{{output\_shape}[1]}})
padding
input_shape
output_shape
pool_size
”same” when
input_shape[i]pool_size+ε<output_shape[i]\lfloor\frac{\text{{input\_shape}[i]}}{\text{pool\_size}+\varepsilon}\rfloor<\text{{output\_shape}[i]}
(i=1,2) [default: ”valid”]
UpSampling size
input_shape
output_shape
input_shape[1]output_shape[1]\lfloor\frac{\text{{input\_shape}[1]}}{\text{{output\_shape}[1]}}\rfloor
Pad, MirrorPad padding
input_shape
output_shape
padding[i][0]=output_shape[i]input_shape[i]2\text{padding[i][0]}=\lfloor\frac{\text{{output\_shape}[i]}-\text{{input\_shape}[i]}}{2}\rfloor
padding[i][1]=output_shape[i]input_shape[i]\text{padding[i][1]}=\text{{output\_shape}[i]}-\text{{input\_shape}[i]}
output_shape[i]input_shape[i]2)-\lfloor\frac{\text{{output\_shape}[i]}-\text{{input\_shape}[i]}}{2}\rfloor)
Space2Batch block_size output_shape output_shape[0]\lfloor\sqrt{\text{{output\_shape}[0]}}\rfloor
padding
input_shape
output_shape
block_size
padding[i][0]=output_shape[i+1]block_sizeinput_shape[i+1]2\text{padding[i][0]}=\lfloor\frac{\text{{output\_shape}[i+1]}*\text{block\_size}-\text{{input\_shape}[i+1]}}{2}\rfloor
padding[i][1]=output_shape[i+1]block_size\text{padding[i][1]}=\text{{output\_shape}[i+1]}*\text{block\_size}
input_shape[i+1]padding[i][0]-\text{{input\_shape}[i+1]}-\text{padding[i][0]}
Table 1. Rules for computing the operators’ attributes. The first column is the operator types; The second column is operator attributes to rebuild; The third column is the information of the corresponding operator in the original model; The last column is the rule to compute the operator’s attributes with the original DL model’s information.

The in-App DL model and its framework are primarily designed for efficient inference. Most on-device DL frameworks do not support gradient computation and backpropagation. However, These training-specific computations are the foundation of the white box DL model assessment. Given that all in-App models are converted or compiled from the BP-enabled models, ASTM proposes a BP-enabled model reconstruction technique for better assessment performance.

The model reconstruction technique takes as input an inference-only DL model and produces its corresponding BP-enabled version. It can port the DL model from the inference-only DL framework (e.g., TFLite) to a DL framework that supports model training (e.g., TensorFlow). Recall existing converters have poor support for converting the inference-only model to a BP-enabled one.

DL model can be viewed as a computational graph that defines how to process the inference input. The computational graph is a directed graph to represent the DL computation procedure. The computational graph has three kinds of nodes, i.e., operator node, parameter node, and input node. Operator nodes represent the basic mathematical computation in the DL model, such as convolution, dense, padding, multiply, etc. The operator nodes are characterized by operator type and operator attributes. Parameter nodes represent the BP-enabled parameters of the neural network, e.g., weights and bias. Parameter nodes, input nodes, and operator nodes can feed their value (i.e., tensor) into other operator nodes. Operator nodes compute the output given values for its inputs.

The model reconstruction consists of three key steps. ASTM first extracts the structure of the computational graph. The structure conveys the data dependencies between the nodes and the type of the operator nodes. Then ASTM computes the values of operator attributes and the values of the parameter nodes. Finally, ASTM generates the model code from the constructed computational graph.

Graph Structure Extraction. Before extracting the structure, we define a representation for the DL model’s computational graph. We build a directed graph, which can represent the data dependency among nodes. The input node is characterized by input shape and output shape. The operator node is characterized by input shapes, output shapes, operator types, and operator attributes. Output shapes and values characterize the parameter node. ASTM’s graph representation shares the same design with the computational graph of the widely-used Keras, in terms of operators and parameters.

First, ASTM parses the in-App DL model according to get the information about model structures, operators, and parameters. Then, ASTM uses the parsed information to rebuild the structure of the computational graph with our representation. The rebuilt computational graph conveys the completed computation procedure, and types of operator nodes are determined.

Node Value Computation. After getting the graph structure, ASTM builds the values of operator attributes and the values of parameter nodes. Note that operator attributes represent the operator configuration. For example, the convolution’s attributes contain the kernel size, strides, padding strategy, and output channel. The parameter nodes’ values are constants used by operators (e.g., the weights of the convolution operator), and the values are determined during training.

The parameter values in different DL frameworks are represented as tensors with different axis orders. Our representation shares the same axis order with Keras. So, we can compute the parameter values by converting the axis of the in-App model’s parameter values. Some of the parameter values, e.g., weights of convolution operation, are quantified. ASTM undoes the quantification according to the quantization rules. For example, we use v=s×(qz)v=s\times(q-z) to undo the TFLite’s quantization, where the vv is the reversed weights, qq is the quantized value, ss denotes the scale coefficient, and zz denotes the zero point value.

As for the attribute values, ASTM proposes a rule-based computation method by fully considering the operators’ computation characteristics. We can obtain the attribute values through the DL framework-independent information. For example, when computing the attribute values of a convolution operator, the kernel size can be obtained through the output shape of the convolution operator’s input parameter nodes.

Part of the rules used to compute the representative attribute values is summarized in Table 1. The needed information is the information used to compute the attribute values. The needed information can be collected from the corresponding operator in the in-App model. The input_shape denotes the shape of the corresponding operator’s input tensor. The output_shape denotes the shape of the corresponding operator’s output tensor. The weight_shape denotes the output_shape of the corresponding operator’s input parameter node.

Model Code Generation. ASTM can finally generate the model code with the constructed computational graph. The generated model code is the same as that used to develop a DL model and load the trained model weights. The model code has three parts, i.e., operator initialization, flow construction, and weight loading. The operator initialization code is to utilize the operator type and attributes to create operator instances. The flow construction code is to utilize the graph structure and operator instances to construct the model’s data flow. The weights loading code is to assign the trained parameters to the operator instances.

Without loss of generality, we choose the widely-used Keras as the DL framework to reconstruct the BP-enabled model. Algorithm 2 shows how ASTM generates the Keras model code. The algorithm takes as input the computational graph constructed before. First, ASTM adds all initialization code for every operator node (line 1\sim4). Then, it performs the topology sort on the computational graph to determine the execution order of the operators (line 5). Next, it generates the forward-pass code according to the operator execution order (line 6\sim7). ASTM adds a fragment of template code to construct the DL model instance (line 8). Finally, we add the code that loads the weights of the reconstructed model and assigns them to the corresponding operator node. (line 9\sim10).

Data: The computational graph: graph=[node1,node2,,noden]graph=[node_{1},node_{2},...,node_{n}]
Result: code
1 code = [] //initialization
2 for node in graph do
3       if node.nodeType == 0 then
4             code.append(genInitCode(node))
5sortedNodeList = topologySort(graph)
6 for node in sortedNodeList do
7       code.append(genForwardCode(node))
8code.append(modelBuildingCode)
9 for node in graph do
10       code.append(genWeightInitCode(node))
Algorithm 2 Model Code Generation.

6. Evaluation

We implement the ASTM where the code reconstruction consists of 6.3K lines of Java code, and the model reconstruction consists of 1.9K lines of Python code.

In this section, we first perform a large-scale study on all found iApps. We then perform code and model reconstruction on 100 in-App models using the top two frameworks, i.e., TFLite and TensorFlow. The selected models perform vision-related tasks, which is the most widely used kind of task in iApps (Xu et al., 2019). Next, we use ASTM to perform the robust assessment of the reconstructed results to show the effectiveness of our ASTM. Finally, we perform three representative physical adversarial attacks to demonstrate the ASTM’s security meaning.

6.1. iApp Statistics

There are 25k APKs downloaded from five App markets, i.e., APKPure (APKPure, [n.d.]), 360 App Store (360, [n.d.]), Baidu App Store (bai, [n.d.]), Xiaomi App Store (mi_, [n.d.]), and Anzhi Market (anz, [n.d.]), in June 2021. We downloaded about 5k the most popular Apps on each market. After removing the duplicated Apps according to the iApps’ MD5, there are about 15k Apps left.

We find 3,064 iApps from the downloaded APKs. And the number of in-App models is 3,845. Some of these models are repeated in different iApps because they are open source or from the same intelligent SDK, e.g., Volcengine(from ByteDance) (Bytedance, [n.d.]) or SenseTime (SenseTime, [n.d.]). After removing duplicate models, we find 800 unique in-App models. We count these in-App models according to the DL framework in Table 2.

Category Framework/SDK Count
TensorFlow (Abadi et al., 2015) 128
TFLite (tfl, 2021) 123
NCNN (Tencent, [n.d.]a) 42
Caffe (Jia et al., 2014) 35
MNN (Jiang et al., 2020) 32
TNN (Tencent, [n.d.]b) 9
Open-source framework (372) ONNX (onn, [n.d.]) 3
Volcengine(ByteDance) (Bytedance, [n.d.]) 167
SenseTime (SenseTime, [n.d.]) 153
Kwai (Kuaishou, [n.d.]) 22
MindSpore(Huawei) (Huawei, [n.d.]) 13
Huya (HUYA, [n.d.]) 7
Meishe (Meishe, [n.d.]) 4
Other SDK (428) Unknown 62
Table 2. Statistics on on-device DL inference frameworks. The last column counts the unique models using the corresponding framework.

We first divide these unique models into two categories according to the DL framework. One kind of framework is the open-source DL framework, e.g., TFLite. The other kind of framework is developed for private or licensed use by commercial companies. Among all the in-App models using open-source frameworks, the models using tflite and tensorflow frameworks account for 67.5% of the models’ total number.

Of all models using private frameworks, the models utilizing frameworks provided by Volcengine and SenseTime account for 74.8%. We find that although the number of these models is large, these models’ functionalities are similar. About 80% of the models’ functionalities are related to the feature detection of face and body, e.g., face landmark detection, pose detection, and face detection. We find that the reason for a large number of such models is that different iApps use different versions of SDKs provided by these companies, and the in-App models of the same functionalities will also be updated with the SDK.

We also count the model sizes of different DL frameworks. We find the median and mean size of in-App models are 396 KB and 1,270 KB, respectively. And about 90% of the models are less than 3 MB in size.

6.2. Code Reconstruction Evaluation

Model Task Count Mean Median
Sytle Transformation 54 1059.9 530.8
Classification 11 19484.5 8719.2
Object Detection 11 8063.2 6737.2
Super Resolution 8 2309.4 1974
Semantic Segmentation 7 6796.1 2714.9
OCR 3 4435.0 2134.8
Text Detection 3 4905.3 2711.3
Pose Estimation 1 2304.6 2304.6
Depth Estimation 1 252.2 252.2
Face Comparison 1 89157.7 89157.7
Table 3. Statistics on the selected 100 unique DL models by task. The second column counts the amount of models for ecah task. The third and forth column shows the mean and median of the model size (KB).

We select 100 unique DL models of the two widely-used DL frameworks, i.e., TensorFlow and TFLite, to evaluate the effectiveness of reconstruction techniques. We report the evaluation results of code reconstruction here. The effectiveness of the model reconstruction is evaluated through the robustness assessment in the next part.

6.2.1. Statistical Results

We first count the selected models by tasks in Table 3. There are 10 kinds of inference tasks performed by the selected models. The largest number of models perform style transformations, which are mainly used for image editing and beautification.

We perform the ASTM’s code and model reconstruction on the desktop PC. The configuration of the desktop PC is AMD Ryzen 9 5900X CPU and 128 GB DDR4 memory. The mean and median of the reconstructed time is 97 seconds and 17 seconds. 60% of the reconstructions can be done in 25 seconds, and 90% of the reconstructions can be done in 223 seconds.

We perform a study on the sliced processing code and find that (1) IO processing mainly focuses on data type conversion, e.g., image to an array, data size adjustment, data cropping, data normalization, etc. 58% of the iApps resize the input image to less than 512×\times512px. (2) Almost all iApps use a group of loop statements to process the image by pixel. (3) The mean and median LoC of the sliced code is 1,971 and 274. 60% of the reconstructed codes’ LoC is less than 308.

6.2.2. Representative Cases

Here we perform three case studies to show the effectiveness of the code reconstruction.

Case 1. Bundle ID: uk.tensorzoom. This DL model is used to carry out image super-resolution. The post-processing code is responsible for converting iApp’s inference result to an image (Figure 5). The inference result is a float array, and each float number represents each channel of each pixel (line 2). The post-processing code uses an int number (line 5) to represent a pixel with the ARGB channels, where each channel is represented by 8 bits, and the value ranges from 0 to 255. To prevent numerical overflow when converting the results, the post-processing code performs a min-max normalization (line 12\sim19) on the inference results.

Refer to caption
Figure 5. Part of the reconstructed code of Case 1.

Case 2. Bundle ID: ru.photostrana.mobile. This DL model is used to carry out face segmentation. The pre-processing code is responsible for converting the input image to a float array (Figure 6). This procedure involves separating the RGB channels of a pixel (line 12\sim13), normalizing the RGB values (line 14\sim15), and putting them into the array in a specific order (line 16). We find that most iApps store the input in the order of height×\timeswidth×\timeschannel, and this iApp stores the input in the order of channel×\timeswidth×\timesheight. If the preset channel order cannot be followed, the inference result will be wrong.

Refer to caption
Figure 6. Part of the reconstructed code of Case 2.

Case 3. Bundle ID: com.blink.academy.nomo. This DL model is used to carry out a 21-class semantic segmentation. We present the post-processing code in Figure 7. As shown in line 13\sim14, the iApp only visualizes a certain class (the 15th class presents the person). Other segmentation results are discarded by the post-processing code.

Refer to caption
Figure 7. Part of the reconstructed code of Case 3.

6.3. Robustness Assessment

Refer to caption
Figure 8. The robustness testing results of 100 unique in-App models by performing PGD attacks. We report the results by the type of tasks. The ε\varepsilon is the attack budget. Testing iteration numebr represents the iteration times of the attack. The model count denotes the number of models detected to have robustness issues.
Refer to caption
Figure 9. The robustness issues detected by ASTM.

6.3.1. Testing Methods

We utilize the PGD (Project Gradient Descent) attack (Madry et al., 2017) to test the in-App model’s robustness. PGD is an iterative attack method that can search for a subtle perturbation to fool the DL model. Here, we formulate the PGD attack. We denote the ground truth label of the input xx as yy. We denote the adversarial example as xadv(t+1)x_{adv}^{(t+1)} at t+1t+1 step.

(1) xadv(t+1)=Projx+𝒮(xadv(t)+αsgn(x(𝒞(xadv(t)),y)))x_{adv}^{(t+1)}=\text{Proj}_{x+\mathcal{S}}\left(x_{adv}^{(t)}+\alpha\text{sgn}(\nabla_{x}\mathcal{L}(\mathcal{C}(x_{adv}^{(t)}),y))\right)

where \mathcal{L} is the loss function, 𝒞()\mathcal{C}(\cdot) represents the model output of xadv(t)x_{adv}^{(t)}, and sgn()\text{sgn}(\cdot) is the sign function. Proj is a projection function which can project the input to the hypersphere x+𝒮x+\mathcal{S}.

Assume the original input is xx, we can represent the final adversarial example as:

(2) xadv=x+εx_{adv}^{*}=x+\varepsilon

Where ε\varepsilon is the attack budget which means the manipulation limit for the input. We use MSE (Mean Square Error) as the loss function to generate the adversarial examples for different tasks.

We collect the dataset for each in-App model. Each dataset consists of 20 images. Some datasets are the subset of the open source datasets, e.g., COCO (Lin et al., 2014), VOC (Everingham et al., 2010), and CelebA (Liu et al., 2015), and the other datasets are prepared by ourselves through the image search engine. We release sample datasets on the anonymous website https://github.com/anonymous4896/public_data.

6.3.2. Metrics

We measure the in-App models’ robustness through the following metrics for different kinds of tasks. The larger the metric value is, the higher the attack success rate is, and the worse the robustness of the in-App model. We propose four metrics according to the way of counting failure cases of different kinds of tasks.

Type-1 Task. The first type of task performs classification, OCR, and face comparison. The metric of type-1 task is

1Ni=1N1[yiyi]\frac{1}{N}\sum_{i=1}^{N}1[y_{i}\neq y_{i}^{\prime}]

It measures the percentage of failure cases. NN is the number of the testing samples. The yiy_{i} denotes the inferred label of xix_{i}, i.e., yi=f(xi)y_{i}=f(x_{i}). The yiy_{i}^{\prime} denotes the xi+εx_{i}+\varepsilon’s inferred label, i.e., yi=f(xi+ε)y_{i}^{\prime}=f(x_{i}+\varepsilon).

In our context, if the value of the metric is larger than 0.6, we say the model is detected with robustness issues. Note that 0.6 represents that 60% of testing inputs are misclassified at least.

Type-2 Task. The second type of task performs object detection and text detection. For bounding boxes B=(b1,b2,,bn)B=(b_{1},b_{2},...,b_{n}) of all testing samples, we calculate

1ni=1n1[c(bi)c(bi)]\frac{1}{n}\sum_{i=1}^{n}1[c(b_{i})\neq c(b^{\prime}_{i})]

as the metric of type-2 task. The c(b)c(b) denotes the classification results of the bounding box bb. The bb denotes the bounding box detected on the original input, and the bb^{\prime} denotes the bounding box detected on the attacked results. Note that the value of bib_{i} may not be the same as that of bib_{i}^{\prime}.

In our context, if the value of the metric is larger than 0.6, we say the model is detected with robustness issues. Note that 0.6 represents that 60% of bounding boxes are misclassified at least.

Type-3 Task. The third type of task performs semantic segmentation, depth estimation, and pose estimation. For an image of size M×\timesN, the attack effect is measured with

1MNi,j1[sijsij]\frac{1}{MN}\sum_{i,j}1[s_{ij}\neq s^{\prime}_{ij}]

The pijp_{ij} and pijp^{\prime}_{ij} denote the ii-th row and jj-th column pixel of original image and the perturbed image, respectively. The sijs_{ij} is the semantic label of pixel pijp_{ij}, and the sijs^{\prime}_{ij} is the semantic label of the adversarial pixel pijp^{\prime}_{ij}.

In our context, if the value of the metric is larger than 0.6, we say the model is detected with robustness issues. Note that 0.6 represents that 60% of pixels are wrongly labeled at least.

Type-4 Task. The fourth kind of task performs style transformation and super resolution. The structural similarity index (SSIM) measures the perceived quality of the attack results compared to the original results. We use

1SSIM(f(x),f(x+ε))1-\texttt{SSIM}(f(x),f(x+\varepsilon))

as the metric to measure the decrease of the SSIM value. The f()f(\cdot) denotes the task. f(x)f(x) and f(x+ε)f(x+\varepsilon) are the outputs of a type-4 task, when it takes as input the xx and x+εx+\varepsilon.

In our context, if the value of the metric is larger than 0.6, we say the model is detected with robustness issues. Note that 0.6 represents that attack makes the SSIM value drop by 0.6 at least.

6.3.3. Testing Results

For each type of task, we experiment with five ε\varepsilon (in Equation 2) values and two iteration numbers. Figure 8 shows the testing results with different experimental settings. We count the models has robustness issues by tasks.

When performing the testing through the PDG configured with ε\varepsilon=8 and iteration number=10, robustness issues are detected in 56% of models (56 out of 100). As the value of ε\varepsilon increases from 8 to 16, the number of models that are detected with robustness issues increases from 56 to 78. When we double the testing iterations (from 10 iterations to 20 iterations), we find the number of models detected with robustness issues is stable, increasing by about 14.2%. We show some representative attack results in Figure 9.

6.4. Detected Physical Attacks against iApps

Refer to caption
Figure 10. The physical attack results of case 1.
Refer to caption
Figure 11. The physical attack results of case 2.

We use ASTM to detect the security issues of three iApps by performing physical adversarial attacks.

Case 1. Bundle ID: com.seefoodtechnologies.nothotdog.

This iApp can recognize whether there is a hot dog in the environment. We perform a physical attack on the in-App model. We implement that if a tablecloth with a special pattern is placed on the table, the iApp will recognize the knife on the table as a hot dog. When visually impaired people use the iApp to identify hot dogs, they face serious safety issues by mistakenly holding a knife.

We use ASTM to reconstruct the BP-enabled DL model and processing code of the hot dog detection module. Then we take a photo of the knife, segment the knife out, and generate 50 images of the knife of different sizes and rotation angles. Then we search for a specific grid background by using project gradient descent so that when we put the knife images on the grid background, the composited image is recognized as a hot dog. The loss function used to compute the gradient is the l1 norm between the hot dog’s label and the recognition results of the composited image. We perform 100 rounds of the project gradient descent.

The attack results are shown in Figure 10. We show the recognition results before and after putting the knife on the searched grid tablecloth. The searched grid tablecloth makes the confidence in viewing a knife as a hot dog from 0.006 to 0.97. As a reference, the confidence of a real hot dog is 0.99.

Case 2. Bundle ID: com.sogou.map.android.maps.

This iApp can be used for road navigation. By using the ASTM, we find the lane detection model in the iApp will fail when there is a paper tape in a special location on the road. The failure of lane detection can lead to serious safety problems, such as vehicles suddenly stopping or driving out of lanes.

We first use ASTM to reconstruct the BP-enabled DL model and processing code of the lane detection module. Then we choose a road and take 20 photos of the road with different camera poses. Next, we search the length and position of the paper tape, which can fail the lane detection model by using project gradient descent. The loss function used to compute the gradient is the L1 norm between the ground truth and the attack target. The attack target is to make the paper tape recognized as a lane while making the actual lane location undetectable. We perform 50 rounds of the paper tape’s length and position searches.

We show the security issues detect by ASTM in Figure 11. We visualize the lane detection results on the image. The first row shows the original view and three attacks. The second row shows the corresponding attack results. The paper tape can change the lane detection results. These attacks cause the lane to be missed, or the lane’s type to be incorrectly identified and brings serious security risks.

Case 3. Bundle ID: co.mensajerosurbanos.app.mensajero.

This iApp can identify receipts. By using ASTM, We find that if the receipt utilizes a special background, it can evade the detection without affecting the user’s reading. The iApp’s competitor can cooperate with the company that prints the receipt to print the receipt with a specific background. They can degrade the receipt detection performance and cause the iApp to lose users.

Refer to caption
Figure 12. The physical attack results of case 3.

We use ASTM to reconstruct the BP-enabled DL model and processing code of the receipt recognition module. We then search a background used to print the receipt, which can fool the receipt recognition. In order to reduce the impact of the receipt background on the readability of the receipt content, we use crossed dotted lines to form the background. Then we search for the number of dashed lines, the rotation angle, and the dashed line grayscale using the projected gradient descent. The loss function used to compute the gradient is the l1 norm between the receipt’s label and the recognition results of the receipt with the searched background. We perform 50 rounds of the project gradient descent.

The attack results are shown in Figure 12. We show the receipt recognition results before and after adding the searched background. The searched background makes the confidence in the receipt recognition drop from 0.99 to 0.096. As a reference, the confidence of a receipt with random background is 0.99.

7. Conclusion

This work proposes ASTM to enable auto testing of iApps for App markets by two novel reconstruction techniques. The experimental results show that the ASTM can successfully reconstruct runnable IO processing code and BP-enabled DL model from commercial iApps. We perform a large-scale robustness assessment on the in-App models with the ASTM’s help. ASTM also detects three representative real-world attacks against iApps. We believe ASTM can be further used to enable finding new attack surfaces lying in the coupling between code and DL models.

References

  • (1)
  • 360 ([n.d.]) [n.d.]. 360 App Store. https://ext.se.360.cn/.
  • anz ([n.d.]) [n.d.]. Anzhi Market. http://www.anzhi.com/.
  • bai ([n.d.]) [n.d.]. Baidu App Store. https://shouji.baidu.com/.
  • onn ([n.d.]) [n.d.]. ONNX. https://onnx.ai/.
  • mi_([n.d.]) [n.d.]. Xiaomi App Store. https://app.mi.com/.
  • tfl (2021) 2021. TensorFlow Lite. https://www.tensorflow.org/lite
  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
  • Almeida et al. (2021) Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis, and Nicholas D Lane. 2021. Smart at what cost? Characterising Mobile Deep Neural Networks in the wild. In Proceedings of the 21st ACM Internet Measurement Conference.
  • APKPure ([n.d.]) APKPure. [n.d.]. Download APK free online downloader — APKPure.com. https://apkpure.com/.
  • Azim et al. (2019) Tanzirul Azim, Arash Alavi, Iulian Neamtiu, and Rajiv Gupta. 2019. Dynamic slicing for android. In International Conference on Software Engineering (ICSE).
  • Berthelier et al. (2021) Anthony Berthelier, Thierry Chateau, Stefan Duffner, Christophe Garcia, and Christophe Blanc. 2021. Deep model compression and architecture optimization for embedded systems: A survey. Journal of Signal Processing Systems (2021).
  • Bhardwaj et al. (2019) Ketan Bhardwaj, Matt Saunders, Nikita Juneja, and Ada Gavrilovska. 2019. Serving mobile apps: A slice at a time. In Proceedings of the Fourteenth EuroSys Conference (EuroSys).
  • Brendel et al. (2017) Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2017. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248 (2017).
  • Bytedance ([n.d.]) Bytedance. [n.d.]. Volcengine. https://www.volcengine.com/.
  • Chen et al. (2018) Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. 2018. Ead: elastic-net attacks to deep neural networks via adversarial examples. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  • Dong et al. (2018) Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9185–9193.
  • Duan et al. (2018) Yue Duan, Mu Zhang, Abhishek Vasisht Bhaskar, Heng Yin, Xiaorui Pan, Tongxin Li, Xueqiang Wang, and XiaoFeng Wang. 2018. Things You May Not Know About Android (Un) Packers: A Systematic Study based on Whole-System Emulation.. In NDSS.
  • Everingham et al. (2010) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision (2010).
  • Eykholt et al. (2018) Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1625–1634.
  • Gao et al. (2020) Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma, Jiliang Zhang, Anmin Fu, Surya Nepal, and Hyoungshick Kim. 2020. Backdoor attacks and countermeasures on deep learning: A comprehensive review. arXiv preprint arXiv:2007.10760 (2020).
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  • Gou et al. (2021) Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision (2021).
  • Huang et al. (2020) Xiaowei Huang, Daniel Kroening, Wenjie Ruan, James Sharp, Youcheng Sun, Emese Thamo, Min Wu, and Xinping Yi. 2020. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Computer Science Review (2020).
  • Huang et al. (2021) Yujin Huang, Han Hu, and Chunyang Chen. 2021. Robustness of on-device models: Adversarial attack to deep learning models on android apps. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE.
  • Huawei ([n.d.]) Huawei. [n.d.]. MindSpore. https://www.mindspore.cn/.
  • HUYA ([n.d.]) HUYA. [n.d.]. huya. https://www.huya.com/.
  • Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
  • Jiang et al. (2020) Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu, Chengfei Lv, and Zhihua Wu. 2020. MNN: A Universal and Efficient Inference Engine. In MLSys.
  • Kamath et al. (2019) Uday Kamath, John Liu, and James Whitaker. 2019. Deep learning for NLP and speech recognition. Vol. 84. Springer.
  • Kuaishou ([n.d.]) Kuaishou. [n.d.]. Kwai, Fantastic Social Video Network. https://www.kwai.com/.
  • Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. 2016. Adversarial examples in the physical world.
  • Lam et al. (2011) Patrick Lam, Eric Bodden, Ondvrej Lhoták, and Laurie Hendren. 2011. The Soot framework for Java program analysis: a retrospective. In In Cetus Users and Compiler Infastructure Workshop (CETUS).
  • Li et al. (2021) Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Xiao Zhang, Ji Liu, Jiang Bian, and Dejing Dou. 2021. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. arXiv preprint arXiv:2103.10689 (2021).
  • Liang et al. (2021) Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing (2021).
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV).
  • Liu et al. (2020) Yu Liu, Cheng Chen, Ru Zhang, Tingting Qin, Xiang Ji, Haoxiang Lin, and Mao Yang. 2020. Enhancing the interoperability between deep learning frameworks by model conversion. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM.
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
  • Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  • Meishe ([n.d.]) Meishe. [n.d.]. Meicam. https://www.meishesdk.com/.
  • Menghani (2021) Gaurav Menghani. 2021. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. arXiv preprint arXiv:2106.08962 (2021).
  • Pauck and Wehrheim (2021) Felix Pauck and Heike Wehrheim. 2021. Jicer: Simplifying Cooperative Android App Analysis Tasks. In 2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE.
  • SenseTime ([n.d.]) SenseTime. [n.d.]. SenseTime. https://www.sensetime.com/.
  • Singh et al. (2017) Shashi Pal Singh, Ajai Kumar, Hemant Darbari, Lenali Singh, Anshika Rastogi, and Shikha Jain. 2017. Machine translation using deep learning: An overview. In 2017 international conference on computer, communications and electronics (comptelix). IEEE, 162–167.
  • Sridharan et al. (2007) Manu Sridharan, Stephen J Fink, and Rastislav Bodik. 2007. Thin slicing. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation. 112–122.
  • Sun et al. (2021) Zhichuang Sun, Ruimin Sun, Long Lu, and Alan Mislove. 2021. Mind your weight (s): A large-scale study on insufficient machine learning model protection in mobile apps. In 30th USENIX Security Symposium (USENIX Security).
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  • Tencent ([n.d.]a) Tencent. [n.d.]a. ncnn. https://github.com/Tencent/ncnn.
  • Tencent ([n.d.]b) Tencent. [n.d.]b. TNN. https://github.com/Tencent/TNN.
  • Voulodimos et al. (2018) Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. 2018. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).
  • Wang et al. (2019) Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE.
  • Xu et al. (2019) Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In Proceedings of the Web Conference.
  • Xue et al. (2020) Lei Xue, Hao Zhou, Xiapu Luo, Le Yu, Dinghao Wu, Yajin Zhou, and Xiaobo Ma. 2020. Packergrind: An adaptive unpacking system for android apps. IEEE Transactions on Software Engineering (TSE) (2020).
  • Xue et al. (2021) Lei Xue, Hao Zhou, Xiapu Luo, Yajin Zhou, Yang Shi, Guofei Gu, Fengwei Zhang, and Man Ho Au. 2021. Happer: Unpacking Android apps via a hardware-assisted approach. In 2021 IEEE Symposium on Security and Privacy (S&P).
  • Zhang et al. (2022) Qiyang Zhang, Xiang Li, Xiangying Che, Xiao Ma, Ao Zhou, Mengwei Xu, Shangguang Wang, Yun Ma, and Xuanzhe Liu. 2022. A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices. arXiv preprint arXiv:2202.06512 (2022).
  • Zhang and Zhu (2018) Quan-shi Zhang and Song-Chun Zhu. 2018. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering (2018).