Fine-Grained Assertion-Based Test Selection

Sijia Gu, Ali Mesbah S. Gu is with Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada.A. Mesbah is with Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada

Abstract

For large software applications, running the whole test suite after each code change is time- and resource-intensive. Regression test selection techniques aim at reducing test execution time by selecting only the tests that are affected by code changes. However, existing techniques select test entities at coarse granularity levels such as test class, which causes imprecise test selection and executing unaffected tests. We propose a novel approach that increases the selection precision by analyzing test code at statement level and treating test assertions as the unit for selection. We implement our fine-grained test selection approach in a tool called Selertion and evaluate it by comparing against two state-of-the-art test selection techniques using 11 open-source subjects. Our results show that Selertion increases selection precision for all the subjects. Our test selection reduces, on average, 63% of the overall test time, making regression testing up to 23% faster than the other techniques. Our results also indicate that subjects with longer test execution time benefit more by our fine-grained selection technique.

Index Terms:

regression test selection, assertion slicing, test assertions

I Introduction

Regression testing has been a vital process to ensure no existing functionality is broken when software evolves. However, even for large companies with allocated enormous resources to testing [1, 2, 3], it can be costly to rerun the whole test suite every time for the code changes. For example, Elbaum et al. [4] report that it costs seven weeks to execute the entire test suite for one of their industrial partner’s products. In addition to companies that conduct tests in continuous integration environments, individual developers often find themselves in a situation where they need to run regression tests multiple times a day before committing their changes to the codebase [5].

As a result, test suite optimization techniques that reduce test runtime such as regression test selection, minimization, and prioritization have become an inalienable part of software research and development [1, 6, 7]. Regression test selection (RTS) [8, 9] in particular aims to run only the tests that are affected by code changes. RTS techniques aim to be (1) safe, by selecting all tests affected by code changes, and (2) precise, by selecting fewer tests.

Current techniques [10, 11, 9, 12] analyze code at different levels of granularity. Coarse-grained RTS techniques that analyze code at file-level [9] incur less analysis overhead but select more tests, i.e. they are less precise than finer-grained RTS techniques that analyze code at basic block or method level [10, 11]. However, existing RTS techniques typically focus on the granularity of production code (code under test), while select test entities at a relatively coarse granularity such as the test-class level. This can result in safe but imprecise test selection and, consequently, longer test execution times. The root cause of this issue lies in the fact that, in practice, only a portion of the test code within the selected test classes is impacted by code changes. For instance, when an atomic production method negate is changed in the Complex class from Apache Commons Math [13], the coarse-grained RTS techniques [12, 9] select the whole test class ComplexTest with 138 test cases. However, only seven out of the 138 test cases are truly affected by the changed method. Furthermore, empirical studies [14] have shown that it is a common practice for a single test case to contain multiple assertions. Therefore, only specific assertions within a test case might be influenced by code changes, leaving others unaffected.

Our insight is that a finer-grained test code analysis can help to increase selection precision and accelerate test execution. As such, we focus on test methods and test assertions inside test methods, which are known to be correlated with a test suite’s fault detection ability [14]. We propose a novel test selection technique, called Selertion, which analyzes test entities at the fine-grained test assertion level. We treat each assertion inside a test method as the smallest unit of interest and slice the test method into executable fragments containing the targeted assertion and statements associated with it for selection. In comparison to coarse-grained RTS techniques, Selertion selects fewer tests more precisely, ultimately leading to a more efficient test execution process.

Our work offers the following significant contributions:

•

The first regression test selection technique that operates at the finest granularity, treating test assertions within test methods as the fundamental units for selection.
•

A method for statically slicing test methods into executable fragments for test each assertion.
•

An implementation of our approach in a tool called Selertion, which supports Java projects with JUnit4 tests.
•

An empirical evaluation of Selertion performed on 11 open-source projects and comparison with two state-of-the-art RTS tools.

Our evaluation results reveal that Selertion selects 15.8% of tests, representing a precision improvement of 3–18% compared to existing state-of-the-art test selection techniques. Furthermore, Selertion significantly enhances efficiency by reducing, on average, 63% of the overall test execution time, resulting in a speed improvement of up to 23%.

II Motivating Example

To illustrate our insight, we use real test cases from the ComplexTest class of Apache Commons Math [13] as a motivating example. Figure 1 shows two test methods testExp and testNegate that both have multiple assertions (i.e., Lines 5, 6, 8 in testExp and Lines 15, 16 in testNegate). When the production method Complex.negate is changed, all the 138 test cases in ComplexTest class will be selected if the selection is at test-class level. However, only seven (including the two shown in Figure 1 ) out of the 138 test cases are truly affected by the changed production method. Moreover, when focusing on the assertions in each test case, the first two assertions in testExp at Lines 5 and 6 neither check Complex.negate’s behavior nor have a dependency on it. Hence, it is not necessary to execute these two unaffected assertions. Meanwhile, if the two assertions in testNegate are assumed to fail because of the changes in Complex.negate, the execution will stop at the first failed assertion at Line 15, and the developer may overlook the second failed assertion at Line 16 when fixing the bug in Complex.negate. More time will be consumed by executing the test suite again to reveal the second failure at Line 16.

The fine-grained test selection approach we propose is to divide a test method into smaller code fragments called assertion slices as the smallest units for selection. Each assertion slice includes the test assertion and previous statements associated with it. The assertion slices for a test method can be obtained by performing static intra-procedural backward slicing [15] on the variables of each assertion as the selected points of interest. Figure 2 presents the assertion slices for the two test methods testExp and testNegate in Figure 1 as an example. The grayed-out slices in Figure 2 are the ones that are selected when there are changes in the production method Complex.negate. Our fine-grained assertion-level selection technique selects only the assertions depending on the code changes and guarantees that all the selected assertions will be checked. Next we will present a more detailed description of our approach.

⬇

1 @Test

2 public void testExp() {

3 Complex z = new Complex(3, 4);

4 Complex expected = new Complex(-13.12878, -15.20078);

5 TestUtils.assertEquals(expected, z.exp(), 1.0e-5);

6 TestUtils.assertEquals(Complex.ONE,Complex.ZERO.exp(), 10e-12);

7 Complex iPi = Complex.I.multiply(new Complex(pi,0));

8 TestUtils.assertEquals(Complex.ONE.negate(),iPi.exp(), 10e-12);

9 }

11 @Test

12 public void testNegate() {

13 Complex x = new Complex(3.0, 4.0);

14 Complex z = x.negate();

15 Assert.assertEquals(-3.0, z.getReal(), 1.0e-5);

16 Assert.assertEquals(-4.0, z.getImaginary(), 1.0e-5);

17 }

18 // 2 of 138 test cases from ComplexTest.java

Figure 1: Test cases from Apache Commons Math. Assertion statements are highlighted.

⬇

3 Complex z = new Complex(3, 4);

4 Complex expected = new Complex(-13.12878, -15.20078);

5 TestUtils.assertEquals(expected, z.exp(), 1.0e-5);

⬇

6 TestUtils.assertEquals(Complex.ONE,Complex.ZERO.exp(), 10e-12);

⬇

7 Complex iPi = Complex.I.multiply(new Complex(pi,0));

8 TestUtils.assertEquals(Complex.ONE.negate(),iPi.exp(), 10e-12);

⬇

13 Complex x = new Complex(3.0, 4.0);

14 Complex z = x.negate();

15 Assert.assertEquals(-3.0, z.getReal(), 1.0e-5);

⬇

13 Complex x = new Complex(3.0, 4.0);

14 Complex z = x.negate();

15 Assert.assertEquals(-4.0, z.getImaginary(), 1.0e-5);

Figure 2: Test cases are sliced into separate code fragments for each assertion. Fragments that depend on production method Complex.negate() are grayed-out.

III Approach

Refer to caption — Figure 3: Overview of our approach.

Figure 3 depicts an overview of our approach, which is composed of four main components. Change computation detects code changes at the class and method level in the production code. Assertion slicing computes test assertion slices for fine-grained selection by analyzing test code. Code instrumentation selectively instruments and executes both production and test code to collect dependencies. Finally, test selection selects test entities based on the information collected from the previous components.

Using the terminology from related work [9, 12], the analysis (A) phase includes change computation, assertion slicing, and test selection; the collection (C) phase includes the code instrumentation and execution of instrumented subject; the execution (E) phase executes selected test entities.

III-A Change Computation

Input: $V_{1}$ , $V_{2}$ : two program revisions
Output: $\delta=\delta^{c}\cup\delta^{m}$ where $\delta^{c}$ : class-level changes,
$\delta^{m}$ : method-level changes

1:AF,DF,CF = compareFileDiff(

V_{1}

V_{2}

)

2:AC,DC,CC = createClassModel(AF, DF, CF)

\delta^{c}

.addAll(AC, DC)

4:for all clz in CC do

5: if

csum(OT)

csum(CH)

is changed then

\delta^{c}

.add(clz)

7: else

8: AM, DM, CM = compareMethodDiff(clz)

9: LC = computeLookUpChanges(AM, DM)

10:

\delta^{m}

.addAll(AM,DM,CM, LC)

11: end if

12:end for

13:return

\delta=\delta^{c}\cup\delta^{m}

Algorithm 1 Change Computation

Similar to state-of-the-art RTS tools, HyRTS [12] and Ekstazi [9], we detect code changes by comparing checksums. However, instead of tracing bytecode files, we compute checksums for source code files directly. We perform source code analysis because our fine-grained technique requires computation at statement level for test code to collect dependencies, and statement level analysis for bytecode will be indirect and may increase analysis complexity.

The procedure to compute code changes is summarized in Algorithm 1. We trace code changes at multiple levels (i.e., method and class level). Given two versions of the code, we first compare the file-level checksums to detect the list of deleted, added, and changed files which are denoted by AF, DF, CF respectively at Line 1. Then for each file, we create class models to handle circumstances that one single file contains multiple classes (e.g., nested classes), although in most cases, there is one class per file (Line 2). We also handle enum types. Since enums can be treated as a special type of classes, we use class to represent both types in Algorithm 1. Intuitively, newly added classes (AC) and deleted classes (DC) are class-level changes, which are collected in the set $\delta^{c}$ at Line 3. Changed classes (CC) may include irrelevant changes that are not impacting the code, such as changes in comments, Javadoc, or indents. Such changes are excluded at the next step of the Algorithm (Line 4-12). For each changed class model, we separate the class body components into the following categories: class head (CH), methods (M) and other (OT). CH includes the class’s name, superclass or interfaces information. Methods (M) include all method declarations, including instance methods and constructors. Other (OT) includes other body declarations such as fields, except methods and classes. We store the smart checksums for each M, CH and OT separately.

The reason to maintain a separate category OT is that, unlike bytecode computation in which field changes are reflected in method-level changes [12], we need to consider changed fields explicitly in source code analysis. To reduce the analysis overhead, we handle field changes all together and transfer them into a class-level change, because a field may be used by multiple methods. Therefore, for each changed class, if the OT or CH checksum varies, the corresponding class will be marked as changed (Line 5-6). If there are only method checksum changes without OT or CH change, the added (AM), deleted (DM) and changed methods (CM) will be stored as method-level changes in the set $\delta^{m}$ (Line 7-10). We also compute look up changes [10, 12] for added and deleted methods based on the inheritance information at Line 9 for the sake of safety in our analysis. In the initial run, because no previous checksums are available, every class is treated as a new class with class-level changes and the whole test suite is executed once.

III-B Test Assertion Slicing

To perform fine-grained selection at the assertion level, we can not merely select the test assertion statement because there could be other test statements associated with the assertion that make it executable. For instance, in Figure 1, the test statement at Line 7 is associated with the assertion at Line 8, because the variable iPi is defined at Line 7 and used by Line 8. In this work, we call a code fragment that includes the targeted test assertion and the dependent statements as an assertion slice.

Definition 1 (Assertion Slice)

Let $S_{m}$ be the list of statements in the test method $m$ . An assertion slice encompasses the assertion statement $asrt$ , where $asrt\in S_{m}$ and its dependant statements $S_{m}[asrt]$ , where $S_{m}[asrt]\subseteq S_{m}$ .

We use $\mathcal{A}$ to denote the total set of assertion slices for the test suite. $\mathcal{A}_{m}$ denotes a set of slices for the test method $m$ . $\mathcal{A}_{m}[st]$ denotes a set of slices that contain the test statement $st$ , where $st\in S_{m}$ . An assertion $asrt$ is also a particular type of test statements.

Using Figure 2 as an example, the test method ( $m$ ) testExp’s assertion slices $\mathcal{A}_{m}$ are shown in the first three blocks. For the test statement $st$ at Line 3, $\mathcal{A}_{m}[st]$ contains one assertion slice shown in the first block. To obtain the assertion slices, we perform static intra-procedural backward slicing [15] for each test method based on the program dependence graph (PDG) using each assertion and its used variables as slicing criterion [16, 17, 18]. A PDG represents a procedure (i.e., the body of a test method) as a graph where nodes are statements or predicate expressions, and edges are control or data dependence. Generally speaking, test code has simpler structures than production code. Most test cases do not include conditionals such as if-else, for-loop, etc. To reduce the analysis overhead, we opted for only slicing test cases with explicit assertions and without conditional statements. Therefore, we apply data dependency analysis to create the PDG by capturing the variables used and defined for each test statement including the assertions. The global variables and methods annotated with @Before or @BeforeClass in test classes serve a purpose in points-to/alias analysis [17]; should always be included when executing the selected test entities, however, they do not need to be part of the data dependency analysis for individual test cases.

Since we target an object-oriented language (Java), we pay more attention to object variable aliases in our analysis. For instance, when an object variable objA in the assertion statement is selected as the slicing criterion and its reference is passed to a method or constructor of another object variable objB, objA’s state may be changed by objB’s methods [17, 18]. If objA is passed to any field of objB, when objA is changed, objB’s state can also change. To make sure the slices do not miss any statements as data dependencies, we make a conservative assumption that method calls will change the states of the caller object and objects passed as arguments, which means a method invocation statement $st$ is associated with the targeted assertion $asrt$ if a variable in $asrt$ is used by $st$ as a caller object or passed to $st$ as an argument. Because assertions are a particular type of statement, if an assertion $asrt_{1}$ prior to the targeted assertion $asrt_{2}$ has a data dependency on $asrt_{2}$ , $asrt_{1}$ will also be included in the slice for $asrt_{2}$ .

III-C Code Instrumentation

After obtaining the assertion slices, we collect test dependencies by capturing the method calls for each test statement. We instrument the production code to log the method signature for each production method invocation, and produce a separate set to keep track of method calls for each test statement inside test cases during test execution. For helper methods that are in test classes but are not test methods (i.e., without @Test annotation), we treat them the same way as production methods.

Performing code instrumentation to collect dependencies at run-time for each new program revision will be inefficient. To reduce the overhead, we instrument each production and test class at initial run $V_{0}$ , and store the instrumented subject as a separate copy (shown as the gray box in Figure 3). When a new revision $V_{i}$ is ready, we only need to re-instrument the changed files computed by Algorithm 1 and synchronize them to the instrumented copy of the subject, and execute the affected tests on the instrumented subject after test selection. Therefore, the dependency collection process is separately executed on the instrumented subject, which does not interfere with the execution of the original subject.

We use $\mathcal{D}$ to denote all the test dependencies. For each test statement $st$ , $\mathcal{D}(st)$ stands for a list of method invocations of $st$ . Each method call is represented by its fully qualified method signature. Thus, it is easy to retrieve the class-level dependencies when needed.

In practice, some test features might pose safety or efficiency challenges in tracing dependencies for each test statement. For parameterized test classes (annotated with @RunWith(Parameterized.class)), the method calls invoked by each test statement may vary for different inputs. It will increase analysis complexity to trace dependencies for each test statement with consideration of different inputs. We opted for collecting dependencies for each parameterized test class by logging all the method calls for different parameters together into one dependency file. When the dependency file contains any changed methods, all the test cases under different parameters in the parameterized test class are selected for execution. For the test classes that use inheritance, since each subclass may inherit test methods from the super class that are not explicitly available in the subclass, the approach to collect dependencies at test statement level is unsafe because the method-calls invoked by the inherited test methods will be missing for the subclass. Therefore, for the test classes that use inheritance (including both subclasses and superclasses), we trace dependencies at the test class level by logging method calls for the whole test class execution. Moreover, when a test class contains test methods calling other tests, it is hard to isolate each test method. For this case, we collect dependencies at the test class level as well. We identify those test classes that are parameterized, use inheritance or contain test methods calling other tests during static source code analysis for code instrumentation. Then, to trace dependencies at the test class level, instead of inserting code to log method calls for each test statement, we insert code only at the beginning and the end of each test class to keep track of the method calls.

In addition to these specific test features, some test classes contain test methods that expect an exception to be thrown (e.g., annotated with @Test(expected=SomeException.class)). For those test methods, it is meaningless to trace method calls for each test statement, as the execution will be interrupted at a certain point and statements after that point will not be executed at all. Therefore, we collect dependencies for each test method instead. Moreover, in Section III-B, we choose to only slice the tests with explicit assertions and without conditional statements. For the test cases that contain conditional statements such as if-else or for-loop that are not sliced by the assertion slicing module, because they are not selected at fine-grained assertion level, it is sufficient to collect dependencies for each test method instead of each test statement to reduce overhead.

III-D Test Selection

Input: $\mathcal{A}$ : a set of assertion slices, $\mathcal{D}$ : test dependencies,
$\delta=\delta_{P}\cup\delta_{T}$ : production and test code changes
Output: $\Gamma$ : selected test entities

\Gamma_{A},\Gamma_{M},\Gamma_{C}=\emptyset

2:for all t in

\delta_{T}

3: /*update its assertion slices in

\mathcal{A}

4:end for

5:for all p in

\delta_{P}

\mathcal{T}_{p}

= retrieveTestEntitiesFrom(

\mathcal{D}

, p)

7: if

\mathcal{T}_{p}\neq\emptyset

then

8: for all

t

\mathcal{T}_{p}

9: if

t\in S_{m}

then

10:

m=getMethodName(t)

11: if

\mathcal{A}_{m}\neq\emptyset

then

12: /*select at test assertion level*/

13:

\mathcal{A}_{m}[t]=getAssertionSlices(\mathcal{A}_{m},t)

14:

\Gamma_{A}.add(\mathcal{A}_{m}[t])

15: else

16: /*select at test method level*/

17:

\Gamma_{M}.add(m)

18: end if

19: else

20: /*

t

is a test method,

\Gamma_{M}.add(t)

t

is a test class,

\Gamma_{C}.add(t)

21: end if

22: end for

23: end if

24:end for

25:

\Gamma=rewriteTests(\Gamma_{A},\Gamma_{M},\Gamma_{C},\delta_{T})

Algorithm 2 Test Selection

At this point, we have a comprehensive analysis of the test code through assertion slicing and code instrumentation. Based on the levels of dependency data collection and the availability of assertion slices, the test entities will be safely selected at different levels.

Test assertion level selects related assertion slices inside a test method; test cases that our analysis handles and turns into assertion slices will be selected at this finest granularity level.

Test method level selects specific test methods in a test class; test cases without any assertion slices or those expecting exceptions to be thrown are selected at test method level.

Test class level selects all test methods in a test class. Test classes, with special features (as described in Section III-C), that are parameterized, use inheritance, or contain test cases calling other tests, are selected at the test class level for the sake of safety and efficiency.

Algorithm 2 summarizes the procedure for our test selection. With the test dependencies $\mathcal{D}$ obtained by running the instrumented subject, the assertion slices $\mathcal{A}$ and the changes $\delta$ computed by Algorithm 1, the selected test entities $\Gamma$ for the current code revision can be computed by Algorithm 2. $\Gamma_{A}$ , $\Gamma_{M}$ , $\Gamma_{C}$ denote selected assertion slices, test methods and test classes, respectively. We categorize the code changes obtained from Algorithm 1 into two distinct sets: test code changes, denoted as $\delta_{T}$ , and production code changes, denoted as $\delta_{P}$ . For each changed test entity $t\in\delta_{T}$ which could be a modified or newly added test method or test class, its assertion slices need to be updated to align with the current code reversion (Line 2-3). Additionally, $\delta_{T}$ should also be selected for execution (Line 25). For each production code change $p\in\delta_{P}$ , Line 6 computes the test entities $\mathcal{T}_{p}$ that depend on the production code change $p$ based on the dependency files in $\mathcal{D}$ . Because there are some special cases that test dependencies are not collected at the test statement level (see Section III-C), the returned test entities $\mathcal{T}_{p}$ might contain test methods and test classes. For each affected test entity $t\in\mathcal{T}_{p}$ , if $t$ is a test statement (Line 9), we retrieve the test method $m$ that contains $t$ (Line 10) and check if there are assertion slices available for the test method $m$ (Line 11). If assertion slices exist for $m$ , all the assertion slices that contain $t$ are selected (Line 11-14). If there are no assertion slices for $m$ , our selection is performed at the test method level (Line 16-17) to guarantee a safe selection. If test entity $t$ is not a test statement, $t$ is selected as the test method or test class instead (Line 20). Last but not least, all the selected test entities $\Gamma_{A}$ , $\Gamma_{M}$ , $\Gamma_{C}$ , $\delta_{T}$ are used to rewrite their corresponding test classes to make sure only the selected assertions, methods or classes will be executed (Line 25). After the execution is finished, the test classes are restored to the original status, while the rewritten version is saved as a reference.

III-E Implementation

We implemented our approach as a tool called Selertion, which is written in Java. For our source code analysis, we build on top of Eclipse Java Development Tools (JDT) [19] to analyze and instrument source code. Our current implementation supports Java programs that use Maven and JUnit 4. In addition to the standard test assertions in JUnit, Selertion supports assertions written in popular frameworks such as AssertJ [20] and Google Truth [21]. At the initial run, Selertion takes the source code of the program, obtains assertion slices, instruments production and test code, and runs the instrumented program to obtain test dependencies. For any subsequent revision of the program, Selertion takes the new revision of the program as input and computes the changes by comparing checksums with the old revision. It instruments only the changed files and updates the changed assertion slices. It then selects tests at assertion level as the finest granularity where possible by analyzing the dependencies, assertion slices and change information. It then executes the selected tests. The code for our implementation is publicly available at [22].

IV Evaluation

To evaluate our approach on real-world projects, we target the following research questions:

•

RQ1 (prevalence): How prevalent are tests that can be selected at fine-grained test assertion level in practice?
•

RQ2 (precision): How does our technique compare with existing RTS techniques in terms of precision?
•

RQ3 (efficiency): How efficient is our approach regarding execution time?
•

RQ4 (performance): What is the performance of Selertion in terms of analysis overhead and collection time? Is there any performance difference when disabling the assertion-level selection?

IV-A Existing RTS Tools

Our search criteria for existing test selection techniques to compare with included tools that are available, runnable, and target Java/JUnit. We selected two state-of-the-art tools, namely HyRTS [12] and Ekstazi [9]. To the best of our knowledge, these are the exclusive options that currently meet our criteria. Significantly, they outperform older RTS techniques like FaultTracer [11], which exhibit slower performance and offer test method selection, as documented in prior research [9].

HyRTS conducts the analysis of test dependencies at either the method or class level and selects test entities at the test class level. Ekstazi analyzes production code to identify changes at the class level. Although the original paper [9] mentions the provision of both method and class-level selection, the Ekstazi tool [23] is currently available with default settings that primarily select tests at the test class level. Our technique collects test dependencies also at a mix of method and class level similar to HyRTS, however, it has the ability to select test entities at multiple levels from the coarse-grained class to the fine-grained assertion level based on its analysis of the test suite. Therefore, theoretically speaking, Selertion should be able to select tests more precisely than both HyRTS and Ekstazi. We empirically assess this in our evaluation.

TABLE I: Subject systems and their characteristics

Subject	Head	LOC (K)		Time (s)		Test cases	Fine-grained		Assertions (#)
Subject	Head	Prod.	Test	total	per test	(#)	(#)	(%)	total	per test
asterisk-java [24]	8fe3e4c	51.4	5.0	35	0.14	248	187	75	677	2.7
commons-net [25]	2b0f3383	20.0	7.6	86	0.33	49	29	59	116	2.4
commons-exec [26]	a374f35	1.8	1.8	99	0.95	100	57	57	210	2.1
tabula-java [27]	5f43a93	4.6	2.3	110	0.54	138	94	68	311	2.3
OpenTripPlanner [28]	8ee31d5	81.1	11.4	130	0.36	320	225	70	1,446	4.5
commons-math [13]	c4a093c	88.4	91.3	160	0.03	4,003	1,308	33	11,229	2.8
stream-lib [29]	7360389	4.8	3.8	165	1.24	135	38	28	425	3.2
tika-parsers [30]	fd1926d	49.5	21.9	181	0.16	1196	221	18	5722	4.8
accumulo-core [31]	5cec6a2	193.1	26.3	338	0.32	1046	433	41	3325	3.2
commons-pool [32]	3ca09a7	5.5	8.9	354	1.21	262	28	11	842	3.2
LogicNG[33]	f94879fa	23.6	12.9	546	0.67	831	552	66	3874	4.7
Total/Average	-	523.9	193.0	2,204	0.54	8,328	3,172	48	28,177	3.3

IV-B Subject Systems

We include a total of 11 subject systems. These subjects are selected based on their relevance to prior research on test assertion analysis [34, 14] and regression test selection [12] and their popularity on GitHub. To ensure fair comparison among Selertion, HyRTS and Ekstazi, our selection criteria include subjects that use Maven with executable and passing JUnit4 test cases. We focus on subjects with a test runtime of at least 30 seconds. As reported by others [12, 9], for lower test runtimes it can take more time to analyze the code than to execute the entire test suite.

The selected subjects and their characteristics are shown in Table I with the ascending order of their testing time. All the subjects are single-module projects, except Tika [30] and Accumulo [31]. Because some modules of these multi-module projects only have a few test cases, for simplicity, we choose the module with most test cases to analyze (i.e., tika-parsers module for Tika, core module for Accumulo). We examine the test suites for each subject and blacklist test classes that do not meet the selection criteria (e.g., test classes that mix JUnit3 and JUnit4 APIs in OpenTripPlanner). As some of the subjects are migrated to JUnit5 [35] recently, for consistency, we start from a revision that uses JUnit4 as the Head to collect 100 revisions before it, and remove revisions that fail to build or have no code changes. For the three subjects commons-exec, commons-net and commons-pool, which lack a sufficient number of revisions with substantial source code changes, we introduce changes artificially. To achieve this, we use a source code mutation generator called Major [36] to create random mutants in various production methods, thus introducing more prevalent source code changes. The selected subjects span different application domains and exhibit diverse code and test suite sizes. Table I presents the short hash under the Head column, and lines of code for both production and test code for the Head revision. We measure lines of code using Cloc [37]. The column Time details the average time required to execute all the tests using the build command (i.e., mvn test) for the revisions we analyzed, along with the average execution time for each test case. This table also presents information pertaining to test cases and test assertions that are further elaborated in the following subsection.

IV-C Procedure and Results

All our experiments are conducted on a macOS server with 2.4 GHz Quad-Core Intel Xeon processor and 64 GB of memory running OpenJDK 64-Bit version 1.8.0_212 and 11.

IV-C1 Prevalence (RQ1).

Given that we exclude parameterized tests, inherited tests, and tests dependent on other tests from our assertion-level selection (such tests are selected at method or class level), with consideration for the safety of our analysis, it is important to know to what extent the test suite for each subject can be fine-grained selected at the assertion-level. To measure the prevalence of fine-grained selection, we implemented a tool using JDT [19]. This tool statically counts the total number of test cases that can be fine-grained selected for every revision we executed. We present the average values for each subject in Table I. Additionally, because the smallest unit of interest for our study is the test assertion, we count the total number of assertions and calculate the average number per test. These numbers are included under the Assertions column in the table, both in total and per test. It’s worth noting that we opt to count these numbers statically instead of relying on information obtained from executed tests (e.g., as provided by maven-surefire-plugin [38]) to avoid repetitive counting, as a single test case may be executed multiple times due to inheritance or different parameters.

The column Fine-grained in Table I shows the number and percentage of test cases that can be fine-grained selected by Selertion in the subject systems. The results show on average 48% out of the total 8,328 test cases can be selected at the assertion level. The percentages range from 11% (commons-pool) to 75% (asterisk-java). Moreover, if each test case contains less than two assertions, the benefits of fine-grained selection become limited because the assertion-level selection will be equivalent to the method-level selection (i.e., the whole test method is selected). Therefore, we count explicit assertion statements (e.g., assertTrue ) inside each test method body. Column Assertions in Table I shows the number of assertions for each subject, which totals 28,177 for all subjects. The average number of assertions per test case ranges from 2.1 for commons-exec to 4.8 for tika-parsers, and the overall average of all the subjects is 3.3 assertions per test case. This indicates that (1) most test cases contain multiple assertions in practice, (2) tests that are analyzable by our approach at the fine-grain assertion-level are prevalent, and (3) our fine-grained selection approach is feasible to be applied to real-world test suites.

IV-C2 Precision (RQ2).

To assess the efficacy of our approach in terms of precision, we use the widely used RTS metric Selected Test Ratio [39, 9, 40, 41, 10, 11, 12] to compare with the other two RTS tools. Selected Test Ratio is defined as the ratio of the number of selected test cases to the number of total test cases. As the fine-grained selection is at the test assertion level in this study, we define a new metric Selected Assertion Ratio as the ratio of the number of selected assertions to the number of total assertions. Lower selected test/assertion ratio indicates better selection precision, as it signifies the selection of fewer tests.

Our precision evaluation results are presented in Table II and Figure 4. In Table II, the macro-columns Selected Assertion Ratio and Selected Test Ratio present the mean values of all the revisions we studied for Selertion, HyRTS and Ekstazi, respectively, except for commons-exec, tabula-java and tika-parsers. In the case of these subjects, HyRTS failed to run due to StackOverflow errors. Therefore, we mark the failed subjects and average values for HyRTS with an asterisk to signify that the values are calculated with subjects excluding the ones that failed, leading to average calculation different from those of Selertion and Ekstazi. As our results show, Selertion has lower selected test ratio values, which means it selects fewer test cases and assertions than the other two tools. Regarding selected assertion ratio, on average, Selertion selects 15.10% out of all the test assertions and all subjects, which is 16.42% less than Ekstazi. When compared with HyRTS (excluding the marked subjects), Selertion is 7.34% less than HyRTS’s average of 17.96%. For selected test ratio, overall our tool selects 15.80% of all the test cases (10.85% when excluding the marked subjects). This is 8.31% less than HyRTS (excluding the marked subjects) and 17.78% less than Ekstazi. For each subject, the precision increase in terms of selected assertion ratio is from 0.7% (OpenTripPlanner) to 22.3% (LogicNG) in comparison with HyRTS, and from 4.6% (OpenTripPlanner) to 42.9% (LogicNG) compared with Ekstazi. Meanwhile, our tool decreases the selected test ratio from 3.9% (OpenTripPlanner) to 22.5% (LogicNG) in comparison to HyRTS, and from 7.7% (accumulo-core) to 44.5% (LogicNG) compared with Ekstazi. It is worth mentioning that if the precision improvement is smaller for one subject than the other, it does not mean the subject performs worse or vice versa. Rather, the selected ratios are related to the size of code changes in each sample revision we collect to analyze. Larger code changes always lead to higher selected ratios as well as higher ratio differences between tools (e.g., LogicNG). Moreover, to provide more insights about selected assertion ratios for each subject besides the mean values, Figure 4 includes boxplots depicting the distributions of selected assertion ratios for each revision. The mean values are marked as diamonds, and median values are marked as lines. The results indicate that our approach has the ability to select tests more precisely than other coarser-grained tools.

TABLE II: Results: selected test and assertion ratios

Subject	Selected Assertion Ratio (%)			Selected Test Ratio (%)
Subject	Selertion	HyRTS	Ekstazi	Selertion	HyRTS	Ekstazi
asterisk-java	13.9	22.4	29.9	12.9	20.6	26.7
commons-net	4.0	6.7	13.1	0.3	5.4	5.9
commons-exec*	27.3	-	53.6	28.4	-	53.9
tabula-java*	32.8	-	39.1	39.6	-	48.4
OpenTripPlanner	5.8	6.5	10.4	5.9	9.8	16.8
commons-math	3.0	8.6	15.6	7.9	11.9	19.6
stream-lib	6.9	14.7	17.0	11.0	19.3	22.0
tika-parsers*	21.0	-	36.2	19.0	-	32.9
accumulo-core	8.1	13.0	14.9	3.7	10.0	11.4
commons-pool	24.7	30.9	55.4	29.3	38.0	71.5
LogicNG	18.6	40.9	61.5	15.8	38.3	60.3
Ave.	15.10	17.96*	31.52	15.80	19.16*	33.58

IV-C3 Efficiency (RQ3).

Because we choose to analyze source code instead of bytecode in our implementation, to assess whether Selertion is efficient to reduce testing time, we measure the offline End-to-End time (AE time) which includes the analysis (A) time to compute code changes, select affected test entities and update assertion slices, as well as the execution (E) time for the selected test entities, then normalize the end-to-end time into a ratio with respect to the overall time of executing the whole test suite (RetestAll). We perform the measurement for each revision and report the average. For fair comparison, we also measure the offline end-to-end time and its ratio for HyRTS and Ekstazi respectively. The results are shown in Table III. Column RetestAll shows the time required to run the whole test suite in average for each subject as a reference. The other three macro-columns present the offline end-to-end time and its ratio to RetestAll for Selertion, HyRTS, and Ekstazi separately. Besides the average values in Table III, to further understand the results, we perform Paired T-Test [42] with the significance level as 5% ( $\alpha$ = 0.05) to determine if there is a significant difference between each two groups (i.e., Selertion vs. HyRTS, Selertion vs. Ekstazi and HyRTS vs. Ekstazi) for all the revisions of every subject. The reason to choose Paired T-Test is that as a parametric test, it has greater statistical power than non-parametric tests like Wilcoxon Test [42], and can provide trustworthy results even for non-normal distributions. We highlight the best results (shortest running time) in yellow in each row across the three tools. If the end-to-end time has no statistical difference in comparison with others, it means it has competitive or equivalent capability for reducing test time.

After examining the results of offline end-to-end time, in Table III, we mark the subject as light gray if Selertion performs competitively or better than one of the other tools; otherwise, Selertion outperforms the other two tools for that subject. Across all 11 subjects, Selertion consistently reduces testing time effectively, with ratios to RetestAll ranging from 14.9% (commons-net) to 70.3% (OpenTripPlanner). This reduction translates to an overall decrease in testing time of 29.7% to 85.1%. For asterisk-java with 35 seconds runtime, our tool is competitive with HyRTS, but slower than Ekstazi. For subject commons-net, although HyRTS has the shortest running time, Selertion is only 1 second slower than HyRTS and is equivalent to Ekstazi, which has no statistical difference in comparison with both tools. For subject commons-math, Selertion outperforms Ekstazi but is 5.3 seconds slower than HyRTS.

Longer Test Runtime. For the subjects with runtimes longer than 86 seconds, Selertion performs significantly better than Ekstazi in terms of testing time. HyRTS fails to run on the subjects commons-exec, tabula-java and tika-parsers, thus we treat Selertion as superior on these three subjects. As an exception, commons-math has long test time with 160 seconds, the reason that HyRTS works better than Selertion is the execution time per test for commons-math is very fast with only 0.03s (shown in Table I). Therefore, although test assertions are reduced significantly, the reduced execution time is marginal. Although HyRTS has lower average testing time for commons-math (row 8), the result of Paired T-Test indicates that there is no significant difference between Selertion and HyRTS. Selertion performs better than HyRTS for the remaining subjects. Generally speaking, we can conclude that Ekstazi works best for subjects with short test runtime such as asterisk-java with 35s running time, Selertion outperforms or is competitive to both HyRTS and Ekstazi with long-running subjects i.e., test runtime longer than 86 seconds in our experiments. The results show that our fine-grained RTS technique is more effective in handling time-intensive tests.

Considering all test suites in our subjects, short and long, the total end-to-end time on average for Selertion is 74 seconds which is 10 second less than HyRTS and 46 seconds less than Ekstazi. The average gain compared with HyRTS could be potentially larger if HyRTS were executable on commons-exec, tabula-java and tika-parsers. To the ratio of RetestAll, our technique reduces 63% of the overall test execution time, which is 5% more than HyRTS (again if we ignore commons-exec, tabula-java and tika-parsers) and significantly more than Ekstazi. Our results show that Selertion is more precise and more efficient, especially for test suites with longer execution times, which benefit most from test selection in practice.

TABLE III: Results: testing time and comparison with other RTS tools

Subject	RetestAll	Selertion		HyRTS		Ekstazi
Subject	s	s	%	s	%	s	%
asterisk-java	35	19.4	58.7	18.1	53.6	11.7	35.0
commons-net	86	13.0	14.9	11.9	13.7	12.9	14.9
commons-exec*	99	48.6	49.1	-	-	70.2	70.9
tabula-java*	110	76.1	69.3	-	-	81.0	73.9
OpenTripPlanner	130	89.8	70.3	93.5	73.1	119.9	94.0
commons-math	160	49.0	30.5	43.7	27.2	63.2	39.4
stream-lib	165	42.4	25.7	52.1	31.4	53.9	32.5
tika-parsers*	181	72.8	40.0	-	-	80.3	44.3
accumulo-core	338	66.7	19.7	80.1	23.6	104.3	30.8
commons-pool	354	160.7	45.4	176.7	49.9	308.0	87.0
LogicNG	546	179.1	32.8	192.0	34.7	411.7	74.9
Ave.	200.36	74.33	-	83.51*	-	119.74	-

IV-C4 Performance (RQ4).

TABLE IV: Performance results for Selertion and comparison to a method-level variant

Subject	End-to-end Time (AE)			Analysis Time (A)			Collection Time (C)		Initial Overhead		Selected Assertions
	Asrt	Mthd	diff	Asrt	Mthd	diff	Asrt	Mthd	Asrt	Mthd	Asrt	Mthd	diff
	s (%)	s	s	s (%)	s	s	s	s	s	s	%	%	%
asterisk-java	19.36 (58.74)	17.25	2.11	3.84 (11.72)	3.71	0.14	17.51	17.40	34.26	33.30	13.87	14.24	-0.37
commons-net	12.96 (14.89)	12.86	0.10	1.17 (1.34)	1.16	0.01	12.31	12.22	89.86	95.21	4.03	4.95	-0.92
commons-exec	48.64 (49.13)	48.76	-0.12	2.56 (2.59)	2.05	0.51	55.89	57.14	143.14	101.77	27.29	28.91	-1.62
tabula-java	76.07 (69.32)	75.97	0.10	6.24 (5.69)	6.15	0.09	76.71	77.14	122.02	122.05	32.80	32.96	-0.15
OpenTripPlanner	89.82 (70.33)	89.92	-0.10	13.60 (10.68)	13.32	0.28	142.76	143.49	640.88	659.89	5.76	6.22	-0.46
commons-math	49.00 (30.55)	48.11	0.89	6.55 (4.08)	5.93	0.62	108.23	107.38	617.13	689.03	3.04	3.05	-0.01
stream-lib	42.43 (25.70)	41.64	0.79	2.25 (1.36)	2.18	0.07	108.38	107.80	367.89	363.19	6.90	6.96	-0.06
tika-parsers	72.80 (40.04)	71.22	1.57	18.93 (10.48)	18.14	0.80	108.55	99.88	887.99	926.05	21.00	21.02	-0.02
accumulo-core	66.70 (19.71)	78.18	-11.48	5.51 (1.63)	4.91	0.60	59.49	96.87	585.35	580.23	8.13	8.22	-0.09
commons-pool	160.73 (45.40)	168.80	-8.07	2.36 (0.67)	2.50	-0.14	173.58	187.88	403.94	401.76	24.74	26.74	-2.00
LogicNG	179.06 (32.78)	175.77	3.29	12.08 (2.20)	9.05	3.03	744.54	767.02	641.25	632.86	18.60	18.86	-0.26
Ave.	74.32	75.32	-0.99	6.83	6.28	0.55	146.18	152.20	412.16	418.67	15.11	15.65	-0.54

To evaluate the performance of Selertion, we measure the offline end-to-end time (AE time), analysis time (A time), collection time (C time) for dependency collection which runs in a separate process, as well as the initial overhead, respectively. In addition, we disable the finest-grained selection component of Selertion to assess the impact of selecting test entities at the assertion level. After disabling the component, the granularity of selection becomes test method level. To have a safe analysis, selection at the test method level still requires instrumenting the test code at statement level to inspect whether there exists any test method that depends on other tests. As before, the test classes that contain test methods calling other tests are selected at test class level along with those using inheritance or parameterized tests.

For the AE time, we already presented the results for Selertion in Table III. To compare with the method level selection, we present the results in seconds and percentages in Table IV under End-to-end Time (AE). We report the difference for AE time between Selertion and the method-level variant. Our analysis reveals that, when comparing assertion-level selection to method-level selection, there is some variation in end-to-end time across different subjects. For subjects like commons-exec, OpenTripPlanner, accumulo-core, and commons-pool, assertion-level selection consistently requires less end-to-end time. However, for the remaining subjects, assertion-level selection tends to demand slightly more time, with differences ranging from 0.1 second (as seen in commons-net) to 3.29 seconds (observed in LogicNG) on average. Notably, the overall average for time difference is -0.99 seconds, indicating that assertion-level selection is, on average, faster. To gain further insights from our results, we group the 11 subjects into two categories: the first one consisting of subjects with lower average times than method-level selection, and the second one including subjects with higher times. We conduct a Paired T-Test ( $\alpha=0.1$ ) for these two groups, aiming to compare the results between Selertion and its coarser-grained variant. Selecting $\alpha=0.1$ as the significance level enhances the sensitivity of our test [42]. The outcomes of the analysis indicate that, for the first group, Selertion is significantly faster than the method-level selection. In contrast, for the second group, which comprises subjects with higher end-to-end times, there is no statistical difference in offline end-to-end running time between Selertion and its coarser-grained variant. Therefore, we can confidently conclude that Selertion is either significantly faster or equivalent to its method-level variant.

The analysis time (A) is positively correlated to the size of code changes, because large code changes require more time for change computation. For example, because the revisions of tika-parsers have relatively large code changes, as shown in Figure 4 where the boxes for the subject are higher than the others, tika-parsers has the longest analysis times of 18.93 seconds. However, for the long-running subjects, the analysis time only takes a small portion of the overall time. Tika-parsers has 18.93 seconds of analysis time on average which is only 10.48% of the RetestAll. In contrast, asterisk-java has 3.84 seconds analysis time which is much shorter than 18.93 seconds, but the analysis time is 11.72% of the RetestAll. That might also explain why Selertion works better on long-running subjects. To compare with its method-level variant, from the results in the columns under Analysis Time (A) in Table IV, the analysis time for Selertion is more than method-level selection for each subject. Overall, it takes 0.55 second more on average than method-level selection for the analysis phase. Besides the mean values reported in Table IV, we also calculate the Paired T-Test. The p-value is much less than $\alpha$ (i.e., $p<\alpha$ , where $\alpha=0.1$ ); thus, we conclude that fine-grained assertion selection requires more analysis time than method-level selection, as expected.

For our implementation, the collection phase is in a separate process, meaning developers can obtain the RTS results first and there is no need to wait for the data collection to finish. But it is still necessary to measure collection time to evaluate the performance. The results of collection time are shown under Collection Time (C) in Table IV. Considering all the 11 subjects as a whole, on average, the collection time is 146.18 seconds for Selertion, while 152.20 seconds for method-level selection. However, based on the Paired T-Test, the collection time for method level selection has no statistical difference with Selertion. The initial overhead includes code instrumentation, backward assertion slicing and dependency collection at the initial run. In table IV, we present the results under Initial Overhead. For assertion-based selection, the time consumption is from 34.26 seconds for asterisk-java to 887.99 seconds (14.8 mins) for tika-parsers. Similarly, the result of Paired T-Test indicates there is no statistical difference for method level selection in terms of initial overhead. Because the initial overhead is just a one-time requirement, it is acceptable compared to the execution time saved by our tool over time.

Besides the time components, we also compare the selected assertion ratio between Selertion and the method-level variant. The results are shown in the fifth macro-column in Table IV. Overall, fine-grained test selection will select less assertions than method-level selection. On average, Selertion chooses 0.54% fewer assertions per revision for each subject, which translates to 14 fewer assertions in absolute numbers. Similarly, The p-value of Paired T-Test for selected assertion ratio indicates that the fine-grained selection is significantly better than method-level selection in terms of selected assertion ratio.

After comparing with the variant by disabling fine-grained assertion selection, although Selertion requires more analysis time, its end-to-end time, collection time and initial overhead are either less or equivalent to the coarse-grained variant. This indicates that fine-grained selection generally does not consume more time or result in a performance loss. Moreover, fine-grained assertion-level selection is either significantly faster or equivalent to its method-level variant and exhibits the ability to select fewer assertions with higher precision across all subjects. Therefore, we can confidently conclude that fine-grained assertion-level selection surpasses method-level selection in terms of overall performance and precision.

V Discussion

In this section, we discuss our efforts to guarantee safety for the fine-grained selection, reflect on our findings, tool design decisions and limitations, as well as threads to validity of our study.

Fine-Grained Selection Safety. In terms of production code analysis, we follow the design decisions of prior works HyRTS [12] and Ekstazi [9]; both argue they are safe in their test selection. Our analysis granularity for production code is a combination of method and class level, because we transform changes outside method bodies to coarse-grained class-level changes. It has been proven by Zhang [12] that this kind of change transformation cannot introduce new safety issues. Thus, if there are any potential safety issues in our technique, they may derive from the fine-grained instrumentation and selection for test code. Our assertion slicing component (see Section III-B) determines whether a test can be sliced into assertion slices. We have performed many experiments to make sure the assertion slices are executable without breaking the original test methods. To guarantee a safe selection, we perform specific analysis (see Section III-C) to discreetly identify tests that are unfeasible for finer-grained selection including both test method and assertion level and take a conservative selection approach (see Section III-D). For tests containing statements out of our analysis scope we also take a conservative approach to be safe. For instance, test statements that call third-party libraries such as Thread.sleep, are not safe to be selected at assertion level, because the statement Thread.sleep will not be included in any assertion slices. We downgrade such tests to method-level selection to be safe.

Test Suite Effectiveness We employed Major [36] to randomly generate mutants across different methods in the subjects and compared the effectiveness of the reduced test suite by Selertion with the original test suite in eliminating these mutants. In all subjects, the mutants killed by Selertion’s reduced test suite align with those killed by the original test suite. For example, for commons-math with the most test cases in our evaluation, Selertion achieves an identical mutation score of 47% when compared to the original test suite. These empirical findings affirm that our fine-grained selection approach maintains the same test effectiveness as observed in the original test suite, and thus is safe.

Source code vs. Bytecode. Different from the other tools that identify changes using Java bytecode, we focus on changes at source code level. There are some scenarios that may lead to completely different results. For example, if the program revision is to upgrade the Java version without any real source code changes, our tool will not select any tests, however, because the bytecode is changed completely, the other tools will select the whole test suite. Moreover, some modifications of source code do not impact the bytecode, such as changing a+=1 to a=a+1; here Selertion selects tests related to the source code changes, however, the other tools will fail to select any tests.

Applications. Besides regression test selection, our technique can be useful for developers in understanding and maintaining test code. Developers often have to take over other developers’ tasks. Unit tests are a great resource to understand the existing code. With the help of our tool, developers can make some small changes in a method without changing its functionality to obtain detailed test entities, especially test assertions related to the method, and use the information to understand the method and its relation with the test. The assertion slicing module can also be used to detect test smells such as redundant statements in test cases. Currently, our implementation only targets unit tests to compare with the prior works. However, unit tests usually run fast; for example, the average execution time per assertion is only around 0.08 seconds for the subjects in our experiments. Our fine-grained selection technique can be more powerful on tests with longer running time, such as tests in the web and mobile app domains.

Limitations. Our implementation currently supports assertions from JUnit4 and assertion frameworks built on top of JUnit4 such as Google Truth [21] and AssertJ [20]. However, the tool provides flexibility to manually add a list of assertions (customized or from other frameworks) to be included in the analysis. We do not support fine-grained selection for multithreaded tests, because the strategy for dependency collection is designed to collect method dependencies for each test statement in a sequential fashion which might not work for multithreaded tests.

Threats to Validity. The main threat to internal validity is that our implementation may have bugs that may impact the results. To improve the confidence for our implementation, we performed numerious tests to make sure the selection results do not break original test cases, and also manually inspected the experiment results for several subjects. Using a limited number of subject systems and mutants to simulate code changes for only three out of the 11 subjects in our evaluation could pose an external validity of our results to be generalized. We tried to minimize the threat to select subjects from different domains, with various lines of code and wide range of test execution time. Moreover, the subjects selected are well-maintained and widely used in recent RTS [12] and fine-grained analysis [34] work. We opted to use mutants to simulate code changes for the three subjects that lack real code changes, because our analysis of production code is coarse-grained, focusing on the location of changes rather than the specifics of the changes themselves. To enhance the representativeness of these simulated code changes, we introduced mutants randomly into various locations within the source code to closely mirror real-world situations.

VI Related Work

Regression test selection (RTS) has been a widely studied topic, with various techniques ranging from basic-block-level to module-level granularity reviewed in several surveys [43, 44, 45, 7, 8]. The basic-block-level RTS, introduced by Rothermel and Harrold for C programs and later extended to Java [46, 39], evolved into coarser-grained methods to improve efficiency. Notable among these are FaultTracer and Chianti at method level [11, 10], and Ekstazi at file level [9], with Ekstazi demonstrating superior performance in testing time and scalability. Further research explored coarser granularities, like file-level and module-level RTS [47, 48]. The trade-off between precision and efficiency led to hybrid approaches, such as Orso et al.’s two-phase RTS [41] and Zhang’s HyRTS, which combines method and class level analysis [12]. Alongside dynamic RTS, static RTS techniques have been explored, though they face challenges with safety and practicality [49, 40, 50].

Current RTS research primarily focuses on the granularity of production code analysis, typically selecting test entities at class or method level. Building on the concept of checked coverage [51] and its correlation with test suite effectiveness [14], our work innovates by selecting tests at the assertion level, treating each assertion and its slice (i.e., associated statements) as the smallest selection unit.

Research on breaking down test cases for assertion-related studies includes Chen et al.’s assertion-aware test reduction [52], Vahabzadeh et al.’s fine-grained test case minimization [34], Fang et al.’s assertion fingerprints for identifying similar test cases [53], and Xuan et al.’s work on enhancing dynamic analysis and test purification [54, 55]. However, none of these specifically target RTS at the assertion level, making our approach a novel contribution to the field.

VII Conclusions

We proposed the first fine-grained regression test selection technique that works to select test entities at the assertion granularity level inside test methods. We used the proposed technique to implement a tool called Selertion capable of effectively selecting change affected test entities to increase precision for selection and reduce test runtime. An empirical evaluation was performed on 11 subject systems. We compared Selertion with two state-of-the-art RTS tools. The results show that our technique improves precision for all the subjects to select fewer tests by analyzing tests at the test assertion-level. Selertion also outperforms existing tools in subjects with longer test runtime. In general, on average, it can reduce 63% of the overall test time (RetestAll) among all subjects. For future work, we plan to apply fine-grained test analysis on other techniques such as test prioritization. We will also upgrade the implementation to support JUnit5 and extend the tool for other test domains including web and mobile tests that are known to have much longer running tests.

References

[1] E. Engström and P. Runeson, “A qualitative survey of regression testing practices,” in International Conference on Product Focused Software Process Improvement. Springer, 2010, pp. 3–16.
[2] S. Elbaum, G. Rothermel, and J. Penix, “Techniques for improving regression testing in continuous integration development environments,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2014, pp. 235–245.
[3] A. Memon, Z. Gao, B. Nguyen, S. Dhanda, E. Nickell, R. Siemborski, and J. Micco, “Taming google-scale continuous testing,” in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 2017, pp. 233–242.
[4] S. Elbaum, A. G. Malishevsky, and G. Rothermel, “Prioritizing test cases for regression testing,” in Proceedings of the 2000 ACM SIGSOFT international symposium on Software testing and analysis. ACM, 2000, pp. 102–112.
[5] M. Gligoric, S. Negara, O. Legunsen, and D. Marinov, “An empirical evaluation and comparison of manual and automated test selection,” in Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. ACM, 2014, pp. 361–372.
[6] A. Kiran, W. H. Butt, M. W. Anwar, F. Azam, and B. Maqbool, “A comprehensive investigation of modern test suite optimization trends, tools and techniques,” IEEE Access, vol. 7, pp. 89 093–89 117, 2019.
[7] S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,” Software testing, verification and reliability, vol. 22, no. 2, pp. 67–120, 2012.
[8] R. Kazmi, D. N. Jawawi, R. Mohamad, and I. Ghani, “Effective regression test case selection: A systematic literature review,” ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–32, 2017.
[9] M. Gligoric, L. Eloussi, and D. Marinov, “Practical regression test selection with dynamic file dependencies,” in Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA ’15). ACM, 2015, pp. 211–222.
[10] X. Ren, F. Shah, F. Tip, B. G. Ryder, and O. Chesley, “Chianti: a tool for change impact analysis of java programs,” in Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA04). ACM, 2004, pp. 432–448.
[11] L. Zhang, M. Kim, and S. Khurshid, “Localizing failure-inducing program edits based on spectrum information,” in Proceedings of the 27th IEEE International Conference on Software Maintenance (ICSM ’11). IEEE Computer Society, 2011, pp. 23–32.
[12] L. Zhang, “Hybrid regression test selection,” in Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). IEEE, 2018, pp. 199–209.
[13] Math, “The apache commons mathematics library,” https://github.com/apache/commons-math/tree/master-old, 2016, accessed: 2022-11-10.
[14] Y. Zhang and A. Mesbah, “Assertions are strongly correlated with test suite effectiveness,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15). ACM, 2015, pp. 214–224.
[15] M. Weiser, “Program slicing,” IEEE Transactions on software engineering, vol. SE-10, no. 4, pp. 352–357, 1984.
[16] D. P. Mohapatra, R. Mall, and R. Kumar, “An overview of slicing techniques for object-oriented programs,” Informatica, vol. 30, no. 2, pp. 253–277, 2006.
[17] C. Hammer and G. Snelting, “An improved slicer for java,” in Proceedings of the 5th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. ACM, 2004, pp. 17–22.
[18] Z. Chen and B. Xu, “Slicing object-oriented java programs,” ACM Sigplan Notices, vol. 36, no. 4, pp. 33–40, 2001.
[19] JDT, “Eclipse java development tools,” https://www.eclipse.org/jdt/, 2022, accessed: 2022-11-10.
[20] AssertJ, “fluent assertions java library,” https://assertj.github.io/doc/, 2022, accessed: 2022-11-10.
[21] G. Truth, “Fluent assertions for java and android,” https://truth.dev/, 2022, accessed: 2022-11-10.
[22] Selertion, “Fine-grained test selection at assertion-level,” https://anonymous.4open.science/r/Selertion, 2022.
[23] L. E. Milos Gligoric and D. Marinov, “Ekstazi: Lightweight test selection,” http://www.ekstazi.org/maven.html, 2014, accessed: 2023-10-24.
[24] Asterisk-Java, “The free java library for asterisk pbx integration,” https://github.com/asterisk-java/asterisk-java, 2020, accessed: 2022-11-10.
[25] Net, “Apache commons net,” https://commons.apache.org/proper/commons-net/, 2022, accessed: 2023-08-01.
[26] A. C. Exec, “Executing external processes from java,” https://commons.apache.org/proper/commons-exec/, 2014, accessed: 2023-08-01.
[27] tabula java, “Extract tables from pdf files,” https://github.com/tabulapdf/tabula-java, 2022, accessed: 2023-01-30.
[28] OpenTripPlanner, “An open source multi-modal trip planner,” www.opentripplanner.org, 2021, accessed: 2022-11-10.
[29] Stream-lib, “Stream summarizer and cardinality estimator,” https://github.com/addthis/stream-lib/tree/v2.9.8, 2019, accessed: 2022-11-10.
[30] A. Tika, “a content analysis toolkit,” https://tika.apache.org/, 2020, accessed: 2022-11-10.
[31] A. Accumulo, “A sorted, distributed key/value store that provides robust, scalable data storage and retrieval,” https://accumulo.apache.org/, 2019, accessed: 2022-11-10.
[32] A. C. Pool, “The apache commons object pooling library,” https://github.com/apache/commons-pool, 2019, accessed: 2022-11-10.
[33] LogicNG, “The next generation logic library,” https://github.com/logic-ng/LogicNG, 2019, accessed: 2023-01-31.
[34] A. Vahabzadeh, A. Stocco, and A. Mesbah, “Fine-grained test minimization,” in Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). IEEE, 2018, pp. 210–221.
[35] JUnit5, “The 5th major version of the programmer-friendly testing framework for java and the jvm,” https://junit.org/junit5/, 2022, accessed: 2022-11-10.
[36] T. M. M. Framework, “Easy and scalable mutation analysis for java,” https://mutation-testing.org/, 2022, accessed: 2023-08-01.
[37] Cloc, “cloc counts blank lines, comment lines, and physical lines of source code in many programming languages,” https://github.com/AlDanial/cloc, 2022, accessed: 2022-11-10.
[38] Surefire, “Maven surefire plugin,” https://maven.apache.org/surefire/maven-surefire-plugin/, 2022, accessed: 2022-11-10.
[39] M. J. Harrold, J. A. Jones, T. Li, D. Liang, A. Orso, M. Pennings, S. Sinha, S. A. Spoon, and A. Gujarathi, “Regression test selection for java software,” ACM Sigplan Notices, vol. 36, no. 11, pp. 312–326, 2001.
[40] O. Legunsen, F. Hariri, A. Shi, Y. Lu, L. Zhang, and D. Marinov, “An extensive study of static regression test selection in modern software evolution,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’16). ACM, 2016, pp. 583–594.
[41] A. Orso, N. Shi, and M. J. Harrold, “Scaling regression testing to large software systems,” ACM SIGSOFT Software Engineering Notes, vol. 29, no. 6, pp. 241–251, 2004.
[42] L. A. Marascuilo and R. C. Serlin, Statistical methods for the social and behavioral sciences. WH Freeman/Times Books/Henry Holt & Co, 1988.
[43] E. Engström, M. Skoglund, and P. Runeson, “Empirical evaluations of regression test selection techniques: a systematic review,” in Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement (ESEM ’08). ACM, 2008, pp. 22–31.
[44] E. Engström, P. Runeson, and M. Skoglund, “A systematic review on regression test selection techniques,” Information and Software Technology, vol. 52, no. 1, pp. 14–30, 2010.
[45] S. Biswas, R. Mall, M. Satpathy, and S. Sukumaran, “Regression test selection techniques: A survey,” Informatica, vol. 35, no. 3, pp. 289–321, 2011.
[46] G. Rothermel and M. J. Harrold, “A safe, efficient regression test selection technique,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 6, no. 2, pp. 173–210, 1997.
[47] M. Vasic, Z. Parvez, A. Milicevic, and M. Gligoric, “File-level vs. module-level regression test selection for. net,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’17). ACM, 2017, pp. 848–853.
[48] A. Celik, M. Vasic, A. Milicevic, and M. Gligoric, “Regression test selection across jvm boundaries,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’17). ACM, 2017, pp. 809–820.
[49] O. Legunsen, A. Shi, and D. Marinov, “Starts: Static regression test selection,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’17). IEEE Press, 2017, pp. 949–954.
[50] A. Shi, M. Hadzi-Tanovic, L. Zhang, D. Marinov, and O. Legunsen, “Reflection-aware static regression test selection,” Proceedings of the ACM on Programming Languages, vol. 3, no. OOPSLA, pp. 1–29, 2019.
[51] D. Schuler and A. Zeller, “Assessing oracle quality with checked coverage,” in Proceedings of the 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation (ICST ’11). IEEE Computer Society, 2011, pp. 90–99.
[52] J. Chen, Y. Bai, D. Hao, L. Zhang, L. Zhang, and B. Xie, “How do assertions impact coverage-based test-suite reduction?” in Proceedings of the 10th IEEE International Conference on Software Testing, Verification and Validation (ICST ’17). IEEE, 2017, pp. 418–423.
[53] Z. F. Fang and P. Lam, “Identifying test refactoring candidates with assertion fingerprints,” in Proceedings of the Principles and Practices of Programming on The Java Platform (PPPJ ’15). ACM, 2015, pp. 125–137.
[54] J. Xuan, B. Cornu, M. Martinez, B. Baudry, L. Seinturier, and M. Monperrus, “B-refactoring: Automatic test code refactoring to improve dynamic analysis,” Information and Software Technology, vol. 76, pp. 65–80, 2016.
[55] J. Xuan and M. Monperrus, “Test case purification for improving fault localization,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT/FSE’14). ACM, 2014, pp. 52–63.
[56] Urban, “Urban airship java library,” https://github.com/urbanairship/java-library, 2020, accessed: 2022-11-10.
[57] Lang, “Apache commons lang,” https://github.com/apache/commons-lang, 2018, accessed: 2022-11-10.
[58] A. Santuario, “Xml security for java,” https://github.com/apache/santuario-xml-security-java, 2019, accessed: 2022-11-10.
[59] A. Vahabzadeh, A. M. Fard, and A. Mesbah, “An empirical study of bugs in test code,” in 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 2015, pp. 101–110.