This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Testing Autonomous Systems with Believed Equivalence Refinement

Chih-Hong Cheng DENSO AUTOMOTIVE Deutschland GmbH
Eching, Germany
Email: [email protected]
   Rongjie Yan State Key Laboratory of Computer Science, ISCAS
Beijing, China
Email: [email protected]
Abstract

Continuous engineering of autonomous driving functions commonly requires deploying vehicles in road testing to obtain inputs that cause problematic decisions. Although the discovery leads to producing an improved system, it also challenges the foundation of testing using equivalence classes and the associated relative test coverage criterion. In this paper, we propose believed equivalence, where the establishment of an equivalence class is initially based on expert belief and is subject to a set of available test cases having a consistent valuation. Upon a newly encountered test case that breaks the consistency, one may need to refine the established categorization in order to split the originally believed equivalence into two. Finally, we focus on modules implemented using deep neural networks where every category partitions an input over the real domain. We present both analytical and lazy methods to suggest the refinement. The concept is demonstrated in analyzing multiple autonomous driving modules, indicating the potential of our proposed approach.

©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/AITEST52744.2021.00020

1 Introduction

Mainstream approaches for engineering autonomous vehicles follow a paradigm of continuous improvement. It is widely conceived in industry that unconsidered engineering issues such as edge cases may appear when validating the autonomous vehicles on road with sufficiently many miles. The autonomy function, upon a newly encountered edge case, will be evaluated against the impact and function modification may be performed. While discovering the previously unconsidered issues leads to an improvement of the engineered system, it also challenges the previously stated assumptions being established in design and in test time.

In this paper, we formally address the impact of newly discovered issues over the previously stated test-completeness claims based on equivalence classes and test coverage. Intuitively, equivalence classes are situations where the system demonstrates similar behavior, and test coverage is a relative completeness measure to argue that the testing efforts have covered the entire input space with sufficient diversity. While the concept of equivalence class may be straightforward in theory, to apply it in complex autonomous driving settings, we mathematically formulate a revision called believed equivalence class. The initial categorization used to build equivalence classes is initially based on expert judgements (e.g., all images having sunny weather and straight roads form a believed equivalence class for front-vehicle detection, as experts think that a front vehicle will be detected) and is subject to available test cases demonstrating similar behavior. So long as any newly discovered test cases in the same class violate the behavior similarity (e.g., on sunny weather but the straight road is covered with snow, the perception will fail to detect a white vehicle in the front), there is a need to refine the believed equivalence class. Regarding the impact of refinement over test coverage, while the number of equivalence classes is known to be exponential (aka combinatorial explosion), under the restricted setting of achieving full coverage using γ\gamma-way combinatorial testing [16], the maximum number of additional test cases required to return to full coverage (due to refinement) is bounded by a polynomial with γ1\gamma-1 degree.

Finally, we look into the testing of functions implemented using deep neural networks (DNNs). When the equivalence class is based on partitioning the input domain over real numbers, guiding the construction of refined equivalence classes can utilize the known approach of adversarial perturbation [20, 15]. This avoids creating a believed equivalence class that only contains a single point. For deep neural networks, we prove that finding an optimal cut can be encoded using mixed-integer linear programming (MILP). Due to the scalability considerations of purely analytical methods, we also utilize the concept of kk-nearest neighbors among the existing test data to guide the perturbed direction, in order to find feasible cuts that at least consistently separate points along all perturbed directions.

To demonstrate the validity of the approach, we have applied the technique on DNN-based modules for trajectory following and adaptive cruise control in autonomous driving. Our research prototype continuously refines the believed equivalence upon newly encountered test cases that lead to inconsistency. Altogether our proposal (equivalence classes based on currently available test cases) provides a rigorous framework to understand better the concept of continuous testing and its impact on the completeness of verification of autonomous systems using test coverage.

The rest of the paper is structured as follows. After Section 2 comparing our results with existing work, Section 3 formulates the concept of believed equivalence, coverage, and the refinement of believed equivalence. Section 4 discusses how believed equivalence can be applied when the module under test is a deep neural network. Finally, we summarize our evaluation in Section 5 and conclude in Section 6.

Refer to caption
Figure 1: Illustrating believed equivalence

2 Related Work

The concept of equivalence class has been widely used in analyzing complex systems. For example, region construction used in timed automata [3] establishes equivalence classes over infinitely many states regarding the ordering of clock variables reaching integer values. This (finite number of regions) serves as a basis of decidability results for verifying properties of timed automata. State-of-the-art guidelines for safety engineering autonomous driving systems such as ISO TR 4804 [2]111An early version (aka SaFAD) drafted by multiple automotive manufacturers and suppliers is publicly available at https://www.daimler.com/documents/innovation/other/safety-first-for-automated-driving.pdf admit the need for continuous engineering; ISO TR 4804 also explicitly mentions equivalence class as a method to avoid excessive testing efforts. Nevertheless, it is not mathematically defined and lacks formal guarantees compared to the classical equivalence class concept. Finally, there are proposals such as micro-ODD [11] in the safety engineering domain to fine-grain the allowed driving conditions. Still, these proposals lack a formal treatment in terms of an equivalence class. All the above results provide a strong motivation to our work: Our formulated believed equivalence class only demonstrates equivalent behavior on the currently available test cases and may be subject to refinement in a continuous engineering paradigm.

For testing DNN-based autonomous systems, researchers proposed structural coverage metrics to achieve a certain level of “relative completeness” [17, 14, 19]. Still, recent results have indicated their limited applicability [13]. Utilizing scenario description languages [9, 1], one can naturally partition the input space for testing [5, 12], but there is no guarantee that all inputs within the partition demonstrate similar behavior. Therefore, one may only process a weaker guarantee (equivalence by currently available test cases), as reflected by our proposal.

Finally, for deep neural networks, our construction of refining the believed equivalence integrates the idea of local robustness (see [10, 18, 7] for recent surveys) as used in deep learning. Due to the scalability issues related to exact analytical methods such as MILP, it is realistic to generate a believed equivalence class that contains local robustness region within a reasonable time budget; one can perform refinement upon newly encountered test cases. Our adoption of kk-nearest neighbors serves for this purpose.

3 Believed Equivalence and Refinement

3.1 Believed Equivalence

Let f:𝒟in𝒟outf:\mathcal{D}_{in}\rightarrow\mathcal{D}_{out} be the function under analysis, where 𝒟in\mathcal{D}_{in} and 𝒟out\mathcal{D}_{out} are the input and the output domain respectively. Let evl:𝒟in×𝒟out𝒟evlevl:\mathcal{D}_{in}\times\mathcal{D}_{out}\rightarrow\mathcal{D}_{evl} be an evaluation function that assigns an input-output pair with a value in the evaluation domain 𝒟evl\mathcal{D}_{evl}. Let 𝒞={C1,,Cm}\mathcal{C}=\{C_{1},\ldots,C_{m}\} be the set of categories, where for i{1,,m}i\in\{1,\ldots,m\}, each category Ci={ci,1,,ci,|Ci|}C_{i}=\{c_{i,1},\ldots,c_{i,|C_{i}|}\} has a set of elements with the set size being |Ci||C_{i}|. In our example, we may use Ci.ci,jC_{i}.c_{i,j} to clearly indicate the element and its belonging category. Each element ci,j:𝒟in{false,true}c_{i,j}:\mathcal{D}_{in}\rightarrow\{{\small\textsf{{false}}},{\small\textsf{{true}}}\} evaluates any input in the domain to true or false. We assume that for any input in𝒟in{\small\textsf{{in}}}\in\mathcal{D}_{in}, for every ii there exists exactly one jj (!j{1,,|Ci|}\exists!j\in\{1,\ldots,|C_{i}|\}) such that ci,j(in)=truec_{i,j}({\small\textsf{{in}}})={\small\textsf{{true}}}, i.e., for each category, every input is associated to exactly one element.

In the following, we define believed equivalence over the test set. Intuitively, for any two data points in the test set, so long as their corresponding element information are the same, demonstrating believed equivalence requires that the results of the evaluation function are also the same.

Definition 1

Given a set 𝒟testDin\mathcal{D}_{test}\subset D_{in} of test cases and categories 𝒞\mathcal{C}, function ff demonstrates believed equivalence over 𝒞\mathcal{C} under the set 𝒟test\mathcal{D}_{test} and evaluation function evlevl, denoted as f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, if for any two test cases in1,in2𝒟test{\small\textsf{{in}}}_{1},{\small\textsf{{in}}}_{2}\in\mathcal{D}_{test} the following condition holds:

(i{1,,m},j{1,,|Ci|}:ci,j(in1)=ci,j(in2))evl(in1,f(in1))=evl(in2,f(in2))\begin{split}(\forall i\in\{1,\ldots,m\},j\in\{1,\ldots,|C_{i}|\}:c_{i,j}({\small\textsf{{in}}}_{1})=c_{i,j}({\small\textsf{{in}}}_{2}))\\ \rightarrow evl({\small\textsf{{in}}}_{1},f({\small\textsf{{in}}}_{1}))=evl({\small\textsf{{in}}}_{2},f({\small\textsf{{in}}}_{2}))\end{split}

Consider the example illustrated in Figure 1. Given 𝒟in\mathcal{D}_{in} to be all possible pixel values in an image from the vehicle’s mounted camera, we build a highly simplified input categorization in the autonomous driving perception setup. Let 𝒞={Time,Front-Vehicle}\mathcal{C}=\{{\small\textsf{{Time}}},{\small\textsf{{Front-Vehicle}}}\}, where Time = {day,night}\{{\small\textsf{{day}}},{\small\textsf{{night}}}\}, Front-Vehicle = {exist, not-exist} indicates whether in the associated ground truth, there exists a vehicle in the front. Consider ff to be a DNN that performs front-vehicle detection. The first output of ff, denoted as f1f_{1}, generates 1 for indicating “DETECTED” (a front vehicle is detected, as shown in Figure 1 with dashed rectangles) and 0 for “UNDETECTED”. Based on the input image, let evlevl be the function that checks if the detection result made by the DNN is consistent with the ground truth, i.e., evl(in,f(in))=trueevl({\small\textsf{{in}}},f({\small\textsf{{in}}}))={\small\textsf{{true}}} iff (Front-Vehicle.exist(in)=truef1(in)=1)({\small\textsf{{Front-Vehicle.exist}}}({\small\textsf{{in}}})={\small\textsf{{true}}}\leftrightarrow f_{1}({\small\textsf{{in}}})={\small\textsf{{1}}}).

As illustrated in Figure 1, let 𝒟test={in1,,in9}\mathcal{D}_{test}=\{{\small\textsf{{in}}}_{1},\ldots,{\small\textsf{{in}}}_{9}\}. For in1{\small\textsf{{in}}}_{1} and in2{\small\textsf{{in}}}_{2}, one can observe from Figure 1 that evl(in1,f(in1))=evl(in2,f(in2))evl({\small\textsf{{in}}}_{1},f({\small\textsf{{in}}}_{1}))=evl({\small\textsf{{in}}}_{2},f({\small\textsf{{in}}}_{2})). Then ff demonstrates believed equivalence over 𝒞\mathcal{C} under 𝒟test\mathcal{D}_{test}, provided that the following conditions also hold.

  • evl(in3,f(in3))=evl(in4,f(in4))evl({\small\textsf{{in}}}_{3},f({\small\textsf{{in}}}_{3}))=evl({\small\textsf{{in}}}_{4},f({\small\textsf{{in}}}_{4})),

  • evl(in5,f(in5))=evl(in6,f(in6))=evl(in7,f(in7))evl({\small\textsf{{in}}}_{5},f({\small\textsf{{in}}}_{5}))=evl({\small\textsf{{in}}}_{6},f({\small\textsf{{in}}}_{6}))\\ =evl({\small\textsf{{in}}}_{7},f({\small\textsf{{in}}}_{7})),

  • evl(in8,f(in8))=evl(in9,f(in9))evl({\small\textsf{{in}}}_{8},f({\small\textsf{{in}}}_{8}))=evl({\small\textsf{{in}}}_{9},f({\small\textsf{{in}}}_{9})).

(Remark) It is possible to control the definition of evlevl such that demonstrating believed equivalence is independent of the computed value f(in)f({\small\textsf{{in}}}), thereby creating a partitioning of the input space and its associated properties (i.e., labels). For example, consider the following function. evl(in1,f(in1))=(v1,,vm)evl({\small\textsf{{in}}}_{1},f({\small\textsf{{in}}}_{1}))=(v_{1},\ldots,v_{m}) where vi=jv_{i}=j iff ci,j(in)=truec_{i,j}({\small\textsf{{in}}})={\small\textsf{{true}}}. This evaluation function assigns every input with its associated element index for each category, thereby being independent of ff.

3.2 Coverage based on Combinatorial Testing

Test coverage can measure the degree that a system is tested against the provided test set. In the following lemma, we associate our formulation of believed equivalence with the combinatorial explosion problem of achieving full test coverage. Precisely, the size of the test set to demonstrate believed equivalence while completely covering all combinations as specified in the categories is exponential to the number of categories.

Lemma 1

Let S=min{|C1|,,|Cm|}S=\min\{|C_{1}|,\ldots,|C_{m}|\}. Provided that f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, the minimum size of 𝒟test\mathcal{D}_{test} to satisfy the below condition is larger or equal to SmS^{m}.

j1{1,,|C1|},,jm{1,,|Cm|}:in𝒟test:i=1mci,ji(in)=true\begin{split}\forall j_{1}\in\{1,\ldots,|C_{1}|\},\ldots,j_{m}\in\{1,\ldots,|C_{m}|\}:\\ \exists{\small\textsf{{in}}}\in\mathcal{D}_{test}:\bigwedge^{m}_{i=1}c_{i,j_{i}}({\small\textsf{{in}}})={\small\textsf{{true}}}\end{split} (1)
Proof 1

Recall in Definition 1, f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C} requires any “two” test cases in1,in2𝒟test{\small\textsf{{in}}}_{1},{\small\textsf{{in}}}_{2}\in\mathcal{D}_{test} where i,j:ci,j(in1)=ci,j(in2)\forall i,j:c_{i,j}({\small\textsf{{in}}}_{1})=c_{i,j}({\small\textsf{{in}}}_{2}), evl(in1,f(in1))evl({\small\textsf{{in}}}_{1},f({\small\textsf{{in}}}_{1})) equals evl(in2,f(in2))evl({\small\textsf{{in}}}_{2},f({\small\textsf{{in}}}_{2})). Therefore, for Equation 1, if there exists only one element in𝒟test{\small\textsf{{in}}}\in\mathcal{D}_{test} in each element combination i=1mci,ji(in)=true\bigwedge^{m}_{i=1}c_{i,j_{i}}({\small\textsf{{in}}})={\small\textsf{{true}}} (in Eq. 1), it still satisfies Definition 1. The number of all possible element combinations is |C1|××|Cm|Sm|C_{1}|\times\ldots\times|C_{m}|\geq S^{m}. One needs to cover each element combination with one test case, and a test case satisfying one combination cannot satisfy another combination, due to our previously stated assumption that i!j{1,,|Ci|}:ci,j(in)=true\forall i\exists!j\in\{1,\ldots,|C_{i}|\}:c_{i,j}({\small\textsf{{in}}})={\small\textsf{{true}}}. Therefore the lemma holds. ∎

As the number of categories can easily achieve 5050, even when each category has only 22 elements, achieving full coverage using the condition in Eq. 1 requires at least 2502^{50} different test cases; this is in most applications unrealistic. Therefore, we consider asserting full coverage using γ\gamma-way combinatorial testing [16], which only requires the set of test cases to cover, for every pair (γ=2\gamma=2) or triple (γ=3\gamma=3) of categories, all possible combinations. The following lemma states the following: Provided that γ\gamma is a constant, the number of test cases to achieve full coverage can be bounded by a polynomial with γ\gamma being its highest degree.

Lemma 2 (Coverage using γ\gamma-way combinatorial testing)

Let S=max{|C1|,,|Cm|}S=\max\{|C_{1}|,\ldots,|C_{m}|\}. Provided that f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, the minimum size of 𝒟test\mathcal{D}_{test} to satisfy γ\gamma-way combinatorial testing, characterized by the below condition, is bounded by (mγ)Sγ{{m}\choose{\gamma}}S^{\gamma}.

a1,a2,,aγ{1,,m}:ja1{1,,|Ca1|},,jaγ{1,,|Caγ|}:in𝒟test:i=1γcai,jai(in)=true\begin{split}\forall a_{1},a_{2},\ldots,a_{\gamma}\in\{1,\ldots,m\}:\\ \forall j_{a_{1}}\in\{1,\ldots,|C_{a_{1}}|\},\ldots,j_{a_{\gamma}}\in\{1,\ldots,|C_{a_{\gamma}}|\}:\\ \exists{\small\textsf{{in}}}\in\mathcal{D}_{test}:\bigwedge^{\gamma}_{i=1}c_{a_{i},j_{a_{i}}}({\small\textsf{{in}}})={\small\textsf{{true}}}\end{split} (2)
Proof 2

To satisfy Equation 2, the first universal term creates (mγ){{m}\choose{\gamma}} choices, where for each choice, the second universal term generates at most SγS^{\gamma} combinations. Therefore, the total number of combinations induced by two layers of universal quantification equals (mγ)Sγ{{m}\choose{\gamma}}S^{\gamma}, and a careful selection of test cases in 𝒟test\mathcal{D}_{test}, with each combination covered by exactly one test case, leads to the size of 𝒟test\mathcal{D}_{test} bounded by (mγ)Sγ{{m}\choose{\gamma}}S^{\gamma}. Finally, similar to Lemma 1, covering each combination with only one test case suffices to satisfy f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}. Therefore the lemma holds. ∎

Refer to caption
Figure 2: Believed equivalence integrated in 22-way combinatorial testing; the color of each circle indicates the computed value for the evaluation function evlevl.

Figure 2 illustrates an example for coverage with 22-way combinatorial testing, where C={C1,C2,C3}C=\{C_{1},C_{2},C_{3}\} with each CiC_{i} (i{1,2,3}i\in\{1,2,3\}) having ci,1c_{i,1} and ci,2c_{i,2} accordingly. Let 𝒟test={in1,,in6}\mathcal{D}_{test}=\{{\small\textsf{{in}}}_{1},\ldots,{\small\textsf{{in}}}_{6}\} (i.e., excluding in7{\small\textsf{{in}}}_{7} and in8{\small\textsf{{in}}}_{8}). Apart from 𝒟test\mathcal{D}_{test} demonstrating believed equivalence, 𝒟test\mathcal{D}_{test} also satisfies the condition as specified in Lemma 2. By arbitrary picking 22 categories among {C1,C2,C3}\{C_{1},C_{2},C_{3}\}, for instance C1C_{1} and C2C_{2}, all combinations are covered by elements in 𝒟test\mathcal{D}_{test} (c1,1c2,1in5,in6c_{1,1}c_{2,1}\mapsto{\small\textsf{{in}}}_{5},{\small\textsf{{in}}}_{6}; c1,1c1,2in3c_{1,1}c_{1,2}\mapsto{\small\textsf{{in}}}_{3}; c1,2c2,1in4c_{1,2}c_{2,1}\mapsto{\small\textsf{{in}}}_{4}; c1,2c2,2in1,in2c_{1,2}c_{2,2}\mapsto{\small\textsf{{in}}}_{1},{\small\textsf{{in}}}_{2}).

3.3 Encountering 𝒟in𝒟test\mathcal{D}_{in}\setminus\mathcal{D}_{test}

As the definition of believed equivalence is based on the test set 𝒟test\mathcal{D}_{test}, upon new test cases outside the test set, i.e., 𝒟in𝒟test\mathcal{D}_{in}\setminus\mathcal{D}_{test}, the function under analysis may not demonstrate consistent behavior. We first present a formal definition on consistency.

Definition 2

A test case in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test} is consistent with the believed equivalence f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C} when the following condition holds for all test cases in𝒟test{\small\textsf{{in}}}\in\mathcal{D}_{test}:

(i,j:ci,j(in)=ci,j(in’))evl(in,f(in))=evl(in’,f(in’))(\forall i,j:c_{i,j}({\small\textsf{{in}}})=c_{i,j}({\small\textsf{{in'}}}))\rightarrow evl({\small\textsf{{in}}},f({\small\textsf{{in}}}))=evl({\small\textsf{{in'}}},f({\small\textsf{{in'}}}))

Otherwise, in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test} is inconsistent with the believed equivalence f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}.

The following result show that, under a newly encountered consistent test case, one can add it to the existing test set while still demonstrating believed equivalence.

Lemma 3

If a test case in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test} is consistent with the believed equivalence f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, then f𝒟test{in’}evl𝒞f\models^{evl}_{\mathcal{D}_{test}\cup\{{\small\textsf{{in'}}}\}}\mathcal{C}.

Proof 3

A direct implication from Definition 1 and 2. ∎

Consider again the example in Figure 2. Provided that 𝒟test={in1,,in6}\mathcal{D}_{test}=\{{\small\textsf{{in}}}_{1},\ldots,{\small\textsf{{in}}}_{6}\}, one can observe that in8{\small\textsf{{in}}}_{8} is consistent with f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C} (as for c1,1c2,2c3,1c_{1,1}c_{2,2}c_{3,1}, there exists no test case from 𝒟test\mathcal{D}_{test}), while in7{\small\textsf{{in}}}_{7} is inconsistent (the evaluated result for in7{\small\textsf{{in}}}_{7} is not the same as in1{\small\textsf{{in}}}_{1} and in2{\small\textsf{{in}}}_{2}).

Encountering a new test case in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test} that generates inconsistency, there are two methods to resolve the inconsistency depending on whether ff is implemented correctly.

  • (When ff is wrong - function modification) Function modification refers to the change of ff to ff^{\prime} such that the evaluation over in’ turns consistent; it is used when ff is engineered wrongly. For the example illustrated in Figure 2, a modification of ff shall ensure that in1{\small\textsf{{in}}}_{1}, in2{\small\textsf{{in}}}_{2}, and in7{\small\textsf{{in}}}_{7} have the same evaluated value (demonstrated using the same color).

  • (When ff is correct - refining the believed equivalence class) When ff is implemented correctly, it implies that our believed equivalence, based on the currently defined category 𝒞\mathcal{C}, is too coarse. Therefore, it is necessary to refine 𝒞\mathcal{C} such that the created believed equivalence separates the newly introduced test case from others. To ease understanding, consider again the example in Figure 1. When one encounters an image where the front vehicle is a white trailer truck, and under heavy snows, we envision that the front vehicle can be undetected. Therefore, the currently maintained categories shall be further modified to

    • differentiate images by snowy conditions (preferably via proposing a new category), and

    • differentiate images having front vehicles in white from front vehicles having non-white colors (preferably via refining an existing element in a category).

In the next section, we mathematically define the concept of refinement to cope with the above two types of changes in categories.

3.4 Believed Equivalence Refinement

Definition 3

Given 𝒞\mathcal{C} and 𝒟test\mathcal{D}_{test}, a refinement-by-cut RC(𝒞,i,j)RC(\mathcal{C},i,j) generates a new set of categories by first removing ci,jc_{i,j} in  CiC_{i}, followed by adding two elements ci,j,1c_{i,j,1} and ci,j,2c_{i,j,2} to CiC_{i}, where the following conditions hold.

  • For any test case in𝒟test{\small\textsf{{in}}}\in\mathcal{D}_{test} such that ci,j(in)=truec_{i,j}({\small\textsf{{in}}})={\small\textsf{{true}}}, exclusively either ci,j,1(in)=truec_{i,j,1}({\small\textsf{{in}}})={\small\textsf{{true}}} or ci,j,2(in)=truec_{i,j,2}({\small\textsf{{in}}})={\small\textsf{{true}}}.

  • For any test case in𝒟test{\small\textsf{{in}}}\in\mathcal{D}_{test} such that ci,j(in)=falsec_{i,j}({\small\textsf{{in}}})={\small\textsf{{false}}}, ci,j,1(in)=falsec_{i,j,1}({\small\textsf{{in}}})={\small\textsf{{false}}} and ci,j,2(in)=falsec_{i,j,2}({\small\textsf{{in}}})={\small\textsf{{false}}}.

To ease understanding, Figure 3 shows the visualization for two such cuts.222Note that Figure 3, the cut using a plane is only for illustration purposes; the condition as specified in Definition 3 allows to form cuts in arbitrary shapes. The cut in the left refines c2,2c_{2,2} to c2,2,1c_{2,2,1} and c2,2,2c_{2,2,2}, and the cut in the right refines c1,2c_{1,2} to c1,2,1c_{1,2,1} and c1,2,2c_{1,2,2}. One can observe that the left cut can enable a refined believed equivalence, as the cut has separated in7{\small\textsf{{in}}}_{7} from in1{\small\textsf{{in}}}_{1} and in2{\small\textsf{{in}}}_{2}. This brings us to the definition of effective refinement cuts.

Definition 4

Given ff, 𝒞={C1,,Cm}\mathcal{C}=\{C_{1},\ldots,C_{m}\}, 𝒟test\mathcal{D}_{test} where f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, and a new test case in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test}, a refinement-by-cut RC(𝒞,i,j)RC(\mathcal{C},i,j) is effective over in’, if f𝒟test{in’}evlRC(𝒞,i,j)f\models^{evl}_{\mathcal{D}_{test}\cup\{{\small\textsf{{in'}}}\}}RC(\mathcal{C},i,j).

Refer to caption
Figure 3: Effective (left) and ineffective (right) cuts

Apart from cuts, the second type of refinement is to add another category.

Definition 5

Given 𝒞={C1,,Cm}\mathcal{C}=\{C_{1},\ldots,C_{m}\}, 𝒟test\mathcal{D}_{test} and a new test case in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test}, an refinement-by-expansion RE(𝒞)RE(\mathcal{C}) generates a new set of categories by adding Cm+1C_{m+1} to 𝒞\mathcal{C}, where Cm+1={cm+1,1,cm+1,2}C_{m+1}=\{c_{m+1,1},c_{m+1,2}\} with elements satisfying the following conditions.

  • For any test case in𝒟test{\small\textsf{{in}}}\in\mathcal{D}_{test}, cm+1,1(in)=truec_{m+1,1}({\small\textsf{{in}}})={\small\textsf{{true}}} and cm+1,2(in)=falsec_{m+1,2}({\small\textsf{{in}}})={\small\textsf{{false}}}.

  • cm+1,1(in’)=falsec_{m+1,1}({\small\textsf{{in'}}})={\small\textsf{{false}}} and cm+1,2(in’)=truec_{m+1,2}({\small\textsf{{in'}}})={\small\textsf{{true}}}.

In contrast to refinement cuts where effectiveness is subject to how cuts are defined, the following lemma paraphrases that refinement-by-expansion is by-definition effective.

Lemma 4

Given ff, 𝒞={C1,,Cm}\mathcal{C}=\{C_{1},\ldots,C_{m}\}, 𝒟test\mathcal{D}_{test} where f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, and a new test case in’𝒟in𝒟test{\small\textsf{{in'}}}\in\mathcal{D}_{in}\setminus\mathcal{D}_{test}, a refinement-by-expansion RE(𝒞)RE(\mathcal{C}) is effective over in’, i.e., f𝒟test{in’}evlRE(𝒞)f\models^{evl}_{\mathcal{D}_{test}\cup\{{\small\textsf{{in'}}}\}}RE(\mathcal{C}).

Proof 4
  1. 1.

    By introducing the new category Cm+1C_{m+1} following the condition as specified in Definition 4, consider in1,in2𝒟test{\small\textsf{{in}}}_{1},{\small\textsf{{in}}}_{2}\in\mathcal{D}_{test} where i{1,,m},j{1,,|Ci|}:ci,j(in1)=ci,j(in2)\forall i\in\{1,\ldots,m\},j\in\{1,\ldots,|C_{i}|\}:c_{i,j}({\small\textsf{{in}}}_{1})=c_{i,j}({\small\textsf{{in}}}_{2}), then as cm+1,1(in1)=cm+1,1(in2)=truec_{m+1,1}({\small\textsf{{in}}}_{1})=c_{m+1,1}({\small\textsf{{in}}}_{2})={\small\textsf{{true}}} and cm+1,2(in1)=cm+1,2(in2)=falsec_{m+1,2}({\small\textsf{{in}}}_{1})=c_{m+1,2}({\small\textsf{{in}}}_{2})={\small\textsf{{false}}}. We can expand the first universal quantifier such that i{1,,m,m+1},j{1,,|Ci|}:ci,j(in1)=ci,j(in2)\forall i\in\{1,\ldots,m,m+1\},j\in\{1,\ldots,|C_{i}|\}:c_{i,j}({\small\textsf{{in}}}_{1})=c_{i,j}({\small\textsf{{in}}}_{2}) also holds. As evl(in1,f(in1))=evl(in2,f(in2))evl({\small\textsf{{in}}}_{1},f({\small\textsf{{in}}}_{1}))=evl({\small\textsf{{in}}}_{2},f({\small\textsf{{in}}}_{2})), so f𝒟testevlRE(𝒞)f\models^{evl}_{\mathcal{D}_{test}}RE(\mathcal{C}).

  2. 2.

    For any test case in𝒟test{\small\textsf{{in}}}\in\mathcal{D}_{test}, we have cm+1,1(in)cm+1,1(in’)c_{m+1,1}({\small\textsf{{in}}})\neq c_{m+1,1}({\small\textsf{{in'}}}). Therefore, the inconsistency condition as specified in Definition 2 does not hold, i.e., in’ is consistent with f𝒟testevlRE(𝒞)f\models^{evl}_{\mathcal{D}_{test}}RE(\mathcal{C}). By applying Lemma 3 we have f𝒟test{in’}evlRE(𝒞)f\models^{evl}_{\mathcal{D}_{test}\cup\{{\small\textsf{{in'}}}\}}RE(\mathcal{C}). ∎

(Coverage under Refinement) Although introducing refinement over categories can make in’ consistent with 𝒟test\mathcal{D}_{test}, it may also make the previously stated 100%100\% coverage claim (using γ\gamma-way combinatorial testing) no longer valid. An example can be found in the left of Figure 3, where a full coverage 22-way combinatorial testing no longer holds due to the cut. By analyzing category pairs C2C_{2} and C3C_{3}, one realizes that achieving 100%100\% coverage requires an additional test case in^\hat{{\small\textsf{{in}}}} where c2,2,1(in^)=truec_{2,2,1}(\hat{{\small\textsf{{in}}}})={\small\textsf{{true}}} and c3,2(in^)=truec_{3,2}(\hat{{\small\textsf{{in}}}})={\small\textsf{{true}}}.

For γ\gamma-way combinatorial testing, in the worst case, the newly introduced element (cm+1,2c_{m+1,2} in the refinement-by-expansion case; ci,j,1c_{i,j,1} or ci,j,2c_{i,j,2} in the refinement-by-cut case) will only be covered by the newly encountered test case in’𝒟test{\small\textsf{{in'}}}\not\in\mathcal{D}_{test}. As one element has been fixed, the number of additional test cases to be introduced to recover full coverage can be conservatively polynomially bounded by (mγ1)Sγ{{m}\choose{\gamma-1}}S^{\gamma}, where S=max{|C1|,,|Cm|}S=\max\{|C_{1}|,\ldots,|C_{m}|\}. Under the case of γ\gamma being a constant, the polynomial is one degree lower than the one as constrained in Lemma 2.

Finally, we omit technical formulations, but as refinement is operated based on the set of categories, it is possible to define the hierarchy of coverage accordingly utilizing the refinement concept. As a simple example, consider the element vehicle being refined to vehiclecar{\small\textsf{{vehicle}}}_{{\small\textsf{{car}}}}, vehicletruck{\small\textsf{{vehicle}}}_{{\small\textsf{{truck}}}} and vehicleothers{\small\textsf{{vehicle}}}_{{\small\textsf{{others}}}} to reflect different types of vehicle and their different performance profile. The coverage on vehicle can be an aggregation over the computed coverage from three refined elements.

4 Refinement-by-cut for Inputs over Reals

Refer to caption
Figure 4: Defining a category C1={c1,1,c1,2,c1,3}C_{1}=\{c_{1,1},c_{1,2},c_{1,3}\} using four values α1,0,α1,1,α1,2,α1,3\alpha_{1,0},\alpha_{1,1},\alpha_{1,2},\alpha_{1,3}, acting on input variable x1x_{1}
Refer to caption
Figure 5: Optimizing from an arbitrary effective cut (a) to a cut being distant to all points (b), to a cut with local robustness guarantees (c), to heuristics to check value consistency alone the direction to 33-nearest neighbors (d)

4.1 Refinement-by-cut as Optimization

Finally, we address the special case where the refinement can be computed analytically. Precisely, we consider the input domain 𝒟in\mathcal{D}_{in} being m\mathbb{R}^{m}, i.e., the system takes mm inputs x1,,xmx_{1},\ldots,x_{m} over reals, where mm matches the size of the category set. We also consider that every category Ci={ci,1,,ci,ik}C_{i}=\{c_{i,1},\ldots,c_{i,i_{k}}\} is characterized by k+1k+1 ascending values αi,0,αi,1,,αi,ik\alpha_{i,0},\alpha_{i,1},\ldots,\alpha_{i,i_{k}}, such that CiC_{i} performs a simple partitioning over the ii-th input, i.e., for in=(x1,,xm){\small\textsf{{in}}}=(x_{1},\ldots,x_{m}), ci,j(in)=truec_{i,j}({\small\textsf{{in}}})={\small\textsf{{true}}} iff xi(αi,j1,αi,j]x_{i}\in(\alpha_{i,j-1},\alpha_{i,j}]. Figure 4 illustrates the idea. The above formulation fits well for analyzing systems such as deep neural networks where each input is commonly normalized to a value within the interval [1,1][-1,1].

Using Definition 3, a refinement-by-cut RC(𝒞,i,j)RC(\mathcal{C},i,j) implies to find a value βi,j(αi,j1,αi,j]\beta_{i,j}\in(\alpha_{i,j-1},\alpha_{i,j}] such that (1) ci,j,1=truec_{i,j,1}={\small\textsf{{true}}} iff xi(αi,j1,βi,j]x_{i}\in(\alpha_{i,j-1},\beta_{i,j}] and (2) ci,j,2=truec_{i,j,2}={\small\textsf{{true}}} iff xi(βi,j,αi,j]x_{i}\in(\beta_{i,j},\alpha_{i,j}]. Provided that in’=(x1,,xm){\small\textsf{{in'}}}=(x_{1}^{\prime},\ldots,x_{m}^{\prime}) being the newly encountered test case, we present one possible analytical approach for finding such a cut, via solving the following optimization problem:

maximizeϵ{\small\textsf{{maximize}}}\;\;\epsilon (3)

subject to:

[Effective]f𝒟test{in’}evlRC(𝒞,i,j)[\text{Effective}]\begin{array}[]{l}\;\;\;\;f\models^{evl}_{\mathcal{D}_{test}\cup\{{\small\textsf{{in'}}}\}}RC(\mathcal{C},i,j)\end{array} (4)
[Distant]in=(x1,,xm)𝒟test{in’}:ci,j(in)=true|βi,jxi|η[\text{Distant}]\begin{array}[]{l}\forall{\small\textsf{{in}}}=(x_{1},\ldots,x_{m})\in\mathcal{D}_{test}\cup\{{\small\textsf{{in'}}}\}:\\ \;\;\;\;c_{i,j}({\small\textsf{{in}}})={\small\textsf{{true}}}\rightarrow|\beta_{i,j}-x_{i}|\geq\eta\end{array} (5)
[Robust]in=(x1,,xm)m:(i{1,,m}:|xixi|ϵ)evl(in,f(in))=evl(in’,f(in’))[\text{Robust}]\begin{array}[]{l}\forall{\small\textsf{{in}}}=(x_{1},\ldots,x_{m})\in\mathbb{R}^{m}:\\ \;\;\;(\forall i\in\{1,\ldots,m\}:|x_{i}-x_{i}^{\prime}|\leq\epsilon)\\ \;\;\;\;\;\;\;\;\;\rightarrow evl({\small\textsf{{in}}},f({\small\textsf{{in}}}))=evl({\small\textsf{{in'}}},f({\small\textsf{{in'}}}))\end{array} (6)

In the above stated optimization problem, apart from constraints as specified in Equation 4 which follows the standard definition as regulated in Definition 4 (illustrated in Figure 5a), we additionally impose two optional constraints Eq. 5 and Eq. 6.

  • First, we wish to regulate the size of the newly created element such that the new believed equivalence class generalizes better by keeping a minimal distance ϵ\epsilon to the boundary βi,j\beta_{i,j} (illustrated in Figure 5b).

  • Secondly, we also specify that all inputs within the ϵ\epsilon-ball centered by in’ demonstrate consistent behavior, thereby demonstrating robustness via behavioral equivalence (illustrated in Figure 5c).

The following lemma states the condition where the optimization problem can be solved analytically using integer linear programming. For general cases, nonlinear programming is required.

Lemma 5

Let ff be a ReLU network and evlevl be a linear function over input and output of ff. Provided that CiC_{i} performs a simple partitioning over the ii-th input, and every input xix_{i} is bounded by (αi,0,αi,ik](\alpha_{i,0},\alpha_{i,i_{k}}] (αi,0\alpha_{i,0}\neq-\infty and αi,ik\alpha_{i,i_{k}}\neq\infty), the optimization problem as stated using Equation 345, and 6 can be encoded using mixed-integer linear programming (MILP).

Proof 5

Finding the maximum ϵ\epsilon for constraint systems using Equation 345 can be trivially by finding the separation plane (illustrated in Figure 5b). Therefore, the key is to independently compute ϵ\epsilon over constraints as specified in Equation 6, and the resulting ϵ\epsilon is the minimum of the computed two. If ff is a ReLU network with all its inputs being bounded by closed intervals, it is well known that the ff as well as the local robustness condition can be encoded using MILP [6, 8, 4] due to the piecewise linear nature of ReLU nodes. As long as evl()evl() is a linear function, it remains valid to perform encoding using MILP. ∎

(Extensions) Here we omit technical details, but it is straightforward to extend the above constraint formulation to (1) integrate multiple cuts within the same element ci,jc_{i,j} and to (2) allow multiple cuts over multiple input variables.

4.2 Using kk-nearest Neighbors to Generate Robust Refinement Proposals

Finally, as the local robustness condition in Equation 6 is computationally hard to be estimated for state-of-the-art neural networks, we consider an alternative proposal that checks robustness using the direction as suggested by kk-nearest neighbors. The concept is illustrated using Figure 5d, where one needs to find a cut when encountering in’. The method proceeds by searching for its closest neighbors in1{\small\textsf{{in}}}_{1}, in2{\small\textsf{{in}}}_{2} and in3{\small\textsf{{in}}}_{3} (i.e., k=3k=3), and builds three vectors pointing towards these neighbors. Then perturb in’ alongside the direction as suggested by these vectors using a fixed step size Δ\Delta, until the evaluated value is no longer consistent (in Figure 5d; all rectangular points have consistent evaluated value like in’). Finally, decide the cut (βi,j\beta_{i,j}) by taking the minimum distance as reflected by perturbations using these three directions.

The generated cut may further subject to refinement (i.e., newly encountered test cases may lead to further inconsistency), but it can be viewed as a lazy approach where (1) we take only finite directions (in contract to local robustness that takes all directions) and (2) we take a fixed step size (in contract to local robustness that has an arbitrary step size). By doing so, the granularity of refinement shrinks on-the-fly when encountering new-yet-inconsistent test cases.

Algorithm 1 Building an initial believed equivalence class with lazy methods
1:  Input: (1) DNN model ff taking input x1,,xmx_{1},\ldots,x_{m}, with each input variable xix_{i} bounded by (lxi,uxi](l_{x_{i}},u_{x_{i}}], (2) the collected data set 𝒟test\mathcal{D}_{test}, (3) evaluation function evl()evl(), and (4) ϵ\epsilon for the minimum required distance.
2:  Output: Categorization 𝒞{\mathcal{C}} that guarantees to maintain believe equivalence f𝒟testevl𝒞f\models^{evl}_{\mathcal{D}_{test}}\mathcal{C}, or return warning when it is impossible to find a cut with interval size larger than η\eta
3:  𝒞:=initializeCategoryUsingVariableBound(x1,,xm)\mathcal{C}:=\textrm{initializeCategoryUsingVariableBound}(x_{1},\ldots,x_{m})
4:  let 𝒟exp:=\mathcal{D}_{exp}:=\emptyset
5:  for in𝒟test\textsf{in}\in\mathcal{D}_{test} do
6:     for in’𝒟exp\textsf{in'}\in\mathcal{D}_{exp} do
7:        if (i,j:ci,j(in)=ci,j(in))(evl(in,f(in))evl(in,f(in)))\forall i,j:c_{i,j}(\textsf{in})=c_{i,j}(\textsf{in}^{\prime}))\wedge(evl(\textsf{in},f(\textsf{in}))\neq evl(\textsf{in}^{\prime},f(\textsf{in}^{\prime}))) then
8:           𝒟neighbor=getKNearestNeighbor(in,𝒟exp,f)\mathcal{D}_{neighbor}=\textrm{getKNearestNeighbor}(\textsf{in},\mathcal{D}_{exp},f)
9:           let replaced := false
10:           for i{1,,m}i\in\{1,\ldots,m\} do
11:              Ci:=cutByRefinement(in,Ci,𝒟neighbor,f)C_{i}^{\prime}:=\textrm{cutByRefinement}(\textsf{in},C_{i},\mathcal{D}_{neighbor},f)
12:              if (isRefinementIntervalLarger(Ci,Ci,η))\textrm{isRefinementIntervalLarger}(C_{i},C_{i}^{\prime},\eta)) then
13:                 Ci:=CiC_{i}:=C_{i}^{\prime}
14:                 replaced := true
15:                 break
16:              end if
17:           end for
18:           if replaced =false=\textsf{false} then
19:              return warning
20:           end if
21:        end if
22:     end for
23:     𝒟exp=𝒟exp{in}\mathcal{D}_{exp}=\mathcal{D}_{exp}\cup\{{\small\textsf{{in}}}\}
24:  end for
25:  return 𝒞{\mathcal{C}}

Integrating the above mindset, we present in Algorithm 1 lazy cut refinement for constructing believed equivalence from scratch. Line 3 builds an initial categorization for each variable by using the corresponding bound. We use the set 𝒟exp\mathcal{D}_{exp} to record all data points that have been considered in building the believed equivalence, where 𝒟exp\mathcal{D}_{exp} is initially set to empty (line 4). For every new data in to be considered, it is compared against all data points in 𝒟exp\mathcal{D}_{exp}; if exists in’𝒟exp{\small\textsf{{in'}}}\in\mathcal{D}_{exp} such that the believed equivalence is violated (line 7), then find the nearest neighbors in DexpD_{exp} (line 8) and pick one dimension to perform cut by refinement (line 8 mimicking the process in Figure 5(d)). Whenever the generated cut proposal CiC_{i}^{\prime} satisfies the distant constraint as specified in Equation 5 (line 12), the original CiC_{i} is replaced by CiC_{i}^{\prime}. Otherwise (lines 18-20), every proposed cut can not satisfy the distant constraint as specified in Equation 5 so the algorithm raises a warning (line 19). Finally, the data point in is inserted to 𝒟exp\mathcal{D}_{exp}, representing that it has been considered. The presentation of the algorithm is simplified to ease understanding; in implementation one can employ multiple optimization methods (e.g., use memory hashing to avoid running the inner loop at line 6 completely) to speed up the construction process.

5 Implementation and Evaluation

We have implemented the concept of believed equivalence and the associated refinement pipeline using Python, focusing on DNN-based modules. The tool searches for feasible cuts on new test cases that lead to inconsistency. As the goal is to demonstrate full automation of the technique, we applied the description in Section 4 such that each category matches exactly one input variable. Therefore, cuts can be applied to the domain of any input variable. In our implementation, we maintain a fixed ordering over categories/variables to be cut whenever refinement is triggered. For the evaluation in automated driving using public examples, we use the lane keeping assist (LKA) module333Details of the example is available at https://www.mathworks.com/help/reinforcement-learning/ug/imitate-mpc-controller-for-lane-keeping-assist.html and the adaptive cruise control (ACC) module444Details of the example is available at https://ch.mathworks.com/help/mpc/ug/adaptive-cruise-control-using-model-predictive-controller.html made publically available by Mathworks to demonstrate the process of refinement on its believed equivalences.

The LKA subsystem takes six inputs, including lateral velocity (LV), yaw angle rate (YAR), lateral deviation (LD), relative yaw angle (RYA), previous steering angle (PSA), and measured disturbance (MD) to decide the steering angle of the vehicle. We construct a DNN for the implementation of the lane keeping assist. The DNN contains three fully connected hidden layers and adopts ReLU as activation functions. The model is trained with the given dataset provided by Mathworks, and the accuracy reaches 99.8%99.8\%. The output of the DNN ranges from 1.04-1.04 to 1.041.04, meaning that the steering angle should be between the degree of [60,60][-60,60]. We divide the range of its output into five classes in our evaluation function, indicating that the output (steering angle) can be one of the following classes: strong left, left turn, straight, right turn, or strong right. Initially, each category has one element being set as the range of the corresponding input. The categories are gradually refined using cuts when inconsistent test cases appear.

#\# test cases LV YAR LD RYA PSA MD
0 1 1 1 1 1 1
1000 28 14 13 5 2 1
5000 29 16 13 10 4 2
10000 31 17 13 11 6 2
20000 32 19 14 12 9 3
30000 35 20 15 13 10 3
40000 39 21 18 13 12 3
50000 40 21 18 13 14 3
60000 41 22 18 14 14 3
70000 41 22 18 15 15 3
80000 41 22 18 15 16 3
TABLE I: Refinement on believed equivalence on the LKA module

Table I provides a trend summary on the number of cuts applied in each category, subject to the number of test cases being encountered. After the first 10001000 test cases, LV has been refined to 2828 intervals, while YAR has been refined to 1414 intervals. The searching sequence for feasible cuts is from the first category to the last. From the table, we observe that after the number of explored test cases exceeds 4000040000, the number of cuts increases slowly. The reason is that the test cases and their evaluated values diverse in the beginning, so the categories are frequently updated. However, with the increasing number of explored test cases, most of the new test cases are consistent with existing ones.

Note that the number of cuts in every category is affected by the ordering how cuts are applied. As the search always starts from the first category (LV), the number of cuts for LV is larger than those of the others. When we change the searching sequence and start from the last category to the first one (i.e., to start the cut from MD), the number of cuts after the first 10001000 test cases becomes 11, 33, 99, 1212, 1616 and 33, respectively. Though the number of cuts varies with different searching sequence, the number of cuts for the last category (MD) is more stable. The reason being that the number of different valuations of MD in the dataset is small.

The ACC module takes five inputs, including driver-set velocity (VsetV_{set}), time gap (TgapT_{gap}), velocity of the ego car (VegoV_{ego}), relative distance to the lead car (DrelD_{rel}), and relative velocity to the lead car (VrelV_{rel}) to decide the acceleration of the ego car. The first two inputs are assigned with fixed valuations. By setting the range of initial velocities and positions for lead and ego cars, we obtain a set of inputs and outputs from the ACC module using model predictive control lateral velocity (LV). We then construct a DNN for the implementation of the module. The structure of the DNN is similar to that of LKA. The model is trained with the obtained set, and the accuracy reaches 98.0%. The output of the DNN ranges from 3-3 to 2, for the physical limitation of the vehicle dynamics. We also divide the range of its output into five classes, indicating the velocity can be strong braking, slow braking, braking, slow accelerating, or strong accelerating. The categories of inputs are gradually refined when inconsistent test cases appear.

#\# test cases VegoV_{ego} DrelD_{rel} VrelV_{rel}
0 1 1 1
1000 221 27 1
2000 230 34 1
3000 261 43 3
4000 262 53 3
5000 284 55 4
6000 285 57 4
7000 289 58 5
8000 290 62 5
9000 305 67 5
10000 307 70 5
TABLE II: Refinement on believed equivalence on the ACC module

Table II lists the trend summary on the number of cuts applied in each category of three inputs (except for the first two), subject to the number of encountered test cases. The searching sequence of feasible cuts is from the first to the last. From the table, we observe that the number of cuts for the velocity of ego increases rapidly when the number of explored test cases is less than 1000. However, the number of cuts for the last input increases slowly. The reason is that the velocity of the ego and the relative distance between the ego and lead car may be the main factors to decide the acceleration valuation of the ego car.

The refined categories show that the granularity for most of the cuts is less than one for the three inputs. For input VrelV_{rel}, the cut granularity is larger when the velocity of the lead car is faster than the ego car. We guess that the model is more sensitive when the velocities of the lead and ego cars are similar.

6 Concluding Remarks

In this paper, we considered the practical problem of using the equivalence class in testing autonomous vehicles. Starting with a set of categorizations, we realized that any combination is only an equivalence class “candidate” and may be further subject to refinement. We formulated this idea using believed equivalence, established its connection with coverage criterion using combinatorial testing, and discussed analytical solutions for finding a refinement using optimization. As guided by the newly discovered test cases, the refinement offers traceability on why such a believed equivalence is formed.

Currently, our initial evaluation is conducted using publicly available autonomous driving functions. We plan to integrate this technique in one of our research projects on automated valet parking to refine the process further and bring the concept into standardization bodies.

(Acknowledgement) This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 956123.

References

  • [1] Measurable scenario description language (m-sdl). https://www.foretellix.com/open-language/, 2018.
  • [2] Iso/tr 4804:2020 road vehicles — safety and cybersecurity for automated driving systems — design, verification and validation. https://www.iso.org/standard/80363.html, 2020.
  • [3] R. Alur. Timed automata. In International Conference on Computer Aided Verification (CAV), pages 8–22. Springer, 1999.
  • [4] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, pages 1–37, 2020.
  • [5] C.-H. Cheng, C.-H. Huang, and H. Yasuoka. Quantitative projection coverage for testing ML-enabled autonomous systems. In International Symposium on Automated Technology for Verification and Analysis (ATVA), pages 126–142. Springer, 2018.
  • [6] C.-H. Cheng, G. Nührenberg, and H. Ruess. Maximum resilience of artificial neural networks. In International Symposium on Automated Technology for Verification and Analysis (ATVA), pages 251–268. Springer, 2017.
  • [7] T. Dreossi, S. Ghosh, A. Sangiovanni-Vincentelli, and S. A. Seshia. A formalization of robustness for deep neural networks. arXiv preprint arXiv:1903.10033, 2019.
  • [8] M. Fischetti and J. Jo. Deep neural networks and mixed integer linear optimization. Constraints, 23(3):296–309, 2018.
  • [9] D. J. Fremont, T. Dreossi, S. Ghosh, X. Yue, A. L. Sangiovanni-Vincentelli, and S. A. Seshia. Scenic: a language for scenario specification and scene generation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 63–78, 2019.
  • [10] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks. In International Conference on Computer Aided Verification (CAV), pages 3–29. Springer, 2017.
  • [11] P. Koopman, B. Osyk, and J. Weast. Autonomous vehicles meet the physical world: RSS, variability, uncertainty, and proving safety. In International Conference on Computer Safety, Reliability, and Security (SAFECOMP), pages 245–253. Springer, 2019.
  • [12] Y. Li, J. Tao, and F. Wotawa. Ontology-based test generation for automated and autonomous driving functions. Information and Software Technology, 117:106200, 2020.
  • [13] Z. Li, X. Ma, C. Xu, and C. Cao. Structural coverage criteria for neural networks could be misleading. In IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pages 89–92. IEEE, 2019.
  • [14] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, et al. Deepgauge: Multi-granularity testing criteria for deep learning systems. In ACM/IEEE International Conference on Automated Software Engineering (ASE), pages 120–131. ACM, 2018.
  • [15] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In IEEE conference on computer vision and pattern recognition (CVPR), pages 1765–1773, 2017.
  • [16] C. Nie and H. Leung. A survey of combinatorial testing. ACM Computing Surveys (CSUR), 43(2):1–29, 2011.
  • [17] K. Pei, Y. Cao, J. Yang, and S. Jana. Deepxplore: Automated whitebox testing of deep learning systems. In ACM Symposium on Operating Systems Principles (SOSP), pages 1–18, 2017.
  • [18] S. A. Seshia, A. Desai, T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, S. Shivakumar, M. Vazquez-Chanlatte, and X. Yue. Formal specification for deep neural networks. In International Symposium on Automated Technology for Verification and Analysis (AVTA), pages 20–34. Springer, 2018.
  • [19] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening. Concolic testing for deep neural networks. In ACM/IEEE International Conference on Automated Software Engineering (ASE), pages 109–119. ACM, 2018.
  • [20] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.