This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Intel Israel Development Center, Haifa, Israel
11email: [email protected]

Accelerating System-Level Debug Using Rule Learning and Subgroup Discovery Techniques

Zurab Khasidashvili
Abstract

We propose a root-causing procedure for accelerating system-level debug using rule-based techniques. We describe the procedure and how it provides high quality debug hints for reducing the debug effort. This includes heuristics for engineering features from logs of many tests, and the data analytics techniques for generating powerful debug hints. As a case study, we used these techniques for root-causing failures of the Power Management design feature Package-C8 and showed their effectiveness. Furthermore, we propose an approach for mining the root-causing experience and results for reuse, to accelerate future debug activities and reduce dependency on validation experts. We believe that these techniques are beneficial also for other validation activities at different levels of abstraction, for complex hardware, software and firmware systems, both pre-silicon and post-silicon.

Keywords:
System-level Debug System-level Validation Validation Knowledge Mining Root Cause Analysis Machine Learning Rule Learning Subgroup Discovery Feature Range Analysis.

1 Introduction

An estimated 5060%50-60\% of validation effort is spent on debugging. Debugging in a system-level platform represents a great challenge due to the enormous number of involved software, firmware and hardware components, domains, flows, protocols, design features, interfaces etc. The current practice is to follow the traces of design failures as they cross the system domains and allocate the right experts to assist in debugging sub-flows associated to these domains, starting from the sub-flows where the mismatches from the expected behavior are first observed. Another challenge is that there is no established way to learn from root-causing of a failure and mine the learning in order to reuse it for future root-causing activities or to automatically root-cause related failure scenarios in the future.

For such distributed debug activities, terms debugging and root-causing might mean different things to different validators as well as in different validation contexts: If a bug (or an unexpected behavior or outcome of a test) is originating from module B, then for a validator responsible for module A root-causing might mean to understand whether the bug is related to module A, and when there is a sufficient evidence that this is not the case, contact a validator of another module where the bug might potentially hide. For the validator of module B, root-causing might mean to understand and confirm that the bug is originating from the hardware or software code of module B. Furthermore, in case of post-silicon validation, the bug or unexpected behavior can be caused by physical phenomena like manufacturing defects or manufacturing process variations, and in that case the validator of module B might require help from a physical designer or manufacturing expert to root-cause the unexpected behavior. Root-causing in such a case might mean to trace down the failure to a physical device that exposes a rare or a systematic weakness under some operating conditions (temperature, voltage, power, or other). The failure in that case might not be associated to a line in hardware, software, or firmware code, and instead be associated with phenomena which are currently very difficult or impossible to control.

We propose to use SubgroupDiscoverySubgroup\,\,Discovery (𝑆𝐷\it SD[17, 31, 30, 1] and RuleLearningRule\,\,Learning (𝑅𝐿\it RL[5, 9] techniques for automating the root-causing activity and generating valuable debug hints based on data that is collected from logs of many tests. The aim of 𝑆𝐷\it SD is to identify “interesting” subgroups of data that can be described as feature-range combinations; and the aim of 𝑅𝐿\it RL is to build classification models using rules with feature-range pairs (as in 𝑆𝐷\it SD) as the antecedent and the predicted class as the consequent. 𝑆𝐷\it SD and 𝑅𝐿\it RL do not operate on trace logs directly, and an important part of our approach is how to engineer features from logs to capture information that might potentially be useful for root-causing. The generated debug hints provide the system-level validator with some level of independence to debug and narrow down the debug space, and this way reduce the dependency on the experts. On the other hand, it also relieves the experts from the burden of massive debug support and also helps them with accelerating debugging. The suggested method also provides a platform for knowledge sharing of previous debug activities. We demonstrate our method on an important and representative PkgC8PkgC8 flow in the Power Management (PM) protocol implemented in Intel processors [13].

Specifically, we apply FeatureRangeAnalysisFeature\,\,Range\,\,Analysis, or RangeAnalysisRange\,\,Analysis (𝑅𝐴\it RA) for short [15, 16], on data engineered from trace logs, for root-causing failing executions. When the data labels are binary, the aim of 𝑅𝐴\it RA is to isolate subgroups of the positive samples from the negative ones. In the context of this work, positive samples represent failing flow executions and negative samples represent passing ones. 𝑅𝐴\it RA is closely related both to 𝑆𝐷\it SD and to 𝑅𝐿\it RL, and represents a way of unifying the 𝑆𝐷\it SD and 𝑅𝐿\it RL approaches: Like 𝑆𝐷\it SD, it aims at identifying feature-range pairs for isolating subgroups of the data with respect to quality functions such as Lift and WRAcc (see Section 5.1), and in addition it uses FeatureSelectionFeature\,\,Selection [10] algorithms aiming at selecting both features highly correlated with the class labels and features giving a good coverage of the variation in the data [7]. In 𝑅𝐴\it RA these high quality feature-range pairs are represented as binary features and they can be used as antecedents of prediction rules as in 𝑅𝐿\it RL, as well as be used as engineered features along with the original features for training classification models; such models can outperform models built from the original features only [16].

In the iterative procedure of root-causing tree construction proposed in Section 5, 𝑅𝐴\it RA is used in each iteration to identify finer subgroups of failures such that all failures in a subgroup have the same root cause and different subgroups represent different root causes. In the validation domain, the 𝑅𝐿\it RL and 𝑆𝐷\it SD techniques have been used for improving test coverage and eliminating redundancy [4, 14], but to the best of our knowledge they have not been used for automating root-causing of test failures.

The paper is organized as follows. In Section 2, we describe the basics of the Power Management flow. In Section 3, we define events, flows, and requirements from traces. In Section 4, we describe how we engineer features from trace logs and the dataset used for inferring root-causing hints. In Section 5, we describe our ML approach. Experimental results are presented in Section 6. In Section 7, we explain how our work is different from related work on trace based root-causing. We conclude by discussing future work in Section 8.

2 Power Management Flows

Power-Management (PM) is a classic topic for system-level protocols to verify. The global PM flows of the Client CPUCPU involve interactions of various internal and external components controlling the power entities across the system, such as Voltage-Regulators, Power-Gates, Clocks, etc.

The Package C-states (PkgCPkgC) are a set of design features for putting the CPUCPU in various levels of sleep states, while PkgC8PkgC8 is one of the deeper PkgCPkgC states. The user visible information on PM in Intel processors is described in [13], Chapter 1.4. The processor supports multiple PkgCPkgC states: C0,C2,C3,C610C0,C2,C3,C6-10. State PkgC0PkgC0 is the normal operation state of the processor – no power saving requirements are imposed. From the PkgC0PkgC0 state the processor can only transition to state PkgC2PkgC2 (which is not a power-saving state) from which it then can transition to any of the power-saving PkgCPkgC states C3,C6C10C3,C6-C10: the higher the number of power-saving state, the higher the power saving in that state.

The processor remains in the PkgC0PkgC0 state if at least one of the IA Cores is in state C0C0 or if the platform has not yet granted the processor the permission to go into power-saving states. In general, a PkgCPkgC state request is determined by the lowest state of processor IA Cores. A PkgCPkgC state is automatically resolved by the processor depending on the idle power states of the processor Core and status of other platform components.

Refer to caption
Figure 1: PkgC8 entry and exit flows.

Figure 1 depicts the PkgC8PkgC8 entry and exit flows in a nutshell; it provides some basic information useful to illustrate the method presented in this paper (familiarity with details of PM flow is not required). This description of the PkgCPkgC states does not apply to any specific generation of Intel products and may change in future generations, and therefore it should be treated as an example. Once all Cores go into sleep-state, the PkgCPkgC-entry flow will be triggered, starting with PkgC2PkgC2. If the conditions allow to go deeper, the processor will initiate Last Level Cache (LLCLLC) flush (reaching PkgC2RPkgC2R). The following stages will be setting the Core, Graphics and Ring voltages to 0v0v, and turn-off the main CPU clock (ClkClk). The PkgC8PkgC8 wake-up flow will set back the voltages and ungate various components of the CPU internal domains.

3 Representation of Events, Traces and Flows

We assume a set of observable variables tracked along execution of a test. Variables originating from hardware are often referred to as signals. In a validation environment, an event is usually referred to as a value change of an observable variable. In this work, an event is a pair (variable,value_range)(variable,value\_range) for numeric observable variables or a pair (variable,level_subset)(variable,level\_subset) for categorical observable variables, interpreted as conditions variablevalue_rangevariable\in value\_range and variablelevel_subsetvariable\in level\_subset, respectively. In case of singleton sets value_range={v}value\_range=\{v\} and level_subset={l}level\_subset=\{l\}, we write these conditions as variable=vvariable=v and variable=lvariable=~{}l. For example, a binary clock signal ClkClk is treated as an (un-ordered) categorical variable; two events are associated to it: CLK_OFFClk=0CLK\_OFF\equiv Clk=0 and CLK_ONClk=1CLK\_ON\equiv Clk=1, which are fired when ClkClk changes value from 11 to 0 or from 0 to 11, receptively. These two events cannot occur at the same time but can occur along the same execution. As another example, one way to encode a message MsgMsg with two possible status values {snd,rcv}\{snd,rcv\} is to treat it as a categorical variable with these two levels and associate to it two events MSG_SNDMsg=sndMSG\_SND\equiv Msg=snd and MSG_RCVMsg=rcvMSG\_RCV\equiv Msg=rcv that are fired when MsgMsg status changes to sent and received, respectively.

By considering value ranges and level subsets, we allow for an abstraction that ignores some of the changes in variable values. For ordered categorical variables, grouping adjacent levels with respect to the order into a single level is a meaningful way of abstraction. For numeric variables, there are many clever ways of discretization [18] into intervals, both supervised and unsupervised. Because the right discretization for root-causing is difficult to quickly estimate upfront, our suggestion is to use discretization techniques where selected ranges can overlap and have different lengths (including length zero) [15, 16]. Then the root-causing algorithm will have a wider choice to select the most relevant event sets.

We assume that a trace (or trace log) is a collection of event occurrences, each associated with the following attributes:

  1. 1.

    (Time Stamp) The cycle or a time stamp of the event occurrence. Usually, for the purpose of validation or root-causing, the absolute values of time stamps are not relevant and therefore are abstracted away, what is important is that the time information captures the sequential vs concurrency information along a trace. In other contexts, the precise timing information can be critical for detecting anomalous behavior (e.g., [29]).

  2. 2.

    (Module ID) The module from which the event is originating. This allows one to zoom in into execution of a particular module that contributes to the events of interest, if debugging so far suggests that the root cause might be found in that module.

  3. 3.

    (Instance ID) An index uniquely determining the instance of the module from which the event occurrence is originating, if multiple instances of that module can be executed, including concurrently, along an execution.

Finally, we assume that a flow is succinctly represented as a sequence of events that must occur along the trace of a passing execution (if that flow was triggered). Events that occur in the description of a flow are called flow-events. In different languages, flows are specified in different forms such as regular expressions of events or as automata. Work [8] introduces a very rich visual formalism to specify concurrent multi-agent flows, with well-defined formal semantics. In the Specman [12] environment that we are using, reference models can be used to specify flows. The flow formalisms and validation environments define and implement the ability to check the traces against flows and conclude a pass or fail status of the flow execution. In this work we avoid sticking to any particular formalism for flows or introducing a new one.

Many formalisms for analyzing traces of concurrent executions assume that the trace logs satisfy the above requirements (see e.g. [8, 26], Sections V and II, respectively). The logger scripts generating trace logs in validation and test environments that we have experimented with support the above requirements. This allows one to correctly identify event occurrences associated to a flow instance with an instance ID ii along the trace, the trace fragment associated to that flow instance, and the corresponding pass or fail labels. This trace fragment starts with the occurrence of the first event in the flow with ID ii, and ends with the occurrence of the last event in the flow with ID ii in case of a successful execution of the flow, and otherwise ends at the end of the trace or earlier, as soon as criteria allowing to ignore the rest of the trace are met, if any. Here, a successful execution of a flow instance is defined as one where occurrences of all events in the flow with ID ii are matched within the trace in the right order, and in that case the associated label is “pass”, and otherwise the label is “fail”.

It is important to note that the events that we use in root-causing are not only the flow-events of the flow whose instances we want to root-cause: Most root-causing approaches try to identify anomalous sub-sequences of a flow, while we want to explain what goes wrong before, in-between, and after the expected flow-events. Our approach to engineering features from traces described in the next section is performed per flow instance and in most cases focuses on event occurrences in the trace fragment associated to it. Note that the above definition of the trace fragment associated to a flow instance, while simple, convenient and sufficient from the validator’s point of view for the PM flow root-causing goals, could in general be very restrictive: the trace prefix leading to the trace fragment associated to the flow instance of interest might contain information critical for root-causing. Actually, also the assumption that future events occurring after the associated trace fragment cannot impact or explain failing executions need not always be valid, for example when execution depends on estimating probabilities of future events based on trends in the data observed so far, or when the effect of an anomalous execution can only be observed late in the execution. In general, taking into account event occurrences from the entire trace might be beneficial for root-causing, and this can be achieved by appropriately adapting the definition of the trace fragment associated to a flow instance and/or by using the full trace during feature engineering.

4 Feature Engineering

As a preprocessing step of the trace logs, for each test in the regression suite and each instance of the flow of interest in it, we extract the corresponding trace fragment (and will treat it as an independent, full input trace), and associate a pass/fail label to that instance, as described in Section 3. Each such instance of the flow and the corresponding trace fragment define a row (a sample) in the dataset that we build as input to ML algorithms.

For each event ee, not necessarily a flow-event, we generate two features:

  • A binary 1/01/0 feature Occurs(e)Occurs({e}) indicating whether event ee occurs along the execution of that flow instance.

  • An integer feature Count(e)Count({e}) continuing the number of occurrences of event ee in the flow instance.

Consider an observable variable vv with which events E(v)={eviiI}E({v})=\{e_{vi}\|i\in I\} are associated in a flow instance or entire trace, where evi=(v,ri)e_{vi}=(v,r_{i}) and {riiI}\{r_{i}\|i\in I\} is a discretization of the full range of vv into non-overlapping ranges (bins). Then the counts Count(evi)Count({e_{vi}}) associated to the events eviE(v)e_{vi}\in E({v}) define a histogram H(v)H({v}) with bins {ri}r_{i}\}. The distribution D(v)D({v}) approximated by this histogram gives a global view of the behavior of variable vv along the trace, and its accuracy is only limited by the granularity of the bins rir_{i} into which the full range of vv was discretized. We call H(v)H({v}) and D(v)D({v}) the histogram and the distribution associated to vv, respectively.

  • We then generate descriptive statistics for that histogram – min, max, mean, median, std (standard deviation), skewness, kurtosis, IQR (interquartile range), and percentiles – and add them as new features to the dataset.

  • We also generate a best distribution fit to the histogram from a list of potentially relevant, continuous or discrete distributions, depending on the type of vv (uniform, normal, lognormal, Weibull, exponential, gamma, etc.) and add the corresponding categorical feature to the dataset. The coefficients of the best fit for each tried distribution can also be added to the dataset as new features, optionally.

In addition, to encode sequential information among the flow-events, for each pair of distinct flow-events (e1,e2)(e_{1},e_{2}) such that e1e_{1} comes before (or optionally, comes immediately before) e2e_{2} in the flow, we can generate:

  • A binary feature INO(e1,e2)INO({e_{1}},{e_{2}}), assigned 11 in a flow instance if and only if there is a pair of occurrences e1e^{\prime}_{1} and e2e^{\prime}_{2} of e1e_{1} and e2e_{2} such that e2e^{\prime}_{2} occurs after e1e^{\prime}_{1} in that flow instance.

  • A binary feature OOO(e1,e2)OOO({e_{1}},{e_{2}}), assigned 11 in a flow instance if and only if there is a pair of occurrences e1e^{\prime}_{1} and e2e^{\prime}_{2} of e1e_{1} and e2e_{2} such that e2e^{\prime}_{2} occurs before e1e^{\prime}_{1} in that flow instance.

The event occurrence feature Occurs(e)Occurs({e}) can be inferred form the event count feature Count(e)Count({e}), since Count(e)=0Count({e})=0 iff Occurs(e)=0Occurs({e})=0, but inclusion of occurrence features can still help ML algorithms to improve root-causing accuracy. In fact, from our experience, occurrence features alone are in most cases enough for root-causing PkgC8PkgC8 flow failures and we will use them in illustrating examples. Recall that we associated events CLK_OFFClk=0CLK\_OFF\equiv Clk=0 and CLK_ONClk=1CLK\_ON\equiv Clk=1 to clock signal ClkClk; now to each of them we associate a binary occurrence feature which we again denote by CLK_OFFCLK\_OFF and CLK_ONCLK\_ON, respectively.

The count features are more useful for performance analysis, pre- or post-silicon, and for performance monitoring during in-field operation, say for detecting anomalous activities such as security attacks that have specific performance profiles. There is some similarity in the concept between the count features and profile features used in [11] in the context of predicting predefined bug types. Since events capture changes in variable values (with some abstraction) and not the duration of the asserted values, the count features capture well signal value switching activities in hardware (which is very important for power analysis), while profiling features are not limited to value switching statistics. On the other hand, the profiling features are limited to only a small fraction of observable variables for which the count features are defined.

There is some redundancy in the in-order features INO(e1,e2)INO({e_{1}},{e_{2}}) and out-of-order features OOO(e1,e2)OOO({e_{1}},{e_{2}}) when they are used in conjunction with the occurrence features. For example, INO(e1,e2)INO({e_{1}},{e_{2}}) can be inferred from Occurs(e1)Occurs({e_{1}}) and Occurs(e2)Occurs({e_{2}}) when e1e_{1} or e2e_{2} do not occur in the flow instance, or when they occur exactly once in the wrong order, which is also captured by OOO(e1,e2)OOO({e_{1}},{e_{2}}).

In general, in-order and out-of-order features can be constructed from more than two flow-event occurrences, say characterizing important sub-flows or stages of the flow, and such features were used in our case studies. In addition, any known correct or wrong order between any sequence of events (not necessarily flow-events) can be added as a feature.

To encode timing information for a subset of events, for each event ee, and for each pair of distinct flow-events (e1,e2)(e_{1},e_{2}) such that e1e_{1} comes before e2e_{2} in the flow, one can generate:

  • The set of elapsed times Timing(e)Timing({e}) associated to the occurrences of ee along the full trace (that is, in all instances of the flow) or along an instance of a flow. This set of elapsed times, viewed as a distribution, is used to generate distribution features in the same way as for the histograms computed based on counts of events associated to a variable, described earlier. Elapsed times form occurrences of ee till the end of the full trace or the trace fragment associated to a flow instance can also be considered.

  • The set of elapsed times Timing(e1,e2)Timing({e_{1}},{e_{2}}) between the occurrences of e1e_{1} and e2e_{2} in a flow instance or in the entire trace. This set of elapsed times, viewed as a distribution, is used to generate distribution features in the same way as described above.

Specifically for Power Management flows, the set of expected or actual elapsed times spent in a sleep state or latencies associated to entering or exiting it are among the most relevant timing information useful for root-causing. One can model these directly, or as elapsed times between the events notifying start and end of the modeled time intervals: say the latency of entering a sleep state can be modeled as elapsed time between starting and completing transition to that sleep state. We note that ordering information between pairs of event occurrences (e1,e2)(e_{1},e_{2}) can be derived from the elapsed times Timing(e1,e2)Timing({e_{1}},{e_{2}}): negative elapsed times would indicate that e1e_{1} occurs after e2e_{2} (which can be in accordance with or against the expected order).

Work [2] considers event ordering features for encoding sequential information in another application domain (a social study), and besides it also encodes timing information, such as the age of individuals at the time of occurrence of age related events. In our use case, this corresponds to associating to each event the elapsed time from the beginning of the flow, extracted from time stamps attached to event occurrences in the logs. In an experimental evaluation comparing usefulness of timing, sequential and other information, the authors conclude that the sequential features are the most useful ones for the accuracy of binary classification models in their study. That work does not propose viewing timing information as histograms and analyzing the associated distribution statistics or other properties.

We note that when the third requirement from traces, considered in Section 3, on tracking instance IDs for event occurrences is not satisfied, then sequential and timing features involving two events are more difficult to construct accurately: say when there are multiple sent and received messages, one would require a mechanism to correctly match a sent message with the corresponding received message, otherwise the event ordering and timing features between two event occurrences might not be constructed for the right pairs of event occurrences and root-causing might become less accurate (though as mentioned above, in many cases event occurrence information and also timing information for single events, can compensate for missing event ordering information). Features derived through histograms and distributions for observable variables as described above will still be highly useful.

Useful features are not limited to ones that are associated with variables in the system’s software and firmware and with logic signals in the hardware. For root-causing post-silicon test data, the test equipment and environmental effects can also be considered in feature engineering process, in addition to design or manufacturing features associated with integration of a die into packages and boards, and die usage conditions. Sensor data collecting temperature, frequency and voltage information is often very important information for root-causing, and this is true for PM in particular.

5 The ML Approach

The aim of our method is to explain differences between the passing tests and subgroups of failing tests in terms of the information available in the trace logs. The identified subgroups of failures must be such that all failures in a subgroup have the same root cause, and failures in different subgroups represent different root causes. Our method does not require full understanding of all legal behaviors of the system. It closely follows the intuition behind the debugging steps that the validators follow in root-causing regression failures. We describe how this method works on root-causing PM PkgC8PkgC8 flow regression test failures.

5.1 Feature Range Analysis

RangeAnalysisRange\,\,Analysis (𝑅𝐴(\it RA[15, 16] is an algorithm resembling RuleLearningRule\,\,Learning (𝑅𝐿\it RL[5, 9] and SubgroupDiscoverySubgroup\,\,Discovery (𝑆𝐷\it SD[17, 31, 1]. From the algorithmic perspective, the main distinguishing feature of 𝑅𝐴\it RA is that it heavily employs FeatureSelectionFeature\,\,Selection (𝐹𝑆\it FS[10] in two basic building blocks of the algorithm: the ranking and basis procedures. The third basic building block in 𝑅𝐴\it RA, the procedure called quality, is the technique borrowed from 𝑅𝐿\it RL and 𝑆𝐷\it SD, where the selection of rules or subgroups is done solely based on one or more quality functions. As a consequence, 𝑅𝐴\it RA can directly benefit from advances in 𝑆𝐷,𝑅𝐿\it SD,\it RL, and 𝐹𝑆\it FS algorithms. These three procedures will be explained below.

For a numeric feature {\cal F}, 𝑅𝐴\it RA defines a single range feature as a pair (,value_range){\cal R}\equiv({\cal F},value\_range), interpreted as a binary feature (s)(s)value_range{\cal R}({s})\equiv{\cal F}({s})\in value\_range, for any sample s{s}. And for a categorical feature and level ll, the corresponding single range feature is defined as (s)(s)=l{\cal R}({s})\equiv{\cal F}({s})=l. A range feature is a single range feature or a conjunction of single range features, also referred to as range tuples or simply as ranges. A range {\cal R} covers a sample s{s} iff (s){\cal R}({s}) evaluates to true. In the latter case, we will also say that s{s} is in {\cal R} or is captured by {\cal R}.

The main purpose of 𝑅𝐴\it RA is to identify range features that explain positive samples, or subsets (subgroups) of positive samples with the same root cause. The 𝑅𝐴\it RA algorithm generates range features that are most relevant for explaining the response, where ‘most relevant’ might mean (a) having a strong correlation or high mutual information with the response, based on one or more correlation measures; (b) explaining part of the variability in the response not explained by the strongest correlating features; or (c) maximizing one or more quality functions, which are usually designed to prefer ranges that separate subgroups of positive samples from negate samples; that is, a range is preferred if a majority, or perhaps a great majority, of positive samples are covered by the range and a great majority of negative samples fall outside the range. Important examples of quality functions include the ones defined below.

By treating a range {\cal R} as a classifier that predicts the positive class for samples covered by the range and negative class for samples outside the range, one can apply the well known quality metrics for binary classification to characterize the quality of ranges. Below we list some of the well known and often used quality metrics for binary classification, where TP(),FP(),TN()TP({{\cal R}}),FP({{\cal R}}),TN({{\cal R}}) and FN()FN({{\cal R}}) denote true positive, false positive, true negative and false negative, respectively, n()n({{\cal R}}) denotes the samples count within range {\cal R}, PosPos denotes the positive samples count, and 𝒩{\cal N} denotes the entire samples count. Thus in this context one can interpret TP()TP({{\cal R}}) as the count of positive samples covered by the range {\cal R}, FP()FP({{\cal R}}) as the count of negative samples covered by the range, TN()TN({{\cal R}}) as the count of negative samples not covered by the range, FN()FN({{\cal R}}) as the count of positive samples not covered by the range, and n()n({{\cal R}}) as the count of samples predicted by {\cal R} as positive. Then the concepts of quality defined for classification tasks can be defined for ranges {\cal R} as follows:

  • True Positive Rate (orsensitivity, recall, or hit rate), TPR()=TP()PosTPR({{\cal R}})=\frac{TP({{\cal R}})}{Pos}

  • Predictive Positive Value (or precision) PPV()=TP()TP()+FP()=TP()n()PPV({{\cal R}})=\frac{TP({{\cal R}})}{TP({{\cal R}})+FP({{\cal R}})}=\frac{TP({{\cal R}})}{n({{\cal R}})}

  • Lift Lift()=TP()/n()Pos/𝒩=PPV()Pos/𝒩Lift({{\cal R}})=\frac{TP({{\cal R}})/n({{\cal R}})}{Pos/{\cal N}}=\frac{PPV({{\cal R}})}{Pos/{\cal N}}

  • Normalized Positive Likelihood Ratio NPLR()=TPR()TPR()+FPR()NPLR({{\cal R}})=\frac{TPR({{\cal R}})}{TPR({{\cal R}})+FPR({{\cal R}})}, where FPR()=FP()FP()+TN()FPR({{\cal R}})=\frac{FP({{\cal R}})}{FP({{\cal R}})+TN({{\cal R}})}

  • Weighted Relative Accuracy WRAcc()=n()𝒩(TP()n()Pos𝒩)=n()𝒩(PPV()Pos𝒩)WRAcc({{\cal R}})=\frac{n({{\cal R}})}{{\cal N}}\cdot(\frac{TP({{\cal R}})}{n({{\cal R}})}-\frac{Pos}{{\cal N}})=\frac{n({{\cal R}})}{{\cal N}}\cdot(PPV({{\cal R}})-\frac{Pos}{{\cal N}})

  • ROC Accuracy associated with Receiver Operating Characteristic (ROC)
    ROCAcc()=TPR()FPR()ROCAcc({{\cal R}})=TPR({{\cal R}})-FPR({{\cal R}})

  • F1F_{1}-score F1Score()=2PPV()TPR()PPV()+TPR()F1Score({{\cal R}})=\frac{2\cdot PPV({{\cal R}})\cdot TPR({{\cal R}})}{PPV({{\cal R}})+TPR({{\cal R}})}

  • Accuracy Acc()=TP()+TN()NAcc({{\cal R}})=\frac{TP({{\cal R}})+TN({{\cal R}})}{N}

The RA algorithm works as follows:

  1. 1.

    The 𝑅𝐴\it RA algorithm first ranks features highly correlated to the response; this can be done using an ensemble Feature Selection [10] procedure, we refer to it as ranking procedure and denote the number of features it selects by nRankingnRanking (this number is controlled by user). In addition, 𝑅𝐴\it RA uses the mRMRemRMRe [6] algorithm to select a subset of features which both strongly correlate to the response and provide a good coverage of the entire variability in the response; this algorithm selects a subset of features according to the principle of Maximum Relevance and Minimum Redundancy (MRMR) [7], we refer to this procedure as basis and denote the number of features it selects by nBasisnBasis (this number too is controlled by user).

  2. 2.

    For non-binary categorical features selected using the ranking and basis procedures described above, or optionally, for all non-binary categorical features, for each level a binary range feature is generated through the one-hot encoding. In a similar way, a fresh binary feature is generated for each selected numeric feature and each range constructed through a discretization procedure. These features are called single-range features. Unlike 𝑅𝐿\it RL and 𝑆𝐷\it SD, in 𝑅𝐴\it RA the candidate ranges can be overlapping, which helps to significantly improve the quality of selected ranges. The 𝑅𝐴\it RA algorithm then applies ranking and basis procedures to select nRankingnRanking and nBasisnBasis most relevant single ranges, respectively; in addition, 𝑅𝐴\it RA selects single range features that maximize one or more quality functions – nRankingnRanking single rage features per quality function used. We refer to the quality-function based selection of ranges as quality.

  3. 3.

    For each pair (or optionally, for a subset of pairs) of selected single range features associated with different original features, 𝑅𝐴\it RA generates range-pair features which have value 11 on each sample where both the component single-range features have value 11 and have value 0 on the remaining samples. 𝑅𝐴\it RA then applies ranking, basis and quality procedures to select respectively nRankingnRanking and nBasisnBasis most relevant range pairs.

  4. 4.

    Similarly, from the selected single ranges and selected range pairs, the 𝑅𝐴\it RA algorithm builds range triplets, and applies ranking, basis and quality procedures to select most relevant ones. Ranges with higher dimensions can be generated and selected in a similar way.

𝑅𝐴\it RA is implemented in Intel’s auto-ML tool EVA [15, 16]. It has been used for root-causing and design space exploration also in other areas of microprocessor design [23, 24].

5.2 Quality orders

For our root-causing application, the first priority is to find range tuples with high precision – preferably with the maximal precision 11, since such a range tuple provides a clear explanation for the subset of positive samples covered by it (this explanation might or might not be the ’true’ explanation). The second priority is to find high precision ranges with high sensitivity (that is, high coverage of positive samples), since that way one can identify root-causing hints valid for many failures thus these failures might be solvable with the same fix – as a subgroup of failures with the same root cause. In addition, in general, range tuples with smaller dimension might be preferable among ranges with the same precision and coverage, as they give simpler explanation for the covered positive samples. Therefore, given a dataset with a binary label, that is, given 𝒩{\cal N} and PosPos, it is convenient to view that the precision PPV()PPV({{\cal R}}) and coverage n()n({{\cal R}}) uniquely determine TP(),FP(),TN()TP({{\cal R}}),FP({{\cal R}}),TN({{\cal R}}) and FN()FN({{\cal R}}), thus all the quality functions above are expressible through precision and coverage.

In this work, in addition to quality function based selection, we suggest to use quality orders for prioritizing ranges for selection. In particular, we define Coverage Weighted Precision (CWP), or Weighted Precision for short, as a lexicographic order on pairs (precision,coverage)(precision,coverage). That is, CWP(1)>CWP(2)CW\!P({{\cal R}_{1}})>CW\!P({{\cal R}_{2}}) if and only if PPV(1)>PPV(2)PPV({{\cal R}_{1}})>PPV({{\cal R}_{2}}) or PPV(1)=PPV(2)n(1)>n(2)PPV({{\cal R}_{1}})=PPV({{\cal R}_{2}})\wedge n({{\cal R}_{1}})>n({{\cal R}_{2}}). For binary responses, which is the focus of this work, CWP induces the same ordering as Positive Coverage Weighted Precision (PCWP) defined as a lexicographic order on pairs (precision,positive_coverage)(precision,positive\_coverage), where positive_coveragepositive\_coverage is the count of positive samples covered by the range feature. For numeric responses, there might be multiple suitable ways to define positive samples (see Section 6.2 in [16]), thus there will be multiple useful ways to define PCWP orderings. While WRAccWRAcc prefers selection of ranges with high precision and high coverage, the ordering induced by WRAccWRAcc is not equivalent to weighted precision. For example, consider dataset with 𝒩=80{\cal N}=80 samples containing Pos=20Pos=20 positive ones, and consider two range tuples 1{\cal R}_{1} and 2{\cal R}_{2}, covering TP(1)=10TP({{\cal R}_{1}})=10 positive and FP(1)=0FP({{\cal R}_{1}})=0 negative samples, and TP(2)=13TP({{\cal R}_{2}})=13 positive and FP(2)=7FP({{\cal R}_{2}})=7 negative samples, respectively. Then PPV(1)=1PPV({{\cal R}_{1}})=1 is significantly higher than PPV(2)=0.65PPV({{\cal R}_{2}})=0.65, thus CWP(1)>CWP(2)CW\!P({{\cal R}_{1}})>CW\!P({{\cal R}_{2}}), while WRAcc(1)=0.09375WRAcc({{\cal R}_{1}})=0.09375 is smaller than WRAcc(2)=0.1WRAcc({{\cal R}_{2}})=0.1.

Coverage Weighted Lift and Positive Coverage Weighted Lift are defined similarly, and they define the same orderings as CWP and PCWP, respectively. Among two range tuples with the same WRAccWRAcc, one with smaller coverage has higher precision, therefore Coverage weighted WRAcc is defined as the lexicographic order on pairs (WRAcc,coverage)(WRAcc,coverage), while Precision weighted WRAcc is defined as as the lexicographic order on pairs (WRAcc,1/coverage)(WRAcc,1/coverage), or equivalently, on pairs (WRAcc,precision)(WRAcc,precision). One can refine the remaining quality functions in the same way. The dimension of the range tuples can be added to these lexicographic orders as the third component in order to ensure that range tuples with smaller dimensions are selected first among tuples with the same precision and coverage.

5.3 Basic Root-Causing Step

We explain the main idea of how 𝑅𝐴\it RA helps root-causing on an example of Figure 2(a), which displays a binary feature CLK_OFFCLK\_OFF, associated to event Clk=0Clk=0, ranked highly by 𝑅𝐴\it RA and plotted for validator’s inspection. The dots in the plot, known as a Violin plot, correspond to rows in the input dataset, thus each dot represents a passing or failing instance of PkgC8PkgC8 flow. Value 11 on the YY axis (the upper half of the plot) represents the failing tests and value 0 (the lower half) represents the passing ones. The plot shows that when CLK_OFF=0CLK\_OFF=0, that is, when signal ClkClk was never set to 0 along the execution of the flow instance, all tests fail! This is a valuable information that the validator can use as an atomic step for root-causing. The legend at the top of the plot specifies that there are actually 115115 failing tests and 0 passing tests with CLK_OFF=0CLK\_OFF=0, and there are 8585 failing tests and 247247 passing ones with CLK_OFF=1CLK\_OFF=1.

Let’s note that the range feature (CLK_OFF,0)(CLK\_OFF,0) has precision 11, as defined in Subsection 5.1, since it only covers positive samples (which correspond to failing tests). To encourage selection of range features with precision 11 in root-causing runs, in 𝑅𝐴\it RA we use weighted precision along with WRAcc and Lift, which also encourage selection of range features with precision 11.

5.4 Root-Causing Strategies

During different stages of product development, the nature of validation activities might vary significantly. When there are many failing tests, the root cause for failures might be different for subgroups of all failed tests. In this case it is important to be able to discover these subgroups of failing tests so that failures in each subgroup can be debugged together to save validation effort. On the other hand, it is important to be able to efficiently debug individual failures. In our approach, the strategies of identifying subgroups of failing tests with similar root cause and putting focus on root-causing individual failures correspond to breadth-first and depth-first search strategies in the Root-Causing Tree (𝑅𝐶𝑇\it RCT) that we define next.

Refer to caption
(a) Range feature ranked highly by 𝑅𝐴\it RA.
Refer to caption
(b) Root-causing tree of failures of PkgC8 and wakeup-timer.
Figure 2: Iterative root-causing procedure.

Our proposed procedure for root-causing regression failures is an iterative process and requires a guidance from the validator – or inputs from an oracle, to make the formulation of 𝑅𝐶𝑇\it RCT procedure formal – in order to decide whether the ranked root-causing hints are sufficient to conclude the root cause or further root-causing effort is required, including upgrading the initial dataset with more features according to what has been learned from the root-causing procedure that far. Generation of the ranked list of root-causing hints in each iteration of building 𝑅𝐶𝑇\it RCT (splitting a node in the current tree) is fully automated by using the 𝑅𝐴\it RA algorithm, but unlike the 𝑅𝐿\it RL and 𝑆𝐷\it SD algorithms, the input data to the 𝑅𝐶𝑇\it RCT procedure can change from one iteration of 𝑅𝐴\it RA algorithm to a next iteration. As a result, in 𝑅𝐶𝑇\it RCT procedure user guidance is explicit while data preparation and inspection of results are outside of the scope of 𝑅𝐿\it RL and 𝑆𝐷\it SD algorithms (thus 𝑅𝐿\it RL and 𝑆𝐷\it SD are deemed as fully automated even if the results are not satisfactory and rerun is required with a more appropriate dataset and/or different parameter values of 𝑅𝐿\it RL and 𝑆𝐷\it SD algorithms).

The iterative root-causing procedure is illustrated using Figure 2(b). More specifically, Figure 2(b) presents an 𝑅𝐶𝑇\it RCT of regression failures of PkgC8PkgC8 with random wakeup-timer, which will issue an interrupt to trigger a wake-up out of PkgC8PkgC8 flow. Using the procedure described next, after a few iterations, at the leaves of the tree we get subgroups of failures characterized by the length of the wakeup-timer and which actions were achieved. Initially, in the regression we see 200200 failing tests as one group of failing tests, and there are 247247 passing tests.

  • (a)

    [Initialization] Initialize the root-causing tree 𝑅𝐶𝑇\it RCT to a node nn, mark it as open, set the input dataset DD passed to the iterative root-causing procedure to the dataset engineered from traces as described in Section 4, and set the number of iterations i=0i=0. Go to step (c)(c).

  • (b)

    [Iterations] In this step, the procedure receives a four-tuple (𝑅𝐶𝑇,n,S,DS)(\it RCT,n,S,D_{S}) where 𝑅𝐶𝑇\it RCT is the current tree, nn is a node in it selected for splitting, SS is the subgroup of failing tests associated to node nn, and DSD_{S} is the dataset to be used for splitting the subgroup SS, with the aim to arrive to smaller high quality subgroups with different root cause or root causes each. 𝑅𝐴\it RA is performed on DSD_{S} and the oracle chooses the best range feature to split the tree at node nn. (Any range feature not identical to the response implies a split at nn.) We add two sons to 𝑅𝐶𝑇\it RCT: one will be associated with the subgroup of failing tests covered by the selected range feature, and the other one with the subgroup containing the rest of failing tests from SS; both sons are marked as open. Increment ii and go to step (c)(c).

  • (c)

    [Decisions]

    1. 1.

      If all leaves of 𝑅𝐶𝑇\it RCT are marked as closed, exit. Otherwise go to (c).2(c).2:

    2. 2.

      Choose a leaf node nn not marked as closed – this is the leaf targeted for splitting in order to refine the associated subgroup SS of failing tests. if the oracle decides there is no need to split SS, that is, all failing tests in SS can be associated to a single root cause or SS is empty, then mark the leaf nn as closed and go to (d)(d). Otherwise go to (c).3(c).3:

    3. 3.

      Define DSD_{S} as follows: if i>1i>1, from the input dataset DD subset rows that correspond to failing tests in this subgroup and rows corresponding to the passing tests, to obtain dataset DSD_{S}. Drop from DSD_{S} all features that were used for splitting at the parent nodes; in addition, drop the features that are identical to the dropped features, as well as features that are identical to the response (the class label), in DSD_{S}. Otherwise (case i=0i=0), set DS=DD_{S}=D. Go to (c).4(c).4.

    4. 4.

      If there are no features in DSD_{S} and no new features can be added, mark the leaf node nn as closed and go to step (c)(c). Otherwise, if DSD_{S} contains all features relevant to SS (subject to the oracle’s decision), or no additional features are available, go to step (b)(b) with four-tuple (𝑅𝐶𝑇,n,S,DS)(\it RCT,n,S,D_{S}). Otherwise go to step (e)(e) with four-tuple (𝑅𝐶𝑇,n,S,DS)(\it RCT,n,S,D_{S}).

  • (d)

    [Knowledge Mining] This step performs mining of knowledge learned along a closed branch of the root-causing tree, and it will be discussed further in Subsection 5.7. Then go to step (c)(c).

  • (e)

    [Data Refinement] This step corresponds to the situation when the oracle wants to add more features that can potentially help to split subgroup SS further, and it will be discussed further in Subsection 5.6. Then go to step (b)(b) (with an updated dataset DSD_{S}).

Let’s note that in the root-causing tree of Figure 2(b), event C8_ACTIONSC8\_ACTIONS is not a flow-event – it does not occur in PkgC8PkgC8 flow description. It helps to explain why some of the flow instances that successfully go through the CLK_OFFCLK\_OFF stage of the PkgC8PkgC8 flow fail to complete the flow. Also, the events of the PkgC8PkgC8 that should occur between CLK_OFFCLK\_OFF and CORE_WAKE_TO_C0CORE\_WAKE\_TO\_C0 do not occur in the root-causing tree. The root-causing tree therefore provides pretty neat and succinct explanations and grouping of the failures, and gives deeper insights compered to just identifying where in the flow the flow instance starts to deviate from the expected execution. As discussed in Section 4, the information that event C8_ACTIONSC8\_ACTIONS should occur before events notifying exit stages of PkgC8PkgC8 flow can be mined as out-of-order features, optionally. During building the tree in Figure 2(b), this information was not coded as a feature in the input dataset DD.

In each splitting iteration, the range feature selected for splitting can be multi-dimensional, such as pairs or triplets of feature-range pairs. That is, in general splitting is performed based on a set of events, not necessarily based on just one event. This will be discussed in more detail in Subsection 5.5. Each iteration is zooming into a lower-level detail of the flow execution and smaller set of modules of the system.

We note that the features identical to the response in dataset DSD_{S} that are dropped from analysis in step (c)(c) are very interesting features for root-causing as they fully separate the current subgroup SS and the passing tests.

Our iterative root-causing procedure assumes that there are passing tests. This is a reasonable assumption but does not need to hold at the beginning of a design project. In such cases, unsupervised rule based or clustering algorithms can be used.

Finally, let’s note that at every iteration in the proposed root-causing procedure, we select only one range feature to perform a split. However, multiple range features can be selected and splitting can be performed in parallel (and then the root-causing tree will not be binary) or in any order (then the root-causing tree will stay binary and in each iteration we add more than one generation of children to the tree). This reduces the number of required iterations.

5.5 Root-Causing Using Range Tuples

Consider the feature signal_y_4signal\_y\_4 in Figure 3(a), associated to event signal_y=4signal\_y=4, and feature timer_22159_22714timer\_22159\_22714 in Figure 3(b), associated to event timer[22159,22714]timer\in[22159,22714]. While for a big majority of tests (6767 out of 7070, in this case) the value 0 of feature signal_y_4signal\_y\_4 implies test failure, there are a few tests (33 out of 7070, in this case) that are passing. So, the single range feature 1=(signal_y_4,0){\cal R}_{1}=(signal\_y\_4,0) cannot fully isolate the failing tests from passing ones. As can be seen from Figure 3(b), the second single range feature 2=(timer_22159_22714,0){\cal R}_{2}=(timer\_22159\_22714,0) cannot isolate failing tests from the passing tests either – along with 6767 failing tests, there are 106106 passing tests when timer does not occur at time range [22159,22714][22159,22714]. However, the range pair feature 𝒫=12{\cal P}={\cal R}_{1}\wedge{\cal R}_{2} (the conjunction of two binary features 1{\cal R}_{1} and 2{\cal R}_{2}, as defined in Subsection 5.1) fully isolates the 6767 failing tests from the passing tests, because the 33 negative samples covered by 1{\cal R}_{1} and 106106 negative sample covered by 2{\cal R}_{2} are disjoint. The range pair feature 𝒫{\cal P} is depicted in Figure 3(c), where the color encodes the ratio of positive and negative samples within the colored squares: the red color encodes positive samples and green encodes negative samples.

Refer to caption
(a) Highly ranked single range feature signal_ysignal\_y cannot isolate failing tests from the passing ones.
Refer to caption
(b) Single range feature timer_22159_22714timer\_22159\_22714 alone cannot isolate failing tests from the passing ones.
Refer to caption
(c) Range pair that combines single ranges (a) and (b) isolates failing tests from the passing ones.
Figure 3: Usefulness of range tuple features in iterative root-causing procedure.

Usage of range pairs and range tuples of high dimensions can accelerate debugging as they allow to arrive to a root cause in fewer debugging iterations. On the other hand, the splittings of subgroups of failures associated to high dimensional range tuples might not match the intuition of the validator (the oracle’s decision) controlling the iterations in the root-causing tree. Thus the time spent by 𝑅𝐴\it RA algorithm (or other 𝑅𝐿\it RL or 𝑆𝐷\it SD algorithms which can be used instead of 𝑅𝐴\it RA at each splitting iteration) will be wasted. One useful heuristic of choosing ranges for next splitting iterations is to choose the range features with precision 11 that originate from the flow-events as a default, without intervention of the validator, if such range features are selected by 𝑅𝐴\it RA as candidates for splitting.

To aid the validator in choosing the range features that are associated to well understood events, to every ranked candidate range features we associate a list of very highly correlated or identical features, with respect to DSD_{S}. Thus, instead of a highly ranked feature the validator can chose another representative with a clear interpretation. These strongly correlated or identical features are computed by 𝑅𝐴\it RA anyway, as a means to reduce the redundancy in the identified subgroups and to speed-up the algorithm [15, 16]. In 𝑅𝐴\it RA, the usage of the MRMRMRMR procedure helps to not miss interesting range features for splitting, when there are many features in the analysis. Finding diverse, non-redundant and concise sets of subgroups is an important direction in the 𝑆𝐷\it SD research [20, 30], and is very relevant in the context of minimizing the validator’s involvement as the decision maker in controlling the root-causing iterations.

5.6 Root-Causing in Hierarchical Order

For large systems with many modules, it might not be wise or feasible to generate upfront all features that can be engineered from traces, and include them into the initial dataset DD used in the first root-causing iteration. The root-causing iterations can also be performed in a hierarchical manner, and this is actually a preferred option which we adopted in our experiments.

In the hierarchical approach, during the initial iterations, one can choose to only consider features associated with some of the important modules as well as with observable signals on module interfaces, in order to include in analysis the features associated to all the flow-events in the flow of interest. And if root-causing hints collected so far indicate that the bug could be in a particular module or modules, features originating from these modules can be added to analysis (and optionally, features from some other modules can be dropped). This flexibility of refining the input data during root-causing iterations helps to keep the datasets analyzed by ML algorithms smaller and focused, and speeds up the root-causing iterations. In particular, when the iterative procedure arrives at step (e)(e), one could consider adding features originating from relevant modules if they were not present, or consider engineering more features associated to additional variables from the relevant modules, when possible, and then continue with step (b)(b), with tuple (𝑅𝐶𝑇,n,S,DS)(\it RCT,n,S,D_{S}) received at step (e), but with the updated DSD_{S}. As an example, in the root-causing tree of Figure 2(b), the rightmost branch was closed with label Reached_last_stage_before_C8Reached\_last\_stage\_before\_C8, because data from the respective module was not available at the time of initial experiments. Interestingly, when the relevant data became available and was added to analysis, further root-causing revealed that the corresponding regression test failures were not caused by a bug, instead, the self-checks in these tests were incorrect, and were fixed as a result of the gained insights during the root-causing iterations.

We note that there might be other useful ways to revise a current dataset, not covered by our algorithm, and it is not our goal to be exhaustive in that respect. For example, as mentioned in Section 4, feature engineering depends on the definition of the trace fragment associated to a flow instance, and our choice in this work was to define the relevant trace fragment upfront as part of data preprocessing step prior to 𝑅𝐶𝑇\it RCT construction, while we could make the definition of relevant trace fragment as part of the 𝑅𝐶𝑇\it RCT procedure, since identification of the most relevant fragments of trace to engineer features from it is also affected by the learning acquired as part of root-causing iteration performed so far.

5.7 Mining Rules

In order to make root-causing easier for validators and to enable leveraging the knowledge accumulated during root-causing activities, we propose to mine validation knowledge as follows:

  • Maintain and improve a table associating their meaning with well understood events. This table is used to display the meaning of the events and event combinations associated to range features ranked highly by 𝑅𝐴\it RA algorithm when the latter are presented to the validator to choose the best root-causing hints and decide on next root-causing iteration.

  • Construct and improve a database or rules as follows: At step (d)(d) of the root-causing procedure, before returning to step (c)(c), the validator can assign a label to the respective subgroup of failures, and add a rule to the database of rules associated to the branch of the tree leading to that subgroup. The antecedent of this rule is the conjunction of conditions associated with the ranges along this branch and the consequent is the label of the subgroup SS at the leaf of this branch. As an example, the rule associated to the branch ending at node Clk_Off_without_C8_ActionsClk\_Off\_without\_C8\_Actions is written as:

    CLK_OFF=1C8_ACTIONS=0Clk_Off_without_C8_ActionsCLK\_OFF=1\wedge C8\_ACTIONS=0\longrightarrow Clk\_Off\_without\_C8\_Actions

The learned rules in the rule database can be used to predict root cause of test failures in future regressions by building predictive models from rules, using the algorithms developed under 𝑅𝐿\it RL or related algorithms such as [20] that combine 𝑅𝐿\it RL and 𝑆𝐷\it SD techniques for building predictive models from rules. An alternative approach is to train a model predicting the learned failure labels (failure labels learned during 𝑅𝐶𝑇\it RCT construction) once a rich set of failure labels and dataset DD engineered from passing tests and failing tests with these failure types are available, like this is done in previous work [22, 11] for predefined failure labels. In addition to the features in DD, the range features corresponding to the left-hand sides of the mined learned rules can be used as derived features when training the model, since this can improve the quality of the trained model as was demonstrated in [16].

It is important to point out that the learned failure labels and rules added to the database are based on the intended behavior of the system and not on its current implementation. Thus, when the design is updated (say due to bug fixes), the design intent usually stays the same. Therefore, the failure labels and rules stay relevant as the design evolves, as long as the design intent is not changed. That is, the rules are more like the reference models or other types of assertions, but have the opposite meaning in that they capture cases of incorrect behavior rather than the intended behavior.

5.8 Backtracking and Termination of Root-Causing Procedure

In our root-causing procedure, we did not discuss backtracking during the 𝑅𝐶𝑇\it RCT construction. When backtracking is not used, each subgroup of failing tests is split at most once, which can be done in a finite number of ways (in binary 𝑅𝐶𝑇\it RCT construction procedure, a subgroup is split in two subgroups or not split at all). Furthermore, steps (a),(b),(d)(a),(b),(d) transition to step (c)(c) which starts with a termination check, and step (e)(e) transitions to (b)(b) which then transitions to (c)(c). Therefore, assuming backtracking is not used and assuming step (b) succeeds by selecting a range that is used for splitting, it is straightforward to conclude that the 𝑅𝐶𝑇\it RCT construction will terminate (there will be no infinite loops).

In general, it can be the case that step (b) does not generate valuable root-causing hints and the validator cannot refine the subgroup at hand further. The most likely reason for this is that the used dataset does not contain relevant features. If that is the case, it is useful to backtrack to the same subgroup SS but upgrade the dataset DSD_{S} by adding potentially relevant features, by going through step (e)(e). Otherwise, the problem might be due to incorrect splitting decisions taken earlier during 𝑅𝐶𝑇\it RCT constriction or poor quality of top ranked range features returned by 𝑅𝐴\it RA, and in such cases too backtracking to an already visited subgroup might be useful. Assuming the number of features available for root-causing is limited and only a limited number of backtracking to an already visited subgroup is allowed, 𝑅𝐶𝑇\it RCT construction will always terminate.

6 Results

We explored the RangeAnalysisRange\,\,Analysis algorithm for root-causing Power Management protocol failures on several regressions, starting with a synthetic regression of PkgC8PkgC8 and random wakeup-timer (Figure 2(b)), and continuing with authentic random PM regressions on a FullChip emulation platform, on two recent Intel products. We found this method very effective for classifying the failures and providing valuable root-causing hints. It also significantly improves the productivity of the validators as the most difficult and time consuming root-causing activities are fully automated.

In Table 1 we report statistics of datasets used in the experiments. The first column lists eight datasets engineered from PMPM test regression logs and fed to 𝑅𝐴\it RA for generating root-causing hints. The next two columns failfail and passpass report counts of failing and passing tests in a regression run, respectively. The column featuresfeatures reports the number of features engineered from trace logs, and the column varsvars reports the number of observable variables (signals from hardware, variables in the software or firmware, parameter values collected from sensors, or other parameters) to which the features are associated with. The column countcount reports the number of count features and variables to which they are associated with, separated with “/”, and similarly the columns occurs,timingoccurs,timing report the number of occurrence and timing features and the variables to which they are associated with, separated with “/”, respectively. Finally, the column orderingordering reports the count of features associated to in-order and out-of-order sequences of two or more events; and the column otherother reports the count of other features relevant for root-causing (e.g., features associated with parameters that do not change along test but might change from test to test, such as particular IDs, modes and conditions of operation). Note that we associate histograms with count and timing features and their count is the same as the number of variables reported in columns countcount and timingtiming; this table does not cover counts of features associated with these histograms (such as histogram statistics minmin, maxmax, meanmean, …), used in root cause analysis.

data fail pass features vars count occurs timing ordering other
Data1 237 247 932 472 368/158 514/302 50/12 0 0
Data2 237 246 932 472 275/107 607/353 50/12 0 0
Data3 237 247 492 156 0/0 424/148 36/8 26 6
Data4 226 44 867 178 0/0 793/170 40/8 27 7
Data5 463 291 869 178 0/0 797/170 38/8 27 7
Data6 2373 142 908 206 0/0 835/198 38/8 27 8
Data7 2626 200 927 207 0/0 854/199 38/8 27 8
Data8 2626 200 944 207 0/0 870/199 39/8 27 8
Table 1: Benchmark dataset statistics

The hierarchical approach to root-causing allows to keep the datasets fed to 𝑅𝐴\it RA pretty small. While 𝑅𝐴\it RA can handle input datasets with thousands of columns and millions of rows, it is important to keep the datasets small to speed up the iterations. As it can be seen from Table 1, in our experiments the number of features range from a few hundred to a thousand. The number of rows range from around few hundreds to a few thousands. The largest analyzed dataset Data8Data8 has 944944 features and 28262826 rows. On such small datasets, each iteration would typically take up to 22 minutes. Number of iterations range from just one to ten or so, with the depth of the 𝑅𝐶𝑇\it RCT reaching up to four or five levels on average.

To the best of our knowledge, this work represents the first usage of 𝑆𝐷\it SD and 𝑅𝐿\it RL techniques towards automating root cause analysis in validation domain. Therefore, in the experimental evaluation below on PM datasets, our first goal is to demonstrate that the root-causing hints identified by our approach are of high quality according to widely accepted objective measures of quality (Tables 2 and 3). In addition, we take a closer look at some of the 𝑆𝐷\it SD heuristics that help to diversify the generated root-causing hints and evaluate them on PM datasets (Tables 3 and 4). Methods used for feature engineering from traces is of critical importance for the entire approach to succeed, and therefore we also analyze which types of features engineered from trace logs are most useful in generating root-causing hints (Table 5). The core 𝑅𝐴\it RA algorithm described in Section 5.1 is not a contribution of this work and while we mostly use 𝑅𝐴\it RA implementation in Intel’s EVA tool in our experiments, it is not our aim to demonstrate that it performs better than 𝑆𝐷\it SD implementations on PM datasets (recall that 𝑅𝐴\it RA leverages 𝑆𝐷\it SD heuristics in one of its main building blocks). In Tables 3 and 4, we include results obtained using the pysubgroup package for 𝑆𝐷\it SD [21], to give an idea about the relative performance, on PM datasets, of 𝑅𝐴\it RA compared to a basic implementation of 𝑆𝐷\it SD. (The pysubgroup package is not restricted to non-commercial usage and is easy to use; it does not support many advanced heuristics of 𝑆𝐷\it SD.) A more comprehensive experimental evaluation of 𝑅𝐴\it RA with 𝑆𝐷\it SD and 𝑅𝐿\it RL on Intel as well as public datasets can be found in [15, 16].

6.1 Quality Metrics on the Largest PM Dataset

Table 2 reports 𝑅𝐴\it RA results in the original root-causing activity after the first 𝑅𝐶𝑇\it RCT iteration on the dataset Data8Data8 with 944944 features and 28262826 rows. Each row in the table corresponds to a selected range feature (the names of features and the selected ranges are not shown), and the first column DimDim shows the dimension of the range features – range features up to triplets were generated. Two heuristics for generating diverse set of ranked range features were used: (1) the number of ranges selected per feature was limited to 22 (this is a default value and was used in all experiments); and (2) among the identical range features per dimension (range features that define the same subgroups but have different descriptions and might be associated to same or different original features in the data), only a single representative was used in further construction of rages of higher dimensions. Therefore the top ranked range features – the selected subgroup descriptions – are diverse, though the subgroups of samples that they define have overlaps. The remaining columns in Table 2 report the confusion matrix statistics and quality metrics defined in Section 5.1. The selected ranges were sorted according to weighted precision defined in Section 5.1. One can see that many ranges with precision 11 have been selected, which include single ranges and range triplets. Actually, all the selected ranges were of high quality. The validator chose the highest ranked range for splitting, in order to further root-cause the seven false negatives (value 77 occurs in column FNFN in the first row of Table 2). The second iteration was 𝑅𝐴\it RA run on the dataset obtained by dropping all positive samples except the seven false negatives. With the ranges reported by the second 𝑅𝐴\it RA run, it was possible to split the seven test failures into subgroups of four, two and one failures, respectively, which resulted in full root-causing of all seven failures.

Dim FN TN TP FP TPR PPV Lift WRAcc F1Score Accuracy
1 7 200 2619 0 0.9973 1.0000 1.0762 0.5328 0.9986 0.9975
3 11 200 2615 0 0.9958 1.0000 1.0762 0.5327 0.9979 0.9961
3 34 200 2592 0 0.9871 1.0000 1.0762 0.5325 0.9935 0.9880
1 35 200 2591 0 0.9867 1.0000 1.0762 0.5324 0.9933 0.9876
1 40 200 2586 0 0.9848 1.0000 1.0762 0.5324 0.9923 0.9858
3 51 200 2575 0 0.9806 1.0000 1.0762 0.5322 0.9902 0.9820
3 58 200 2568 0 0.9779 1.0000 1.0762 0.5322 0.9888 0.9795
3 59 200 2567 0 0.9775 1.0000 1.0762 0.5321 0.9886 0.9791
3 62 200 2564 0 0.9764 1.0000 1.0762 0.5321 0.9881 0.9781
3 64 200 2562 0 0.9756 1.0000 1.0762 0.5321 0.9876 0.9774
3 65 200 2561 0 0.9752 1.0000 1.0762 0.5321 0.9874 0.9770
3 72 200 2554 0 0.9726 1.0000 1.0762 0.5320 0.9861 0.9745
1 73 200 2553 0 0.9722 1.0000 1.0762 0.5320 0.9859 0.9742
3 77 200 2549 0 0.9707 1.0000 1.0762 0.5319 0.9851 0.9728
3 78 200 2548 0 0.9703 1.0000 1.0762 0.5319 0.9849 0.9724
1 110 200 2516 0 0.9581 1.0000 1.0762 0.5315 0.9786 0.9611
1 114 200 2512 0 0.9566 1.0000 1.0762 0.5315 0.9778 0.9597
3 115 200 2511 0 0.9562 1.0000 1.0762 0.5314 0.9776 0.9593
1 120 200 2506 0 0.9543 1.0000 1.0762 0.5314 0.9766 0.9575
3 122 200 2504 0 0.9535 1.0000 1.0762 0.5314 0.9762 0.9568
1 124 200 2502 0 0.9528 1.0000 1.0762 0.5313 0.9758 0.9561
3 149 200 2477 0 0.9433 1.0000 1.0762 0.5310 0.9708 0.9473
3 155 200 2471 0 0.9410 1.0000 1.0762 0.5309 0.9696 0.9452
3 2520 200 106 0 0.0404 1.0000 1.0762 0.5013 0.0777 0.1083
2 478 198 2148 2 0.8180 0.9991 1.0752 0.5266 0.8995 0.8301
2 478 196 2148 4 0.8180 0.9981 1.0742 0.5262 0.8991 0.8294
2 478 196 2148 4 0.8180 0.9981 1.0742 0.5262 0.8991 0.8294
2 478 196 2148 4 0.8180 0.9981 1.0742 0.5262 0.8991 0.8294
2 114 193 2512 7 0.9566 0.9972 1.0732 0.5303 0.9765 0.9572
2 122 192 2504 8 0.9535 0.9968 1.0727 0.5300 0.9747 0.9540
2 114 191 2512 9 0.9566 0.9964 1.0723 0.5300 0.9761 0.9565
3 34 190 2592 10 0.9871 0.9962 1.0720 0.5308 0.9916 0.9844
2 148 190 2478 10 0.9436 0.9960 1.0718 0.5294 0.9691 0.9441
2 154 190 2472 10 0.9414 0.9960 1.0718 0.5293 0.9679 0.9420
1 114 189 2512 11 0.9566 0.9956 1.0715 0.5296 0.9757 0.9558
2 34 170 2592 30 0.9871 0.9886 1.0638 0.5275 0.9878 0.9774
2 63 158 2563 42 0.9760 0.9839 1.0588 0.5252 0.9799 0.9628
2 34 127 2592 73 0.9871 0.9726 1.0467 0.5205 0.9798 0.9621
2 72 128 2554 72 0.9726 0.9726 1.0467 0.5201 0.9726 0.9490
3 47 117 2579 83 0.9821 0.9688 1.0426 0.5186 0.9754 0.9540
1 32 106 2594 94 0.9878 0.9650 1.0385 0.5170 0.9763 0.9554
2 47 105 2579 95 0.9821 0.9645 1.0379 0.5167 0.9732 0.9498
2 54 99 2572 101 0.9794 0.9622 1.0355 0.5156 0.9707 0.9452
2 6 84 2620 116 0.9977 0.9576 1.0305 0.5137 0.9772 0.9568
1 6 64 2620 136 0.9977 0.9507 1.0231 0.5104 0.9736 0.9498
2 51 63 2575 137 0.9806 0.9495 1.0218 0.5097 0.9648 0.9335
1 41 53 2585 147 0.9844 0.9462 1.0183 0.5082 0.9649 0.9335
1 49 49 2577 151 0.9813 0.9446 1.0166 0.5074 0.9626 0.9292
1 2 21 2624 179 0.9992 0.9361 1.0074 0.5034 0.9666 0.9360
Table 2: Range Analysis results in first iteration of 𝑅𝐶𝑇\it RCT construction on Data8Data8, with 𝑅𝐴\it RA option values nRanking=15nRanking=15 and nBasis=12nBasis=12, and with quality functions WRaccWRacc, LiftLift and a weighted average of TPRTPR and FNRFNR

6.2 Coverage with Precision 1 on PM Datasets

In Table 3 we present positive samples coverage information on all eight datasets in our benchmark suite. Table 3.a\ref{pos_coverage_ppv_all_one}.a reports coverage information on positive samples (i.e., test failures) with single range features only, with 𝑅𝐴\it RA. The first column lists the datasets, the columns positivespositives and negativesnegatives report the count of positive and negative samples in the dataset, receptively, and they are included here for convenience. Columns ppv_minppv\_min and ppv_maxppv\_max report the min and max values of the precisions (PPV) of the selected single ranges. To achieve a high level of diversity among the selected range tuples, in addition to the heuristics mentioned above, a heuristic method from [20, 30] was used to generate a diverse set of single ranges that penalizes selection of range features that cover samples already covered by earlier selected range features. In particular, in experiments reported in Table 3, the weight ww that controls the degree of penalizing coverage of a sample by ii-th rule is computed as w=γiw=\gamma^{i}, where 0γ10\leq\gamma\leq 1 was set to 0.40.4, and then (weighted) coverage for a candidate rule is computed based on the weights [20, 30]. Note that when γ=1\gamma=1 the heuristic has no effect, while γ=0\gamma=0 corresponds to ignoring the previously covered samples (or dropping them from the dataset). Columns cov_mincov\_min and cov_maxcov\_max report the min and max coverage of positives samples by the selected single ranges (not all samples need to be covered with the same number of rules). The number of samples not covered by any of the selected single ranges are reported in column not_covnot\_cov, and for convenience in column all_covall\_cov we report number of all covered samples by single ranges. Each of the coverage columns cov_min,cov_max,not_covcov\_min,cov\_max,not\_cov and all_covall\_cov actually report two numbers separated by ``/′′``/^{\prime\prime}: on the left to ``/′′``/^{\prime\prime} we report the respective coverage information for all the selected single ranges while to the right we present coverage information for selected single ranges with precision 11. Range features with precision 11 clearly separate a subset of positive samples from the negative samples thus they can be considered as high quality root-causing hints. From the data reported in columns cov_mincov\_min and cov_maxcov\_max we see that for a single positive sample there are multiple selected single ranges with precision 11, which means that there are multiple root-causing hints for these samples and each one of them can potentially point to the real root cause.

Table 3.b reports the same information as Table 3.a, except now it covers selected single ranges and range pairs, and Table 3.c reports the same information for all range features – single ranges, pairs and triplets. Table 3.d reports the same information obtained using pysubgroup 𝑆𝐷\it SD, where the depthdepth parameter was set to 33, to generate range tuples up to triplets, and the result_set_sizeresult\_set\_size parameter for each dataset was set to the number of generated triplets in Table 3.c for γ=1\gamma=1. The latter choice is justified by the fact that pysubgroup 𝑆𝐷\it SD does not support the γ\gamma-based heuristic and with γ=1\gamma=1 this heuristic has no effect (in 𝑆𝐷\it SD and in 𝑅𝐴\it RA). As it can be seen from Table 3.c, for datasets 13,7,81-3,7,8 only very few samples remain not-covered by selected ranges with precision 11, and they require an additional iteration of the 𝑅𝐶𝑇\it RCT procedure.

(3.a):Coveragewithsinglerangesonly(\ref{pos_coverage_ppv_all_one}.a):\,\,Coverage\,\,with\,\,single\,\,ranges\,\,only
data positives negatives ppv_min ppv_max cov_min cov_max not_cov all_cov
Data1 237 247 0.6286 1.0000 2/1 12/7 0/3 237/234
Data2 237 246 0.5203 1.0000 3/3 13/10 0/37 237/200
Data3 237 247 0.4917 1.0000 9/1 17/8 0/5 237/232
Data4 226 44 0.8427 1.0000 7/1 17/1 0/209 226/17
Data5 463 291 0.6316 1.0000 5/1 13/1 0/331 463/132
Data6 2373 142 0.9519 0.9983 3/0 15/0 0/2373 2373/0
Data7 2626 200 0.9348 1.0000 1/1 16/3 0/73 2626/2553
Data8 2626 200 0.9361 1.0000 3/1 16/9 0/2 2626/2624
(3.b):Coveragewithsinglerangesandpairs(\ref{pos_coverage_ppv_all_one}.b):\,\,Coverage\,\,with\,\,single\,\,ranges\,\,and\,\,pairs
data positives negatives ppv_min ppv_max cov_min cov_max not_cov all_cov
Data1 237 247 0.6286 1.0 4/1 22/13 0/1 237/236
Data2 237 246 0.5203 1.0 7/1 21/11 0/5 237/232
Data3 237 247 0.4917 1.0 18/1 30/8 0/5 237/232
Data4 226 44 0.8427 1.0 18/1 47/3 0/158 226/68
Data5 463 291 0.6316 1.0 7/1 22/8 0/163 463/300
Data6 2373 142 0.9519 1.0 4/2 35/14 0/1279 2373/1094
Data7 2626 200 0.9348 1.0 1/1 32/12 0/11 2626/2615
Data8 2626 200 0.9361 1.0 4/1 33/9 0/2 2626/2624
(3.c):Coveragewithsingleranges,pairsandtriplets(\ref{pos_coverage_ppv_all_one}.c):\,\,Coverage\,\,with\,\,single\,\,ranges,\,\,pairs\,\,and\,\,triplets
data positives negatives ppv_min ppv_max cov_min cov_max not_cov all_cov
Data1 237 247 0.6286 1.0 5/1 37/25 0/1 237/236
Data2 237 246 0.5203 1.0 7/1 38/23 0/5 237/232
Data3 237 247 0.4917 1.0 24/2 54/20 0/5 237/232
Data4 226 44 0.8427 1.0 22/1 78/12 0/146 226/80
Data5 463 291 0.6316 1.0 9/1 32/13 0/150 463/313
Data6 2373 142 0.9519 1.0 4/3 65/25 0/1279 2373/1094
Data7 2626 200 0.9348 1.0 1/4 58/27 0/11 2626/2615
Data8 2626 200 0.9361 1.0 4/2 60/34 0/2 2626/2624
(3.d):Coveragewithsingleranges,pairsandtripletsusingpysubgroup(\ref{pos_coverage_ppv_all_one}.d):\,\,Coverage\,\,with\,\,single\,\,ranges,\,\,pairs\,\,and\,\,triplets\,\,using\,\,pysubgroup
data positives negatives ppv_min ppv_max cov_min cov_max not_cov all_cov
Data1 237 247 0.9831 1.0000 8/4 77/4 2/5 235/232
Data2 237 246 1.0000 1.0000 57/57 57/57 37/37 200/200
Data3 237 247 0.9590 1.0000 42/1 45/1 3/5 234/232
Data4 226 44 0.9513 0.9887 5/0 95/0 1/226 225/0
Data5 463 291 0.9976 1.0000 11/1 61/51 43/43 420/420
Data6 2373 142 0.9983 0.9983 81/0 82/0 4/2373 2369/0
Data7 2626 200 1.0000 1.0000 66/66 66/66 73/73 2553/2553
Data8 2626 200 1.0000 1.0000 58/58 63/63 7/7 2619/2619
Table 3: Positive samples coverage with PPV = all / PPV = 1, with γ=0.4\gamma=0.4, with top 1515 range tuples selected per dimension by the target procedure in 𝑅𝐴\it RA; and with pysubgroup 𝑆𝐷\it SD (where the value of γ\gamma has no effect)

In Table 4 we evaluate in more detail the effect of the selected range feature diversification heuristic from [20, 30], discussed above. In particular, we evaluate positive samples coverage for values γ=1.0,0.9,0.8,,0.0\gamma=1.0,0.9,0.8,\ldots,0.0, for 𝑅𝐴\it RA runs with nRanking=15nRanking=15 and nRanking=25nRanking=25, in Tables 4.a and 4.b, respectively; and with pysubgroup 𝑆𝐷\it SD, in the last column called PSGPSG. Unlike Table 3, we report the count of positive samples not covered by selected range tuples with precision 11, since the selected range tuples with high precision (of dimensions up to 33) cover all positive samples at least once for all values of γ\gamma, in all datasets, even if this diversification heuristic is not used, which corresponds to case γ=1\gamma=1. While for datasets 3,7,83,7,8 the value of γ\gamma has no effect in terms of improving positive samples coverage, for other datasets the effect is pretty significant, especially for dataset 66, where activating the heuristic reduces the number not covered positive samples very significantly. The row named TotalTotal reports the sum of not-covered positive samples in all the eight datasets, while the row named WinsWins reports for how many datasets the given column, that is, given value of γ\gamma, was among the winning values of γ\gamma in minimizing the number of not covered positive samples. As it can be observed, the value γ=0.4\gamma=0.4 is best performing on 𝑅𝐴\it RA runs with nRanking=15nRanking=15 in terms of the number of wins and minimizing the number of not covered positive samples, and γ=0.4\gamma=0.4 is also among of the top winning choices for 𝑅𝐴\it RA runs with nRanking=25nRanking=25. Details of 𝑅𝐴\it RA runs with this winning value of γ=0.4\gamma=0.4 were given in Table 3.

(4.a):RArunswithnRanking=15andnBasis=12(\ref{pos_coverage_gamma}.a):\,\,RA\,\,runs\,\,with\,\,nRanking=15\,\,and\,\,nBasis=12
data 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 PSG
Data1 2 1 1 3 1 1 1 1 8 3 3 5
Data2 6 6 6 6 5 5 5 5 5 6 6 37
Data3 5 5 5 5 5 5 5 5 5 5 5 5
Data4 147 140 135 155 142 155 146 158 158 158 150 226
Data5 159 165 158 170 158 150 150 150 163 163 256 43
Data6 1437 1279 1279 1279 1279 1279 1279 1279 1279 1279 1279 2373
Data7 11 11 11 11 11 11 11 11 11 11 11 73
Data8 2 2 2 2 2 2 2 2 2 2 2 7
Total 1769 1609 1597 1631 1603 1608 1599 1611 1631 1627 1712 2769
Wins 3 5 6 4 6 7 7 7 5 4 4 1
(4.b):RArunswithnRanking=25andnBasis=12(\ref{pos_coverage_gamma}.b):\,\,RA\,\,runs\,\,with\,\,nRanking=25\,\,and\,\,nBasis=12
data 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 PSG
Data1 1 3 3 3 3 3 1 1 1 3 3 5
Data2 6 6 6 6 5 5 5 5 5 6 6 37
Data3 3 1 1 3 3 3 3 3 3 3 3 5
Data4 147 130 140 150 136 148 148 134 129 140 140 226
Data5 232 150 148 145 143 143 142 142 148 148 237 43
Data6 1337 1177 1177 1177 1177 1177 1177 1177 1177 1177 1178 2373
Data7 11 11 11 11 11 11 11 11 11 11 11 73
Data8 2 2 2 2 2 2 2 2 2 2 2 7
Total 1739 1480 1488 1497 1480 1492 1489 1475 1476 1490 1580 2769
Wins 3 4 4 3 4 4 6 6 6 3 2 1
Table 4: Positive samples coverage with PPV = 1 for different values of γ\gamma, with nRanking=15nRanking=15 and nRanking=25nRanking=25 in the top and bottom tables, respectively, and with nBasis=12nBasis=12, with 𝑅𝐴\it RA algorithm; and with pysubgroup 𝑆𝐷\it SD, with the same count of selected range tuples as with 𝑅𝐴\it RA and γ=1\gamma=1

6.3 Importance of Types of Engineered Features on PM Datasets

Finally, we evaluate which types of features engineered from Power Management regression logs are selected most frequently. In Table 5 we report the types of features that occur among the top 66 ranked selected range tuples, for 𝑅𝐴\it RA runs with nRanking=15nRanking=15, nBasis=12nBasis=12, and γ=0.4\gamma=0.4. These tables demonstrate that the count and timing features, for which we generate histogram statistics as additional features for root causing, are selected very often; and the same is true for the histogram features generated from them. The occurrence features are also selected often, while the ordering features have the lowest representation among the top ranked range features.

data features FN TN TP FP PPV
Data1 var367__COUNT_hist_min 104 247 133 0 1.0000
Data1 var297__COUNT__0_TO_13547 119 247 118 0 1.0000
Data1 var14__TIMING_hist_min 129 247 108 0 1.0000
Data1 var12__TIMING_hist_skewness 161 247 76 0 1.0000
Data1 var239__COUNT__0_TO_0 177 247 60 0 1.0000
Data1 var320__COUNT__0_TO_0 189 247 48 0 1.0000
Data2 var308__COUNT__1025_TO_1025 37 246 200 0 1.0000
Data2 var242__COUNT_hist_max 42 246 195 0 1.0000
Data2 var412__COUNT__128_TO_128 84 246 153 0 1.0000
Data2 var14__TIMING_hist_min 129 246 108 0 1.0000
Data2 var452__COUNT_hist_max 189 246 48 0 1.0000
Data2 var380__COUNT_hist_max 190 246 47 0 1.0000
Data3 var6__OCCURS__167_TO_167 37 247 200 0 1.0000
Data3 var9__OCCURS__0_TO_4092 63 247 174 0 1.0000
Data3 var10__TIMING__6703_TO_13513 113 247 124 0 1.0000
Data3 var6__OCCURS__7_TO_7 122 247 115 0 1.0000
Data3 var56__OCCURS__0_TO_0 124 247 113 0 1.0000
Data3 var13__TIMING_hist_min 129 247 108 0 1.0000
Data4 var14__TIMING_hist_kurtosis 209 44 17 0 1.0000
Data4 var45__OCCURS__116609_TO_358880 182 44 44 0 1.0000
Data4 var60__TIMING_hist_max 201 44 25 0 1.0000
Data4 var115__OCCURS__5_TO_5 184 44 42 0 1.0000
Data4 var46__OCCURS__1_TO_8 183 43 43 1 0.9773
Data4 var8__OCCURS__167_TO_175 183 43 43 1 0.9773
Data5 var14__TIMING_hist_kurtosis 331 291 132 0 1.0000
Data5 var11__OCCURS__0_TO_11750 339 291 124 0 1.0000
Data5 var103__OCCURS__28455_TO_62818 346 291 117 0 1.0000
Data5 var8__OCCURS__167_TO_175 373 291 90 0 1.0000
Data5 var14__TIMING_hist_max 384 291 79 0 1.0000
Data5 var16__OCCURS__6144_TO_6144 403 291 60 0 1.0000
Data6 var3 1606 142 767 0 1.0000
Data6 var157__OCCURS__318767104_TO_318767104 1608 142 765 0 1.0000
Data6 var10__OCCURS__0_TO_0 1616 142 757 0 1.0000
Data6 var188__OCCURS__423624706_TO_427819011 1636 142 737 0 1.0000
Data6 var0__OCCURS__4_TO_4 1747 142 626 0 1.0000
Data6 var84__OCCURS__0_TO_0 28 138 2345 4 0.9983
Data7 var3 73 200 2553 0 1.0000
Data7 var166__ORDERING 483 200 2143 0 1.0000
Data7 var5 948 200 1678 0 1.0000
Data7 var6__OCCURS__167_TO_167 355 200 2271 0 1.0000
Data7 var159__OCCURS__318767104_TO_318767104 369 200 2257 0 1.0000
Data7 var6__OCCURS__7_TO_7 374 200 2252 0 1.0000
Data8 var6__OCCURS__167_TO_167 7 200 2619 0 1.0000
Data8 var159__OCCURS__318767104_TO_318767104 35 200 2591 0 1.0000
Data8 var6__OCCURS__7_TO_7 40 200 2586 0 1.0000
Data8 var3 73 200 2553 0 1.0000
Data8 var86__OCCURS__0_TO_0 110 200 2516 0 1.0000
Data8 var12__OCCURS__0_TO_0 120 200 2506 0 1.0000
Table 5: Types of selected top ranked range tuples for RA runs with nRanking=15nRanking=15, nBasis=12nBasis=12 and γ=0.4\gamma=0.4

7 Related Work and Discussion

7.1 Subjective Measures of Interestingness of Subgroups

Work [19] discusses three use cases in other application domains – one medical, two marketing – with the aim to highlight advantages of 𝑆𝐷\it SD methodology and the wide range of algorithms and quality functions developed under 𝑆𝐷\it SD for the task of identifying relevant subgroups of a population. While the above application domains are pretty remote from validation, big part of discussion and learning from [19] apply to our work as well, and we repeat here some of the most relevant arguments (see Sections 3 and 8,  [19]).

One can distinguish between objective quality measures and subjective measures of interestingness [28]. Both the objective and subjective measures need to be considered in order to solve subgroup discovery tasks. Which of the quality criteria are most appropriate depends on the application. Obviously, for automated rule induction it is only the objective quality criteria that apply. However, for evaluating the quality of induced subgroup descriptions and their usefulness for decision support, the subjective criteria are more important, but also harder to evaluate. Below is a list of subjective measures of interestingness:

  1. 1.

    Usefulness. Usefulness is an aspect of rule interestingness which relates a finding to the goals of the user [17].

  2. 2.

    Actionability. A rule (pattern) is interesting if the user can do something with it to his or her advantage [27, 28].

  3. 3.

    Operationality. Operationality is a special case of actionability. Operational knowledge enables performing an action which can operate on the target population and change the rule coverage. It is therefore the most valuable form of induced knowledge [19].

  4. 4.

    Unexpectedness. A rule (pattern) is interesting if it is surprising to the user [28].

  5. 5.

    Novelty. A finding is interesting if it deviates from prior knowledge of the user [17].

  6. 6.

    Redundancy. Redundancy amounts to the similarity of a finding with respect to other findings; it measures to what degree a finding follows from another one [17], or to what degree multiple findings support the same claims.

The main reason why our root-causing approach is not fully automated is that we rely on validator’s feedback to decide when splitting of failure subgroups is not required any further. For example, to decide whether all failures in each subgroup of failures at the leaves of the 𝑅𝐶𝑇\it RCT constructed so far have the same root cause, with actionable insights. Even if the subgroup descriptions – the range tuples – were generated automatically so that all subgroup descriptions together cover all the positive samples and only positive ones, and the samples covered by these subgroup descriptions are disjoint, it is not guaranteed that these subgroups are the right ones, and that the descriptions of these subgroups are the right ones.111There may be many such sets of subgroup descriptions in general, and there might be none, say in case the dataset has a sample that is both positive and negative; in the latter case, adding more features to the dataset might help to resolve such conflicts in the data and allow for cleaner separation of subgroups of positive samples from negative ones using range tuples or rules. Indeed, as it can be seen from the experimental evaluation is Section 6, a failing test might have multiple explanations through rules with precision 11, and not all of them need to point to the right root cause. Formal specifications enabling automating such a decision (to keep human out of the loop / eliminate the need for an oracle) do not exist, and as discussed in the introduction, such decisions can also depend on the context and goals of root-causing, and on how deep root-causing should go. We expect that further automation can be added in this direction by mining expert validator’s decisions through labeling the identified intermediate root-causing hints and the rules that can isolate the associated subgroups of failures, by extending ideas proposed in Subsection 5.7. These rules can be reused for predicting and classifying failure root causes in future regressions. In particular, rules and models learned during pre-silicon validation can be reused for post-silicon validation and later for tracking anomalous behaviors during in-field operation.

7.2 Comparing 𝑆𝐷\it SD with 𝑅𝐿\it RL and Decision Trees for Root-causing

Work [19] also compares 𝑆𝐷\it SD methodology with decision trees and classification rule learning. Here are the main relevant points:

  1. 1.

    The usual goal of classification rule learning is to generate separate models, one for each class, inducing class characteristics in terms of properties (features) occurring in the descriptions of training examples. Therefore, classification rule learning results in characteristic descriptions, generated separately for each class by repeatedly applying the covering algorithm, meaning that the positive samples covered by already generated rules are dropped from the dataset when generating a new rule.

  2. 2.

    In decision tree learning, on the other hand, the rules which can be formed from paths leading from the root node to class labels in the leaves represent discriminant descriptions, formed from properties that best discriminate between the classes. As rules formed from decision tree paths form discriminant descriptions, they are inappropriate for solving subgroup discovery tasks which aim at describing subgroups by their characteristic properties.

  3. 3.

    Subgroup discovery is seen as classification rule learning by treating subgroup condcond as a rule condclasscond\longrightarrow class, where usually classclass is the positive class. However, the fact that in classification rule learning the rules have been generated by a covering algorithm hinders their usefulness for subgroup discovery. Only the first few rules induced by a covering algorithm may be of interest as subgroup descriptions with sufficient coverage. Subsequent rules are induced from smaller and strongly biased example subsets, excluding the positive examples covered by previously induced rules. This bias prevents the covering algorithm from inducing descriptions uncovering significant subgroup properties of the entire population.

In the context of root-causing from regression test trace logs, the above statements from [19] mean that both decision tree and rule-based algorithms are suitable for separating the two classes (fail vs pass), which is the explicit aim of the decision tree algorithms, but this is not enough since we want to also separate between subgroups of failures as not all failures have the same root cause; and here decision tree algorithms with discriminating description are less appropriate than subgroup discovery techniques with characteristic descriptions which aim at finding subgroups of failures with the same explanation (same root cause).

Our root-causing task is different from both 𝑆𝐷\it SD and 𝑅𝐿\it RL tasks in that the dataset that we work with can change during the root-causing procedure: we might add and/or remove some of the features in the initial input dataset, and might drop some of the positive samples in the initial dataset, in order to focus the root-causing procedure to one or more relevant modules and to a subgroup of positive samples. In addition, we want to learn subgroups of failures with the same root cause (which is a task relevant to 𝑆𝐷\it SD) as well as to learn rules that can serve as predictors of previously learned root causes in the future (which is a task relevant to 𝑅𝐿\it RL). Therefore, both 𝑆𝐷\it SD and 𝑅𝐿\it RL are relevant for our root-causing procedure, and so is 𝑅𝐴\it RA which was designed to serve both above tasks. Both the objective and subjective measures of quality and interestingness discussed earlier remain relevant for this more general setting.

Work [3] uses a modified version of C4.5C4.5 decision trees to root-cause failures in an internet service system, and demonstrates their usefulness. However, due to the classification mindset in searching for root causes, which requires rules with high quality discriminant descriptions, their tree splitting and branch selection heuristics prefer branches that maximize positive samples at the respective leaf nodes (see the ranking criterion in Section 3.2), and operate under assumption that there are only a few independent sources of error (see the noise filtering criterion in Section 3.2). Therefore, unlike the 𝑆𝐷\it SD algorithms, their search algorithm does not encourage splitting subgroups of failing samples further even if the latter would result into higher quality subgroups according to relevant quality functions like WRAcc or Lift. As a consequence, the approach in  [3] does not fully adequately address the challenge of root-causing failures caused by a combination of multiple factors, which are usually most difficult to debug. Indeed, for the root-causing example discussed in Section 6 in some detail (for dataset Data8Data8, with 𝑅𝐴\it RA results in first 𝑅𝐶𝑇\it RCT iteration reported in Table 2), the final tree generated by the decision tree approach in [3] would not contain branches leading to the three final subgroups with full root causes identified by our root-causing procedure, since each of these three subgroups cover fewer positive samples than the subgroup of the seven failures obtained by a first split that maximizes the number of positive samples in one of the branches.

7.3 Comparing 𝑆𝐷\it SD with Other Approaches to Root-causing from Traces

Work [25] proposed to learn behavior of the system from passing execution traces and compare failing runs against the learned models to identify the accepted and the unaccepted event (sub)sequences. The latter – the suspicious sequences – are presented to validator together with a representation of the behaviors that are acceptable instead of them, according to the learned models. While the identified illegal event sequences localize well the bug locations within respective modules and serve as valuable input to the validators, further root-causing might be required since the learned models are abstract and do not capture all fine-grained legal behaviors (real systems are too complex to be fully learned as automata, not every event from the traces will be part of the learned model). The root-causing techniques proposed in our work can be considered complementary to the above research goal: we already have a well-defined specification for the flow of interest, thus it is straightforward to determine where the failing trace stops to conform to it, and instead our aim is to find observable event combinations (not necessarily occurring in the flow specification) that jointly explain the root cause.

In [26], the authors address the root-causing problem in post-silicon SoC (System On Chip) validation. They apply clustering to features engineered from traces of message-passing protocols at SoC level.222Their usage of term message is similar to our usage of term event, and messages might have additional attributes: “In SoC designs, a message can be viewed as an assignment of Boolean values to the interface signals of a hardware IP”. Their approach to feature engineering from traces is very different from ours. The idea that they follow is as follows: ”Logical bugs in designs can be considered as triggering corner-case design behavior; which is infrequent and deviant from normal design behavior. In ML parlance, outlier detection is a technique to identify infrequent and deviant data points, called outliers whereas normal data points are called inliers”. As a consequence, their task is to identify patterns of consecutive messages that cause the bugs, which they call anomalous message sequences. The infrequency criterion can be captured through the entropy associated to message sequences and the deviancy is captured through a distance measure defined on the sequences. The techniques proposed in our work have been used as part of product development in the pre-silicon stage to root-cause regression suite failures, and at the early stages of product development often there are more failing tests than there are passing ones (this is the case for 5/85/8 datasets used in our experiments). And there usually are many reasons for failures. Thus the outlier detection approach adopted in [26] is more relevant at late stages of the product development and post-silicon when the bugs are relatively rare.

To summarize, once trace logs are processed and a dataset is engineered, for which we proposed multiple widely applicable techniques of feature engineering, many ML and statistical approaches can be applied for the purpose of root-causing failures, and these techniques are often complementary. We believe that 𝑅𝐿\it RL and 𝑆𝐷\it SD based techniques are among the most suitable techniques for root-causing regression test failures when there might be multiple causes for failures and each cause for failure might require analyzing complex interplay between multiple factors to explain it. In addition, semi-supervised or unsupervised ML techniques like clustering, outlier direction, anomaly detection novelty detection, change detection and more can be used.

8 Conclusions and Future Work

In this work we have proposed a new method for applying ML techniques on datasets engineered from a large set of test logs in order to generate powerful root-causing hints which can be mined and reused. The methodology proposed in this work has been applied successfully to debug PkgC8PkgC8 flow of Intel’s Power Management protocol.

While our focus was on root-causing regression test failures, and in particular on Power Management flows, our approach is very widely applicable. Indeed, the types of features that we generate from traces apply also for cases when there is no specification of the flow or when the flow is specified but its instances cannot be tracked accurately in the trace, and only pass and fail labels are known for each trace log. In that case, all events become non-flow events, and occurrence, count, ordering and timing features are defined for these events as well. Furthermore, the ability to encode sequential event occurrence information in trace logs through distributions and root-causing using properties of these distributions makes our approach very general. As already mentioned, when pass and fail labels are not available for trace logs, encoding of sequential event occurrence information into distributions still works, and unsupervised ML techniques can be used instead of RuleLearningRule\,\,Learning and SubgroupDiscoverySubgroup\,\,Discovery techniques.

As far as RuleLearningRule\,\,Learning and SubgroupDiscoverySubgroup\,\,Discovery are concerned, our main contribution is extension of these research frameworks to one where the input dataset is not fixed and can be refined based on learning acquired through applications of 𝑅𝐿\it RL and 𝑆𝐷\it SD algorithms: both features and samples can be added to the input dataset or removed from it. As a consequence, our framework supports much more realistic usages of 𝑅𝐿\it RL and 𝑆𝐷\it SD. In addition, introduction of quality orderings allows more freedom in specifying criteria for rule selection and we believe they will be useful for other applications of 𝑅𝐿\it RL and 𝑆𝐷\it SD as well.

Learning sequential rules is somewhat similar to learning assertions from traces in the form of event sequences or automata, or in a richer formalism that can directly capture the concurrency information, but now the intention is to learn patterns of incorrect scenarios rather than the correct ones (where incorrect scenarios are not necessary rare scenarios). In our iterative root-causing and learned bug mining approach, the rules can be mined by validation expert as sequential rules or in other formats. Further automation of mining sequential rules that capture bug patterns is an important direction for future work.

9 Acknowledgments

We would like to thank Yossef Lampe, Eli Singerman, Yael Abarbanel, Orly Cohen, Vladislav Keel and Dana Musmar for their contributions to this work: contributions to discussions, implementation of respective parts of the root-causing tool, initial experiments, and actual usage in Intel products.

References

  • [1] Martin Atzmueller. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(1):35–49, 2015.
  • [2] Francesco C Billari, Johannes Fürnkranz, and Alexia Prskawetz. Timing, sequencing, and quantum of life course events: A machine learning approach. European Journal of Population/Revue Européenne de Démographie, 22:37–65, 2006.
  • [3] Mike Chen, Alice X Zheng, Jim Lloyd, Michael I Jordan, and Eric Brewer. Failure diagnosis using decision trees. In International Conference on Autonomic Computing, 2004. Proceedings., pages 36–43. IEEE, 2004.
  • [4] Wen Chen, Li-Chung Wang, Jay Bhadra, and Magdy Abadir. Simulation knowledge extraction and reuse in constrained random processor verification. In Proceedings of the 50th Annual Design Automation Conference, pages 1–6, 2013.
  • [5] Peter Clark and Tim Niblett. The cn2 induction algorithm. Machine learning, 3(4):261–283, 1989.
  • [6] Nicolas De Jay, Simon Papillon-Cavanagh, Catharina Olsen, Nehme El-Hachem, Gianluca Bontempi, and Benjamin Haibe-Kains. mrmre: an r package for parallelized mrmr ensemble feature selection. Bioinformatics, 29(18):2365–2368, 2013.
  • [7] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185–205, 2005.
  • [8] Ranan Fraer, Doron Keren, Zurab Khasidashvili, Alexander Novakovsky, Avi Puder, Eli Singerman, Eran Talmor, Moshe Y Vardi, and Jin Yang. From visual to logical formalisms for soc validation. In 2014 Twelfth ACM/IEEE Conference on Formal Methods and Models for Codesign (MEMOCODE), pages 165–174. IEEE, 2014.
  • [9] Johannes Fürnkranz, Dragan Gamberger, and Nada Lavrač. Foundations of rule learning. Springer Science & Business Media, 2012.
  • [10] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
  • [11] Thomas Hirsch and Birgit Hofer. Root cause prediction based on bug reports. In 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pages 171–176. IEEE, 2020.
  • [12] Yoav Hollander, Matthew Morley, and Amos Noy. The e language: A fresh separation of concerns. In Proceedings Technology of Object-Oriented Languages and Systems. TOOLS 38, pages 41–50. IEEE, 2001.
  • [13] Intel Corporation. 11th Generation Intel Core TM Processor Desktop, Datatsheet Volume 1. Intel Corporation, Santa Clara, California, United States, 2022.
  • [14] Yoav Katz, Michal Rimon, Avi Ziv, and Gai Shaked. Learning microarchitectural behaviors to improve stimuli generation quality. In Proceedings of the 48th Design Automation Conference, pages 848–853, 2011.
  • [15] Zurab Khasidashvili and Adam J Norman. Range analysis and applications to root causing. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 298–307. IEEE, 2019.
  • [16] Zurab Khasidashvili and Adam J Norman. Feature range analysis. International Journal of Data Science and Analytics, 11(3):195–219, 2021.
  • [17] Willi Klosgen. A multipattern and multistrategy discovery assistant. Advances in knowledge discovery and data mining, pages 249–271, 1996.
  • [18] Sotiris Kotsiantis and Dimitris Kanellopoulos. Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1):47–58, 2006.
  • [19] Nada Lavrač, Bojan Cestnik, Dragan Gamberger, and Peter Flach. Decision support through subgroup discovery: three case studies and the lessons learned. Machine Learning, 57:115–143, 2004.
  • [20] Nada Lavrac, Branko Kavsek, Peter Flach, and Ljupco Todorovski. Subgroup discovery with cn2-sd. J. Mach. Learn. Res., 5(2):153–188, 2004.
  • [21] Florian Lemmerich and Martin Becker. pysubgroup: Easy-to-use subgroup discovery in python. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 658–662, 2018.
  • [22] Biruk Mammo, Milind Furia, Valeria Bertacco, Scott Mahlke, and Daya S Khudia. Bugmd: Automatic mismatch diagnosis for bug triaging. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1–7. IEEE, 2016.
  • [23] A Manukovsky, Y Shlepnev, Z Khasidashvili, and E Zalianski. Machine learning applications for com based simulation of 112 gb systems. Signal Integrity Journal, 2020.
  • [24] Alex Manukovsky, Yuriy Shlepnev, and Zurab Khasidashvili. Machine learning based design space exploration and applications to signal integrity analysis of 112gb serdes systems. In 2021 IEEE 71st Electronic Components and Technology Conference (ECTC), pages 1234–1245. IEEE, 2021.
  • [25] Leonardo Mariani and Fabrizio Pastore. Automated identification of failure causes in system logs. In 2008 19th International Symposium on Software Reliability Engineering (ISSRE), pages 117–126. IEEE, 2008.
  • [26] Debjit Pal and Shobha Vasudevan. Feature engineering for scalable application-level post-silicon debugging. arXiv preprint arXiv:2102.04554, 2021.
  • [27] Gregory Piatetsky-Shapiro and Christopher J Matheus. The interestingness of deviations. In Proceedings of the AAAI-94 workshop on Knowledge Discovery in Databases, volume 1, pages 25–36, 1994.
  • [28] Abraham Silberschatz and Alexander Tuzhilin. On subjective measures of interestingness in knowledge discovery. In KDD, volume 95, pages 275–281, 1995.
  • [29] John Stearley. Towards informatic analysis of syslogs. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pages 309–318. IEEE, 2004.
  • [30] Matthijs Van Leeuwen and Arno Knobbe. Diverse subgroup set discovery. Data Mining and Knowledge Discovery, 25:208–242, 2012.
  • [31] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Principles of Data Mining and Knowledge Discovery: First European Symposium, PKDD’97 Trondheim, Norway, June 24–27, 1997 Proceedings 1, pages 78–87. Springer, 1997.