Fault Injection based Failure Analysis of three CentOS-like Operating Systems

Hao Xu [email protected] Tongji UniversityChina , Yuxi Hu [email protected] Alibaba Inc.China , Bolong Tan [email protected] Alibaba Inc.China , Xiaohai Shi [email protected] Alibaba Inc.China , Zhangjun Lu [email protected] Tongji UniversityChina , Wei Zhang [email protected] Tongji UniversityChina and Jianhui Jiang [email protected] Tongji UniversityChina

Abstract.

The reliability of operating systems (OSs) has always been a focus of attention in both academia and industry. This paper presents a novel methodology for failure analysis of Linux-like OSs based on fault injection. Initially, we systematically define Linux-like fault modes by adopting the method of fault mode generation based on functional module division of Linux-like OSs. Subsequently, we construct a Linux fault mode library and develop a fault injection tool based on the fault mode library (FIFML). Finally, we conduct fault injection experiments on three commercial Linux distributions, i.e. CentOS, Anolis OS and openEuler. To reasonably divide the influence level and reduce the impact of performance fluctuations, we introduce three performance metrics including performance threshold, performance standard deviation, and the worst performance. Additionally, we employ failure rate, performance degradation rate, and performance level after fault injection to quantitatively describe the influence of fault injection on OS performance. Utilizing these metrics, we measure the performance disparity of three OSs. The experimental results show that Anolis OS outperforms CentOS and openEuler in virtual file systems, network interfaces, and process management systems. These findings underscore the significance of our methodology in assessing OS reliability. By comprehensively examining various fault modes and their effects on performance, our methodology contributes to a better understanding of OS failure behavior and provides insights for future system optimization.

CentOS, Anolis OS, openEuler, failure analysis, fault injection

^†^†conference: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; 11 - 17 November, 2023; San Francisco, USA^†^†ccs: General and reference Reliability^†^†ccs: Software and its engineering Software testing and debugging

1. Introduction

A failure in any component of a software system can result in system failure, which negatively impacts the user experience. The operating system (OS) should ideally maintain good quality of experience even in the presence of faults, such as isolating a failed component without compromising system reliability and responsiveness. Reliability is an essential attribute of product or system quality, and failure analysis plays a crucial role in identifying the causes of system reliability problems. It helps identify failure modes, failure causes, and failure mechanisms, and propose measures to prevent future failures, thus improving the system’s reliability.

Fault injection techniques are crucial for identifying and analyzing system failures in a controlled environment. By intentionally causing failures, developers can study the chain of events leading up to system failures, identify root causes, and improve the system to prevent future failures. However, developers are often hesitant to implement fault injection techniques due to the growing complexity and cost of modern systems, particularly when it comes to deciding which faults or errors to inject. In this context, failure analysis based on fault injection has emerged as an effective means to analyze system reliability, providing a valuable tool for identifying and addressing system failures in software systems.

Various researchers have studied system reliability by injecting faults into different components of the system. For example, Amarnath (b1, ) and Winter (b2, ) have injected faults into CPU registers and drivers, while Yoshimura et al. (b3, ) have injected faults into application processes to investigate the propagation of errors to the kernel. Arlat (b4, ) et al. have analyzed the representativeness of OS fault injection and compared the similarities and differences between real faults and injected faults using Linux as an example. Since the 1990s, academia has published numerous achievements in the development and application of fault injection tools, including fault injector in FTAPE (b18, ), fast fault injection FSFI (b21, ), and OS fault injection tools such as SockPFI (b19, ) and TFI (b20, ). However, the majority of these works suffer from issues related to the application or coverage of their analysis methods, such as coarse granularity, inaccuracy, and low coverage. Additionally, some fault injection tools for OSs have been developed by industry, including FailViz (b26, ) and DICE (b27, ), but these tools have limitations such as poor expandability, inaccurate injection, and poor authenticity.

This paper aims to present a methodology for failure analysis of Linux-like OSs based on fault injection. Specifically, we provide OS developers with an effective method for defining, executing, and analyzing faults, as well as a fault injection tool. In this paper, we focus on CentOS 8, Alibaba’s Anolis OS, and Huawei’s openEuler, and provide examples at each step to help readers better understand the process or adopt these steps in practical scenarios. Our contributions include: 1) We performed fault injection on internal functions at the kernel level and constructed a fault mode library for Linux. We also provided a systematic, lightweight guidelines for defining fault models through OS’s function modules, which is applicable to large software systems. 2) We developed a fault injection tool based on the fault mode library (FIFML) that enables the implementation of fault models through reusable fault injection plugins and an extensible architecture. 3) We conducted comparative experiments on CentOS 8, Anolis OS and openEuler distributions to analyze their failures and identify deficiencies in terms of reliability. We also compared the performance of these three open-source OSs.

The remainder of this paper is organized as follows. In Section 2, we introduce the methodology for failure analysis of Linux-like OSs based on fault injection, which includes key concepts and our proposed fault injection method based on internal function. Section 3 presents the failure analysis method based on fault injection, the fault mode generation method based on functional module division and the Linux fault mode library we have built. In Section 4, we detail the design of the fault injection tool FIFML and its verification for validity, completeness, and authenticity. The Section 5 describes the design and analysis of fault injection experiments on three OSs, as well as a performance analysis experiment. Finally, in Section 6, we conclude this paper.

2. Methodology for Failure Analysis of Linux-like OSs based on Fault Injection

2.1. Background

The goal of our failure analysis is to evaluate the impact of failures on Linux-like OSs. Evaluating system credibility and verifying the correctness of fault-tolerant mechanism design and implementation are crucial steps in the developing dependable computer systems. Artificially manufacturing faults in a verified system, by replacing natural physical faults with artificial faults, can effectively shorten the verification cycle. Fault injection is a process that involves using artificial methods to generate faults based on a selected fault type. The generated faults are then applied to the target system to accelerate its error process. The fault response information is collected and analyzed in a timely manner, and the final results are provided to technical personnel for research purposes. Therefore, our approach is to inject faults in the internal functions of the kernel and analyze the responsiveness and availability of the whole system in the presence of the injected faults. A key part of this approach is the generation of failure modes, since failures can generate from any hardware and software component of the OS. Since the Linux-like OSs is a large, complex software system, this method emphasizes software faults injection, which is the main cause of reliability problems in such systems.

In software operation, a state that fails to provide the expected function is called a software failure, and failure analysis is an essential method for improving software reliability. The cause that has the potential to cause failure is defined as a fault, which can also be described as the adjudged or hypothesized cause of an incorrect state of the system (b5, ). Software faults can originate from open source OS development or from proprietary customizations introduced by developers. The approach performs a series of fault injection experiments, where each experiment simulates a fault in an internal function of the kernel.

The Linux fault mode library is constructed by analyzing (including theoretical and experimental analysis), abstracting and summarizing the performance of possible failures in the Linux systems. Fault injection for Linux-like OSs is based on the Linux fault mode library, we select a single fault mode or combines multiple fault modes, generate a fault injection scheme, and forms a fault instance according to the conditions of the target system and its operating environment for implementation. Fault injection in Linux can accurately simulate kernel faults at the function level. Linux fault injection implemented by simulation include kernel function fault simulation, delay simulation, buffer data error simulation, system downtime restart simulation, kernel denial of service simulation, CPU usage simulation, etc.

Refer to caption — Figure 1. General Structure of Linux

2.2. Fault Injection based on Internal Functions

The current main measures of fault injection are orthogonal defect classification (ODC) based fault injection (b6, ) and interface-based fault injection (b28, ). The faults injected by ODC based fault injection are usually dormant, which leads to inefficiency because most of them are difficult to trigger. In contrast, interface-based fault injection does not have these problems. Its principle is to simulate the impact of faults in components by injecting exceptions or invalid values into their interfaces. Interface-based fault injection is a practical and popular approach that does not require any changes to the component’s code, and since each fault injection generates errors, the efficiency of experimentation is ensured. However, our fault injection method based on internal functions has all the advantages of interface-based fault injection. Moreover, due to the further analysis of the internal functions, this fault injection method is more accurate and complete, and has wider coverage.

The principle of fault injection based on internal function is to inject faults into the layer below the target level, and the impact is observed at the target level to extract the fault mode. The Linux system consists of four parts, namely hardware, Linux kernel, Linux services, and user applications, as shown in Figure1. Among them, the Linux kernel is mainly used for abstraction and access scheduling of hardware resources. Fault injection for the Linux should be performed at the system call layer if it is based on the interface and at the kernel function layer if it is based on internal functions. When a Linux kernel code defect is activated, it will cause a Linux runtime fault, and the fault will propagate to the system call layer, causing an error when interacting with the upper-layer applications of Linux, and may even cause some programs to fail. The Linux kernel defect-fault-failure propagation mechanism is shown in Figure2.

To provide a more specific example, the system call corresponding to “ $Set$ $memory$ $permission$ ” is mprotect(). The function description of mprotect() is as follows:

SYSCALL\_DEFINE3(mprotect,\ unsigned\ long,start,

\ size\_t,\ len,\ unsigned\ long,\ prot)

mprotect() calls the function do_mprotect_pkey(), which is the target of fault injection. The function description of do_mprotect_p
key() (The pkey value is fixed to -1 when mprotect() is executed) is as follows:

static\ int\ do\_mprotect\_pkey(unsigned\ long\ start,

size\_t\ len,unsigned\ long\ prot,int\ pkey)

The analysis of do_mprotect_pkey() is as follows:

(1)

If the first parameter start is wrong, the fault mode of “invalid address parameter error when setting memory permission” may occur. When memory access permission is set, start is an invalid pointer, or is not an integer multiple of the page size. It is not aligned with the memory page. The function returns -EINVAL for “Invalid argument”.
(2)

If the second parameter len is wrong, the fault mode of “memory overflow error when setting memory permissions” may occur. When memory access permission is set, the specified address space [start, start+len-1] exceeds the actual process address space ( this is also related to the start parameter). The function returns -ENOMEM for “Out of memory”.
(3)

If the third parameter prot is wrong, the fault mode of “permission conflict error when setting memory permissions” may occur. After the mmap() is used to map a read-only file to memory, if you try to give this memory PROT_WRITE permission (write permission), the service will be denied due to a permission conflict error. The function returns -EACCES for “Permission denied”.

3. Failure Analysis based on Fault Injection and Fault Mode Generation based on Functional Module Division

3.1. Failure Analysis Process and Methods

When analyzing the failure of a Linux system, Linux can be divided into various abstraction levels, such as kernel functions, system calls, and user applications. The Linux kernel mainly abstracts and schedules hardware resources. The high complexity of Linux makes it challenging to generate fault modes as different functional modules behave differently after a fault occurs. Faults, errors, or failures in the runtime of Linux are typically caused by code defects being activated, hardware errors, even Linux application errors. Performing failure analysis directly on Linux code is time-consuming and laborious.

To address these challenges, we established a failure analysis process for Linux-like OSs based on fault injection as shown in Figure 3. We have previously used ODC and software-implemented fault injection (b6, ) to perform failure analysis on the Linux kernel code. The experimental results indicate that 59.66 $\%$ of kernel code are related to the Linux system call chain, which covers nearly 60 $\%$ of Linux kernel code failures. Therefore, we adopt the fault mode generation based on functional module division to generate Linux fault modes and construct a Linux fault mode library to support Linux failure analysis based on fault injection.

Step 1: Divide functional modules.

In this step, we first divide the target software (Linux-like OS) into functional modules, i.e. file system, interrupt management, IO management, memory management, and process management. After obtaining the large functional modules, we identify the key functions of each module (such as setting memory permissions, sending signals to processes, opening files, managing semaphore sets, sending messages, etc.). This process needs experience to ensure the quality of the subsequent fault mode extraction. It is not necessary to fully understand the internal details of the module, but it is necessary to master the interface and parameter information.

Step 2: Define fault modes derived from functional modules.

Failure scenarios are used in Linux failure analysis to help find the direct causes of these failures. The types of failure scenarios we consider are defined according to the relevant literature (b14-17) and our previous work on modeling the effects of software failures (b8-11), including:

(1)

Kernel function failure: The return value of a system call is wrong, and it is unable to provide correct service.
(2)

Delay: The long delay makes it unable to provide services within the user’s waiting time.
(3)

Buffer data error: The buffer data is wrong, and it cannot provide the correct service.
(4)

System downtime and restart: The system breaks down or restarts, and it cannot provide services.
(5)

Kernel Denial of Service: The system kernel refuses to provide services to users.
(6)

System CPU usage increases: The increase in CPU occupancy affects the services provided to users.

By combining past experiences with fault injection and analyzing functional issues in Linux, we can identify fault modes in a Linux-like OS. Although this process may not uncover the root cause of the failure, it is a comprehensive, convenient, and effective way to identify issues. The list of issues includes the following:

(1)

Does the function return an abnormal result due to invalid parameters or values outside the specified range of the called kernel function? If yes, then one or more failure modes corresponding to “kernel function failure” are generated.
(2)

Does the function return an abnormal result due to insufficient privileges or the wrong address of the called kernel function? If yes, then one or more failure modes corresponding to “kernel function failure” are generated.
(3)

Is the function delayed or unresponsive during execution? If yes, then one or more failure modes corresponding to “delay failure” are generated.
(4)

Does the function produce an error result of unknown type or return incorrect results due to external data errors? If yes, then one or more failure modes corresponding to “buffer data error” are generated.
(5)

Does the function cause data synchronization errors or abnormal data due to unexpected system shutdown and restart? If yes, then one or more failure modes corresponding to “system shutdown and restart” are generated.
(6)

Does the function generate an exception due to the process being killed? If yes, then one or more failure modes corresponding to “kernel denial of service” are generated.
(7)

Does the function affect the service due to the system overload? If yes, then one or more failure modes corresponding to “system CPU usage failure” are generated.

We obtain the corresponding fault modes by analyzing related functions. It is necessary to consider the system calls and related parameters used by the functions. The fault modes are generated based on characteristics of system call and kernal functions, and added to the fault mode library. This process is repeated until no any new fault mode is generated, then the construction of the fault mode library is finished.

Step 3: Perform fault injection and failure analysis.

The fault injection tool injects faults into the OS. After a fault is injected, it is necessary to track the propagation process of errors generated by the injected fault, monitor the behavior of the OS, and analyze the failure caused by the injected fault. The fault injection tool must accurately inject or simulate the fault modes, and the OS source code must not be changed during the injection process.

3.2. Examples of Linux Fault Mode Generation

The analysis and collection process of the Linux fault modes served by the “ $Manage$ $Semaphore$ $Set$ ” operation is given below. According to the list, the function may occur “kernel function failure”. The system call that corresponding to this operation when Linux is running is semop(). The function description of semop() is as follows:

SYSCALL\_DEFINE3(semop,\ int,\ semid,

struct\ sembuf\ \_\_user\ *,\ tsops,\ unsigned,\ nsops)

The function semop() completes its task by calling do_semtimedo
p(). The function description of do_semtimedop() is as follows:

static\ long\ do\_semtimedop(int\ semid,

struct\ sembuf\ \_\_user\ *tsops,unsigned\ nsops,

const\ struct\ timespec64\ *timeout)

To analyze the Linux failure mode caused by the wrong parameter of do_semtimedop(), consider the following:

(1)

If the first parameter semid is incorrect, the following fault modes may occur: “Semaphore does not exist error encountered while managing semaphore set”: When operating on a semaphore set, the semaphore does not exist. The function returns -EIDRM for “Identifier removed”.
(2)

If the second parameter tsops is incorrect, the following fault modes may occur: “Illegal address error encountered when managing semaphore set”: When operating on a semaphore set, the address pointed to by the parameter sops or timeout is inaccessible. The function returns -EFAULT for “Bad address”.
(3)

If the third parameter nsops is incorrect, the following fault modes may occur: “Number out of range error when managing semaphore set”: When operating on a semaphore set, the number in the specified semaphore set is out of range. The function returns -E2BIG for “Argument list too long”.

3.3. Linux Fault Mode Library

We constructed a Linux fault mode library with 2870 fault modes by analyzing Linux source code and its 152 system calls. Each fault mode in the library has the following attributes: simulation_method
_id, fault_mode_id, fault_mode_name, simulation_method_type and attach_data. The simulation_method_id serves as a unique identifier for each fault mode, while the fault_mode_id denotes the type of each fault. The fault_mode_name provides a specific description of the fault mode, and the simulation_method_type indicates the type of simulation method used for this fault mode. Additionally, the attach_data field contains various additional parameters that are required for injecting this fault mode.

4. Fault Injection Tool based on the Linux Fault Mode Library

FIFML is a fault injection tool that enables the analysis of Linux failures without modifying the OS source code, thereby eliminating the need to rebuild the code base. The structure of FIFML is illustrated in Figure 4, it comprises the control module, the fault mode library management module, the fault injection scheme generation module, the fault injection module, the log module, and the Linux fault mode library.

4.1. Fault Injection Scheme Generation Module

The fault injection scheme generation module is responsible for receiving fault injection control information and generating a fault injection scheme based on acquired fault simulation method data. This module has two parts: the schema querier, which queries specific fault mode data from the fault mode library management module based on the control command, and the scheme generator, which generates the fault injection scheme using the fault mode data.

The fault injection scheme generation module executes the control command issued by the control module, generates a fault injection scheme based on the fault mode library and the control information, and sends it to the fault injection module. The fault injection scheme specifies the fault set to be simulated, the simulation method for each fault, the start time and duration of each fault, and the scope of influence of each fault. The generated fault injection scheme includes fault mode information, fault simulation method information, fault occurrence time, fault duration, target process PID, target file name, and other information.

4.2. Fault Injection Module

The fault injection module implements the fault injection scheme and performs the specified fault simulation operation. This module is composed of three parts: the license checker, which verifies the validity of the tool license; the parameter checker, which ensures the legality of the received control commands; and the injector, which performs fault injection according to the fault injection scheme and manages the injected faults. Upon receiving a control command from the control module, the fault injection module interacts with the Linux kernel to execute the fault injection.

4.3. Verification

To validate the fault injection method, the fault mode library, and the injection tool, we conducted a series of experiments. We began by verifying the fault mode library’s validity using tools to ensure that each item was effective and met the expected effect. In Section 3.A, the functional division of Step 1 ensures the theoretical completeness of the failure modes, and these fault modes were reviewed by our industry partners. Then, we conducted authenticity verification. We worked closely with industry partners to generate fault modes that completely covered a series of problems they actually encountered. Additionally, we analyzed a series of real failure scenarios submitted by the Anolis open-source community, they can be successfully simulated and reproduced.

We collected 786 bug reports submitted to the Anolis open-source community for the Anolis OS 8 series OS between December 2021 and April 2023. After screening the reports, we removed items that were not clearly described, caused by users’ improper operations, or marked as errata. We then identified a set of 144 verifiable bugs, of which 74 could be simulated using our method. The remaining 70 bugs could not be simulated through fault injection, and were categorized as follows:

(1) Missing packages and dependencies during installation: There were 43 such errors, whose root cause lay in the installation source or other parts, and could not be simulated through fault injection. Although these errors are common in actual use, they can be quickly fixed by updating, so do not require failure analysis.

(2) User application failure: There were 12 bugs in this category, such as garbled characters displayed in user applications or vulnerabilities in the Apache HTTP server, which could be resolved by using a newer version of the application. Although it is difficult to simulate faults in user applications using the fault injection based on internal functions targeting system calls, simulation targeting application program interfaces can be effective.

(3) Hardware failure: This type of error included 5 bugs, such as the inability to read the network card or an outdated BIOS version, whose root cause was hardware-related. Although fault mode simulations can help to observe and analyze the failure performance of upper layers, it is important to note that the fundamental problem lies in the hardware and should not be ignored.

(4) Compatibility errors: There were 10 bugs caused by adaptation or kernel compatibility, which could not be well simulated through fault modes .

Here is an example of fault simulation: Bug 1953 describes a Linux issue where pipe_resize_ring() lacks a lock. The function of pipe_resize_ring() is to adjust the size of the pipe ring. During the adjustment process, the content of the old pointer is copied to the new pointer and the space pointed to by the old pointer is released. Without the lock, another function post_one_notification() could cause a use-after-free error and result in an oops when inserting into the buffer. This scenario clearly exhibits the characteristics of kernel function failure, and fault injection into pipe_resize_ring() and post_one_notification() can directly simulate the failure. Furthermore, the fixed pipe_resize_ring() utilizes spin_lock_irq() and spin_unlock_irq() to control the lock. Fault injection into these functions can produce additional failure scenarios.

5. Experiments and Analysis

We conducted two types of experiments on three Linux-like OSs, i.e. CentOS-stream-8 (kernel 4.18.0-383.el8.x86_64), AnolisOS-8.4-GA (kernel 4.18.0-305.an8.x86_64) and openEuler-20.03-LTS-SP3 (kernel 4.19.90-2112.8.0.0131.oe1.x86_64). The first type of experiment was the failure analysis based on fault injection, while the second was the performance analysis of the three OSs, both with and without fault injection. For each type of experiment, we ran Phoronix Test Suite as workload to warm up the system for 30 seconds. Phoronix includes a wide range of benchmarks covering different types of workloads, such as CPU, memory, disk, and graphics. The benchmarks are designed to stress-test the system and provide a comprehensive evaluation of its performance. We selected Phoronix as workload for fault injection because it provides a challenging and realistic set of workloads that can reveal potential failures in the system. Then, we performed fault injection using FIFML. Finally, we rebooted the system to its initial state before each experiment. We collected log files at the end of each experiment to determine whether the fault was successfully injected and whether it affected the OS performance. This experimental design enabled us to evaluate the robustness of the three OSs against faults and compare their performance with and without fault injection.

5.1. The Experiment of Failure Analysis based on Fault Injection

5.1.1. Experiment Description

The results of the failure analysis experiment are classified into five levels: (1) Crash: The system crashes after a fault is injected; (2) No Response: The system does not respond after a fault is injected; (3) Affect: Some processes are affected and cannot run normally after a fault is injected; (4) Light: After a fault is injected, the system can run but is affected, and some processes cannot provide services correctly; (5) Normal: The system runs normally after a fault is injected. For each experiment, a single fault is injected each time, and kernel functions from each of the three OSs are injected three times. Figure 5 shows the aggregated results of fault injection for the three OSs. Each bar in the figure represents the corresponding fault distribution.

For each OS, we performed 1250, 229, 621, 173, and 597 fault injection experiments on the modules of file system (fs), interrupt management (int), IO management (io), memory management (mem), and process management (pro), respectively. The number of fault injection experiments depends on the number of fault modes in the fault mode library. The proportion of file system and IO management modules that are seriously affected by injected faults (Crash or No Response) is relatively low, while the proportion of interrupt management module and process management module that are seriously affected by injected faults is relatively high. Comparing results show that Anolis OS performs well in file system and IO management, while openEuler performs better in memory management and process management. The three OSs performed similarly in interrupt management, but Anolis OS crashes less frequently.

5.1.2. Experimental Analysis and Comparison

In our experiments, we observed one type of failure related to the read file operation. We simulated IO errors, invalid parameter errors, and operation blocking by returning error codes such as EIO, EINVAL, and EWOULDBLOCK from the system call sys_read(). Figure 6 shows an example of a system call returning an error code that leads to a runtime exception. CentOS and openEuler will crash directly without providing any meaningful information to the user, leading to a bad user experience. Anolis OS, on the other hand, will prompt the user that the library file cannot be read and provide a message of “Input/output error.”

Another type of function failure occurs during the operation of obtaining the file status according to the file descriptor. This failure is generated by returning EFAULT, ENOTDIR, and ELOOP from sys_newfstat(). CentOS and openEuler cannot execute file operations and prompt the user that the library file cannot be loaded and gives Error 40 information. Some functions of Anolis OS, such as linking files produced during the compilation process, will prompt Error, but the overall performance of the system is stable. Anolis OS is more robust and can provide some help for subsequent failure recovery.

5.1.3. Case Study

In addition, some of the differences between various kernel versions of the components are highly related to the fault modes. Specifically, in the case of the sys_munmap(), sys_getdents64() and the remaining 8 system calls, a total of 39 fault modes have been identified that can cause certain kernel versions of the systemd components to fail, leading to core dumps and system crashes. Additionally, failures related to sys_fadvise64() have similar behavior. To illustrate an example of failure analysis related to sys_munmap(), we provided an analysis of the fault caused by the linux-mem-26-7-1 fault mode injected in our experiment. Upon injection, the system encounters an invalid parameter error while canceling the mapping of files or devices to memory, leading to a segmentation fault and system restart in Anolis. In contrast, the other two OSs behave normally. Further investigation using system log comparison reveals that the kernel was encountering issues while handling page faults triggered by sys_munmap(). Specifically, in the experiment using the 4.18.0-305.an8 Anolis kernel, the page fault handling was found to be similar to that of the Linux 4.18.0-305.el8 kernel. Consequently, this similarity resulted in significant contention for the mmap_sem lock. The situation became precarious when a fault was injected into the memory-mapping class, causing a core dump. Due to this injected fault, the process (often systemd) failed to release the mmap_sem lock, leading to improper handling of the page fault. As a result, a “stack guard page was hit” error occurred, ultimately leading to a kernel panic due to stack overflow. This problem has been resolved in subsequent versions of the kernel.

The differences observed during the experiment and the corresponding analysis results were confirmed by our industry partners. We presented our findings in a report that was shared with the developers for their consideration. This feedback allowed the developers to gain valuable insights into the performance of their OS under various fault conditions, and helped to identify objects for improvement.

5.2. Performance Analysis Experiment

5.2.1. The Selection of the Analysis Object

To assess the impact of faults on Linux kernel performance, virtual file systems, interprocess communication systems, memory management systems, process management systems, and network interface modules of CentOS 8, Anolis OS, and openEuler are used as evaluation objects. Combined with the feature that FIFML can perform accurate fault injection for Linux kernel functions, the performance of their internal common functions before and after fault injection was measured, and the performance metric was represented by the delay in the completion of the function call.

In order to screen out the kernel functions and system calls that are frequently used in each object, we use strace (b29, ) and ftrace (b30, ) tools to build the call stack of Linux that runs continuously for a period of time under the working environment. The functions to be tested which are screened out according to the function calls in the stack involved 59 functions from five system modules.

5.2.2. Performance Evaluation Metrics

The performance analysis experiment obtains the time required for the execution of each function before fault injection and the time required for the execution of each function after fault injection ( $TAF$ ), and the influence level of fault on the execution performance of the function was divided according to the difference in the time required for each operation before and after fault injection. In order to reasonably divide the influence level and reduce the impact of performance fluctuations, we defined a performance threshold ( $PT$ ) for the execution of each function, which is used to estimate the maximum performance fluctuation of each function before fault injection. In addition, we define the performance standard deviation ( $PSD$ ) of each function before fault injection. The worst performance ( $WP$ ) of each operation before fault injection is obtained from the time required for each function to execute before fault injection. According to the 3 Sigma criterion in statistics, if the performance float follows the normal distribution, 99.73 $\%$ of the data is within the range of 3 standard deviations of the mean. Therefore, in order to further improve the confidence level, under the condition of known standard deviations and the worst performance, the maximum allowed performance float is the sum of 3 times $PSD$ and $WP$ , i.e.

(1)

PT=3\times PSD+WP

According to the relationship between $TAF$ and $PT$ , three levels of influence are defined

(1) No Influence: $TAF<PT$

(2) Mild Influence: $PT\leq TAF<5\times PT$

(3) Serious Influence: $TAF\geq 5\times PT$

“No Impact” indicates that there is little difference in operation performance before and after fault injection. “Mild Impact” indicates that operational performance deteriorates before and after fault injection, but the performance difference is considered tolerable by the user and has little impact on the performance of user processes and system services. “Serious Impact” indicates that operation performance deteriorates significantly before and after fault injection, and the performance of user processes and system services are affected.

We define the failure rate ( $FR$ ), performance degradation rate ( $PDR$ ), and performance level after fault injection ( $PL_{f}$ ) to quantitatively describe the impact of fault injection on performance of CentOS 8, Anolis OS, and openEuler .

$PL_{f}$ depends on fault impact degree ( $FID$ ). $FID$ is related to the number of faults that cause a serious impact ( $NSF$ ), the number of faults that cause a mild impact ( $NMF$ ), the number of faults that cause no responding ( $NNR$ ), and the number of faults that cause system crash ( $NOC$ ). But they have different impacts, In order to distinguish this difference, $NSF$ , $NMF$ , $NNR$ and $NOC$ are assigned different weights respectively.

(2)

FID=0.4\times NSF+1.0\times NMF+2.0\times NNR+3.0\times NOC

The larger the $FID$ , the more faults that affect performance, the higher the fault impact, the smaller the $PL_{f}$ . And also, the larger the $FID$ , the smaller the effect on $PL_{f}$ . At the same time, $PL_{f}$ is also related to the total number of faults ( $N$ ). So $PL_{f}$ can be expressed as

(3)

PL_{f}=e^{\frac{-FID}{N}}\times 100\%

$FR$ means the probability that the system is in failure state and loses the ability to provide services. It depends on $N$ , $NNR$ and $NOC$ . The smaller the $FR$ , the stronger the system’s ability to maintain services after fault injection. So $FR$ can be expressed as

(4)

FR=(NNR+NOC)/N\times 100\%

$PDR$ is the probability that a fault affects system performance and results in performance degradation (including loss of service capability). It depends on $N$ , $NMF$ , $NSF$ , $NNR$ , and $NOC$ . A smaller $PDR$ indicates a smaller impact on system performance. So $PDR$ can be expressed as

(5)

PDR=(NMF+NSF+NNR+NOC)/N\times 100\%

5.2.3. Experimental Analysis and Comparison

On the whole, Anolis OS performs better than CentOS and openEuler in virtual file systems, network interfaces, and process management systems. openEuler provides better inter-process communication performance than CentOS and Anolis OS, but provides more NSFS for virtual file systems and network interfaces than CentOS and Anolis OS.

Table 1 gives the number of faults that cause various impacts ( $NMF$ / $NSF$ / $NNR$ / $NOC$ ). The results show that, compared with CentOS and openEuler, Anolis OS has lower $NMF$ , $NSF$ and $NOC$ , but also has higher $NNR$ , which indicates that more faults will cause Anolis OS to enter the non-responsive state and lose the ability to provide services. OpenEuler has the lowest $NOC$ and produces the fewest system crashes, but its larger $NMF$ and $NSF$ indicate that more failures may affect openEuler performance to some extent.

Table 1. Number of Faults That Cause Various Impacts (

NMF

NSF

NNR

NOC

)

System name	$NMF$	$NSF$	$NNR$	$NOC$
CentOS 8	179	81	70	328
Anolis OS	156	64	139	224
openEuler	243	223	80	303

Based on the experimental data and the performance metrics defined above, we obtain $PL_{f}$ , $FR$ and $PDR$ for CentOS 8, Anolis OS and openEuler. Table 2 gives $PL_{f}$ , $FR$ and $PDR$ of the three OSs. Anolis OS obtains the best $PL_{f}$ , $FR$ and $PDR$ scores because of lower $NMF$ , $NSF$ and $NOC$ . CentOS 8 lags behind Anolis OS in $PDR$ and $PL_{f}$ . openEuler performs not so good on $PDR$ and $PL_{f}$ due to higher $NMF$ , $NSF$ and $NOC$ , but better on $FR$ than CentOS 8.

Table 2. Statistics of

FR

PDR

and

PL_{F}

System name	$FR$	$PDR$	$PL_{F}$
CentOS 8	13.87 $\%$	22.93 $\%$	67.49 $\%$
Anolis OS	12.65 $\%$	20.42 $\%$	70.47 $\%$
openEuler	13.34 $\%$	29.58 $\%$	65.47 $\%$

5.2.4. Case Study

Fault linux-mem-percpu-38-7-1 can simulate the fault of Linux system by causing __alloc_percpu() error while allocating dynamic percpu area. In the performance analysis experiment, this fault greatly increases the execution delay of kmem_cache
_create() for CentOS and AnolisOS, but does not affect the execution delay of kmem_cache_create() for openEuler.

Therefore, we used ftrace to construct the call trees of three OSs, and analyzed the call chains involving kmem_cache_create() by contrast. CentOS and Anolis OS exhibit the same control flow. After executing find_mergeable(), __kmem_cache_alias() internally returns and continues to execute subsequent segments including __alloc_percpu(). Fault is triggered when __alloc_percpu() is executed, leading to system performance degradation. However, openEuler has a different control flow. __kmem_cache_alias() internally executes sysfs_slab_alias after executing find_mergeable(). And __alloc_percpu() is not included in subsequent program segments. Therefore, system performance is not affected.

After further analysis, it is found that the main reason for the above control flow difference is the inconsistency in the implementation of find_mergeable() in the three OSs. In find_mergeable(), CentOS and AnolisOS return NULL while openEuler returns a kmem_cache structure on execution of the list_for_each_entry_reve
rse segment. Inconsistencies in the returned results from find_merge
able() directly impact system performance discrepancies.

6. Conclusion

This paper presents a novel methodology for failure analysis of Linux-like OSs based on fault injection. We systematically define Linux-like fault modes by adopting the method of fault mode generation based on functional module division of Linux-like OSs. We construct a Linux fault mode library and develop a fault injection tool based on the fault mode library (FIFML). Then, we conduct fault injection experiments on three commercial Linux distributions, i.e. CentOS, Anolis OS and openEuler. To reasonably divide the influence level and reduce the impact of performance fluctuations, we introduce several performance metrics to measure the performance disparity of three OSs. The experimental results show that Anolis OS outperforms CentOS and openEuler in virtual file systems, network interfaces, and process management systems. These findings underscore the significance of our methodology in assessing OS reliability. By comprehensively examining various fault modes and their effects on performance, our methodology contributes to a better understanding of OS failure behavior and provides insights for future system optimization.

However, the completeness of the fault mode library is crucial for the effectiveness of FIFML in Linux failure analysis. Due to the high complexity of system failure behavior, the failure analysis time is long, and it is challenging to ensure the coverage of the fault mode library.

As for future work, our first priority is to further improve the completeness of the Linux fault mode library. To reduce the manual effort required to construct a Linux fault mode library, we will combine practical experience with automated tools. Furthermore, we will employ fault injection based reliability testing to estimate OS reliability. We will explore the use of machine learning techniques to further enhance the accuracy and efficiency of our fault injection method. In addition, we will also extend the fault injection method to other levels (mainly user mode).

References

(1) Rakshith Amarnath, Shashank Nagesh Bhat, Peter Munk, and Eike Thaden. 2018. A Fault Injection Approach to Evaluate Soft-Error Dependability of System Calls. In Proceedings of the IEEE International Symposium on Software Reliability Engineering Workshops, Memphis, TN, USA, October 15-18, 2018, Sudipto Ghosh, Roberto Natella, Bojan Cukic, Robin S. Poston, and Nuno Laranjeiropp (Eds.). IEEE Computer Society, 71–76. https://doi.org/10.1109/ISSREW.2018.00-28
(2) Stefan Winter, Oliver Schwahn, Roberto Natella, Neeraj Suri, and Domenico Cotroneo. 2016. No PAIN, no gain? The utility of parallel fault injections. In Proceedings of Software Engineering 2016, Fachtagung des GI-Fachbereichs Softwaretechnik, February 23-26, 2016, Wien, Österreich, Jens Knoop, and Uwe Zdun (Eds.). GI, 45-46. https://doi.org/10.18420/se2016-04
(3) Takeshi Yoshimura, Hiroshi Yamada, and Kenji Kono. 2013. Using Fault Injection to Analyze the Scope of Error Propagation in Linux. IPSJ Online Transactions 6 (2013), 55-64. https://doi.org/10.2197/ipsjtrans.6.55
(4) Jean Arlat, Jean-Charles Fabre, Manuel Rodríguez, and Frédéric Salles. 2002. Dependability of COTS Microkernel-Based Systems. IEEE Trans. Computers 51, 2 (2002), 138-163. https://doi.org/10.1109/12.980005
(5) Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl E. Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1 (2004), 11-33. https://doi.org/10.1109/TDSC.2004.2
(6) Ram Chillarege and Shriram Biyani. 1994. Identifying risk using ODC based growth models. In Proceedings of the 5th International Symposium on Software Reliability Engineering, Monterey, CA, USA, November 6-9, 1994. IEEE Computer Society, 282-288. https://doi.org/10.1109/ISSRE.1994.341388
(7) Domenico Cotroneo, Antonio Ken Iannillo, Roberto Natella, and Stefano Rosiello. 2021. Dependability Assessment of the Android OS Through Fault Injection. IEEE Transactions on Reliability 70, 1 (2021), 346-361. https://doi.org/10.1109/TR.2019.2954384
(8) Xuliang Chen, Jianhui Jiang, Wei Zhang, and Xuzei Xia. 2020. Fault Diagnosis for Open Source Software Based on Dynamic Tracking. In Proceedings of the 7th International Conference on Dependable Systems and Their Applications, Xi’an, China, November 28-29, 2020. IEEE, 263-268. https://doi.org/10.1109/DSA51864.2020.00047
(9) Xuze Xia, Wei Zhang, and Jianhui Jiang. 2019. Ensemble Methods for Anomaly Detection Based on System Log. In Proceedings of the 24th IEEE Pacific Rim International Symposium on Dependable Computing, Kyoto, Japan, December 1-3, 2019. IEEE, 93-94. https://doi.org/10.1109/PRDC47002.2019.00034
(10) Ang Jin, Jianhui Jiang, Jiawei Hu, and Jungang Lou. 2008. A PIN-Based Dynamic Software Fault Injection System. In Proceedings of the 9th International Conference for Young Computer Scientists, Zhang Jia Jie, Hunan, China, November 18-21, 2008. IEEE Computer Society, 2160-2167. https://doi.org/10.1109/ICYCS.2008.329
(11) Ang Jin and Jianhui Jiang. 2009. Fault Injection Scheme for Embedded Systems at Machine Code Level and Verification. In Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing, Shanghai, China, Nov. 2009. IEEE, 55-62. https://doi.org/10.1109/PRDC.2009.68
(12) Min Xie. 1997. Handbook of Software Reliability Engineering, by Michael R. Lyu (Editor), McGraw-Hill and IEEE Computer Society, 1996 (Book Review). Softw. Test. Verification Reliab. 7, 1 (1997), 59-60. https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-1689(199703)7:1$%$3C59::AID-STVR126$%$3E3.0.CO;2-5
(13) Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. 2018. Operating System Concepts, 10th Edition. Wiley, 2018. ISBN: 978-1-118-06333-0
(14) Domenico Cotroneo, Anna Lanzaro, and Roberto Natella. 2018. Faultprog: Testing the Accuracy of Binary-Level Software Fault Injection. IEEE Trans. Dependable Secur. Comput. 15, 1 (2018), 40-53. https://doi.org/10.1109/TDSC.2016.2522968
(15) Roberto Natella, Domenico Cotroneo, and Henrique Madeira. 2016. Assessing Dependability with Software Fault Injection: A Survey. ACM Comput. Surv. 48, 3 (2016), Article 44, 55 pages. https://doi.org/10.1145/2841425
(16) Marcello Cinque, Domenico Cotroneo, Raffaele Della Corte, and Antonio Pecchia. 2014. Assessing Direct Monitoring Techniques to Analyze Failures of Critical Industrial Systems. In Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering, Naples, Italy, November 3-6, 2014. IEEE Computer Society, 212-222. https://doi.org/10.1109/ISSRE.2014.30
(17) João Durães and Henrique Madeira. 2006. Emulation of Software Faults: A Field Data Study and a Practical Approach. In IEEE Transactions on Software Engineering, vol. 32, no. 11, November 2006, pages 849-867. https://doi.org/10.1109/TSE.2006.113
(18) Timothy K. Tsai and Ravishankar K. Iyer. 1995. Measuring Fault Tolerance with the FTAPE fault injection tool. In Quantitative Evaluation of Computing and Communication Systems, Berlin, Heidelberg, 1995, Heinz Beilner and Falko Bause (Eds.). Springer Berlin Heidelberg 26-40. https://doi.org/10.1007/BFb0024305
(19) Scott Dawson, Farnam Jahanian, and Todd Mitton. 1995. A Software Fault Injection Tool on Real-Time Mach. In Proceedings of the 16th IEEE Real-Time Systems Symposium, Palazzo dei Congressi, Via Matteotti, 1, Pisa, Italy, December 4-7, 1995, IEEE Computer Society, 130-140. https://doi.org/10.1109/REAL.1995.495203
(20) Cuong Pham, Long Wang, Byung-Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. In IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 2, 503-516. https://doi.org/10.1109/TPDS.2016.2575829
(21) Oliver Schwahn, Nicolas Coppik, Stefan Winter, and Neeraj Suri. 2018. FastFI: Accelerating Software Fault Injections. In 23rd IEEE Pacific Rim International Symposium on Dependable Computing, Taipei, Taiwan, December 4-7, 2018. IEEE, 193-202. https://doi.org/10.1109/PRDC.2018.00035
(22) Mark Sullivan and Ram Chillarege. 1991. Software Defects and their Impact on System Availability: A Study of Field Failures in Operating Systems. In Proceedings of the 1991 International Symposium on Fault-Tolerant Computing, Montreal, Canada, 1991. IEEE, 2-9. https://doi.org/10.1109/FTCS.1991.146625
(23) Wei-lun Kao, Ravishankar K. Iyer, and Dong Tang. 1993. FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults. In IEEE Trans. Software Eng. 19, 11 (November 1993), 1105-1118. https://doi.org/10.1109/32.256857
(24) Anshuman Thakur, Ravishankar K. Iyer, Luke T. Young, and Inhwan Lee. 1995. Analysis of failures in the Tandem NonStop-UX Operating System. In Sixth International Symposium on Software Reliability Engineering, Toulouse, France, October 24-27, 1995. IEEE, 40-50. https://doi.org/10.1109/ISSRE.1995.497642
(25) Inhwan Lee and Ravishankar K. Iyer. 1995. Software Dependability in the Tandem GUARDIAN System. IEEE Trans. Software Eng. 21, 5 (May 1995), 455-467. https://doi.org/10.1109/32.387474
(26) Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti. 2019. FailViz: A Tool for Visualizing Fault Injection Experiments in Distributed Systems. In Proceedings of the 15th European Dependable Computing Conference, Naples, Italy, September 17-20, 2019. IEEE, 145-148. https://doi.org/10.1109/EDCC.2019.00036
(27) Craig Sheridan, Darren Whigham, and Matej Artac. 2017. DICE Fault Injection Tool. CoRR abs/1707.06420 (2017). arXiv:1707.06420. http://arxiv.org/abs/1707.06420
(28) Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2020. Analyzing the Effects of Bugs on Software Interfaces. IEEE Trans. Software Eng. 46, 3 (2020), 280-301. https://doi.org/10.1109/TSE.2018.2850755
(29) strace, [Online]. Available: https://strace.io/. [Accessed: Aug. 2, 2023].
(30) ftrace, [Online]. Available: https://www.kernel.org/doc/html/latest/trace/ftrace.html. [Accessed: March 13, 2023].
(31) Tahar Jarboui, Jean Arlat, Yves Crouzet, and Karama Kanoun. 2002. Experimental Analysis of the Errors Induced into Linux by Three Fault Injection Techniques. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, Bethesda, MD, USA, June 23-26, 2002. IEEE, 331-336. https://doi.org/10.1109/DSN.2002.1028917
(32) Jörgen Christmansson and P. Santhanam. 1996. Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults. In Proceedings of the Seventh International Symposium on Software Reliability Engineering, White Plains, NY, USA, October 30-November 2, 1996. IEEE, 175-184. https://doi.org/10.1109/ISSRE.1996.558785
(33) Frédéric Salles, Manuel Rodríguez, Jean-Charles Fabre, and Jean Arlat. 1999. MetaKernels and Fault Containment Wrappers. In FTCS-29: Digest of Papers, The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Madison, Wisconsin, USA, June 15-18, 1999. IEEE, 22-29. https://doi.org/10.1109/FTCS.1999.781030
(34) Phil Koopman and John DeVale. 2000. The Exception Handling Effectiveness of POSIX Operating Systems. IEEE Trans. Softw. Eng. 26, 9 (September 2000). IEEE, 837-848. https://doi.org/10.1109/32.877845
(35) João Durães and Henrique Madeira. 2004. Generic on Software Fault Faultloads Baseds for Dependability Benchmarking. In Proceedings of the International Conference on Dependable Systems and Networks, Florence, Italy, June 28 - July 1, 2004. IEEE, 285-294. https://doi.org/10.1109/DSN.2004.1311898
(36) Ali Kalakech, Karama Kanoun, Yves Crouzet, and Jean Arlat. 2004. Benchmarking The Dependability of Windows NT4, 2000 and XP. In Proceedings of the International Conference on Dependable Systems and Networks, Florence, Italy, June 28 - July 1, 2004. IEEE, 681-686. https://doi.org/10.1109/DSN.2004.1311938
(37) Arnaud Albinet, Jean Arlat, and Jean-Charles Fabre. 2004. Characterization of the Impact of Faulty Drivers on the Robustness of the Linux Kernel. In Proceedings of the International Conference on Dependable Systems and Networks, Florence, Italy, June 28 - July 1, 2004. IEEE, 867-876. https://doi.org/10.1109/DSN.2004.1311957
(38) Andreas Johansson and Neeraj Suri. 2005. Error Propagation Profiling of Operating Systems. In Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 28 - July 1, 2005. IEEE, 86-95. https://doi.org/10.1109/DSN.2005.45
(39) Andreas Johansson, Neeraj Suri, and Brendan Murphy. 2007. On the Impact of Injection Triggers for OS Robustness Evaluation. In Proceedings of the 18th IEEE International Symposium on Software Reliability Engineering, Trollhättan, Sweden, November 5-9, 2007. IEEE, 127-126. https://doi.org/10.1109/ISSRE.2007.23