Causal factors discovering from Chinese construction accident cases

Zi-jian Ni [email protected] Wei Liu Faculty of Economics and Management, Dalian University of Technology
2 Ling Gong Rd., Dalian 116024, Liaoning, P. R. China

Abstract

In China, construction accidents have killed more people than any other industry since 2012. The factors which led to the accident have complex interaction. Real data about accidents is the key to reveal the mechanism among these factors. But the data from the questionnaire and interview has inherent defects. Many behaviors that impact safety are illegal. In China, most of the cases are from accident investigation reports. Finding out the cause of the accident and liability affirmation are the core of incident investigation reports. So the truth of some answers from the respondents is doubtful. With a series of NLP technologies, in this paper, causal factors of construction accidents are extracted and organized from Chinese incident case texts. Finally, three kinds of neglected causal factors are discovered after data analysis.

keywords:

Roles mismatch , Natural Language Processing (NLP) , Accident cases , Accident causes

^†^†journal:

1 Introduction

China, as the largest construction market in the world, its value of construction output was about 24.8 trillion Yuan in 2019. Concerning safety in the construction industry, it is still challenging today [1]. The death toll reached 1152 in 2003 and then fell for 11 consecutive years. With the holistic improvement of the occupational health and safety management system of the country, however, accidents in the construction industry have killed more people than in coal mines since 2012. In 2019, construction deaths on the job were 904, which ranked the first in all types of industrial accidents. Many studies hold that construction is one of the most dangerous industries due to the complicated and multicausal factor of accidents on project sites [2, 3].

In [4], accident causation theories were divided into four generations: accident proneness theories, domino theories, injury epidemiology models, and system theories. In the last generation, occupational safety is impacted by factors in different levels that have complex interactions. Further, two kinds of elements are analyzed in the construction accident system model. One is the factors influencing safety performance, which is called the risk factor. And the other is the causal factor. As the name implies, they resulted in the accident.

Generally, the system model about risk factors is based on the empirical method. The whole research begins with statements or hypotheses. After data collection from the questionnaire and interview, whether a hypothesis is supported or not depends on the appropriate statistical formula. All kinds of specific aspects of construction safety have been discussed in this methodology. Thirteen main risk factors from 55 papers are summarized in a useful review [5]. In construction accident analysis, there is an essential weakness of this kind of empirical research. Many behaviors that impact safety is illegal. So the truth of some answers from the respondents in the questionnaire and interview are doubtful.

Moreover, the unsafety does not equal to the accident. Revealing the causality of accidents is essential to distinguish between factors that require some action or not [6]. The research shows that causes of accidents vary substantially between industries [7]. Most causal models of construction accident [2, 8, 9, 10] originated from systematic and holistic thinking about accidents. But not all of them have been validated by sufficient real accident data. In work [10], for example, only a small sample of fatal accidents (26 in total 211 accidents cases) was used to understand underlying causes. Another example is causal factors were divided into the proximal and distal in the [2]. But because of the limitations of the accident data available, only the proximal factors are validated [2]. The ConAC (Construction Accident Causality) framework [8] was verified [6, 11] and applied [12] a couple of times. But at the same time, analyzing data is the cost. For extraction data from 84000 words, this study engaged four analysts [11]. The consistency of criteria for extracting information is still problematic, even if you can hire more skilled professionals. As a result, for analyzing construction accidents, real data is the key.

Not only in the field of construction, but it is also hard to collect data of accidents in other industries. The reason is that it is impossible to conduct reproducible incident experiments like other disciplines. Past accident analysis and learning (PAL) is always one of two pillars on which the edifice of occupational safety research [13]. For PAL, accident cases are one of the most important sources [14]. In China, most of the cases are from accident investigation reports [15]. Finding out the cause of the accident and liability affirmation are the core of incident investigation reports. [16]. Including illegal acts, in other words, causal factors of every accident can be found in these documents. NLP (Natural Language Processing) can assist people in improving the performance of analyzing the unstructured text. In this paper, causal factors of construction accidents will be extracted from the free text in Chinese with Automatic Keyphrase Extraction (AKE) [17]. AKE includes a series of NLP technologies and will be discussed in section 3. Furthermore, not only for incidents of construction, we believe that our framework for the extraction can be used in other industry accident case text in Chinese.

For evaluating the necessity and sufficiency of causal factors in data sets, all valid accident data in a short-term was input into various algorithms to get the correlation. Because our Chinese cases are typical incidents for an extended period (more than 25 years), the holistic causal model can not be proposed in this paper. But due to more accurate information being extracted and summarized, some neglected causal factors will be revealed. In the meanwhile, empirical studies may be inspired by these real accident data also. Finally, the organized data will be shared online for further studies ¹¹1https://github.com/liuwei-965/Digital-management-of-Chinese-accident-cases.

The rest of this paper is structured as follows: The data source and the case text structure will be introduced in section 2. A framework for extraction causal factors from texts will be proposed in section 3. In section 4, the role mismatch and the other two neglected factors will be discussed.

2 Accident causes in Chinese accident cases

\dirtree

.1 Case title. .2 Project profile…(optional). .3 Client. .3 Contractor. .2 Details of the accident and Emergency response. .2 Causes of the accident. .3 Direct cause. .3 Indirect cause. .2 Accident severity. .2 Liabilities. .2 Accident prevention and improvements…(optional).

Figure 1: The structure of Chinese construction accident cases.

In our study, 267 typical construction accident cases are all from esafety.cn, which is the information platform of the Ministry of Emergency Management of China. The text structure of a Chinese accident case is listed in Fig. 1. Some projects are small, and none of the stakeholders are corporations. But the loss is severe. So chapters about the project profile and accident prevention and improvements are sometimes omitted. However, the accident causes are the core of the document.

Moreover, there are causes-and-effects relationships between two kinds of causes in the cases. Direct causes have two main factors, which are unsafe behaviors of people and hazard status of matters. Furthermore, the matter includes equipment, material, and surroundings. The indirect cause can lead to immediate causes and thus increase the risk of projects, which is similar to distal causal in [2]. Details of the indirect cause will be discussed in section 4.1.

Refer to caption — Figure 2: Cross track model: causes-and-effects relationships between direct causes and indirect causes.

As discussed above, most of the Chinese cases are from accident investigation reports. And the legal base of investigation reports is Regulation for the Investigation of Casualty Accidents of China (RICAC) [18]. Professor Sui is one of the counselors of RICAC, who proposed an accident model called the cross-track model [19]. This model illustrates the relation between direct and indirect causes. In Fig. 2, the unsafe behaviors and hazard statuses of matters are understood as a consequence of management failures. Moreover, the accident is not an inevitable outcome. But as the project goes on, loss expectation will increase until an accident happens.

3 Extract causal factors for each accident

Although causal factors are rich in the two specific sub-sections of case texts, not every word is about the cause. In Fig. 3, there is an example from one real case text. The parts with underlines describe the causes of the accident. Rather than one single word, a sequence of words makes up this description, which is called the phrase [20]. Moreover, an observation in Fig. 3 is that only some phrases are valuable to analyze causal factors. In this paper, these phrases are called keyphrases. Finally, more than one keyphrases can express the full meaning of accident causes. This kind of keyphrases set is called the fact.

Each case text contains more than one fact about the accident. Based on a series of NLP techniques, in this section, a framework will be proposed to extract these facts. Due to the complexity and ambiguity of natural language, there are many ways of expressing the same semantic [21]. So it is almost impossible to find every fact from the free text. Our study, due to the above, is based on one assumption that people and organizations repeat the same mistakes always. As a result, if our framework can extract frequent causes automatically, the manual workload for the rests will be very reduced.

3.1 Framework for extraction

Automatic Keyphrase Extraction (AKE) is a task of natural language processing (NLP), which may be divided into two kinds [17]: supervised and unsupervised. Although promising results were delivered from current supervised AKE approaches, both the data labeling and manual sorting facts are time-consuming. Without training data, unsupervised AKE is a recent trend aimed at discovering the underlying structure of a document [22]. The graph-based model is a typical method of unsupervised AKE [23, 24], in which the whole text is switch to the network, words as nodes. Based on different standards, each node gets a weight to evaluate its importance. Then rank nodes by their weight, and select nodes of top rank as keyphrases at last. However, based on this graph-based model, it can not guarantee a phrase representing the text theme is a top-ranking term if it does not frequently occur in the text. In the text of Fig. 3, the occurrences of some phrases are much higher than anyone of keyphrases. For example, Songyuan appears 3 times, Property Management appears 4 time and Property Management Co., LTD. is 3.

Although the weighted network’s topology can not be used as an extraction basis in our data, the following two features are still valuable.

1.

The core of causal factors is usually verb phrases in Chinese.
2.

If people repeat the same mistake, which is the assumption discussed in section 3, one causal factor in one case will appear in others.

Based on the above features about the case text, the whole workflow is depicted in Fig. 4. In this process, the core parts are the candidate identification (step 2) and feature engineering (step 3). In stage 2, candidate phrases sets will be identified through dependency syntax analysis (DSA) and heuristic rules. The core meaning of every sentence will be extracted in this step. If multiple candidate phrase sets have a similar semantic, in the next step, keyphrases sets (facts) will easily be brought together with the semantic clustering.

3.2 Case text pre-processing

In the preprocessing stage, text data will be formatted into a machine-readable format to decrease their complexity. In Chinese, a part of a sentence that can provide additional information for the sentence is called the sense group [25]. And sense groups of a sentence are divided into commas, semicolons, and full stops. We believe that a sense group can retain the whole meaning of a fact. To this end, after noisy symbols are removed, sentences will be segmented by the three kinds of punctuation. In our studies, these segments are called candidate clauses.

3.3 Identification of candidate phrases sets

In this stage, candidate clauses will be transformed into candidate phrase sets.

For detecting all candidate phrases sets, three main methods were used by previous studies: N-Gram based [26, 27], Part-Of-Speech (POS) sequence based [28] and both [29]. All methods above fall into the lexical analysis. According to the characters of our data, a novel method based on syntactic analysis will be proposed in this paper. Dependency parsing is quite a vital grammar analysis tool [30]. In the dependency grammar, rather than the constituent and structure of solo phases, binary grammatical relations between words are directly described.

In Fig. 5, there are two examples of the sentence dependency syntax analysis (DSA). A Chinese sentence is cut into words (or phrases). Each one, in Fig. 5, is in a top part of a word box. And the bottom part of this box is the sequence number and part of the speech of it. On the top of the word boxes, the directed edge is from the headword to its dependent. And the labels are all from a fixed library of syntax relations [31]. There must be a root node in the dependency structure, which is the head of others. Note that if the sequence number of the headword is less than the dependent’s, the arc is called reverse syntax relation.

With heuristic rules, the DSA results of candidate clauses will be extracted to get candidate phrase sets. Generally, the root of a sentence is a verb phrase, which is the core of causal factors. As a result, the start point of extraction of rules is the root phrase. Further, the other two rules will help to find the rest candidate phrases if they exist.

1.

Taking the root as the start, its nearest dependent will be extracted.
2.

The headword and dependent of some certain reverse syntax relation will be extracted, which is the nearest to the root. These reverse syntax relations include direct object (dobj), object of preposition (pobj), adjectival complement (acomp).

In the sub-figure b of Fig. 5, the last phrase is the root of the whole sentence, so there is no reverse syntax relation in it. And following rule 1, namely the nearest dependent of the root, safety inspection can be extracted. By rules, the candidate phrases set of this clause is ‘safety inspection not appropriate implementation’.

Moreover, an important observation in Fig. 5 is that although the candidate clauses and the rules are the same, the results are different. The reason is different ways of sentence segmentation. With classical methods, in sub-figure a of Fig. 5, the clause is divided into words. And the complete fact can not be found by rules. Rather than words, in the sub-figure b, the same sentence is cut into phrases. The phrase, in Chinese, is a group of words or a single word, which is a single unit in the grammar of a sentence. In the example above, a group of words is combined to Songy Tians Property Manag Co., LTD which is a noun phrase. And the phrase in the last box is an adverbial phrase.

A few kinds of noun phrases, such as organizations and locations, can be found by one NLP technology called named-entity recognition (NER) [32]. Other kinds of phrases, including some noun phrases, need a novel method. Phrases extraction, essentially, is the assignment to identify combinations of words that show some idiosyncrasy in some certain corpus [33]. In this paper, this idiosyncrasy will be evaluated by a mixed index [34]. The equation is as follows:

Score(b)=PMI(w_{i},w_{j})+min(H_{L}C(b),H_{R}C(b))

(1)

In the phase extraction, two sequential words in the text are called the bigram. Let $w_{i},w_{j}$ be a bigram in the corpus, which is denoted by $b$ . The score of bigram $b$ , in Equ. 1, is composed of two parts, which will be used to evaluate whether $b$ can be a phrase. Specifically, $PMI(w_{i},w_{j})$ is the inner connection index and $min(H_{L}C(b),H_{R}C(b))$ is the outer independence index.

Pointwise mutual information (PMI) is one of the standard connection measures in the phrase extraction, which was introduced into NLP by Church and Hanks [35].

PMI(w_{i},w_{j})=\log(\frac{P(w_{i},w_{j})}{P(w_{i})\times P(w_{j})})

(2)

$P(w_{i},w_{j})$ is the probability of the bigram $w_{i},w_{j}$ , which can be gotten by the maximum likelihood estimation. $P(w_{i},w_{j})=C(w_{i},w_{j})/N$ , where $C(w_{i},w_{j})$ is the number of occurrences of the bigram and $N$ is the number of words in the corpus. By the same way, $P(w_{i})$ and $P(w_{j})$ can be estimated also.

PMI as an inner connection index can not be used to evaluate whether the bigram is a complete phase. $PMI(Songyuan,Tianshan)$ ²²2Songyuan is the name of a city. Tianshan is a mountain, for example, may have a high PMI value. But ‘Songyuan Tianshan Property Manag Co., LTD’ is a whole noun phrase. In other words, by the outer index, a bigram can be independent of contextual words.

If contextual words of a bigram are always in change, we believe that it may well be a complete semantic unit [36] (phrase). Information entropy can be used to calculate the chaos and unpredictability of a random variable. Let $LC(b)=\{w_{1},...,w_{n}\}$ be left context words set of the bigram. Thus the left entropy of bigram can be defined as:

H_{L}C(b)=\sum_{w_{i}\in LC(b)}P(w_{i})log_{2}P(w_{i})

(3)

By MLE, $P(w_{i})=C(w_{i})/N$ , where $C(w_{i})$ is the number of occurrences of word $w_{i}$ appearing to the left of $b$ , and $N$ is total number of occurrences that all adjacent words appear to the left of $b$ . In the same way, the right entropy of $b$ can also be got.

Finally, based on these scores, the bigrams set will be ranked. And top-ranked ones may be returned as phrases. Note that the phrase extraction can be operated repeatedly until as many whole semantic units as a possible return.

3.4 Feature engineering

In this step, accident facts will be identified. In AKE, characters that can distinguish keyphrases from others in the candidate set are called features. TF-IDF (Term frequency - Inverse document frequency) is the most popular feature [37, 38]. TF-IDF can select candidate phrase sets that are frequent in a given document but infrequent in the whole corpus. As shown in Fig. 3, facts can not be identified because of less frequency. Assuming that people always repeat the same mistakes, a novel feature will help to pick keyphrases in our studies.

Repeating the same mistakes means the facts with similar semantics appear in many different candidate phrase sets. As a result, the cluster based on the semantic similarity can characterize keyphrases sets from others. By counting the minimum number of operations required to switch one string to the other, edit distance is a method to evaluate the semantic similarity [39] between two candidate phrases sets. In our work, types of operation contain the insertion, removal, or substitution of a character in the string. This kind of distance is called Levenshtein distance [40] which is defined as the following.

sem(a,b)=lev(a,b)=\left\{\begin{array}[]{lc}|a|&if|b|=0\\ |b|&if~{}|a|=0\\ lev(tail(a),tail(b))&if~{}\;a[0]=b[0]\\ 1+min\left\{\begin{array}[]{l}lev(tail(a),b)\\ lev(a,tail(b))\\ lev(tail(a),tail(b))\end{array}\right.&otherwise\end{array}\right.\\

(4)

$lev(a,b)$ is the Levenshtein distance of the two strings $a,b$ and $|a|,|b|$ is the length of them. The tail of string a ( $tail(a)$ ) is the string of all but the first character of $a$ , and $a[n]$ is the nth character of the string a, starting with character 0. For the two strings $a,b$ ( $|a|>0,|b|>0$ ), if they’re exactly the same, $lev(a,b)=1$ . Further, the larger the difference between $a,b$ , the higher the Levenshtein distance. As a result, $lev(a,b)$ can be used to evaluate semantic similarity. Let $T_{D}$ be the candidate phrases. Levenshtein distance is used to get the pairwise similarities between each pair of phases in $T_{D}$ . And the result is a similar matrix of size $|T_{D}|\times|T_{D}|$ , which is denoted by $SC$ .

Then, $SC$ will be clustered. There are many kinds of algorithms to cluster $SC$ efficiently, but not all can analyze the distance matrix. DBSCAN [41] is a robust algorithm that does not need to specify the number of clusters. DBSCAN requires two parameters. One is the radius of a neighborhood with respect to some point denoted by $\varepsilon$ . The other is the minimum number of points ( $minPts$ ) required to form a dense region. A point is a core point if at least $minPts$ points (including the core point) are within distance $\varepsilon$ of it. With the core point, DBSCAN will cluster all points (core or non-core) that are reachable from it.

Every parameter will influence the result of an algorithm, which is the key for every mining task. To DBSCAN, $\varepsilon$ and $minPts$ as parameters are needed to specified by the user.

1.

$minPts$ is then the desired minimum cluster size. Because people always repeat the same mistakes, $minPts$ can be set a little higher. Generally, higher values are better for data sets with noise sets and will yield more significant clusters. Here noise sets mean the content of the phrase set is nothing about the cause of the accident. In the clustering process of our study, $minPts$ is always 5.
2.

It is hard to estimate $\varepsilon$ because there are many ways to express the same semantic in the free text. But it is much easier to get a minimum value of $\varepsilon$ than its maximum value. If two candidate phrases sets are the same, which is very common in $SC$ , the Levenshtein distance between them is 1. So the lower bound of $\varepsilon$ is 1. If $\varepsilon$ is chosen much too small, a large part of the data will not be clustered. The example is in Fig. 6. Two candidate phrase sets are all about warning signs being ignored when the Levenshtein distance between them is not small, which is 1.9. Namely, if $\varepsilon<1.9$ , it is quite possible that they are considered as noise set by DBSCAN. And for a too high value of $\varepsilon$ , clusters will merge, and most nodes will be in the same cluster.

In our work, a succinct multi-density clustering will be implemented in our candidate phrases sets. The algorithm is listed as the following:

1.

To candidate phrases set $SC$ , $\varepsilon$ is determined by comparison.
2.

With $\varepsilon$ , some clusters will be mined from $SC$ .
3.

If any two phrases in one cluster satisfy $lev(a,b)=Max(|a|,|b|)$ , the algorithm will stop. All clusters mined by the algorithm are the result.
4.

If not, delete candidate phrases set belonging to any clusters from $SC$ to form a new $SC$ . And repeat step 1.

The whole process is depicted in Fig. 7.

The subgraph named Round1 in Fig 8 depicts the relationship between the $\varepsilon$ and the number of clusters. The whole $SC$ is clustered by different $\varepsilon$ whose value is from 1.1 to 1.5. The peak number of clusters appears in $\varepsilon=1.32$ , which is chosen as the value of the radius of a neighborhood in round 1. The same pattern about the number of clusters appears in the rest of the data until the stopping rule is satisfied. Note that the terminal rule is $lev(a,b)=Max(|a|,|b|)$ , which means there is not one same character in the string $a$ and $b$ . In our data set, round 6 is the last clustering and $\varepsilon_{6}=3$ The radius from $\varepsilon_{1}$ to $\varepsilon_{5}$ are depicted in Fig. 8.

3.5 Summary for extracting causal factors

267 accident case texts are input into our extracting framework, in which accidents happened from 1998 to 2018. And 5598 candidate clauses format from these text data. Of course, 5598 candidate phrase sets are ready for clustering analysis by DSA and heuristic rules extracting. After six rounds of multi-density DBSCAN, 355 clusters are the final result, and 664 phrases sets are not contained by any clusters. In 664 sets, only 3 are not noise sets.

Note that only 40 clusters (in 355) are noise set also. After removing duplications, 1669 phrase sets about accident causations are the keyphrase sets. Then each case text will retrieval these key sets to get the recall. More specifically, if a clause in the text includes a whole essential phrase set, the causal factor is identified. The recall of our framework is 87%.

4 New causal factors discovery

As discussed above, the scale of risk factors in the construction are much larger than the causal factors. An excellent review [5] investigated 55 previous papers, and 95 sub-factors are summarized into 13 main factors. In contrast to risk factors, ConAC, which is a causal model, only considers Four main factors and 19 sub-factors [11]. As a result, for revealing new causal factors, we try to classify 1669 facts into 95 sub-factors until someone can not be laid down. If some of these neglected facts have common characteristics, we can say one novel causal factor is discovered.

4.1 Role mismatch

The first one, which caught our attention, is a fact which is ‘fake many times to defraud franchise’. Not only is this fact not classified into any 13 primary factors, but it makes me wonder what has happened in that accident. Then we went back to read the case text and found that it was a complicated accident ³³3http://www.safehoo.com/item/157796.aspx Last open in 2021.01.15. In brief, to save money, a big project is masqueraded as a small one by lying to the government first. Then the client finish jobs of the contractor, supervision, and engineering designer. Because of the improper plan, insufficient strength of columns led to concrete formwork collapsing. Seven people died, and over ten were injured in this accident. It is impossible for respondents in the questionnaire or interview to admit such a severe crime.

Stakeholders are the organizations who are actively involved with the project’s work or have something to either gain or loss due to the project [42]. Much more than other industries, there are five kinds of directly involved organizations in China, including the government, client, project supervision, contractor, and others (Land survey, design, equipment leases, etc.). One stakeholder unfulfilling his responsibility to result in an accident has drawn attention from previous studies [43, 44]. But few people note that one stakeholder did something beyond their scope of duties and cause accidents. In this paper, this is called role mismatch. One example is the client in the last paragraph.

From 267 case texts, six kinds of role mismatch are summarized, which is listed in Tab. 1. Except for supervisors, the other five kinds of stakeholders are included. The second column of Tab. 1 is the occurrence number of this sub-factor in total 267 cases. If two factors appear in the same accident frequently, there may be a strong correlation between them. Before discussing the relations between role mismatch and other causal factors, a classification of causal factors will be proposed first.

Table 1: Relations between role mismatch and other causal factors

Role mismatch	Occurrence number	Other factors in the same accident
Client: making construction plan	1	6-2(1)	2-7(1)	2-12(1)	2-3(1)
Government: appoint sub-contractor	1	2-12(1)	2-7(1)	3-1(1)	2-3(1)
		2-7(33)	2-12(12)	3-4(5)	4-1(16)
		2-13(26)	1-3(11)	5-2(3)	2-11(6)
Contractor: construction	41	2-3(23)	3-1(11)	4-3(2)	3-5(1)
without competency		1-1(23)	2-4(11)	6-1(2)	2-5(15)
		5-1(22)	2-8(9)	2-9(2)	6-2(6)
		1-2(18)	2-1(9)	4-4(2)	2-10(13)
		3-2(16)	2-6(8)	4-2(1)	2-2(5)
		2-13(5)	1-1(3)	1-2(2)	3-2(3)
		2-7(5)	2-10(3)	3-1(2)	4-3(1)
Contractor:	5	2-3(5)	2-5(2)	2-6(1)	4-1(2)
illegal transfer		5-1(4)	2-11(2)	1-3(1)	2-12(2)
		2-8(3)	5-2(2)	2-2(1)	2-4(1)
		2-1(3)
		1-2(48)	2-6(22)	4-3(7)	2-12(27)
		2-3(43)	1-3(19)	3-4(6)	2-1(12)
		2-7(40)	2-5(19)	2-11(5)	4-2(2)
Worker: labour	57	5-1(37)	2-10(18)	6-2(5)	2-4(25)
without competency		4-1(35)	2-13(17)	6-1(5)	3-1(12)
		1-1(31)	3-2(16)	5-2(4)	4-4(2)
		2-8(30)	2-2(14)	2-9(3)
		2-7(4)	1-2(3)	6-1(1)	4-4(1)
Designer:	4	5-1(4)	1-1(3)	2-2(1)	2-10(1)
without competency		6-2(4)	3-2(3)	2-12(1)	3-4(1)
		2-13(4)	2-4(2)	2-1(1)	3-5(1)
		2-3(4)	3-1(2)	4-1(1)

The case data character is each accident fact has a stakeholder who has to be held accountable. As a result, six main factors correspond to six different kinds of stakeholders in the construction industry in our classification. Each stakeholder’s responsibilities in the construction safety are defined in two laws of China [45, 46], which are Construction Law and Regulations on construction engineering quality management, respectively. So the sub-factors are all from the two laws. Rather than open interpretation [12], the definition of these factors in the law is more strict. The main factors and their sub-factors are listed in Tab. 3 of Appendix A. Note that the number in the bracket is the code of this sub-factor. And these codes correspond to the number in the last column in Tab. 1.

If two factors appear in the same accident frequently, there may be a strong correlation between them. Moreover, the causal diagrams [47] of the construction accident can be deduced from these correlations. With role mismatch, factors that appear in the same accident are listed in the last column in the Tab. 1. And the number of co-occurrence is in the bracket behind the factor code. Based on causal diagrams [47], the mechanism of role mismatch will be discussed in our future work. Here we only come up with some preliminary observations. Except for government appointing sub-contractor, reducing costs and saving time may be the common purpose of the rest five sub-factors.

4.2 More than one neglected factor

Table 2: The other two neglected causal factors

Main factor	Sub-factors	Accident case title
	Supplier: Failure to fully perform the contract	2003-9-20 Lift cage falling
	No engineer contract	1996-3-14 The earth collapsed
Engineer contract management	No labor contract	2003-5-15 The car crane collided with the high voltage line
	In Inappropriate contract management	2002-3-15 Crane boom overturned
		2003-9-12 Pipe network trench collapse
		2002-11-6 Falling
	Delayed response	2001-6-20 The outer cornice collapsed
		2003-7-24 The building collapsed
	No contingency plan	2002-5-12 Explosion
Response		2003-11-20 Construction collapse
for the accident	Contingency plan has not been implemented	2014-9-1 Poisoning in a sewage pumping station project
	Inappropriate rescue	2003-3-29 Poisoning in a sewerage project
		2012-12-23 Carbon monoxide poisoning

From case text data, the other two main factors are relatively little studied in construction accidents. One is engineering contract management, and the other is the response to the accident. We believe that the reason for neglecting is also data problems. It is hard to collect enough samples because the people who have participated in rescue or contract management are very few.

With case texts, other scholars may be inspired by the two factors and their sub-factors. In Tab. 2, the sub-factors are listed in the second column. Note that all of these sub-factors are summarized from the real accident cases, and the date and title of them are in the last column. And our share data in Github has these case texts.

5 Conclusion

The accident data is valuable. After the whole process of past accidents is revealed, the future losses can be reduced. Very few people have ever had an accident, so the data about accidents are hard to get. Typical accident cases should be studied carefully because the cost of life is behind most of these texts. Beyond the limitations of the manual analysis, based on a series of NLP technologies, a framework to organize data about accident causes is proposed in this paper. And some neglected causal factors are discovered. Role mismatch will be further discussed in our future studies.

We believe that our framework can also analyze Chinese case texts in other industries. And the research involving other languages can be inspired by this work. Moreover, society and economic climate can also affect the occupational incident system [11, 43]. As a result, other developing countries would benefit from our study also.

Acknowledgements

This work is supported by the National Natural Science Foundation of China Nos. 71501022, 71901047, 71874020 and 71774021.

Appendix A

Table 3: Causal factors categorized by stakeholders

Stakeholder	Causal factors (ID)
	$\bullet$ Unsafe operation (1-1)
1. Worker and Work group	$\bullet$ Without competency (1-2)
	$\bullet$ Tacit knowledge: ability, experience, knowledge, safety awareness (1-3)
	$\bullet$ Responsibilities of contractor is not fulfilled (2-1)
	$\bullet$ Construction plan (2-2)
	$\bullet$ Safety, quality supervision and control (2-3)
	$\bullet$ Rules and regulation (2-4)
	$\bullet$ Safety culture and climate (2-5)
2. Contractor	$\bullet$ Safeguard procedures, equipment and sign (2-6)
	$\bullet$ Inappropriate construction operation (2-7)
	$\bullet$ Training and education (2-8)
	$\bullet$ Site condition (2-9)
	$\bullet$ Command (2-10)
	$\bullet$ Verification of competency (2-11)
	$\bullet$ Response to the accident (2-12)
	$\bullet$ Competency of itself (2-13)
	$\bullet$ Safety management (3-1)
	$\bullet$ Illegal construction (3-2)
3. Client	$\bullet$ Supervising contractors (3-3)
	$\bullet$ Project acceptance (3-4)
	$\bullet$ Archives management (3-5)
	$\bullet$ Supervising contractors (4-1)
	$\bullet$ Communication with client (4-2)
4. Supervisor	$\bullet$ Competency of itself (4-3)
	$\bullet$ Tacit knowledge: ability, experience, knowledge, safety awareness (4-4)
	$\bullet$ Guide and supervise (5-1)
5. Government	$\bullet$ Inappropriate punishment (Punishment is too light or laws is not strictly enforced) (5-2)
	$\bullet$ Organization, mechanism, system (5-3)
6. Others	$\bullet$ Supplier: Material and equipment quality (6-1)
	$\bullet$ Designer: Survey and design (6-2)

Reference

References

[1] C. Tam, S. Zeng, Z. Deng, Identifying elements of poor construction safety management in china, Safety science 42 (7) (2004) 569–586.
[2] A. Suraji, A. R. Duff, S. J. Peckitt, Development of causal model of construction accident causation, Journal of construction engineering and management 127 (4) (2001) 337–344.
[3] P. H. Mohseni, A. A. Farshad, R. Mirkazemi, R. J. Orak, Assessment of the living and workplace health and safety conditions of site-resident construction workers in tehran, iran, International journal of occupational safety and ergonomics 21 (4) (2015) 568–573.
[4] V. V. Khanzode, J. Maiti, P. K. Ray, Occupational injury and accident research: A comprehensive review, Safety Science 50 (5) (2012) 1355–1367.
[5] A. Mohammadi, M. Tavakolan, Y. Khosravi, Factors influencing safety performance on construction projects: A review, Safety science 109 (2018) 382–397.
[6] A. Gibb, H. Lingard, M. Behm, T. Cooke, Construction accident causality: learning from different countries and differing consequences, Construction Management and Economics 32 (5) (2014) 446–459.
[7] A. M. Williamson, A.-M. Feyer, D. R. Cairns, Industry differences in accident causation, Safety Science 24 (1) (1996) 1–12.
[8] S. Hide, S. Atkinson, T. C. Pavitt, R. Haslam, A. G. Gibb, D. E. Gyi, Causal factors in construction accidents, © Health and Safety Executive, 2003.
[9] P. Mitropoulos, T. S. Abdelhamid, G. A. Howell, Systems model of construction accident causation, Journal of construction engineering and management 131 (7) (2005) 816–825.
[10] A. Hale, D. Walker, N. Walters, H. Bolt, Developing the understanding of underlying causes of construction fatal accidents, Safety science 50 (10) (2012) 2020–2027.
[11] S. Winge, E. Albrechtsen, B. A. Mostue, Causal factors and connections in construction accidents, Safety science 112 (2019) 130–141.
[12] M. Behm, A. Schneller, Application of the loughborough construction accident causation model: a framework for organizational learning, Construction Management and Economics 31 (6) (2013) 580–595.
[13] B. Abdolhamidzadeh, T. Abbasi, D. Rashtchian, S. A. Abbasi, Domino effect in process-industry accidents–an inventory of past events and identification of some patterns, Journal of Loss Prevention in the Process Industries 24 (5) (2011) 575–593.
[14] S. Tauseef, T. Abbasi, S. A. Abbasi, Development of a new chemical process-industry accident database to assist in past accident analysis, Journal of loss prevention in the process industries 24 (4) (2011) 426–431.
[15] M. of Housing, U.-R. Development, Case analysis of construction safety accidents, China Architecture Press, 2019.
[16] S. C. of the People’s Republic of China, Regulations on the reporting, investigation and handling of production safety accidentsconstruction engineering quality management regulations (2007).
URL http://www.gov.cn/zwgk/2007-04/19/content_588577.htm
[17] Z. A. Merrouni, B. Frikh, B. Ouhbi, Automatic keyphrase extraction: An overview of the state of the art, in: 2016 4th IEEE international colloquium on information science and technology (CiSt), IEEE, 2016, pp. 306–313.
[18] S. A. of China, Regulation for the investigation and analysis of accidents involving casualties of enterprise employees.
[19] S. ChengPeng, Casualty accident analysis and prevention principle, Industrial Safety and Environmental Protection 05 (1982) 1–8.
[20] Z. A. Merrouni, B. Frikh, B. Ouhbi, Automatic keyphrase extraction: a survey and trends, Journal of Intelligent Information Systems (2019) 1–34.
[21] J. Piskorski, R. Yangarber, Information extraction: Past, present and future, in: Multi-source, multilingual information extraction and summarization, Springer, 2013, pp. 23–49.
[22] H. H. Alrehamy, C. Walker, Semcluster: unsupervised automatic keyphrase extraction using affinity propagation, in: UK Workshop on Computational Intelligence, Springer, 2017, pp. 222–235.
[23] T. Washio, H. Motoda, State of the art of graph-based data mining, Acm Sigkdd Explorations Newsletter 5 (1) (2003) 59–68.
[24] S. S. Sonawane, P. A. Kulkarni, Graph based representation and analysis of text document: A survey of techniques, International Journal of Computer Applications 96 (19).
[25] D. X. Zhou C., Difficulties and counter measures for machine understanding of chinese: A viewpoint of the sense-group dynamics, Modern Foreign Languages (Quarterly) 23 (2).
[26] C. Huang, Y. Tian, Z. Zhou, C. X. Ling, T. Huang, Keyphrase extraction using semantic networks structure analysis, in: Sixth International Conference on Data Mining (ICDM’06), IEEE, 2006, pp. 275–284.
[27] Z. Liu, P. Li, Y. Zheng, M. Sun, Clustering to find exemplar terms for keyphrase extraction, in: Proceedings of the 2009 conference on empirical methods in natural language processing, 2009, pp. 257–266.
[28] K. Barker, N. Cornacchia, Using noun phrase heads to extract document keyphrases, in: conference of the canadian society for computational studies of intelligence, Springer, 2000, pp. 40–52.
[29] M. Grineva, M. Grinev, D. Lizorkin, Extracting key terms from noisy and multitheme documents, in: Proceedings of the 18th international conference on World wide web, 2009, pp. 661–670.
[30] H. Calvo, O. J. Gambino, A. Gelbukh, K. Inui, Dependency syntax analysis using grammar induction and a lexical categories precedence system, in: International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2011, pp. 109–120.
[31] J. Nivre, M.-C. De Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, et al., Universal dependencies v1: A multilingual treebank collection, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 1659–1666.
[32] D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Lingvisticae Investigationes 30 (1) (2007) 3–26.
[33] G. Bouma, Normalized (pointwise) mutual information in collocation extraction, Proceedings of GSCL (2009) 31–40.
[34] Hankcs, Introduction to nature language processing, People’s posts and telecommunications press, 2019.
[35] K. Church, P. Hanks, Word association norms, mutual information, and lexicography, Computational linguistics 16 (1) (1990) 22–29.
[36] C.-W. Lee, Y.-L. Wu, L.-C. Yu, Combining mutual information and entropy for unknown word extraction from multilingual code-switching sentences., Journal of Information Science & Engineering 35 (3).
[37] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in scientific publications, in: International conference on Asian digital libraries, Springer, 2007, pp. 317–326.
[38] F. Liu, D. Pennell, F. Liu, Y. Liu, Unsupervised approaches for automatic keyword extraction using meeting transcripts, in: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, 2009, pp. 620–628.
[39] M. Baroni, J. Matiasek, H. Trost, Unsupervised discovery of morphologically related words based on orthographic and semantic similarity, arXiv preprint cs/0205006.
[40] G. Navarro, A guided tour to approximate string matching, ACM Computing Surveys 33 (1) (2001) 31–88. doi:10.1145/375360.375365.
[41] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise., in: Kdd, Vol. 96, 1996, pp. 226–231.
[42] R. Newcombe, From client to project stakeholders: a stakeholder mapping approach, Construction management and economics 21 (8) (2003) 841–848.
[43] Q. Chen, R. Jin, Multilevel safety culture and climate survey for assessing new safety program, Journal of Construction Engineering and Management 139 (7) (2013) 805–817.
[44] A. Pinto, I. L. Nunes, R. A. Ribeiro, Occupational risk assessment in construction industry–overview and reflection, Safety science 49 (5) (2011) 616–624.
[45] S. C. of the People’s Republic of China, Regulations on construction engineering quality management (1997).
URL http://www.gov.cn/flfg/2005-08/06/content_20998.htm
[46] S. C. of the National People’s Congress, Construction law of people’s republic of china (2019).
URL http://www.npc.gov.cn/npc/c30834/201905/0b21ae7bd82343dead2c5cdb2b65ea4f.shtml
[47] J. Pearl, D. Mackenzie, The book of why: the new science of cause and effect, Basic Books, 2018.