MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature
Abstract
The number of published articles in the field of materials science is growing rapidly every year. This comparatively unstructured data source, which contains a large amount of information, has a restriction on its re-usability, as the information needed to carry out further calculations using the data in it must be extracted manually. It is very important to obtain valid and contextually correct information from the online (offline) data, as it can be useful not only to generate inputs for further calculations, but also to incorporate them into a querying framework. Retaining this context as a priority, we have developed an automated tool, MatScIE (Material Science Information Extractor) that can extract relevant information from material science literature and make a structured database that is much easier to use for material simulations. Specifically, we extract the material details, methods, code, parameters, and structure from the various research articles. Finally, we created a web application where users can upload published articles and view/download the information obtained from this tool and can create their own databases for their personal uses.
keywords:
Sequence Labeling , Information Extraction , Material Scientific Articles1 Introduction
Currently, the majority of material science findings and information are stored in an unstructured format across numerous published articles. A typical article contains information on the material(s) studied, the method used, the computational software(s) used in the study, the simulation parameters and, finally, the outcome of the study. If we consider the scenario where we want to query the methods and parameters discussed in published material science papers, there is no easy and robust method to effectively filter and refine this information automatically. Manually going through the papers and finding out the methods used is an inefficient and tedious task. A potential solution can be to build a system that can automatically extract mentions of a method from any published article.
To solve these problems, we introduce Material Science Information Extractor (MatScIE), which is capable of extracting information about the material, code, parameter, method and structure from the published research article, along with providing a summary of the main research findings. Significant developments have been observed in the field of information extraction using machine learning and deep learning techniques. A very specific and widely used task is Named Entity Recognition (NER) that classifies (extracts) named entities in the text as per pre-decided classes or categories. It accepts as an input a sequence of tokens and identifies the spans in the input sequence that belong to one of the pre-decided categories. In our use-case, we attempt to extract the spans of text in a material science research article belonging to one of the five categories: material, code, parameter, method, structure. This enables us to adapt the NER framework for our task.
Popular NLP methods for the NER task are based on recent advances in deep-learning [1]. An important requirement in training a deep learning model is the availability of appropriate (and large) annotated data. We created a modest annotated dataset using 214 material science articles by labeling the text spans into the five categories.
In the first part, we trained standard sequential models on the supervised data to predict labels. We utilized word embeddings pre-trained in the material science domain. In the second part, we injected noise in the training dataset to increase the robustness of the model. Since injecting noise in the textual dataset is critical, we developed a novel procedure using a Relabeling and Mimicking model. The Relabeling model has a high recall, we use this model to inject noise in the dataset. We varied the amount of noise injected, and obtained results pertaining to different amount of injected noise. In order to derive a short summary level information from the published article, we trained a sentence classification model on an annotated dataset of 90 articles. Additionally we developed a web application that generates summary level information and token spans corresponding to each category.
The remainder of the paper is structured as follows. In Section 2, we describe past research on information extraction on scientific articles. We describe our annotated dataset in Section 3. Then we describe our proposed approach and evaluation metrics in Sections 4 and 5, respectively. We show the experimental results and sample outputs in Sections 6 and 7, respectively. Some more analysis on the output are presented in Section 8. We present the online interface in Section 9. Section 10 concludes the paper and gives directions for the future work.
2 Related Work
Information extraction from scientific articles has been extensively explored recently. With the help of information extraction methods we intend to extract potential information out of the large body of scientific articles. Extracted information can provide an overview of the key-insights from scientific articles. Some of the major computational approaches in this regards include rule-based approaches, machine learning approaches like Naive Bayes classifier [2], support vector machines [3] and deep learning approaches. Most of the studies with regards to deep learning are performed on publicly available datasets like bc5cdr [4] and SCIERC [5]. One interesting work is this field is done by Luan et al. [5], that extracted entities, relations between entities and co-reference clusters in scientific articles using a multi-task setup. They applied their model on SCIERC dataset, that contains abstracts of 500 scientific articles from different domains. Furthermore they used their predictions to generate a scientific knowledge graph that can further be used to show analysis on scientific literature. Another popular work done in this field is that of Beltagy et al. [6]. This work uses a BERT model pretrained on a corpus consisting of scientific articles to improve performance on downstream scientific NLP tasks. Corpus used for pretraining the model mostly consisted of biomedical documents. Since our work is primarily based on material science domain articles, we generated an annotated dataset consisting of only material science articles and then trained deep learning models on the same.
There has been quite some work to extract Chemical entities. Some of the information related to chemical entities is already available in static databases. These databases map chemical information to relevant document with details like patents, literature etc related to the entered text. Below is a list of popular chemical databases :
- •
-
•
ChemSpider [9] - It is a free database containing chemical structures with the feature to search by chemical names and chemical structures. It helps to find important data like literature references, physical properties, chemical suppliers etc.
-
•
SciFinder [10] - It is used to access information from selected Chemical Abstract Services(CAS) databases. It offers searches of authors name, related topic etc.
But all such databases are static and require constant updates from researchers.
One interesting work in this field is the OSCAR4 recogniser [11]. It build an n-gram model and then uses it in Bayesian classifier to classify whether a token belongs to “chemical” or “non-chemical”. The n-gram model is built with the help of a list of chemical tokens, that are obtained from a fixed dictionary and manually annotated documents. It then builds a Maximum Entropy Markov Model [12] by representing each token with a set of features, one of which is the probability that it belongs to a chemical domain with the predictions of the previously built n-gram model. Since the model is prepared from a fixed dictionary, more often than not it fails to capture the materials when represented in complex notations. Additionally, this model extracts only the chemical components from a text whereas we are interested in a much detailed extraction from the text.
ChemSpot [13] is another chemical extraction tool that uses Conditional Random Field(CRF) [14] to identify IUPAC and IUPAC-like names from a scientific text. This model also fails to fulfill our purpose for similar reasons.
Recently there are research studies being performed on material science domain articles which use deep learning models to classify tokens into material science categories. One such work is that of Weston et al. [15]. It uses BiLSTM with CRF to classify each token of the text. Dataset used in this paper contains 800 article abstracts which are annotated among the following labels - material, symmetry/phase label, sample descriptor, property, application, synthesis method and characterization method. However, the entity labels used in that paper is different from the labels in which we are interested in since we are dealing with papers that report density functional theory / first principle-based calculations.
Several researchers have focused on traditional machine learning models to explore textual features to perform several tasks like understanding material science languages, extract information or develop a knowledge graph. One such work was done by Tshitoyan et al. [16] who developed an unsupervised word embedding model to process text for identifying complex materials science knowledge such as the underlying structure of the periodic table and structure–property relationships in materials111We have used this embedding technique as a baseline in our experiment.. Another work was done by Buttler et al. [17]. They have explored the evolution of research workflow and how different machine learning algorithms (like Genetic Algorithms, Naïve Rules based Approaches222We have used Naïve Rule approach as a baseline etc.) would be used to design, synthesize different aspects of chemical and material science research areas. Hakimi et al. [18] focused on biomaterial text mining for retrieving relevant documents. Kostoff et al. [19] described techniques for data retrieval and discussed the desirability of conflating search terms. Another direction of work was to develop an automatic approach. Kim et al. [20] built a platform using variety of machine learning and natural language processing techniques to automatically retrieve articles and then extract materials synthesis conditions found in the text. Juan-Pablo et al. [21] presented how machine learning and natural language processing can help solving long timelines and low success probabilities of material science research and can accelerate the pace of novel materials development. Some researchers were also focused on exploring material science doamin relevant features. One such work was done by Goldsmith et al. [22]. They have shown how machine learning can be useful for aiding heterogeneous catalyst understanding, design and discovery. Dragone et al. [23] proposed a system that can evaluate chemical reactivity and detect for new reactions, rather than a predefined set of targets.
Few works were also done on material science synthesis procedure. The semi-supervised machine-learning classification model by Huo et al. [24] used a Dirichlet allocation model to cluster keywords into topics corresponding to specific experimental materials synthesis steps, such as “grinding” and “heating”, “dissolving” and “centrifuging”, etc. Guided by a modest amount of annotation, a random forest classifier can then associate these steps with different categories of materials synthesis, such as solid-state or hydrothermal. Some people explored semantic features of text - Mysore et al. [25]. Young et al. [26] experimented on data mining techniques in case of pulsed laser deposition of complex oxides. Mysore et al. [27] have built graph based automatic extraction model. Kononova et al. [28] have automatically extracted synthesis entries from scientific publications to gather a dataset of “codified recipes” for solid-state synthesis. Himanen et al. [29] worked on data driven approaches on material science articles.
The real complexity of our task lies in extracting only the entities used in the study rather than those that are just being mentioned. For example, consider the paper with title, “Temperature independent band structure of WTe2 as observed from ARPES”. This paper may mention many different materials but only WTe2 is being studied, and is the only material in which we are interested in. We observe that an article usually works on a much smaller subset of the materials than the ones actually occurred in the paper, while the remaining are just mentioned as previous work. In a normal NER task, however, we would be interested in extracting all the different materials mentioned in the paper. The same goes for all other classes.
3 Dataset
In this section, we discuss the collection and annotation process of the dataset. We first discuss the dataset creation for the information extraction task, followed by the dataset for sentence classification task.
3.1 Information Extraction Dataset
We started creating our dataset using a set of 70 hand-picked material science articles. To expand this, we crawled material science articles published between 2010 to 2019 from arxiv under cond-mat.mtrl-sci category. We applied filters to consider only the articles that contain atleast one among the following keywords in the abstract : “ab initio simulation”, “density functional study”, “density functional theory” and “first principles”.
Next, we considered only those articles with atleast one code listed on https://psi-k.net/software. Thus, we ended up with a repository of 10,900 material science articles in PDF format.
3.2 PDF annotation and parsing
In addition to the initial 70 documents, we took some articles randomly from the arxiv dataset to create a dataset of 214 articles. These PDF documents are annotated using Google Pdf Editor by experts in material science domain. We selected the following entity types for labelling : a) material b) method c) code d) parameter e) structure. Selective words or series of words were highlighted and annotated with the corresponding label using this annotation tool. If there are multiple occurrences of the same word or series of words, then only one occurrence is labelled. The rest of the occurrences are automatically annotated when we process the data.
Since the annotations are in the PDF format, we need to extract the contents along with the annotation to be able to apply NLP approaches. To achieve this we analysed several open source applications as listed below.
-
•
pdfanno333https://github.com/paperai/pdfanno
-
•
brat444https://brat.nlplab.org
-
•
INCEpTION555https://github.com/inception-project/inception
-
•
popplerqt4666https://github.com/frescobaldi/python-poppler-qt4
-
•
OCR++777https://github.com/ocrplusplus/ocrplusplus [30]
-
•
Science Parse888https://github.com/allenai/science-parse
To this end, we finalised two applications, popplerqt4 and Science Parse, that are both usable and appropriate to create the dataset, as described below.
-
•
popplerqt4 - popplerqt4 is a python library that can be used to extract annotations (highlights, comments) from a PDF file, and formats them as markdown text. The output of the script contains each highlighted text and corresponding comments. However, it does not provide the sentence containing the annotation.
-
•
Science Parser - Science Parser parses scientific papers (in PDF form) and returns in a structured form. It extracts Title, Authors, Abstract, Sections, Bibliography, Mentions, etc. in a json format. It does not have the ability to extract highlighted portions and comments.

We combine the output of Science Parser and popplerqt4 to generate a well formed output containing the contents of scientific article along with annotations. This is done by the annotation script that first lists out the annotations of the PDF article using popplerqt4 and then uses regular expressions to match the annotations with text obtained from Science Parser application as shown in Figure 1.
3.3 Text Preprocessing
The above steps provide the annotated PDF document in raw text format. However, to make the data ready for the extraction task, the first step is to split the raw text into sentences. An adhoc splitting of sentences with respect to period (.) might not be the best choice since such token might occur multiple times within the same sentence. Hence we used Spacy, a sentence tokenizer tool, that uses machine learning algorithm to learn the end-of-lines (eol).
Each sentence consists of tokens. As mentioned earlier, a chemical formulae can be represented in multiple forms. To perform normalization of chemical formulae, we replaced all numbers with an uniform digit(0). We extracted the tokens from a sentence by spitting the token based on whitespace.
Generally the output obtained from Science Parser contain spurious characters that are either misrepresented or are added erroneously. We cleaned the text by either removing them or by replacing those with appropriate letters.
3.4 Selecting sentences
After tokenization of the text sentences, we have two choices : a) consider only the sentences that contains any token that is labelled b) consider all the sentences. We prepare dataset based on both the choices. Later, we will see that the dataset prepared with the second choice provides better results. Since we are interested in finding spans of sentences belonging to a specific category, there is a need to convert the data to some tagging format. We used inside-outside-beginning (IOB) format which is a very common tagging format used in named entiry recognition (NER) tasks. After all the above mentioned operations, we ended up with 49,610 sentences in total where the tokens are annotated among 5 classes: (material, method, code, parameter, structure). Any token that is not associated to these classes is treated as if it belongs to a special class “O”.
3.5 Label Distribution and division of the dataset
Figure 2 shows the number of spans corresponding to each class in the annotated data. Multiple occurrences of a span in the same document is counted once. From the figure it is clear that the labels follow a skew distribution with maximum number of spans observed in method & parameter classes and minimum number of spans are observed for structure class.
We have performed 5-fold cross validation test and we report average and standard deviation of the precision, recall, F1-score of the model across these 5 folds. Note that we did the division using documents and not sentences to reflect the real world setting, where we would be provided with a new document at test time. Following standard practices, for each fold, the model parameters will be learnt on the training set, the hyper-parameters will be tuned on the validation set, while not touching the test set. The final tuned model will be used on the test set to get the final numbers.
3.6 Sentence classification Dataset
Since the previous annotations would help in extracting the material, methods, code, parameter and structure from the scientific document, a decent summary level information of the article can be derived by including sentences that contain some information related to results. We, therefore, chose abstracts of 90 articles from the 214 articles above and annotated those into positive or negative depending on whether the corresponding sentence corresponds to a result999Data available at https://github.com/TeamMatSciE/MatSciE.

4 Proposed Approach
Our proposed approach consists of an information extraction and a sentence classification model, which we describe in this section. For information extraction, we utilize a sequence labeling approach, commonly used to identify the (named) entities of interest. Given a sequence of tokens (words), this model makes a judgement for each token in the sequence, as to whether it is part of any of the underlying entities of interest.
For example, the model should output that a token “Na2SO4” belongs to the class material and the sequence “Vienna Ab Simulo Package” belongs to the class code.

In the NLP literature, there are primarily three types of information that are utilized for the sequence labeling task: a) word representation, b) the context in which the word has occurred in the sentence, and c) shape of the token.
Word representation can be obtained by training a word-embedding model. We used the word2vec model with skip-gram architecture to represent the word in vector as discussed by Mikolov et al. [31]. Skip-gram architecture utilizes the distributional hypothesis, i.e., ‘similar words tend to appear in similar contexts’, by utilizing the context words to learn the representation of a given word. It is usually trained using a large corpus containing texts of similar domain. Since our dataset is much smaller, we used pretrained word2vec vectors, using a corpus containing 650K material science published papers, from the previous work of Kim et al. [32].
Local context of the word occurring in the sentence proves useful in those cases where the word vector corresponding to a given token is not available. That is very common because of new materials and other tokens in the literature. Consider the following sentence in our dataset : We performed first principles calculations on ___ doped with Mn, focusing on different aspects. From the context words, it is quite evident that the word that appears in place of blank belongs to the class of material. To utilize the context information, we treat each sentence as a sequence of words that are represented using embedding. Since it is a sequence labelling task, we use recurrent neural networks (RNN) to learn the embedding of the context that the sentence represents [33].

RNN are known to suffer when the sequence is long as in that case it cannot utilize context words that are not very close to the target word. Because of this, we used Long Short Term Memory (LSTM) [34], which is a variant of RNN. Forward LSTM can capture the sequence of the tokens appearing before the current token, while backward LSTM captures the sequence of the tokens appearing after the current token. While training we used Bidirectional LSTM (Bi-LSTM) [1], which is basically a concatenation of embedding obtained by forward LSTM and embedding obtained by backward LSTM to achieve this purpose. For each element in the input sequence, each LSTM layer computes the following
(1) |
(2) |
(3) |
where represents the input gate, represents forget gate, represents output gate, represents sigmoid function, represents weight of respective gate(x) neurons, represents the output of previous block at time step , represents input representation at current time-step, and represents the biases for the respective gates. The equations to obtain the final output is
(4) |
(5) |
(6) |
where represents the cell state at time-step and represents the candidate for cell state at time-step . The output obtained at each time-step is passed through a fully connected linear layer to perform the task of classification among six classes. We applied softmax function to the logits obtained from the linear network to obtain the probabilities for each class.
Chemical formulae are often represented by an inorganic notation, for example - sodium sulphate is represented as Na2SO4. We can observe that this token has a fixed pattern where the first letter is uppercase, followed by lowercase characters and numbers. There are tokens with distinct shapes available in our dataset. Also, we notice that the prefix and suffix of a token play a major role in deciding the class to which it belongs. To detect the shape of the tokens, we treat each token as a sequence of characters and then apply Bi-LSTM. In the end, we obtain the character level embedding of the token.
After we obtain the embedding, we need to build a tag encoder that can rightly understand the class belonging to the token. For this purpose, we use a Conditional Random Field (CRF) classifier [35].
Peters et al. [36] showed that ELMO embedding of a token can give better performance as compared to Bi-LSTM embedding in general. In ELMO, each token is assigned a representation that is a function of the entire input sentence. ELMO embeddings are deep, in the sense that they capture the output of all the internal layers of LSTM. We used pretrained ELMO embedding for material science scientific articles. The input to the Bi-LSTM model is thus a concatenation of pretrained word2vec embedding, character embedding, and pretrained ELMO embedding (see Figure 3).
The basic workflow of the model is shown in Figure 4.
We tried quite a few variations of the basic approach during our experiments, which we describe below.
4.1 Varying the context length: Using sentence context vs. from an entire Section
The length of the input sequence to the model largely determines the context available to correctly classify the tokens. As mentioned earlier, we aim to extract only the “entities of interest” and not all the mentions from an article. In general, a sentence is an input to the model for performing NER. This is computationally favorable, but doing so results in loss of context, thus it is not suitable for our task. Instead of a sentence, we also experiment with the entire section as an input to the model. For example, if ABSTRACT is a section, then all the text under this section becomes an input to the model. As we will see in the experimental results, using a larger context from the entire section yields a much better performance than restricting context to only at the sentence-level.
4.2 Relabeling and Mimicking Model
In general, for training an NER model, only the sentences which contain at least one positive annotation, (i.e, a label other than ′O′) are used. Discarding all other sentences from the training dataset, may result in loss of data, which could otherwise have been helpful. Also, this particular setting cannot be used for the purpose of section-wise training. To explore this, we perform experiments with two different settings. One with only sentences containing at least one annotation (dataset ), and other with all the sentences (dataset ). We then use our validation set to understand the performance of these settings.

Results show that model trained on dataset , gives better result for material and structure class. Whereas, model trained on the dataset gives better result for method, code and parameter class. These results can be explained by the frequency of these classes. As shown in the label distribution in figure 2, the number of spans of material and structure class is much less than that of other classes. The density of these labels further reduces with the addition of non-positive sentences in the dataset. It was observed that the first model had better recall scores and the second model had better precision scores.
Our objective is to train a model that performs well for all the desired classes. For this we introduce, Relabeling and Mimicking Model (see Figure 5). First, we train a model on the dataset containing only positively annotated lines. We call this model as the Relabeling model. Now, we add all the non-positive sentences in the dataset, thus the new dataset becomes . Now, we use the Relabeling model to make prediction on this dataset only for material and structure class. Keeping all the positive ground truth labels for all the classes intact, we add some percentage of the new predicted positive labels in this dataset, as a way of adding meaningful noise. Thus, the new dataset now becomes . Now, we train a model on dataset. We call this model as Mimicking Model. Increasing the amount of noise may increase the recall but decrease the precision. Different amounts of noise may be required for a different class, which has to be determined experimentally. As mentioned earlier a model trained on dataset has good precision but low recall. As we will see in the experiments, the mimicking model achieves good precision as well as recall scores.
4.3 Post processing
One very basic technique to improve the results of any NER task is to understand the pattern of occurrence of tokens across the classes by analyzing the dataset. We identified two such patterns for our dataset. i) the entities belonging to the class material will generally also be present in the title of the paper and ii) entities belonging to class method and parameter are usually close to each other. To reduce the number of spurious entities obtained from the model, we apply the above-mentioned post-filters. We achieved this by removing the predicted materials that are obtained in any section other than TITLE. We also removed those methods that does not contain any token classified as parameter within a span of two sentences. Experimentally we observed that by applying these filters, the precision of these classes is increased without considerable decrease in recall. It is to be noted that for some articles, the PDF parsing application is not able to obtain the TITLE, thereby posing an issue to apply the first filter. We handled those cases by considering the ABSTRACT and initial two sections from the article, i.e., only those material entities obtained in ABSTRACT and the initial two sections are considered and the rest are filtered out.
4.4 Sentence Classification
Recent state-of-the-art results in sentence classification, over multiple public datasets, are obtained by using the BERT model. It generates an embedding for each input instance and then we use it as an input to feed forward neural network to categorise it into positive or negative classes. Training set used in this model is obtained by annotating the abstracts of 90 articles. We fine-tune the model using both bert-base-uncased and scibert-base-uncased pretrained embeddings. Experimental results show that training it using scibert-base-uncased gave better F1 scores.
5 Evaluation metrics
We view the task of NER as classification of a token with one among the 5 classes that we are interested in. If some token does not belong to any such class, then we treat that token to belong to a special class denoted by “O”. As with any other classification tasks, we use precision, recall, and f-measure to measure the performance of our model. For each document we created two sets of tokens for each category - one set contains the actual annotations or ground truth (set ) and the other set contains the predicted token (lets call it set ). Then we compared the tokens between the sets and used it to compute the metrics.
We define the precision (P) of a particular class (c) by the expression
(7) |
where represents the set that are actually labelled with and represents the set that is predicted as . Precision calculates the fraction of the obtained results that are relevant. We define recall (R) by the expression
(8) |
Recall calculates the fraction of total relevant results that re correctly classified by the system. In general, we want the system to output both high precision and high recall. In order to measure the accuracy of the system we use F1-score which is defined by the harmonic mean of precision and recall i.e
(9) |
Naïve rule based method | Unsupervised Word embedding based method | |||||
P, | R, | F1-Score, | P, | R, | F1-Score, | |
material | 40.38, 16.3 | 20.35, 10.43 | 26.25, 11.58 | 64.17, 6.32 | 59.98, 11.04 | 61.72,7.84 |
structure | 21.39, 9.84 | 29.79, 13.43 | 24.78, 11.38 | 38.62, 6.4 | 50.52,17.62 | 42.97,9.42 |
method | 63.2, 3.62 | 69.03, 7.91 | 65.74,3.38 | 55.44, 4.72 | 85.95,1.52 | 67.27,3.11 |
parameter | 7.78, 0.34 | 98.36, 1.32 | 14.43, 0.59 | 18.67, 7.01 | 88.3,6.27 | 30.16,9.13 |
code | 84.74, 2.66 | 60.6, 7.62 | 70.42, 5.08 | 84.75, 7.12 | 72.97, 12.76 | 77.43,5.46 |
macro score | 43.3, 4.1 | 55.63, 5.85 | 41.76, 7.55 | 52.33, 4.6 | 71.54, 5.9 | 55.84, 4.89 |
BERT question answering model | BERT domain adaptation model | |||||
P, | R, | F1-Score, | P, | R, | F1-Score, | |
material | 71.82, 3.51 | 73.56, 4.23 | 72.68, 5.72 | 76.16, 4.03 | 58.48, 5.02 | 66.29, 4.02 |
structure | 46.51, 3.93 | 35.21, 5.34 | 40.08, 4.31 | 45.58, 7.14 | 40.29, 9.81 | 42.29,7.33 |
method | 60.62, 2.56 | 71.23, 3.45 | 65.50, 3.76 | 81.11, 2.27 | 63.97, 3.55 | 71.47,2.26 |
parameter | 62.51, 2.87 | 65.72, 6.31 | 64.07, 5.38 | 81.04, 3.81 | 73.02, 5.22 | 76.74, 3.7 |
code | 82.11, 3.13 | 71.89, 2.35 | 76.66, 3.29 | 82.74, 4.34 | 80.63, 3.7 | 81.55, 2.02 |
macro score | 64.71, 4.52 | 63.52, 5.68 | 63.79, 5.14 | 73.33, 2.3 | 63.28, 2.54 | 67.65, 2.48 |
A model with a high F1-score is considered better as compared to a model with lower F1-score. In our task, we are concerned with spans of tokens. In order to calculate how well the model is performing, we have followed combination of two different approaches: a) exact match and b) partial match [37]. Revised expression to calculate precision and recall is given by
(10) |
(11) |
For exact matching, is true if the and represents exactly the same spans, whereas in partial matching, is true if there is some overlap between the spans of and .
(12) |
In our experiments, we calculated the average precision (P), recall (R) and F1-score along with their respective standard deviations () where values were calculated as per Equation 12. We have utilized string match as an accuracy metric. In other words, for each article, we compared the list of annotated tokens for each class with the list of predicted tokens for that class. This is significantly different than that of usual NER setting where we are interested in finding the class level metrics for each sentence.
6 Experimental Results
The task of extracting the entities from material science articles shares some similarities with the Named Entity Recognition task (NER) and also with question-answering task [38]. Thus, we consider state-of-art models for these tasks as the baseline. For NER, we followed the work of Beltagy et al. [6] and used Scibert which is pretrained on scientific publications. Since our task is specific to the material science domain, we used unsupervised domain adaptation techniques to obtain a BERT model specific to the material science domain as discussed by Han and Eisenstein [39]. For the question-answering task, we used the BERT multi-answer question answering model showed by Devlin et al. [40]. Along with the above two models, we also used a baseline that predicts based on the frequency of occurrence in the training set. Thus, in the test dataset, if a span is present in the training dataset with a class annotation, it is given that particular label. We call this as the Naïve rule-based approach. Another baseline, we have taken is the unsupervised word embedding based method (Word2Vec), developed by Tshitoyan et al. [16]. The results for Naïve rule-based approach and Unsupervised Word embedding based method (UWE) [16] are listed in Table 1. The results for BERT question answering model and BERT domain adaptation model are listed in Table 2. We have performed 5 fold cross validation test for each model to report average precision (P), recall (R) and F1-score (F1) along with their standard deviations () across five folds.
Per-line setup with only positively annotated lines model | All lines per section model | |||||
P, | R, | F1-Score, | P, | R, | F1-Score, | |
material | 78.31, 8.92 | 85.4,5.47 | 81.38,5.28 | 80.54, 6.1 | 78.43, 12.81 | 78.92, 7.38 |
structure | 35.45, 4.23 | 61.02, 20.03 | 43.58, 7.19 | 45.16, 16.98 | 63.5, 15.65 | 52.78, 10.3 |
method | 59.68, 9.54 | 89.34, 4.03 | 71.07, 6.77 | 65.5, 7.91 | 81.91, 4.21 | 72.46, 4.9 |
parameter | 52.39, 6.0 | 85.05, 1.68 | 64.64, 4.31 | 62.85, 11.27 | 81.34, 6.94 | 70.11, 6.25 |
code | 69.16, 10.55 | 85.74, 5.05 | 75.98, 5.27 | 73.97, 14.69 | 82.88, 7.47 | 77.16, 6.21 |
macro score | 58.99, 4.02 | 81.31, 6.0 | 67.43, 1.52 | 65.61, 9.34 | 77.61, 7.48 | 70.29, 3.97 |
Per-section setup with only positively annotated lines (Relabeling Model) | Mimicking model | |||||
P, | R, | F1-Score, | P, | R, | F1-Score, | |
material | 80.23, 3.91 | 89.03, 6.29 | 84.05, 3.09 | 80.7, 5.72 | 87.27, 3.71 | 83.66, 1.9 |
structure | 38.39, 8.6 | 75.32, 14.04 | 50.81, 10.53 | 46.14, 16.11 | 67.67, 12.22 | 53.75,11.28 |
method | 49.6, 5.21 | 92.57, 2.28 | 64.5, 4.82 | 66.24, 5.62 | 88.01, 5.02 | 75.44, 3.93 |
parameter | 39.49, 7.98 | 88.31, 2.39 | 54.1,7.65 | 69.51, 9.53 | 82.26, 2.76 | 74.95, 4.53 |
code | 68.31, 5.77 | 90.04, 0.86 | 77.57, 3.51 | 74.19, 9.17 | 84.05, 4.23 | 78.4, 3.46 |
macro score | 55.2, 3.51 | 87.06, 3.05 | 66.55, 3.48 | 67.35, 8.6 | 81.85, 2.92 | 73.13, 4.66 |
Next, we compare the results of the per-line training (with only positively annotated lines) and all lines per-section training setup. In Section 4.2 we mentioned that we used the section-wise training setup. In Table 3, we compare the results of these two training setups. The model trained using the section-wise training setup outperforms the model trained on a sentence-wise training setup.
In Section 4.3 we mentioned that using only positive-annotated lines for training can lead to information loss. Including all lines per section (positive and negative) in the training, dataset leads to comparatively better scores as seen in Table 3. Since we have already established that section-wise training is better, we adopted this approach in both the models. Only the training dataset for them is different. Results in Table 3 show that the model trained on only positive-annotated lines do have good recall scores, but low precision scores and the model trained on all lines have good precision scores but low recall scores. Also, except material for all other classes, the later model performs better than the former in terms of F1 score.
We have already explained the Relabeling and Mimicking model in detail in section 4.3. The model trained on only positive-annotated lines (section-wise) is used as the Relabeling model due to its high recall scores. Relabeling model is similar to the per-section setup with only positively annotated lines. In Table 4 we compare the results of the Relabeling model and the Mimicking model. The mimicking model achieves a better macro F1-score as well as has the best F1 for all the classes. We have tuned several hyper-parameters of the model to get the optimum results. In our case, we have set “” = “0.001”, “optimizer” = “sgd”, “dropout” = “0.5”, “” = “200”, “” = “120”, “” = “8” to get the best F1-score.
We use the Relabeling model to add noise in the training dataset of the Mimicking model. Keeping the ground truth labels intact, we add some percentage of the extra labels predicted by the Relabeling model. The amount of noise added per class is shown in Table 5. For example, if 50 % noise is added to the material class, it means half of the extra predicted labels of this class are added into the training data randomly. The amount has to be balanced because increasing noise results in an increase in the recall but a decrease in precision. The optimum noise level has to be determined experimentally. We used the validation set results to fine-tune the noise level for each individual classes (as shown in Table 5).
Class | % Noise |
---|---|
material | 50 |
structure | 50 |
method | 25 |
parameter | 0 |
code | 33 |
We have shown summary of F1-scores of all models along with their 95% confidence interval (error bars are shown in black) in Figure 6. It is seen that the Mimicking model achieves the best F1-score performance among all approaches - Naïve rule based (NB), Unsupervised Word Embedding (UWB), BERT Question Answering (BERT QA), BERT domain adaptation (BERT DA), Per-line, Per-section (All lines), Relabeling and Mimicking model.
We observe that all the competing models fail to identify micro details of the structures properly. Thus, the precision, recall and F1-scores for structure are lower than other classes. This is due to the fact that systems correctly identifies macro details of structures like “tetragonal”, “cubic” etc. but not able to match micro details of the structures like - “cubic perovskite Pm30304m”.
Class | Entity | F1-Score |
Methods | GGA (Generalised | 96.67 |
Gradient Approximation) | ||
DFT (Density | 94.56 | |
Functional Theory) | ||
PAW (Projected | 93.31 | |
Augmented Wave) | ||
PBE (Perdev | 94.54 | |
Burke and Ernzerhof) | ||
FP-LAPW (full potential | 91.10 | |
linearized augmented plane wave | ||
Parameters | eV | 97.56 |
K-point | 84.06 | |
Basis Set | 76.16 | |
gmax | 87.43 | |
lmax | 89.81 |
Detailed Analysis
To perform a detailed analysis of the results, we take our best method, “mimicking model” and analyze the individual accuracy scores while identifying the top five most frequent methods and parameters. The F1-score for each of these entities is shown in Table 6. We see that our approach is able to extract several entities (methods like GGA, DFT, PAW, PBE, LAPW and parameters like eV, K-points, Basis Set, gmax, lmax) with very good F1-score.

Summary Extraction
For summary extraction, results obtained after applying the BERT model are shown in Table 7. We used the generic BERT (bert-base-uncased) as well as SCI-BERT (scibert-scivocab-uncased), and found that the generic BERT gives much better results in terms of F1-score.
Model | P | R | F1 |
---|---|---|---|
BERT with scibert-scivocab-uncased pretrained model | 90.91 | 71.43 | 80.00 |
BERT with bert-base-uncased pretrained model | 82.61 | 95.00 | 88.37 |
7 Sample Output
In this section we present some of the actual outputs obtained from the best model for the information extraction task.
Tables 8 and 9 show the output obtained by applying the best model on material science papers titled, “Single pair of Weyl fermions in thehalf-metallic semimetal EuCd2As2” and “Electron-phonon superconductivity in CaBi2 and the role of spin-orbit interaction”, respectively.
Class | Ground truth | Model output |
---|---|---|
material | • EuCd2As2 | • half - metallic semimetal EuCd2As2 |
structure | • space group 164(P3mm) | • space group 164(P3mm) |
• hexagonal | ||
method | • PBE | |
• Generalized Gradient Approximation | • Perdew - Burke - Ernzerhof (PBE) | |
• projector - augmented wave [37] method | ||
• Generalized Gradient Approximation | ||
parameter | • 318 eV | |
• U = 5.0 eV | ||
• 0.01 eV/Å | • 318 eV | |
• U = 5.0 eV | ||
• 0.01 eV/Å | ||
code | • VASP | • VASP |
Class | Ground truth | Model output |
---|---|---|
material | • CaBi2 | • CaBi2 |
structure | • space group Cmcm, No. 63 | |
• cubic fcc | • space group Cmcm, No. 63 | |
• cubic | ||
method | • densityfunctional theory(DFT) | |
• Perdew - Burke - Ernzerhof | ||
• Rappe-Rabe-Kaxiras-Joannopoulos ultrasoft pseudopotentials | • densityfunctional theory(DFT) | |
• pseudopotential method | ||
• Perdew - Burke - Ernzerhof | ||
• Rappe-Rabe-Kaxiras-Joannopoulos ultrasoft pseudopotentials | ||
parameter | • grid of 123 k points | |
• grid of 43 q points | • 123 k points | |
code | • QUANTUM ESPRESSO | |
• BOLTZTRAP | • QUANTUM ESPRESSO | |
• BOLTZTRAP |
8 Post-facto analysis
The proposed model can be used to observe various statistics over time. For instance, what are the main methods and codes being used over time? To answer this question, we run our best model on the 10K material science articles from arxiv (see Section 3.1). Figures 7 and 8 show the distribution of code and method, respectively over published articles in the last ten years. We can see from the Figure that “VASP” is the most popular code of the decade followed by “QUANTUM ESPRESSO”. Both codes have a common trend, their use has grown over the years, whereas the use of post-processing packages, such as “Phonopy” and “Boltztrap”, has fluctuated over time. For the method tag, we see that “GGA” appears the most followed by “DFT” and “PAW”. This is indeed true that most of the researchers in the field of computational materials science use VASP as their primary choice as code and the most popular approach is PAW-PBE. We recognized that “eV” is the most popular parameter followed by “K-point”. Therefore, we also would like to point out that this utility would be useful to track the choices of the researchers in this field over the time globally.


9 Online interface
To be able to make the proposed methods accessible and usable by the community, we have created the online interface:
server: http://34.69.41.173:3333/.
With the help of this website, one can upload any published paper in PDF format and obtain the predictions using our best model. It supports batch upload of multiple PDF documents at the same time. This website will allow the material science researchers to automatically extract entities of interest from uploaded papers. We expect this will enhance the pace of future material science researches. The server side of the application is developed using python Django 101010https://djangoproject.com and the model is trained using PyTorch package [41].
10 Conclusions
This paper explored the task of extracting relevant entities from material science publications. We annotated 214 material science articles with five classes - material, code, method, structure and parameter and used the annotated data to train a deep neural network (see Section 4). We applied our trained model to obtain entities from a set of 10K material science articles that are published in arxiv over the last ten years. Additionally, we trained a sentence classification model to extract sentences appearing under abstract section that are related to the results. This information helped us to generate a short summary level information of the published article. By using our model, we pulled out statistics as shown in Section 8 giving an insight of the usage of methods and codes in a span of ten years. We created an online interface that lets us obtain the predicted entities from an uploaded published article (see Section 9). Using our tool, one can index the material science documents according to the material used or parameter used for the specific methods.
Most of the sentences in an article will not contain any entity that we are interested in. Passing these sentences to the system is of no use since the model will not output any entity from those. One possibility is to automatically identify relevant sections of the text data and pass as input only the sentences that are contained in those sections. In order to achieve this, we need to first classify each section of the document as whether it is relevant or not. As a future work, we can build a classifier that can classify sentences or sections of text as relevant or not, and then feed this input to the model for prediction. Currently, we are only extracting text information from the PDF documents. A further improvement can be obtained if we extract figures and tables from published documents. Material science documents contain useful information about materials and the output obtained from those materials using different methods in the form of tables. We expect most entities to appear in the table section. Hence extracting those information will enable us to collect more information. An additional feature can be to correctly identify the name of the parameter for every span that is predicted as the class parameter. From the dataset, it is clear that the parameter name occurs in the same sentence of the predicted token. We may use POS tags or dependency trees to accurately output the parameter name as well.
11 Data Availability
The code and data are available in https://github.com/TeamMatSciE/MatSciE.
References
- Huang et al. [2015] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015).
- Xu [2018] S. Xu, Bayesian naïve bayes classifiers to text classification, Journal of Information Science 44 (2018) 48–59.
- Alsaleem et al. [2011] S. Alsaleem, et al., Automated arabic text categorization using svm and nb., Int. Arab. J. e Technol. 2 (2011) 124–128.
- Li et al. [2016] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, Z. Lu, Biocreative v cdr task corpus: a resource for chemical disease relation extraction, Database 2016 (2016).
- Luan et al. [2018] Y. Luan, L. He, M. Ostendorf, H. Hajishirzi, Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction, arXiv preprint arXiv:1808.09602 (2018).
- Beltagy et al. [2019] I. Beltagy, A. Cohan, K. Lo, Scibert: Pretrained contextualized embeddings for scientific text, arXiv preprint arXiv:1903.10676 (2019).
- Wang et al. [2009] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, S. H. Bryant, Pubchem: a public information system for analyzing bioactivities of small molecules, Nucleic acids research 37 (2009) W623–W633.
- Geer et al. [2010] L. Y. Geer, A. Marchler-Bauer, R. C. Geer, L. Han, J. He, S. He, C. Liu, W. Shi, S. H. Bryant, The ncbi biosystems database, Nucleic acids research 38 (2010) D492–D496.
- Pence and Williams [2010] H. E. Pence, A. Williams, Chemspider: an online chemical information resource, 2010.
- Ridley [2009] D. D. Ridley, Information Retrieval: SciFinder, John Wiley & Sons, 2009.
- Jessop et al. [2011] D. M. Jessop, S. E. Adams, E. L. Willighagen, L. Hawizy, P. Murray-Rust, Oscar4: a flexible architecture for chemical text-mining, Journal of cheminformatics 3 (2011) 41.
- McCallum et al. [2000] A. McCallum, D. Freitag, F. C. Pereira, Maximum entropy markov models for information extraction and segmentation., in: Icml, volume 17, 2000, pp. 591–598.
- Rocktäschel et al. [2012] T. Rocktäschel, M. Weidlich, U. Leser, Chemspot: a hybrid system for chemical named entity recognition, Bioinformatics 28 (2012) 1633–1640.
- Lafferty et al. [2001] J. Lafferty, A. McCallum, F. C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001).
- Weston et al. [2019] L. Weston, V. Tshitoyan, J. Dagdelen, O. Kononova, A. Trewartha, K. A. Persson, G. Ceder, A. Jain, Named entity recognition and normalization applied to large-scale information extraction from the materials science literature, Journal of chemical information and modeling 59 (2019) 3692–3702.
- Tshitoyan et al. [2019] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder, A. Jain, Unsupervised word embeddings capture latent knowledge from materials science literature, Nature 571 (2019) 95–98.
- Butler et al. [2018] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, A. Walsh, Machine learning for molecular and materials science, Nature 559 (2018) 547–555.
- Hakimi et al. [2020] O. Hakimi, M. Krallinger, M.-P. Ginebra, Time to kick-start text mining for biomaterials, Nature Reviews Materials 5 (2020) 553–556.
- Kostoff [2005] R. N. Kostoff, Method for data and text mining and literature-based discovery, 2005. US Patent 6,886,010.
- Kim et al. [2017] E. Kim, K. Huang, A. Saunders, A. McCallum, G. Ceder, E. Olivetti, Materials synthesis insights from scientific literature via text extraction and machine learning, Chemistry of Materials 29 (2017) 9436–9444.
- Correa-Baena et al. [2018] J.-P. Correa-Baena, K. Hippalgaonkar, J. van Duren, S. Jaffer, V. R. Chandrasekhar, V. Stevanovic, C. Wadia, S. Guha, T. Buonassisi, Accelerating materials development via automation, machine learning, and high-performance computing, Joule 2 (2018) 1410–1420.
- Goldsmith et al. [2018] B. R. Goldsmith, J. Esterhuizen, J.-X. Liu, C. J. Bartel, C. Sutton, Machine learning for heterogeneous catalyst design and discovery (2018).
- Dragone et al. [2017] V. Dragone, V. Sans, A. B. Henson, J. M. Granda, L. Cronin, An autonomous organic reaction search engine for chemical reactivity, Nature communications 8 (2017) 1–8.
- Huo et al. [2019] H. Huo, Z. Rong, O. Kononova, W. Sun, T. Botari, T. He, V. Tshitoyan, G. Ceder, Semi-supervised machine-learning classification of materials synthesis procedures, npj Computational Materials 5 (2019) 1–7.
- Mysore et al. [2019] S. Mysore, Z. Jensen, E. Kim, K. Huang, H.-S. Chang, E. Strubell, J. Flanigan, A. McCallum, E. Olivetti, The materials science procedural text corpus: Annotating materials synthesis procedures with shallow semantic structures, arXiv preprint arXiv:1905.06939 (2019).
- Young et al. [2018] S. R. Young, A. Maksov, M. Ziatdinov, Y. Cao, M. Burch, J. Balachandran, L. Li, S. Somnath, R. M. Patton, S. V. Kalinin, et al., Data mining for better material synthesis: The case of pulsed laser deposition of complex oxides, Journal of Applied Physics 123 (2018) 115303.
- Mysore et al. [2017] S. Mysore, E. Kim, E. Strubell, A. Liu, H.-S. Chang, S. Kompella, K. Huang, A. McCallum, E. Olivetti, Automatically extracting action graphs from materials science synthesis procedures, arXiv preprint arXiv:1711.06872 (2017).
- Kononova et al. [2019] O. Kononova, H. Huo, T. He, Z. Rong, T. Botari, W. Sun, V. Tshitoyan, G. Ceder, Text-mined dataset of inorganic materials synthesis recipes, Scientific data 6 (2019) 1–11.
- Himanen et al. [2019] L. Himanen, A. Geurts, A. S. Foster, P. Rinke, Data-driven materials science: Status, challenges, and perspectives, Advanced Science 6 (2019) 1900808.
- Singh et al. [2016] M. Singh, B. Barua, P. Palod, M. Garg, S. Satapathy, S. Bushi, K. Ayush, K. S. Rohith, T. Gamidi, P. Goyal, et al., Ocr++: a robust framework for information extraction from scholarly articles, arXiv preprint arXiv:1609.06423 (2016).
- Mikolov et al. [2013] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013).
- Kim et al. [2017] E. Kim, K. Huang, A. Tomala, S. Matthews, E. Strubell, A. Saunders, A. McCallum, E. Olivetti, Machine-learned and codified synthesis parameters of oxide materials, Scientific data 4 (2017) 170127.
- Mesnil et al. [2013] G. Mesnil, X. He, L. Deng, Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding., in: Interspeech, 2013, pp. 3771–3775.
- Hochreiter and Schmidhuber [1997] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780.
- Abramson [2015] M. Abramson, Sequence classification with neural conditional random fields, in: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), IEEE, 2015, pp. 799–804.
- Peters et al. [2018] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018).
- Breck et al. [2007] E. Breck, Y. Choi, C. Cardie, Identifying expressions of opinion in context., in: IJCAI, volume 7, 2007, pp. 2683–2688.
- Zhang et al. [2017] J. Zhang, X. Zhu, Q. Chen, L. Dai, S. Wei, H. Jiang, Exploring question understanding and adaptation in neural-network-based question answering, arXiv preprint arXiv:1703.04617 (2017).
- Han and Eisenstein [2019] X. Han, J. Eisenstein, Unsupervised domain adaptation of contextualized embeddings for sequence labeling, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4229–4239.
- Devlin et al. [2018] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
- Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.