An Annotated Corpus of Webtables
for Information Extraction Tasks

Erin Macdonald Work done while the author was affiliated with the University of Alberta. Intuit
10423 101 St NW, Suite 2200
Edmonton, AB, Canada Denilson Barbosa Department of Computing Science
University of Alberta
Edmonton, AB, Canada

Abstract

Information Extraction is a well-researched area of Natural Language Processing with applications in web search and question answering concerned with identifying entities and relationships between them as expressed in a given context, usually a sentence of a paragraph of running text. Given the importance of the task, several datasets and benchmarks have been curated over the years. However, focusing on running text alone leaves out tables which are common in many structured documents and in which pairs of entities also co-occur in context (e.g., the same row of the table). While there are recent papers on relation extraction from tables in the literature, their experimental evaluations have been on ad-hoc datasets for the lack of a standard benchmark. This paper helps close that gap. We introduce an annotation framework and a dataset of 217,834 tables from Wikipedia which are annotated with 28 relations, using both classifiers and carefully designed queries over a reference knowledge graph. Binary classifiers are then applied to the resulting dataset to remove false positives, resulting in an average annotation accuracy of 94%. The resulting dataset is the first of its kind to be made publicly available.

1 Introduction

We endeavored to annotate hundreds of thousands of tables with relationships between either a pair of columns or the subject of the article and a table column. We used two different methods for this. The first was inspired by distant supervision, a commonly used supervised algorithm for annotation discussed in 2, which required a set of tables and a knowledge graph Mintz et al. (2009). The second method was comprised of a set of carefully crafted queries to pick out relational tables, also requiring a set of tables.

To do this we needed an unannotated set of tables as well as a knowledge graph and set of relations. We used a dump of Wikipedia tables from March 2019 for a few reasons, the most important being the trustworthiness of Wikipedia articles. Wikipedia editors follow a set of guidelines when writing articles and new content is fact-checked regularly¹¹1https://en.wikipedia.org/wiki/Wikipedia:Editing_policy. This results in a more consistent and factual dataset than, for example, tables on the web used in WebTables Cafarella et al. (2008). Wikipedia also contains information about a wide variety of topics and entity types, ensuring a diverse dataset.

Our dataset is available at https://doi.org/10.7939/DVN/SHL1SL and can be cited as

E.Macdonald and D.Barbosa. Anannotated corpus of webtables for information extraction tasks, 2019. URL https://doi.org/10.7939/DVN/SHL1SL.

2 Related Work

Researchers have developed a number of datasets and benchmarks for relation extraction from running text Riedel et al. (2010); Hendrickx et al. (2010); Mesquita et al. (2019). Traditionally, these contain a set of sentences each with a corresponding relation label Riedel et al. (2010). Each example might also have additional information like annotated or linked entities Riedel et al. (2010); Mesquita et al. (2019). In addition to the New York Times Riedel et al. (2010) and SemEval-2010 Task 8 Hendrickx et al. (2010) datasets others include the TAC (text analysis conference) relation extraction Zhang et al. (2017), ACE Doddington et al. (2004), and KnowledgeNet Mesquita et al. (2019) datasets, all of which contain thousands of instances of sentences with labeled relations.

In order to compare the results of different methods on the same dataset as accurately as possible, the same metrics for evaluation must be used. For relation extraction this frequently means accuracy, or the $F1$ score that aggregates precision and recall Jurafsky and Martin (2009). Some literature presents a precision-recall curve, created by adjusting the value of $n$ to collect precision values at different recall levels Zelenko et al. (2003), and some papers report other common metrics in Information Retrieval as well, like Mean Reciprocal Rank (MRR).

2.1 Relation Extraction from Tables

In recent years, some researchers have attempted to make the tables on the web more useful for search systems Cafarella et al. (2008); Limaye et al. (2010); Ritze et al. (2015); Venetis et al. (2011). Currently, most search engines only index the text in a web page and ignore information provided by tables Cafarella et al. (2008). This means users cannot search for the huge amount of data that is described in tables that would be tedious and repetitive to discuss explicitly in plain text Cafarella et al. (2008). Table understanding is a task parallel to information extraction where the input is a table, rather than text.

The system created by Muñoz et al. Muñoz et al. (2014) suggests relationships between table columns and between columns and the article subject when pairs of the same row are already related in an existing knowledge graph. Suggestions are then filtered using a classifier by analyzing features of the article, table, columns, entities, cells and resulting triples. The authors evaluate their work on a dump of Wikipedia’s tables by manually annotating 750 of the around 37 million suggested triples using three judges. Using only triples for which there was a unanimous agreement amongst judges, five classifiers were trained, evaluated and compared, with the best achieving close to 80% F1 (81% precision and 77% recall) and producing almost eight million new triples.

Following Muñoz et al., a group of researchers at Roma Tre University in Italy and the University of Alberta in Canada have published three methods for relation extraction on tables Sekhavat et al. (2014); Cannaviccio et al. (2018a, b). All of these methods use language models, an existing knowledge graph and an additional text corpus (Clueweb) to determine the relationships in tables. Each entity pair in a table is scored against a model for every relation and the highest-scoring relation is selected. The highest F1 value reported in one paper is 71% Cannaviccio et al. (2018a) while the most recent work reports on a different metric (MRR) Cannaviccio et al. (2018b).

Despite the strong similarities between the work of these two groups, including the fact both used Wikipedia tables and Freebase relations, their results are hard to compare directly as they did not use the same benchmark. For example, they did not use the same set of Freebase relations, nor did they use the same Wikipedia dump. Our goal is to help ameliorate this situation by offering a common benchmark.

2.2 Distant Supervision

Obtaining sufficient training or testing data has always been a longstanding challenge in Machine Learning. One accepted method to overcome that challenge in the context of relation extraction from text is that of distant supervision, introduced by Mintz et al. Mintz et al. (2009). The idea behind distant supervision is to leverage an existing KG to annotate sentences mentioning pairs of entities which are known to belong to a relation in the KG and assume those sentences as positive training data for machine learning methods for relation extraction, possibly with some filtering steps to remove obvious noise. In order to generate a large dataset of tables which could be used both for evaluating but also for training relation extraction methods we also resorted to distant supervision, as discussed below.

3 Method

We chose to build our dataset using well known resources that were familiar to researchers in the area as well as representative of the task at hand. Without loss of generality we chose Wikipedia to obtain the tables and Freebase as the reference KG to obtain relations. Choosing Wikipedia is perhaps obviously a good idea as the corpus is not only easy to obtain and archived periodically, but also built within a relatively strict editorial process, leading to a fairly high quality tables and text surrounding those tables (should there be methods that exploit such texts).

To choose relations to label these tables with, we considered DBpedia, which is derived from Wikipedia and Freebase, which albeit no longer updated, has been extensively used in previous research into table understanding and also relation extraction from text Mintz et al. (2009); Cannaviccio et al. (2018b); Paulheim (2016). For example, Mintz et al. Mintz et al. (2009) used the 23 largest relations in Freebase at the time while Cannaviccio et al. (2018b) run their experiments on 9 relations involving entities of type person, some of which overlap with the 23 from Mintz et al. Considering that both Wikipedia and Freebase cover extensively the film domain, we also selected relations from that domain in our benchmark, resulting in 28 relations to annotate tables with, provided in Table 1.

Comparing Table 1 with the list of 23 largest Freebase relations used by Mintz et al. one glaring difference between the use of text and tables in Wikipedia becomes obvious. For example, the relation that we could find the most tables in Wikipedia corresponded to the team an athlete plays for, because Wikipedia has many articles written by sports experts and aficionados with such information. In contrast, that relation is not among those used in the distant supervision benchmark. Similarly, that benchmark does not include the next three relations in our benchmark in terms of the number of tables we could find: actor-film, political_party-politician, and actor-character.

Relation Dist. Sup. Querying Tables Acc. Tables Acc. sports_team-player 21,440 0.83 $\rightarrow$ 0.93 22,102 0.98 actor-film 22,580 0.90 28,177 0.98 political_party-politician 8,642 0.79 $\rightarrow$ 0.89 18,014 0.99 actor-character 1,724 0.49 $\rightarrow$ 0.84 21,462 0.97 location-contains 11,059 0.92 6,121 1.00 football_position-player 2,121 1.00 13,076 0.87 musician-album 8,049 0.92 $\rightarrow$ 0.96 8,560 0.97 person-nationality 8,865 0.58 $\rightarrow$ 0.90 7,002 1.00 director-film 7,019 0.71 $\rightarrow$ 0.81 4,504 0.97 award-nominee 6 0.56 $\rightarrow$ 1.00 5,725 1.00 person-graduate 83 0.54 $\rightarrow$ 1.00 5,408 1.00 film-language 2,256 0.89 2,952 1.00 author-works_written 588 0.63 $\rightarrow$ 0.87 1,712 0.89 producer-film 668 0.18 $\rightarrow$ 0.46 964 0.95 writer-film 1,036 0.17 $\rightarrow$ 0.36 199 0.80 film-music 932 0.53 $\rightarrow$ 0.75 399 0.97 person-profession 420 0.60 $\rightarrow$ 0.73 716 0.97 person-parents 122 0.23 $\rightarrow$ 0.65 731 0.97 film-country 474 0.60 $\rightarrow$ 0.82 380 1.00 musician-origin 112 0.49 $\rightarrow$ 0.75 632 1.00 film-production_company 446 0.59 $\rightarrow$ 0.66 275 0.98 person-spouse 305 0.14 $\rightarrow$ 0.60 90 1.00 film-genre 8 0.09 $\rightarrow$ 1.00 203 0.94 person-place_of_birth 0 – 180 1.00 company-industry 59 0.50 $\rightarrow$ 1.00 12 1.00 person-place_of_death 133 0.05 $\rightarrow$ 0.25 4 1.00 person-religion 14 0.70 $\rightarrow$ 1.00 42 1.00 book-genre 22 0.79 $\rightarrow$ 1.00 9 0.89 No relation - - 12,491 -

Table 1: Number of tables collected and estimated accuracy with each method. Results for column pairs and article subject-column pairs are combined.

We restricted our benchmark to relations involving named entities that were disambiguated (via wikilinks in the table cells) as subject and object only. We made this choice based on two factors. First, several methods in the literature do not contemplate entity disambiguation, and could not therefore be evaluated with our benchmark. Second, despite tremendous recent progress, named entity disambiguation remains a challenge and relying on automatic methods for that step would introduce error in our dataset. We will consider relaxing this restriction in future releases of the benchmark.

Classifier	Col Pairs		Subj-Col Pairs
Classifier	Small	Large	Small	Large
kNN ( $k=1$ )	0.69	0.70	0.72	0.69
GNB	0.67	0.61	0.66	0.57
kNN ( $k=2$ )	0.61	0.60	0.62	0.51
NC	0.54	0.55	0.68	0.67

Table 2: Annotation accuracy of different classifiers using different size feature vectors.

3.1 Collecting Tables

We used two different methods to obtain tables from Wikipedia tables. First, we attempted to adapt distant supervision to our task. Upon realizing some crucial shortcomings, we resorted to manually crafted queries over Freebase and DBpedia to further improve our dataset.

Distant Supervision.

We first used a form of distant supervision to annotate the Wikipedia tables. This involved gathering a list of entity pairs $(e_{1},e_{2})$ related in Freebase through a relation $r$ from our set of 28 relations. We also included entities related through analogous relations in DBpedia, another public knowledge base similar to Freebase, to increase number of pairs.

For each pair $(e_{1},e_{2})$ collected for a relation $r$ , we searched for Wikipedia tables with $e_{1}$ and $e_{2}$ appearing in cells of the same row but different columns, $c_{1}$ and $c_{2}$ (referred to as column pairs). We make the assumption that relation $r$ holds between $c_{1}$ and $c_{2}$ and can infer the relation for all pairs in those columns. We also searched for tables with either $e_{1}$ or $e_{2}$ as the subject of the Wikipedia article and the remaining entity appearing in any cell of a table in that article (referred to as article subject-column pairs). We then assume the article subject is related through $r$ to the corresponding column.

Refer to caption — Figure 1: A table from the Wikipedia article “Monsters Inc. (franchise)” showing the entity “Dan Scanlon” in two different columns representing different relations to the film “Monsters University”.

These assumptions are responsible for almost all the erroneously annotated tables in this dataset. The root of the problem is illustrated in Figure 1. Two entities can and often are related to one another through multiple relations so when we annotate a table like that in our example, we will assign both the relations director and writer between columns 1 and 2 and columns 1 and 5, two of which are incorrect.

In an attempt to mitigate these errors, we performed a small set of experiments using three different binary classifiers to mark each annotation as correct or incorrect, using the relations annotated in the previous step with the lowest accuracy. To create training and testing data for these classifiers, we hand-annotated 200 tables for each relation, using 100 for training and the other 100 for testing. We also use these annotations as a measure of the accuracy of the distant supervision method. We experimented using k-nearest-neighbors, nearest centroid and Gaussian naive Bayes, building the feature vector for each annotation using a combination of the word2vec embeddings of the terms in the article title, headers, section title and section text. We also compared how different vector lengths (generated using more or less words in those texts mentioned above) and present the results in Table 2 where we compare the classifiers with different vector lengths on both the column pair and article subject-column pair tasks. The results show that kNN with $k=1$ consistently out-performs other classifiers.

Ad-hoc queries.

When curating the gold standard dataset for training the above classifiers, we identified some commonalities in how data for certain relations is represented in the tables. Using this information, we composed a set of over 150 queries for each relation using column headers and section titles to run on the dataset. There are two notable benefits to this method compared to distant supervision and one major downside. The benefits are that it does not rely on information already present in any dataset and, when 100 tables per relation were checked manually, it proved far more accurate than distant supervision. However the primary downside was that we annotated far fewer tables in this way.

In Table 1 we present an estimate of the accuracy of the tables collected for each method (before and after classifiers were applied). We also annotated 12,491 tables with no relation present by querying for random tables in the dataset, selecting random columns and annotating those that weren’t already annotated with a relation.

4 Use Case and Benchmark Difficulty

We briefly report in Table 3 results of a first study using the benchmark described above (citation omitted due to double blind requirements). We note that the benchmark was created prior to conducting this study.

The baseline choses the relation in the KG that covers the most pairs of entities in the respective table, achieving very low results. Our method uses a neural network that takes into consideration not only the entities in columns but also contextual information from the article in which the table appears such as the table caption, the table headers, the title of the section in which the appears, and the first paragraph of that section.

	Accuracy	F1
Baseline	0.15	0.27
anonymous	0.88	0.95

Table 3: Accuracy and F1.

Network	Accuracy
Full	0.92
Full - table captions	0.91
Full - table headers	0.76
Full - section paragraphs	0.76
Full - section titles	0.72

Table 4: Ablation study.

Table 4 shows an ablation study indicating that among all sources of contextual information, section titles contribute the most to finding the correct relation, followed by the paragraphs and table headers. We argue these numbers, together with the low performance of the query-only baseline support the claim that our benchmark is sufficiently challenging to contribute to further development in the field.

For comparison, Muñoz et al. Muñoz et al. (2014) and Cannaviccio et al. Cannaviccio et al. (2018b) report 0.71 and 0.74 F1 scores for relation extraction from Wikipedia tables. Although this comparison is imperfect, as their methods used different relations and different Wikipedia tables than one another and than us.

5 Dataset and Code Release

We are releasing the code used to create the dataset, the 200 manually annotated tables for each of the relations, the results of all classifiers, and the ad-hoc queries used to improve our table corpus for other researchers.

6 Conclusion

We draw two conclusions from this work in annotating tabular data. The first is that a versatile method such as distant supervision is incredibly effective in discovering possible annotations for diverse types of data, but is limited by information in the knowledge graph and makes an assumption which inevitably introduces error to the dataset. Second, we conclude that by using a combination of different methods each with their own compromises, we could build a diverse annotated dataset that is able to discover brand new relations to add to a knowledge graph.

This dataset can be very helpful in training relation extraction systems to detect relations in tables, further filtering the dataset itself to add new, accurate relations to a knowledge graph or training systems for question-answering that can return tables containing relevant information. Future research could focus on the cleaning step applied to the tables collected with distant supervision. Refining this method using more sophisticated classifiers or more informative feature vectors has the potential to improve the confidence in these annotations. Another future improvement to the benchmark would be expanding the set of relations to include those involving entities and literals (e.g., dates or other numerical values), in a way that results on the different kinds of relations could be reported separately.

References

Cafarella et al. [2008] Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webtables: Exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538–549, aug 2008. doi: 10.14778/1453856.1453916.
Cannaviccio et al. [2018a] Matteo Cannaviccio, Lorenzo Ariemma, Denilson Barbosa, and Paolo Merialdo. Leveraging wikipedia table schemas for knowledge graph augmentation. In Proceedings of the 21st International Workshop on the Web and Databases, pages 5:1–5:6. ACM, 2018a. doi: 10.1145/3201463.3201468.
Cannaviccio et al. [2018b] Matteo Cannaviccio, Denilson Barbosa, and Paolo Merialdo. Towards annotating relational data on the web with language models. In Proceedings of the 2018 World Wide Web Conference, pages 1307–1316. International World Wide Web Conferences Steering Committee, 2018b. doi: 10.1145/3178876.3186029.
Doddington et al. [2004] George R. Doddington, Alexis Mitchell, Mark A. Przybocki, Lance A. Ramshaw, Stephanie M. Strassel, and Ralph M. Weischedel. The automatic content extraction (ace) program - tasks, data, and evaluation. In LREC. European Language Resources Association, 2004.
Hendrickx et al. [2010] Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó. Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 33–38. Association for Computational Linguistics, 2010.
Jurafsky and Martin [2009] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., 2009. ISBN 0131873210.
Limaye et al. [2010] Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., 3(1-2):1338–1347, sep 2010. doi: 10.14778/1920841.1921005.
Mesquita et al. [2019] Filipe Mesquita, Matteo Cannaviccio, Jordan Schmidek, Paramita Mirza, and Denilson Barbosa. Knowledgenet: A benchmark dataset for knowledge base population. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 749–758, 2019.
Mintz et al. [2009] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011, 2009.
Muñoz et al. [2014] Emir Muñoz, Aidan Hogan, and Alessandra Mileo. Using linked data to mine rdf from wikipedia’s tables. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pages 533–542. ACM, 2014. doi: 10.1145/2556195.2556266.
Paulheim [2016] Heiko Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8:489–508, 12 2016. doi: 10.3233/SW-160218.
Riedel et al. [2010] Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III, pages 148–163. Springer-Verlag, 2010.
Ritze et al. [2015] Dominique Ritze, Oliver Lehmberg, and Christian Bizer. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pages 10:1–10:6. ACM, 2015. doi: 10.1145/2797115.2797118.
Sekhavat et al. [2014] Yoones A. Sekhavat, Francesco Di Paolo, Denilson Barbosa, and Paolo Merialdo. Knowledge base augmentation using tabular data. In Proceedings of the Workshop on Linked Data on the Web co-located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 8, 2014, volume 1184 of CEUR Workshop Proceedings. CEUR-WS.org, 2014. URL http://ceur-ws.org/Vol-1184/ldow2014_paper_02.pdf.
Venetis et al. [2011] Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. Recovering semantics of tables on the web. Proc. VLDB Endow., 4(9):528–538, jun 2011. doi: 10.14778/2002938.2002939.
Zelenko et al. [2003] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel methods for relation extraction. J. Mach. Learn. Res., 3:1083–1106, mar 2003.
Zhang et al. [2017] Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45. Association for Computational Linguistics, 2017. doi: 10.18653/v1/D17-1004.

An Annotated Corpus of Webtables for Information Extraction Tasks