^†^†footnotetext: * These authors contributed equally to this work.

The Grind for Good Data: Understanding ML Practitioners’ Struggles and Aspirations in Making Good Data

Inha Cha^∗ inhacha@upstage.ai Upstage AI ResearchYonginKorea, Republic of , Juhyun Oh^∗ juhyun@upstage.ai Upstage AI ResearchYonginKorea, Republic of , Cheul Young Park^∗ cheul@upstage.ai Upstage AI ResearchYonginKorea, Republic of , Jiyoon Han jiyoonhan@upstage.ai Upstage AI ResearchYonginKorea, Republic of and Hwalsuk Lee hwalsuk.lee@upstage.ai Upstage AI ResearchYonginKorea, Republic of

(2018)

Abstract.

We thought data to be simply given, but reality tells otherwise; it is costly, situation-dependent, and muddled with dilemmas, constantly requiring human intervention. The ML community’s focus on quality data is increasing in the same vein, as good data is vital for successful ML systems. Nonetheless, few works have investigated the dataset builders and the specifics of what they do and struggle to make good data. In this study, through semi-structured interviews with 19 ML experts, we present what humans actually do and consider in each step of the data construction pipeline. We further organize their struggles under three themes: 1) trade-offs from real-world constraints; 2) harmonizing assorted data workers for consistency; 3) the necessity of human intuition and tacit knowledge for processing data. Finally, we discuss why such struggles are inevitable for good data and what practitioners aspire, toward providing systematic support for data works.

data construction, machine learning practitioners, semi-structured in-depth interviews

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†booktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Human-centered computing Empirical studies in HCI

1. Introduction

The value of data was long known to the Machine Learning (ML) community, only to be rediscovered recently. From lauding the effectiveness of large data (Halevy et al., 2009), to recognizing their limits (Boyd and Crawford, 2012; Sun et al., 2017), and now to a new paradigm of ML engineering with data at the center, the data research in ML is at the cusp of change (dca, [n. d.]). As the saying goes, ”Garbage in, Garbage out,” good data is a binding condition for a successful ML system, but in disappointment to its popularity, it is hard to find. Recent works have eagerly revealed many shortcomings of existing datasets and algorithms derived from them, and to list just a few, they are riddled with biases (Mehrabi et al., 2019), under/mis-represent the real-world (Buolamwini and Gebru, 2018; Zhao et al., 2018), outright offensive (Crawford and Paglen, 2019; Prabhu and Birhane, 2020), exploits spurious correlations (Geirhos et al., 2020), all the while used with little account of by whom and how the data was created (Denton et al., 2020; Geiger et al., 2020). But indeed, this outpour of work revealing what has been swept under the rug is a blessing in disguise, as at last the ML community is given a chance to redeem itself. The community is striving to this end; existing benchmarks are being fixed (Yang et al., 2019; Bowman and Dahl, 2021), more rigorous documentation is emphasized (Gebru et al., 2018; Mitchell et al., 2018; Bender and Friedman, 2018), and new value-centric datasets are introduced (Initiative, [n. d.]; Galvez et al., 2021). More broadly, the movement is pointing towards greater accountability, and transparency in data creation and use (Hutchinson et al., 2021), and a sociocultural system supporting data excellence for ML (Paullada et al., 2021; Sambasivan et al., 2021; Aroyo et al., 2022).

In this thread, a growing body of work is investigating data through the lens of humans involved in the data work (Pine et al., 2022). Data, from its creation to its utilization in ML or any other application, is naturally human-driven. More so, in this age of deep learning fueled by big data, data is a feat of collective people with variegated skills (Huff and Tingley, 2015; Zhang et al., 2020). Nonetheless, this heterogeneity of humans involved in data work makes organizing data development (creation, preparation, and evaluation altogether) into a structured discipline a convoluted problem. Karmaker (“Santu”) points out that the need for much human interaction and the subjective nature of data hinder the systematic studies of data construction steps (e.g., data schema development, data annotation) (Amershi et al., [n. d.]) also recognizes that discovering, managing, and versioning the data needed for ML applications is much more complex and difficult than other types of software engineering. In this regard, Paullada et al. (2021) calls for looking into human labor, arbitrary judgments and biases, and unforeseeable circumstances in dataset creation to attain data excellence in ML.

An expanding amount of HCI literature focuses on the human labor that goes into producing, gathering, maintaining, curating, analyzing, deciphering, and disseminating data (Muller et al., 2019; Sambasivan, 2022; Thakkar et al., 2022). Most recently, the CHI 2022 workshop highlighted the need to detail the practices and processes of humans in data work. In line with this trend, we aimed to discover the specific tasks and challenges of humans in data work, with a particular interest in data construction for ML.

Towards this goal, our study empirically examines what ML practitioners do and consider when creating data. We conducted in-depth interviews focused on the data construction pipeline with 19 ML experts, ranging from a data manager to a senior AI research engineer. We asked our participants about the actual steps they take to build datasets, the difficulties they encounter in the process, and the kind of assistance they deem necessary for constructing good data. Based on our findings, we document a human-oriented data construction pipeline for end-to-end ML systems composed of 6 steps with 14 tasks. We also discover what humans actually consider to make good data in each step of the process. Overall, we discover that individual strategies for constructing datasets are never identical. Nevertheless, our participants share important commonalities in making good data: iteration is the status quo, manual evaluation is inevitable, annotators dictate data quality, and tacit knowledge is the key to improving data to improve a model.

We also identify three challenges of data construction that repeatedly occur at different stages of the data building process. First, recurring dilemmas result from real-world constraints such as the trade-off between cost versus scalability, consistency, and data validity. Harmonizing diverse data workers involved in data construction to result in consistent and quality outcomes was another. Finally, data construction requires human intuition and tacit knowledge empirically gained through experience to improve data to improve a model in turn.

Acknowledging the iterative and repeatable characteristics of data work, we discuss future research directions to reduce the trial and error to make good data:

(1)

There is no silver bullet for how to refine data to make a better ML system, but systematic support can help make good data. For example, a model-independent metric for evaluating data quality; aligning qualitative and quantitative metrics of model performance; and possibly explainable model results that allow data debugging.
(2)

The quality of annotation and the annotators are the most important factors for data quality; thus, we advise providing annotator support: for example, developing an interface that enables efficient annotation; promoting ways to enhance annotators’ understanding of their work; facilitating the annotation task management; and emphasizing the importance of constant and quick communication between annotators and project managers.

Our paper makes three main contributions:

(1)

We document detailed steps and tasks of the data construction pipeline for developing an ML system drawn from semi-structured interviews with 19 ML experts.
(2)

We illuminate the specific human-centric nature that makes structuring data work challenging, paving a stepping stone towards systematizing data work.
(3)

Finally, we suggest the future directions to address the challenges for more productive dataset construction.

2. Related Works

2.1. Data Excellence in ML

Recently, model-focused ML research has moved into a new phase emphasizing data (Bowman and Dahl, 2021; Sun et al., 2017; Gebru et al., 2018; Mazumder et al., 2022). Data significantly affects the ML system’s performance, fairness, robustness, safety, and scalability (Halevy et al., 2009; Sambasivan et al., 2021; Mehrabi et al., 2019; Zhao et al., 2019; Wilkinson et al., 2020; Dixon et al., 2018; Prabhu and Birhane, 2020; Yang et al., 2019; Buolamwini and Gebru, 2018). However, most popular benchmark datasets in ML applications suffer from pervasive label errors (Crawford and Paglen, 2019). ML researchers reported that most benchmark datasets used in ML applications include widespread label inaccuracy (Northcutt et al., 2021). Many efforts have been made to release benchmark datasets that relieve the issue of spurious correlations in the data (Hu et al., 2020), datasets safe from social bias (Yang et al., 2019), and to develop methods to obtain fine-grained data annotations that better align with the motivating task of the benchmark (Tsipras et al., [n. d.]). On the other hand, recent findings further emphasize the need for well-made datasets. For instance, small models trained on high-quality small-size data often did better than large models trained on bad or unsuitable data sets (Northcutt et al., 2021).

Data-centric AI (DCAI) (Ng, 2021; Ng et al., 2021; DeepLearningAI, 2021), one recent trend in ML, calls for a transition of focus from models to data to improve ML systems. DCAI is the discipline of ”systematically engineering the data used to build an AI system.” It investigates what makes data ”good” and how to systematize best practices for constructing datasets to enable novices and experts to continue producing quality data (centric AI Resource Hub, [n. d.]; Eyuboglu et al., 2022). In that regard, researchers investigated ways to complement and refine existing data while building new datasets that overcome previously stated challenges and suggested novel frameworks that enable more rigorous dataset development (Initiative, [n. d.]; Mazumder et al., 2022). Many researchers also strived to provide guidelines and establish data-focused benchmarks. Gebru et al. (Gebru et al., 2018) introduced Datasheets for datasets, an extensive checklist composed of questions that dataset creators should try to answer. Covering the overall process of data creation, distribution, and maintenance, the datasheet aims to help data creators provide comprehensive documentation in favor of transparency and accountability while allowing data consumers to make informed choices when using the data. In addition, DataPerf (Mazumder et al., 2022) is a growing suite of benchmarks related to building, maintaining, and evaluating datasets to make such practices easier, cheaper, and more repeatable. These efforts combine to reorganize data development that happened in an unprincipled manner with post hoc rationales into more disciplined engineering practice. Researchers have provided a fuller landscape of data practices to facilitate a better understanding of the current state. However, there is still a lack of focus on humans, who imbue meaning into data to make it useful. To create, maintain, and use data, it always takes more than one person and different kinds of work.

2.2. Data and HCI

From an interview-based study with 16 ML practitioners, Xin et al. (Xin et al., 2021) found that participants spent over half of their entire time in the ML pipeline on data preparation, with around 80% of tasks in the stage performed manually. Numerous ML papers (Roh et al., 2021, 2018; Karmaker, “Santu”) focused on data engineering in ML workflows and primarily introduced the technical skills required for specific steps in data work (e.g., data collection, cleansing, and validation). Little is known about how humans intervene, what they do, and what considerations they take in the data work across the ML pipeline.

Humans play a significant role in determining whether and how a dataset might be useful for a specific application, conducting thorough error analysis, and handling the numerous unstandardized difficulties of creating useful datasets. In that regard, understanding the human labor, human judgments and biases, and volatile circumstances involved in producing datasets for ML systems has emerged as a big issue (Paullada et al., 2021).

Much work (Zhang et al., 2020; Cartwright et al., 2019; Anik and Bunt, 2021; Wang et al., 2022; Thakkar et al., 2022) in the field of HCI is bridging the gap between investigating technical skills for building datasets and understanding humans’ roles and influences in making ML data. Studies have also covered the organization and infrastructure surrounding annotation. Wang et al. (Wang et al., 2022) suggest a view of data annotation as organized employment. Researchers have found that systematic work practices and well-organized annotation processes help annotation businesses and requesters more than they help the workers.

Moreover, existing studies have focused on the annotators who actually build the data, investigating their work practices, and urging the need to acknowledge their career goals and aspirations. Rivera et al. (Rivera and Lee, 2021) figured out the career goals and challenges of crowd workers on Amazon Mechanical Turk. Participants in AMT desired to pursue a profession other than crowd labor and signed up for AMT as a starting point, but many struggled to take further steps owing to a lack of career advice and constrained time and financial resources. In addition, Zhang et al. (Zhang et al., 2022) noted that developing annotation tools is time-consuming, costly, and requires software expertise. To alleviate such burdens, they suggested a conceptual framework to help the easy development of annotation tools by considering multiple usage cases. Based on the framework, they proposed the OneLabeler, designed with reusability and flexibility in mind. To sum up, previous literature has mainly focused on illustrating particular stakeholders’ needs (Cartwright et al., 2019; Rivera and Lee, 2021; Wang et al., 2022) or suggesting systems to support humans in specific steps.

Recently, the CHI 2022 workshop (Pine et al., 2022), ”Investigating Data Work Across Domains: New Perspectives on the Work of Creating Data,” called for attention to investigate the details of data work to realize its goal of making data useful and meaningful. They highlighted the need to understand how humans create, collect, manage, curate, analyze, interpret, and communicate data. In this vein, we focused on the roles humans play, the difficulties, and the efforts to create good data in the data practices in ML workflows.

3. Method

This paper seeks to understand what ML experts go through as they build and manage datasets to develop ML models. With ML practitioners in academia and industry working on various ML tasks, we conducted one-on-one semi-structured interviews, with a task to build a dataset construction pipeline as supplementary material alongside the interview. Participants were asked about their practices, challenges, and support they think are necessary to construct datasets for ML system development.

Table 1. Participant demographics, from the left: (1) Participant #, (2) Job/role, (3) Years of experience in programming + ML, (4) Domain, (5) Workshop task.

No.	Job/Role	Yrs. of Exp.	Domain	Workshop Task
P1	AI Research Engineer	6	NLP	Opinion mining from product reviews
P2	AI Research Engineer	5	CV, Multimodal	Physical activity recognition
P3	ML Engineer	7	CV, OCR	Checkbox recognition from documents
P4	Data Manager	2	CV, OCR	Medical documents parsing
P5	AI Research Engineer	8	CV	Detecting defects in manufactured products
P6	AI Research Engineer	4	NLP, Speech	Automatic speech recognition
P7	AI Research Engineer	11	OCR	Medical documents parsing
P8	AI Research Engineer	7.5	CV, Bioinformatics	Detecting defects in manufactured products
P9	AI Research Engineer	4	CV, Medical	Detecting tables in documents
P10	Software Engineer	13	NLP	Neural Machine Translation
P11	Data Scientist	7	CV	Fashion image search and recommendation
P12	AI Research Engineer	6	CV	Defect detection for mobile phones
P13	Ph.D. Student	9	MIR	Automatic DJ/mixing
P14	Software Engineer	3	OCR, CV	Credit card OCR
P15	Ph.D. Student	4	Speech, MIR	Singing Voice Synthesis
P16	Ph.D. Student	8	NLP	Neural Machine Translation
P17	Ph.D. Student	5	CV	Object detection for classifying mart goods
P18	AI Research Engineer	3	NLP	Hate speech detection
P19	AI Research Engineer	2	NLP	Sentiment analysis from movie reviews

3.1. Recruitment & Participants

Participants were recruited from the authors’ personal networks, developer communities, and school communities via snowball and purposive sampling. As a result, 19 participants affiliated with 13 different organizations were recruited. In particular, we recruited engineers and researchers who have experience with the following:

(1)

Constructing datasets for supervised learning and labeling work associated with.
(2)

Developing ML models for real-world applications.
(3)

Using AutoML tools in practice (or having an understanding of AutoML).

Overall, our participants had a minimum of 2 to a maximum of 13 years of experience with ML and programming. All participants had experience developing ML models for real-world applications or conducting research projects. Their profile spanned multiple occupations, ML domains/applications, and affiliations. Table 1 summarizes participant demographics, including their years of experience in programming + ML, current roles, and domains of ML they have expertise in. The last column shows specific tasks each participant chose for the workshop task. The details of participants’ affiliations are omitted from the table for anonymity. Participants were compensated for each interview with about 50,000 KRW (approx. $40).

3.2. Study Procedure

Our participants are located in or have worked on projects based in South Korea or the USA. We conducted all interviews in Korean, the first language of participants and interviewers. All interviews were conducted online via Zoom, considering COVID-19 situations.

The interview began with researchers briefly introducing the research goal and asking participants about their backgrounds and experiences in constructing a dataset for developing an ML application. Participants were then asked to expand upon the experience they had just described and were tasked with building a data construction pipeline. As a prompt for the task, participants were asked the following questions:

Refer to caption — Figure 1. A screenshot of a Figjam worksheet, where top: barebone, and bottom: an example of a filled-out worksheet. Participants were instructed to enter the steps of data construction at a higher level in the upper blue boxes and specific tasks and considerations for each step in the gray boxes below. As they proceeded, participants used post-its to add any additional details.

•

Assuming constructing a dataset to develop an ML model to deploy for real-world application, what pipeline do you go through? What are the steps in the pipeline? What tasks are performed at each step? What specific challenges and considerations need to be addressed in each step?

To assist participants in elaborating upon their experiences, we provided a worksheet, as in Figure.1, and asked them to fill in the barebone pipeline with steps of data construction, along with detailed challenges and considerations for constructing good data in each step.

Participants were asked to think aloud about their experiences and opinions fully throughout the task. Participants explained the process of constructing a dataset in a specific domain/task of their choice (see “Workshop Task” column in Table 1) to deploy an ML model trained with the data to the general public. Additionally, participants were instructed to choose a target task based on constructing unstructured data for supervised learning since human involvement is more natural for unstructured data.

Upon completing a data pipeline, the participants were then asked another set of questions that intend to investigate human engagement in data construction as it is a task that inherently involves human-model interaction:

•

What roles do humans play in data construction? What parts must humans intervene in and use their tacit knowledge? How do they interact with data and models to make good data in the process? What parts could humans benefit from interacting with models, i.e., Would gaining any information from the model help construct better data?

Finally, we also asked the participants what and how they would support non-expert users to build datasets by following the data construction pipelines.

•

If non-experts were to create datasets following the pipeline, what help would they need to minimize trial and error? Which steps in data construction could be automated to help non-experts easily construct datasets? Which steps would be impossible to automate? Are there parts that mandate human intervention? Why?

Note that participants were given a specific context of helping non-experts create a dataset, as we set to provide them a window to tap into challenges that they think are the most critical yet systematically solvable problems when constructing data from a self-distanced perspective.

3.3. Analysis & Coding

All interview sessions were video recorded with participants’ consent, resulting in 30+ hours of recording. Then from participants’ data construction pipelines and interview transcriptions, about 90 action items for constructing a dataset were extracted with overlaps. Three members of our research team reorganized the participants’ responses and clustered similar action items together under 6 major steps, which we derived from previous works (Karmaker, “Santu”; Roh et al., 2018; Whang and Lee, 2020) that lay out generalized steps to construct a dataset for developing an ML system, to construct a generalized data construction pipeline with 14 unique action items. We also organized considerations and challenges in each step of data construction. This analysis was performed iteratively altogether.

Further, we analyzed participants’ interview transcriptions in depth to deduce insights for developing good data. This thematic analysis was done using Dovetail (Dovetail, 2022), an online UX research/markup tool. Similar to the affinity diagramming method, each researcher individually highlighted noteworthy quotes from participants and assigned descriptive codes (such as “difficulties in data construction” or “requires domain knowledge” to encode what piece information a quote contains or an insight it provides. After the individual analysis, the researchers came together for a collaborative deductive coding process, and 284 codes in total were extracted. Our analysis team categorized these 284 initial codes into 39 sub-level themes and further grouped them into three major themes.

4. Findings

This section presents our findings from interviews with 19 ML experts and the thematic analysis of interview responses. Through our findings, taking a step beyond painting a broad landscape of data construction work, we scrutinize people’s realistic considerations and dilemmas in making good data for ML. This section includes two parts: 1) what humans do and what they consider in each step of the data construction pipeline, and 2) what are recurring struggles and endeavors in the steps of constructing good data.

4.1. What do humans do and consider in the dataset construction pipeline?

As described in section 3.3, we reorganized our 19 interview participants’ data construction pipelines into one generalized dataset construction pipeline, as shown in Fig 2. Table 2 presents that the pipeline includes six steps, each composed of 1 to 4 sub-steps. In the following, we elaborate upon the pipeline, describing what humans do and consider when they make data.

4.1.1. Step 1 – Data Design

(1)

Identifying service requirements – [service values and operational contexts]

The characteristics of the desired ML services/products and their goals were identified for the first step of data design. Participants who have built ML models for service stressed the importance of understanding how models will be used in service. Knowing what values an ML-driven service intends to deliver [P1, P7, P12], what demands prospective service users may have, and in what environment a model would operate [P11] were essential for planning ahead how data should be looked at. P17, with his experience of developing a product classification model to be deployed in an autonomous robot inside a supermarket, said

“Understanding the store environment was essential for determining how a model should operate as all stores have different arrangements of products and layouts, and even the amount of products displayed on shelves is different, while all of that may influence a model’s working.” (P17)

(2)

Defining the task – [target domains, task difficulty and formulation, pilot study, literature survey]

A step following identifying the service requirements was defining what problem a model solves. According to P14, it was necessary to assess whether a conventional rule-driven approach is sufficient or a data-driven approach is required to develop an algorithm for a service, as the latter requires data collection that can be costly and time-consuming. If the solution required data, then estimating the minimum/optimum amount of required data came next [P4, P7], and determining the problem formulation [P8] and the model architecture and training methods [P3, P7] were subsequent steps. Referring to how others previously approached similar problems could also be effective in determining the amount of data required and setting a model’s objective.

Table 2. Definition and Tasks in Each Step of Data Construction Pipeline.

Step

Sub-step

Definition & Tasks

1. Data Design

(1) Identifying Service Requirements

- Identify characteristics of the desired AI service(product) and understand its goals

(2) Defining Task

Define the model’s objectives

- Determine model I/O

- Determine methodology

- Survey previous works

2. Raw Data Collection and Exploration

(3) Surveying Raw Data Sources

Determine raw data sources

- List-up and explore raw data sources

(4) Collecting Raw Data

Data collection

- Crawl or purchase existing data

- Create new data

(5) Exploring Collected Data

Exploratory data analysis & visual inspection

3. Data Construction

(6) Define Data Scheme

- Determine model evaluation metric (e.g., accuracy, F-score)

- Define annotation scheme

- Annotation format

- Annotation method

- Label hierarchy, etc.

(7) Planning Annotation

- Define annotation guideline

- Select annotation tool

- Recruit annotators or hire data collection agency

- Evaluate annotators’ knowledge, skills, and experiences

(8) Annotating Data

- Educate annotators

- Auto-labeling

- Annotate raw data

- Manage annotators

- Q&A for annotators

(9) Validating Annotated Data

- Cross-validation among annotators

- Qualitative evaluation of annotations

4. Data Preparation

(10) Preprocessing & Splitting Data

- Split data to train/test/valid sets

- Preprocess data to match the format of model input

5. Model Training and Evaluation

(11) Training and Evaluating Model

- Train baseline model(s)

- Select 2-3 different models and train/evaluate with the same data

- Estimate model performance with small data

- Qualitative & quantitative evaluation of model

- Visualization

(12) Strategizing Next Step

- Identify samples model didn’t perform well

- Determine the source of problem

- Model? Data? Annotation?

(13) Revising Data

- Change model OR add more data OR change annotation guide

- Version new data

6. Deployment

(14) Deploying Model

- Deploy for service

- Continued maintenance

4.1.2. Step 2 – Raw Data Collection and Exploration

(3)

Surveying raw data sources – [source, availability, representativeness, specifications, legal and ethical issues]

Surveying raw data sources, selecting appropriate sources, and deciding methods for data collection were the main tasks of this step. It was necessary to estimate the amount of data feasible for collection [P4, P5, P9, P10, P14] and select the types of data to be included or excluded [P7]. P5 mentioned that it is vital to consider the detailed characteristics of data, such as image resolution and sampling rates for audios, and what other data were in use in the target domain. Potential legal (e.g., copyrights) and ethical issues with data must also be checked [P4, P13, P16, P19]. In particular, P19 put extra care not to introduce human bias during data collection, as it was easy to blindly attempt to collect more data without careful consideration of data sources when more data often results in a better performance. P19 also noted:

“If I were to only crawl data from the website with the most community traffic in Korea (DCInside), because it has the most data, and train a language model. That model would likely learn explicit insults and even sexually discriminative comments. (…) That means there is a discrepancy between how the real-world data looks and what people want. Possibly that is the nature of the Internet, and that is why I was interested in cleaning corpuses crawled from the web. As more data usually results in a better model.”

P4 also emphasized that the data must represent the entire domain, especially when a sampled subset of data is used.

(4)

Collecting Raw Data – [quantity, cost, quality, privacy, speed, validation] Data were collected from existing data or entirely anew from scratch. The cost of data collection in terms of both time and money must be put into consideration [P4, P13, P14]. Collected data were then checked for duplicates and any errors, such as missing values or broken links [P2, P10, P13, P16], and either rule-based or model-based methods were used for cleaning error. It was also necessary to decide how to deal with privacy-sensitive data (if there is any), for example to delete them or apply masking [P4]. Additionally, P2 and P13 noted it is important not to overload websites when using web crawlers to collect data.

(5)

Exploring collected data – [class distribution, noise, manual inspection]

Collected data must be explored and, if necessary, validated with bare eyes. Many of our participants suggested it is crucial to check for noises in data and class distribution. Removing dirty data could provide more significant benefits than collecting more data, and often a small portion of contaminated data detrimentally harms the overall quality; thus, what constitutes noisy data has to be defined carefully. P5 explained that when detecting defects from electronic products, the various types of defects and backgrounds that are not defects were identified from images. In that process, all data were inspected manually if data was sufficiently small; otherwise, only a subset. The participant also noted that having “defective” data was as important as having “normal” data, as balanced data is essential for a well-performing model.

4.1.3. Step 3 – Data Construction

(6)

Defining data scheme - [evaluation metric, annotation method, test set, alignment with model]

The annotation schema, including data formats, label categories, and annotation methods, were defined along with the evaluation metric (e.g., accuracy and F1-score) in this step. Surveying schemas and metrics used in other similar datasets came first [P1, P14]. The scalability of data had to be taken into account when selecting an appropriate annotation method to use.

“For example in OCR, it is possible to use either center points or bounding boxes when annotating texts, and while bounding boxes are more costly than center points, they are more scalable as center points can be easily computed from the points of a bounding box.” (P3)

P1 emphasized that labeling categories should be constructed, so labels are mutually exclusive but collectively exhaustive. Evaluation metrics must be considered along with labels and annotations, i.e., whether a metric matches a model’s performance as experienced by the service users [P12, P14]. P5 shared an experience where the model’s performance metric and service goal had to be closely aligned.

“We had to take into account what our users expected. When a defect detection model makes an error, it can be either a false positive (okay but classified as defect) or a false negative (defect but classified as okay). Because FN is more critical than FP when it comes to detecting defects, we put more focus on minimizing the FN and tested our models towards that goal.” (P5)

Some found that formulating a test set before building other parts of a dataset was also critical for constructing good data. P4 and P10 noted that a test set should closely represent the real world, while P8 and P15 also noted that a test set must include samples representative of a model’s training.

(7)

Planning annotation - [annotation cost/speed/quality, hiring annotators, consistency of an annotation guide, annotation tool, pilot study]

How raw data should be annotated was determined in this step. Hire annotators and let them work in-house, outsource tasks to a crowdsourcing agency such as MTurk, or do it manually? This decision required considering time, money, and expected quality of data. Once the annotation method was determined, an appropriate tool was surveyed and chosen or implemented from scratch if no existing tool fit the purpose.

The following steps were writing an annotation guideline and determining qualifying criteria for hiring annotators. Our participants found careful planning is mandatory as annotation tasks are “expensive” with constraints in both time and money. The number of annotators was chosen from the time available for the task [P4, P9, P14, P16, P18] and the available time in turn was determined from the allocated budget [P4, P14, P16].

Consistency was of utmost importance for annotation guidelines to minimize any chance annotators produce disparate annotations when given the same data [P1, P3, P4, P6]. Educating annotators such that they well understand the guideline was another point of consideration [P4, P5, P16, P19].

Finding a suitable annotation tool for the task was necessary as data requirements are heterogeneous across tasks. P9 and P14 reported implementing an annotation tool from scratch to meet the requirements of their project, as no existing tool satisfied their needs. In common, our participants reported that an annotation tool’s supportive features directly influence the quality of resulting data, such as the functionality to analyze annotation logs, manage the performance of annotators, or even simple shortcuts that help increase productivity.

Naturally, the importance of who builds data was highlighted as well. Participants noted that the qualification criteria for annotators have to be set by the specifics of the task and data [P5, P8, P9, P14, P15], and not only the backgrounds of annotators but the backgrounds of task managers writing guidelines and managing the workers must be considered to mitigate potential biases in data [P18].

P18 mentioned that getting hands dirty in the annotation tasks also helps to understand the potential challenges for annotators and what might be done in their support. P19 similarly noted that constructing a small pilot dataset before getting into the main task helps foresee potential data construction problems, which can be reflected in an annotation guideline and the overall plan ahead.

(8)

Annotating data - [annotator well-being & conditions, edge cases, annotation guide update]

Aside from the annotation task itself, managing and educating the annotators, doing Q&A for annotations, and automating data annotation with the assist of a model are included in this step.

Following the education with annotation guidelines in the previous step, workers were educated on performing annotation [P4, P11, P19]. P4 also envisioned annotators how the data they annotate would be used in models and for what services, which helped workers be more motivated and productive as a result. P19 also found showing various examples of images the workers will encounter during annotation was far more effective than explaining the guideline only in words to increase the consistency of annotations. Almost all participants agreed that annotators themselves largely decide the quality of data, and some also mentioned that managing the annotation workload, not to fatigue workers is important for both the efficiency and consistency of annotations [P4, P7, P11, P14, P18, P19].

P4, P7, and P9 shared workers often submit previously unreported edge cases during the annotation, and merging new cases into the guideline should be done with the care that there are no conflicts with completed annotations.

Further, if available, a trained model can be used for automated annotation; tasks can be sped up significantly by human annotators revising automated annotations compared to manually annotated data from scratch.

(9)

Validating annotated data - [inter-annotator agreement, manual validation]

The quality of annotated data is evaluated in this step. Participants cross-validate annotations across multiple workers, or assessed annotation results qualitatively (i.e., manually). P19 measured inter-annotator agreement across multiple workers, while P11 had a well-performing annotator (or engineers themselves) manually inspect annotators’ works.

4.1.4. Step 4 – Data Preparation

(10)

Preprocessing and splitting data – [memory constraints, split strategy, test set quality]

A complete dataset is constructed with annotated & validated data in this step. The dataset is divided into train, validation, and test splits and preprocessed to match the model’s input format. P8 emphasized stratifying splits, while P12 and P15 noted that a test split should include diversified samples. Random sampling can be sufficient to obtain balanced distribution across splits if the data quantity is large [P12, P15], but the data distribution tends not to follow actual data distribution, especially in the early stage of data construction when the data size is small. When the data has imbalanced distribution across classes, extra care should be taken that each split includes at least one instance of every class [P5]. Additionally, the resulting data should be saved in a drive so that they are easily accessible [P9], along with versioning and memory constraints also under consideration [P3].

4.1.5. Step 5 – Model Training and Data Refinement

(11)

Training and evaluating model – [wrong label, service requirements]

A model is trained and evaluated in this step. Participants reported training and evaluating multiplied models in parallel on the same data [P8, P17, P18, P19] and selecting a subset of samples for a quick test run before training a model on the full dataset [P3, P5].

For model evaluation, the test scores were put first [P18], along with other measures such as the inference speed and confidence scores [P1, P11, P14, and P15]. Qualitative evaluation was also performed; for example, P5, P11, and P16 searched for cases where models’ prediction and ground truth labels disagree, as in some cases, the annotated labels were wrong, but the prediction was correct. It was also brought up that evaluating the model in service context was crucial, prioritizing the service requirements over evaluation metrics [P5].

(12)

Strategizing next step – [quantity, edge cases, data quantity, model refinement, data refinement/debugging]

A decision is made to deploy a model or go through another round of training. This step requires understanding the characteristics of samples a model performs poorly and deciding whether a model or data should be fixed in response. If data is the cause, it is necessary to determine if increasing the data quantity is sufficient or if new samples are needed. P13 and P14 suggested additional data collection with a revised annotation guideline or preprocessing strategy unless refining a model is enough to fix the problem.

(13)

Revising data - [annotation guideline, difficult samples, synthesized data]

Based on the decisions from the previous step, additional data is collected, the model is refined, or annotation guidelines are updated. If types of samples model perform poorly are known, new data of similar type can be collected [P1, P2, P4, P6, P7, P8, P9, P10]. Along with another round of training and evaluation, an annotation guideline is updated and the label categories are redefined if necessary [P1, P2, P3, P4, P5, P7, P15]. If possible, trained models can be utilized to generate synthetic data [P10]. Careful versioning is required as the data goes through changes during these iterations.

4.1.6. Step 6 – Deployment

(14)

Deploying model - [unseen data, re-collect]

At last, a model is deployed for service. From this point onward, continued maintenance of a model is necessary by keeping a model robust against incoming real-world data that is possibly out of the model’s learned distribution. Planning for additional data construction and model training meanwhile inspecting incoming service data may be necessary [P16, P18].

All participants mentioned that data construction is an iterative process. They note that the iteration does not happen only after evaluating a model, but in any step of the pipeline; data construction is not a linear unidirectional operation but a bidirectional, cyclic process.

4.2. What challenges exist towards making good data? How to deal with them?

In this section, we focus on the overarching themes that repeat across each step in the pipeline. Such themes are best described as the inevitable struggles toward making good data. We present the difficulties the participants face when making data for ML and their strategies to deal with these struggles.

4.2.1. Trade-offs stemming from real-world constraints

P14, P16, and P10 reported that data construction is costly. Therefore, making good data by following the ML and data management disciplines in practice is difficult because of the realistic constraints, such as time and cost.

The participants considered the tradeoff between short-term and long-term costs in designing data. Specifically, P1, P14, and P19 argued the importance of scalable data design to respond to different models and domains. They mentioned that if they had planned annotations with multiple options to prepare for situations that can be used on diverse models, they would have to pay more costs in the short term. However, if they wanted to reduce short-term costs, they could make data quickly, but the data would not be scalable.

“If we want to use our data more than just once, continually adapting its format for various purposes, we will need to make sure the currently defined format is scalable. It won’t be easy, but surely it will help cut cost if we were to go large in quantity” (P19)

The participants shared how they dealt with such dilemmas with their own strategies. P18, P10, and P11 emphasized quick evaluation through constructing a pilot dataset for long-term cost reduction. To be specific, P10 started building data after determining whether the performance would be improved if data was added through quick evaluation using public datasets. P18 conducted pilot annotations to identify problems or edge cases that may appear during this construction in advance. He said that this pilot annotation minimizes trial errors when building data. P11 conducted data debugging after examining the output of the model trained with small sample data.

The participants also mentioned trade-offs when managing the annotators and preparing guidelines for annotation. P6 and P17 reported the trade-off between speed and validity (Aroyo et al., 2022) of data. Sometimes the annotation schema that accurately represents the phenomena to capture might be very complex. Such schemas may impose too much cognitive load on annotators, slowing down the speed of data annotation. P17 shared his experience of reducing the number of label types to speed up data production at the expense of data validity. In addition, P4 considered balance in validity and consistency when writing annotation guidelines. Giving specific directions on annotating ambiguous cases may help increase consistency among annotators; however, unifying the label for all subjective cases may negatively affect the validity of data.

The participants have established strategies to deal with such dilemmas in managing annotators and guidelines. After the first dataset construction, P16 set strategies to identify the quantity of data for additional construction for saturation points. P15 and P17 said they revised data (e.g., annotation schema, annotation guideline) based on model performance. Furthermore, P19 stressed the importance of the dataset being easily reused. To do so, data managers and modelers wrote data documentation and their specifications with problem definition, data format, and baseline model performance.

The participants collectively mentioned various types of trade-offs in data work. Such dilemmas resulted from cost, scalability, data validity and consistency. Nevertheless, participants took strategies to pursue data excellence, that is, to spend cost efficiently without abandoning data quality.

4.2.2. Harmonizing assorted data workers for consistency

Our participants collectively state that bringing together the differences that stem from diverse people who participate in the data work is difficult. Everyone has a different point of view on the same data or has only partial knowledge required to make good data. P10 and P12 addressed that the results vary depending on the knowledge about the target service and model. It was difficult for P10 to determine whether the data met the service requirements due to a lack of domain knowledge. P17 noted that the modelers need domain experts’ help when modeling to reduce the gap between service and model requirements.

“So we were making a translation service, a multilingual one. And we didn’t know all the languages, but we still made test sets and what not, all of that wasn’t so easy.” (P10)

Participants said that the required level of knowledge was different depending on what role they played in the process of constructing data. For example, P16 said that a person who determines annotation schema or performs annotation does not need to understand the specifics of ML. However, it would be helpful for them to have basic knowledge of ML (e.g., ML systems work based on distribution, the model’s input/output, and the current model’s pros and cons). Likewise, P3 mentioned that it is very important to educate a person who makes data on how the model is evaluated or how the model works. P11 pointed out that knowing the service domain (e.g., whether there is a clear difference between casual and minimal) can help whether the model can accept such annotation structure can be determined in advance without trial and error.

During the annotation, the quality and quantity of work may differ significantly depending on the knowledge level and the annotator’s well-being and motivation. P6 emphasizes that in some tasks where more data guarantees higher performance (e.g., speech-to-text model),”It’s all about the caliber of annotators,” such as how good they are at transcription and how fast they type, has more influence on the system performance than any other characteristics (e.g., data distribution or data sources).

The participants endeavored not to cause bias in the model and data-driven by annotators’ different knowledge levels and biases. In particular, these characteristics affect each step of annotation guides and task management, which are the most critical factors in maintaining data quality. In defining the task and planning annotations, P6 said several people should write the guide to prevent author bias. P1, P3, and P12 use examples to prevent arbitrary interpretation of the guide and to show clear intent. P4 and P7 said that just giving guidelines is insufficient to make good data. Therefore, they used various methods of worker training, such as showing animations or videos of sample works.

“Overall, the workers follow a similar flow, but their details can be different. Just with a guide, you can’t convey what you think should exactly happen in their mind as they make data. But when you educate the workers, you first give them data and show the sequence of how your attention flows through as you work through data, be it a video recording, then the workers will flow your flow. This boosts the consistency of data and the types of edge cases the workers report become similar as well.” (P7)

Participants specifically attempted to reduce these differences through direct communication at the step of annotating data. P3, P6, P7, and P15 suggested that the modeler or data manager communicates with the annotator constantly and on time to reduce the gap in guide understanding. In addition, P12 said that it is necessary to understand the appropriate amount of work for each annotator rather than assigning work as much as possible. These attempts aimed to maintain annotation quality consistently. After annotation, P12 reported that he checks the correlation of annotation among different workers to check if there is a noise stemming from certain annotators. This process facilitates the data cleansing process.

From judging the dataset’s suitability for the task to managing the consistency among annotators, the participants pointed out the inevitable challenges they face due to human involvement in data construction. The strategies for making consistent, the high-quality dataset included communicating with domain experts to align the service requirements into ML tasks, putting much effort into educating annotators, and checking inter-annotator agreement.

4.2.3. The necessity of human intuition and tacit knowledge gained through repetitive experience

Participants said that following the textbook knowledge of technical skills and the disciplines for making good data is not enough. P10, P16, and P17 emphasized the importance of knowledge gained through experience looking through much data. Specifically, P14 and P17 mentioned that looking into models’ outputs often is a must to find the parts where the model gets confused and revise the label categories to resolve the confusion.

“It all comes down to an experience when we try to sort out where a model gets confused, and figuring those pain points in the first place. Who has gone through a lot has gut feelings about how things might go down when training a model, and what to do to get a model working straight.” (P14)

Participants also said that many steps in data construction require thorough manual inspection. Human intervention is mainly involved in the steps related to evaluating the model performance. P17 claimed that quantitative metrics (e.g., accuracy) do not tell much about what exactly the system is good at, so qualitative evaluation of the output is a must. P1 also emphasized manually going through model predictions and model calibration since a high confidence score of the model does not guarantee the correct answer. Constructing test sets requires human intuition as well. P14 often split train, valid, test set manually, image by image, so that test set can represent the cases to be evaluated with high fidelity. P1, P6, and P10 stressed that humans must inspect test datasets thoroughly.

“Honestly, I just pick a day, and sink some time into it. I have once gone through images one by one, before splitting train, validation, and test sets.” (P14)

The fundamental reason behind utilizing human intuition and tacit knowledge in data construction lies in the lack of interpretability of deep-learning models (i.e., the blackbox model) and the messy nature of data. P17 described the trickiness in interpreting the effect of data improvement on system performance even for the experts.

“I’ve been going straight in, but knowing the causality between improvements in data and model’s performance would have helped training (…) I have had glimpses about when putting in some data would work out, but only if that could be more concrete.” (P17)

P6, P11, P16, and P17 claimed the necessity of interpreting the model output into something meaningful to humans. They mentioned that there is an extra interpretation step upon examining the model output (e.g., label distribution, heatmap) to decide the next step for system improvement. For example, P6 and P11 mentioned Gradient-weighted Class Activation Mapping (Selvaraju et al., 2019), which visualizes which feature of the data the model refers to when making predictions. After looking at the weight heatmap and model output, they can decide which data influences model prediction.

Participants pointed out that data work requires human input due to the labyrinthine nature of data. P10 said that quantifying or predicting the behavior of ML systems is difficult due to many variables stemming from data, unlike software engineering. Moreover, P3, P6, and P10 pinpointed that it is impossible to be prepared for all the edge cases and variance of data in advance, as is the nature of real-world data. P1, P13 told not to trust the ground truth 100%, since they are human artifacts after all.

“What constitutes checkbox areas can differ and the check marks can differ too. You need an exorbitant amount of data to be able to cover all possible types of checkboxes.” (P3)

As such, current data work is indubitably a human endeavor. Building good data requires human builders with tacit knowledge only acquirable from empirical experiences and their manual, tedious efforts. Humans being embedded in data so deeply makes systematizing data work a seemingly intractable problem without any efficient solution. Even so, as data work is already developing towards a disciplined engineering exercise for greater accountability and transparency of ML systems originating from data, together with the more efficient and iterable creation of data, what might be the following concrete steps to be taken toward that goal?

5. Discussion

Our findings show iteratively improving the ML system with multiple rounds of data collection and refinement is the status quo. The participants shared diverse strategies to make good data for AI systems, all formulated through unique individual experiences, rather than adapted from a pre-established best practice. Taking one extra step further from documenting and organizing the practices we observed, we discuss the future directions of research to help reduce the trial & error in the data construction pipeline based on the experiences and desires of the ML practitioners.

5.1. There is no “skeleton key” to refining data.

How exactly should data be refined to make a better ML system? At the end of the interview, P6 gave one last remark: ”How can we make good data? I have been pondering, to no avail. Really, foreseeing what data gives a good model is what we all dream of. ” Our observations firmly echo the statement ”no established metrics for defining high-quality data exist yet” (Aroyo et al., 2022), highlighting the necessity for collective efforts on data quality assessment.

Nevertheless, our participants hinted at some hopeful yet concrete next steps toward systematizing data improvement. First, a model-independent metric for evaluating data quality needs to be developed. Assessing the fidelity of data (i.e., the degree to which the dataset represents reality) is an important topic to be explored (Batini et al., 2009; Aroyo et al., 2022). In that regard, P10 said that one common mistake in ML is judging the whole picture just by looking at sampled subsets of the data. Many of our participants found it crucial to check for class distribution of the data, and it would be nice to have a go-to method to find out the statistical landscape of the data. Another way to ensure model-independent data quality would be to align the qualitative and quantitative metrics. Most participants went through annotated data with their own eyes and evaluated data qualitatively; however, this kind of heuristic and qualitative evaluation makes it difficult to determine the causal link between data improvement and system performance improvement. It is high time for the data management community to develop interpretable measures to assess the data quality.

Second, it should be able to directly use model results for debugging data. Many participants improved data based on model confidence scores or the analysis of model output. For example, P6 and P11 used Gradient-weighted Class Activation Mapping (Selvaraju et al., 2019), which visualizes the data-model interaction. However, the model confidence score is still not very reliable, and it is difficult to understand the model behavior, especially if model outputs are all but just a handful of numbers, so more research on model explainability is needed (Gunning, 2017; Barredo Arrieta et al., 2020). Studies toward providing more interpretable results that explain why models behave a certain way rather than just providing numbers will help improve by deciding the specifics of the data to collect and add.

5.2. Annotation is at the core of data quality.

Unsurprisingly, most of our participants found the annotation quality is the most important aspect of the data. In the same line, all of our participants reported that annotators themselves largely determine the data quality. Our participants strived to minimize inconsistency in annotations by writing well-designed annotation guidelines and devising effective methods for educating annotators. Therefore, all point to the direction to focus on the stakeholders involved in the data annotation process. Investigating how might the performance of annotators be enhanced would largely help increasing the quality of the data.

For the annotators, a well-designed interface enabling efficient annotation should be developed. One work that emphasizes designing an annotation tool customizable for the task at hand is (Zhang et al., 2022). Likewise, our participants emphasized the supportive functionalities of an annotation tool that can help annotators reduce errors systematically. Moreover, an annotator’s understanding of the ultimate goal of the ML systems they contribute to, along with an overview of how models operate, can positively influence the annotation quality. For example, P4 envisioned for annotators how the data they annotate would be utilized in models and for what services, hence increasing worker motivation and productivity in turn. Another approach to aligning annotators’ motivation with the project goal is to provide appropriate education or guidance to workers. Educating basic ML knowledge may have a positive impact on data quality, as previously investigated (Zhu et al., 2014; Batini et al., 2009).

For the managers of the annotation process, it is necessary to devise a way to facilitate the annotators’ task management in terms of annotation speed and well-being. For example, some participants suggested that annotation tools should support monitoring annotation pace and tracking outliers. Moreover, constant and quick communication is essential to maintain the consistency of data annotation and for the annotators and managers/modelers to be on the same page even in times of high volatility in annotation guidelines. The current tendency of the ML community to abstract away the human workers, disregarding their contributions (Sambasivan and Veeraraghavan, 2022; Irani and Silberman, 2013; Gray and Suri, 2019), is toxic both for the annotation workers and the annotation requestors.

6. Conclusion & Future Works

There is no established best practice for data work in the ML community. Therefore, we must first find out and record how people make data, what efforts they make to make it “good,” and do best practices based on that. We conducted semi-structured interviews with 19 ML practitioners who had experience constructing datasets for supervised learning to understand data work practice and challenges. Based on the interviews, we present a data-focused pipeline for end-to-end ML systems, including the specifics of what humans do and consider for data construction. Our findings empirically show that data work requires a myriad of concerns involving a constant interplay of data, models, and humans. We identified recurrent human challenges at different stages of the data-building process: 1) trade-offs from real-world constraints; 2) harmonizing assorted data workers for consistency; 3) the necessity of human intuition and tacit knowledge for processing data. Based on the experiences and aspirations of the ML practitioners, we explored future research directions to decrease trial and error in the data-building pipeline. While the scope of our work was limited to making datasets for supervised learning techniques, we acknowledge that the degree or types of human intervention in the data construction process may differ in different situations, such as under semi-supervised learning or weak supervision circumstances.

References

(1)
dca ([n. d.]) [n. d.]. Data-centric AI Resource Hub. https://datacentricai.org/. Accessed: 2022-9-12.
Amershi et al. ([n. d.]) Saleema Amershi, Andrew Begel, Christian Bird, Robert De Line, and Harald Gall. [n. d.]. Software Engineering for Machine Learning: A Case Study. ([n. d.]).
Anik and Bunt (2021) Ariful Islam Anik and Andrea Bunt. 2021. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 75, 13 pages. https://doi.org/10.1145/3411764.3445736
Aroyo et al. (2022) Lora Aroyo, Matthew Lease, Praveen Paritosh, and Mike Schaekermann. 2022. Data excellence for AI: why should you care? Interactions 29, 2 (Feb. 2022), 66–69.
Barredo Arrieta et al. (2020) Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58 (June 2020), 82–115.
Batini et al. (2009) Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for Data Quality Assessment and Improvement. Comput. Surveys 41, 3 (July 2009).
Bender and Friedman (2018) Emily M Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.
Bowman and Dahl (2021) Samuel R Bowman and George E Dahl. 2021. What Will it Take to Fix Benchmarking in Natural Language Understanding? (April 2021). arXiv:2104.02145 [cs.CL]
Boyd and Crawford (2012) Danah Boyd and Kate Crawford. 2012. CRITICAL QUESTIONS FOR BIG DATA. Inf. Commun. Soc. 15, 5 (June 2012), 662–679.
Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research, Vol. 81), Sorelle A Friedler and Christo Wilson (Eds.). PMLR, 77–91.
Cartwright et al. (2019) Mark Cartwright, Graham Dove, Ana Elisa Méndez Méndez, Juan P. Bello, and Oded Nov. 2019. Crowdsourcing Multi-Label Audio Annotation Tasks with Citizen Scientists. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300522
centric AI Resource Hub ([n. d.]) Data centric AI Resource Hub. [n. d.]. Data-centric AI Resource Hub. https://datacentricai.org/. Accessed: 2022-9-12.
Crawford and Paglen (2019) Kate Crawford and Trevor Paglen. 2019. Excavating AI: The politics of images in machine learning training sets. AI and Society (2019).
DeepLearningAI (2021) DeepLearningAI. 2021. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI.
Denton et al. (2020) Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, Hilary Nicole, and Morgan Klaus Scheuerman. 2020. Bringing the People Back In: Contesting Benchmark Machine Learning Datasets. (July 2020). arXiv:2007.07399 [cs.CY]
Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (New Orleans, LA, USA) (AIES ’18). Association for Computing Machinery, New York, NY, USA, 67–73.
Dovetail (2022) Dovetail. 2022. Customer knowledge platform - Dovetail. https://dovetailapp.com/. Accessed: 2022-9-16.
Eyuboglu et al. (2022) Sabri Eyuboglu, Bojan Karlaš, Christopher Ré, Ce Zhang, and James Zou. 2022. dcbench: a benchmark for data-centric AI systems. In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning (Philadelphia, Pennsylvania) (DEEM ’22, Article 9). Association for Computing Machinery, New York, NY, USA, 1–4.
Galvez et al. (2021) Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. 2021. The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage. (Nov. 2021). arXiv:2111.09344 [cs.LG]
Gebru et al. (2018) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé, III, and Kate Crawford. 2018. Datasheets for Datasets. (March 2018). arXiv:1803.09010 [cs.DB]
Geiger et al. (2020) R Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang. 2020. Garbage in, garbage out? do machine learning application papers in social computing report where human-labeled training data comes from?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 325–336.
Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence 2, 11 (Nov. 2020), 665–673.
Gray and Suri (2019) Mary L Gray and Siddharth Suri. 2019. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Houghton Mifflin Harcourt.
Gunning (2017) David Gunning. 2017. Explainable artificial intelligence (xai). Defense advanced research projects agency (DARPA), nd Web 2, 2 (2017), 1.
Halevy et al. (2009) Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data. IEEE intelligent systems 24, 2 (2009), 8–12.
Hu et al. (2020) Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, and Lawrence S Moss. 2020. OCNLI: Original Chinese Natural Language Inference. (Oct. 2020). arXiv:2010.05444 [cs.CL]
Huff and Tingley (2015) Connor Huff and Dustin Tingley. 2015. “Who are these people?” Evaluating the demographic characteristics and political preferences of MTurk survey respondents. Research & Politics 2, 3 (July 2015), 2053168015604648.
Hutchinson et al. (2021) Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 560–575.
Initiative ([n. d.]) BigScience Initiative. [n. d.]. Building a TB Scale Multilingual Dataset for Language Modeling. https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling. Accessed: 2022-9-12.
Irani and Silberman (2013) Lilly C Irani and M Six Silberman. 2013. Turkopticon: interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13). Association for Computing Machinery, New York, NY, USA, 611–620.
Karmaker (“Santu”) Shubhra Kanti Karmaker (“Santu”), Md Mahadi Hassan, Micah J Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni. 2021. AutoML to Date and Beyond: Challenges and Opportunities. ACM Comput. Surv. 54, 8 (Oct. 2021), 1–36.
Mazumder et al. (2022) Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Douwe Kiela, David Jurado, David Kanter, Rafael Mosquera, Juan Ciro, Lora Aroyo, Bilge Acun, Sabri Eyuboglu, Amirata Ghorbani, Emmett Goodman, Tariq Kane, Christine R Kirkpatrick, Tzu-Sheng Kuo, Jonas Mueller, Tristan Thrush, Joaquin Vanschoren, Margaret Warren, Adina Williams, Serena Yeung, Newsha Ardalani, Praveen Paritosh, Ce Zhang, James Zou, Carole-Jean Wu, Cody Coleman, Andrew Ng, Peter Mattson, and Vijay Janapa Reddi. 2022. DataPerf: Benchmarks for Data-Centric AI Development. (July 2022). arXiv:2207.10062 [cs.LG]
Mehrabi et al. (2019) Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning. (Aug. 2019). arXiv:1908.09635 [cs.LG]
Mitchell et al. (2018) Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2018. Model Cards for Model Reporting. (Oct. 2018). arXiv:1810.03993 [cs.LG]
Muller et al. (2019) Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3290605.3300356
Ng (2021) Andrew Ng. 2021. A.I. needs to get past the idea of big data. https://fortune.com/2021/07/30/ai-adoption-big-data-andrew-ng-consumer-internet/. Accessed: 2022-9-16.
Ng et al. (2021) Andrew Ng, Dillon Laird, and Lynn He. 2021. Data-Centric AI Competition. https-deeplearning-ai.github.io/data-centric-comp. Accessed: 2022-9-16.
Northcutt et al. (2021) Curtis G Northcutt, Anish Athalye, and Jonas Mueller. 2021. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. (March 2021). arXiv:2103.14749 [stat.ML]
Paullada et al. (2021) Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns (N Y) 2, 11 (Nov. 2021), 100336.
Pine et al. (2022) Kathleen Pine, Claus Bossen, Naja Holten Møller, Milagros Miceli, Alex Jiahong Lu, Yunan Chen, Leah Horgan, Zhaoyuan Su, Gina Neff, and Melissa Mazmanian. 2022. Investigating Data Work Across Domains: New Perspectives on the Work of Creating Data. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 87, 6 pages. https://doi.org/10.1145/3491101.3503724
Prabhu and Birhane (2020) Vinay Uday Prabhu and Abeba Birhane. 2020. Large image datasets: A pyrrhic win for computer vision? (June 2020). arXiv:2006.16923 [cs.CY]
Rivera and Lee (2021) Veronica A Rivera and David T Lee. 2021. I Want to, but First I Need to: Understanding Crowdworkers’ Career Goals, Challenges, and Tensions. Proc. ACM Hum.-Comput. Interact. 5, CSCW1 (April 2021), 1–22.
Roh et al. (2018) Yuji Roh, Geon Heo, and Steven Euijong Whang. 2018. A Survey on Data Collection for Machine Learning: a Big Data – AI Integration Perspective. (Nov. 2018). arXiv:1811.03402 [cs.LG]
Roh et al. (2021) Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. 2021. Sample Selection for Fair and Robust Training. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 815–827. https://proceedings.neurips.cc/paper/2021/file/07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf
Sambasivan (2022) Nithya Sambasivan. 2022. All Equation, No Human: The Myopia of AI Models. Interactions 29, 2 (feb 2022), 78–80. https://doi.org/10.1145/3516515
Sambasivan et al. (2021) Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21, Article 39). Association for Computing Machinery, New York, NY, USA, 1–15.
Sambasivan and Veeraraghavan (2022) Nithya Sambasivan and Rajesh Veeraraghavan. 2022. The Deskilling of Domain Expertise in AI Development. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 587, 14 pages. https://doi.org/10.1145/3491102.3517578
Selvaraju et al. (2019) Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2019. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision 128, 2 (oct 2019), 336–359. https://doi.org/10.1007/s11263-019-01228-7
Sun et al. (2017) Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision. 843–852.
Thakkar et al. (2022) Divy Thakkar, Azra Ismail, Pratyush Kumar, Alex Hanna, Nithya Sambasivan, and Neha Kumar. 2022. When is Machine Learning Data Good?: Valuing in Public Health Datafication. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 322, 16 pages. https://doi.org/10.1145/3491102.3501868
Tsipras et al. ([n. d.]) Tsipras, Santurkar, Engstrom, and others. [n. d.]. From imagenet to image classification: Contextualizing progress on benchmarks. Int. Bus. Rev. ([n. d.]).
Wang et al. (2022) Ding Wang, Shantanu Prabhat, and Nithya Sambasivan. 2022. Whose AI Dream? In Search of the Aspiration in Data Annotation.. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 582, 16 pages. https://doi.org/10.1145/3491102.3502121
Whang and Lee (2020) Steven Euijong Whang and Jae-Gil Lee. 2020. Data collection and quality challenges for deep learning. Proceedings VLDB Endowment 13, 12 (Aug. 2020), 3429–3432.
Wilkinson et al. (2020) Jack Wilkinson, Kellyn F Arnold, Eleanor J Murray, Maarten van Smeden, Kareem Carr, Rachel Sippy, Marc de Kamps, Andrew Beam, Stefan Konigorski, Christoph Lippert, Mark S Gilthorpe, and Peter W G Tennant. 2020. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health 2, 12 (Dec. 2020), e677–e680.
Xin et al. (2021) Doris Xin, Eva Yiwei Wu, Doris Jung-Lin Lee, Niloufar Salehi, and Aditya Parameswaran. 2021. Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 83, 16 pages. https://doi.org/10.1145/3411764.3445306
Yang et al. (2019) Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. 2019. Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy. (Dec. 2019). arXiv:1912.07726 [cs.CV]
Zhang et al. (2020) Amy X. Zhang, Michael Muller, and Dakuo Wang. 2020. How Do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proc. ACM Hum.-Comput. Interact. 4, CSCW1, Article 22 (may 2020), 23 pages. https://doi.org/10.1145/3392826
Zhang et al. (2022) Yu Zhang, Yun Wang, Haidong Zhang, Bin Zhu, Siming Chen, and Dongmei Zhang. 2022. OneLabeler: A Flexible System for Building Data Labeling Tools. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22, Article 93). Association for Computing Machinery, New York, NY, USA, 1–22.
Zhao et al. (2019) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender Bias in Contextualized Word Embeddings. (April 2019). arXiv:1904.03310 [cs.CL]
Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 15–20. https://doi.org/10.18653/v1/N18-2003
Zhu et al. (2014) Haiyi Zhu, Steven P Dow, Robert E Kraut, and Aniket Kittur. 2014. Reviewing versus doing: learning and performance in crowd assessment. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (Baltimore, Maryland, USA) (CSCW ’14). Association for Computing Machinery, New York, NY, USA, 1445–1455.