\system: Evaluating Interactive Human-LM Co-writing Systems
Abstract.
A surge of advances in language models (LMs) has led to significant interest in using LMs to build co-writing systems, in which humans and LMs interactively contribute to a shared writing artifact. However, there is a lack of studies assessing co-writing systems in interactive settings. We propose a human-centered evaluation framework, \system, for interactive co-writing systems. \systemshowcases an integrative view of interaction evaluation, where each evaluation aspect consists of categorized practical metrics. Furthermore, we present \systemwith a use case to demonstrate how to evaluate and compare co-writing systems using \system.

1. Introduction
Language models (LMs) have advanced significantly, showcasing previously unheard-of capabilities in solving a wide spectrum of generation and language understanding tasks (Rae et al., 2021; Chowdhery et al., 2022; Ouyang et al., 2022). This has spurred great academic and public interest in using LMs to build writing assistants, in which humans collaborate with LMs to paraphrase sentences (e.g., QuillBot), autocomplete sentences (Chen et al., 2019), write stories (Akoury et al., 2020), etc. Despite the interactive co-writing process, the co-writing systems are at present primarily tested in non-interactive settings (Yuan et al., 2022; Dang et al., 2022). Specifically, current studies commonly conduct evaluations only on the final, co-written article (Mirowski et al., 2022), or prior- and post-human assessment on perceiving LMs (Yuan et al., 2022; Singh et al., 2022), etc. Consequently, these evaluations fail to capture the delta (or dynamic shift) of human-LM interactions. Consider, for instance, a scientific paper writing task involving two types participants, freshmen and a university professors. While it may not be surprising that the professor-LM team achieves significantly better writing quality, this metric does not reflect the fact the impact of the LM on its users: Freshmen might have benefited significantly from the model in terms of the scientific paper structure, grammar correction, etc., whereas professors might have achieved similar writing performance even without the LM. In other words, the quality of the final article cannot genuinely reflect the co-writing system’s influence on users. Instead, we should assess users’ dynamic interaction improvements to indicate system capability, such as the relative quality change between two iterated articles from the same user.
In spite of the importance, to the best of our knowledge, there are few studies that examine how to assess the interaction shift in the iterative co-writing process, and how to depict an integrative view for evaluating interactive co-writing systems. To close this gap, we propose \system: a human-centered evaluation framework for human-LM interactive co-writing systems. We identify the key components and interaction aspects for evaluating the co-writing systems. We further collect the comprehensive types of evaluation metrics under each aspect, including the novel measurements designed for dynamic interaction assessments, and the other two conventional yet important aspects (i.e., human-LM interaction and writing artifact evaluation). Though a case study, we show that \systemcan be effectively used as a thinking tool for comprehensively evaluating and comparing the co-writing systems.
2. \systemFramework
We aim to propose \systemas a guideline that can assist researchers in evaluating and analyzing interactive co-writing systems more comprehensively and fairly.
To this end, we focus on identifying key axes for co-writing systems. To begin with, grounded on the Co-Creative Framework for Interaction Design (COFI) model (Rezwana and Maher, 2022), we identify three key components in human-LM co-writing systems: it involves both human and LM collaborating on a shared writing artifact (e.g., essay, story, paper, etc.) as partners (Rezwana and Maher, 2022). Among these three components, humans are the primary decision-makers for interacting with LMs and writing artifacts. To reflect its importance, we design \systemto be a human-centered evaluation framework, meaning that the ultimate objectives of these metrics is help humans achieve their various writing needs (e.g.,, better user experience, higher-quality writing artifact, etc.).
As shown in Figure 1, \systemfurther recognizes three aspects, which are important around these components and also the interactions between them,
including
evaluate human-LM interaction,
evaluate dynamic interaction trace, and
evaluate writing artifact.
Based on existing evaluation approaches, which might capture insufficient evaluation aspects to reflect the system capability holistically (Yuan et al., 2022; Mirowski et al., 2022; Dang et al., 2022) or focus much on one-time metrics that neglect dynamic interaction changes (Singh et al., 2022; Lee et al., 2022),
\systemseeks to contribute in two folds. First, \systempresents a more integrative view of interaction evaluation supported by practical metrics. Besides, \systemexplicitly extracts a set of metrics to assess the dynamic change along iterations.
3. \systemto Practical Metrics
How can \systemguide practical evaluation? We use \systemto analyze existing evaluations, and reflect on what axes these metrics emphasize Concretely, we use an inductive approach to collect the practical metrics adopted in state-of-the-art co-writing systems (e.g., Wordcraft (Yuan et al., 2022), Integrative Leaps (Singh et al., 2022), Beyond Text Generation (Dang et al., 2022), Dramatron (Mirowski et al., 2022), etc.) Then we categorize them into subgroups and fit into \systemframework. Figure 1(b) depicts the categorized evaluation metrics for each \systemaspect. We next clearly define the aspects and metric subgroups in \system, and briefly explain the underlying motivation. Please see Table 1 in Appendix A.1 for more elaborated metrics and details.111Note that we do not aim to build enumerated lists of evaluation metrics. Instead, we focus on introducing the motivation and process of creating these evaluation aspects and subgroups, which can be generalized in broader use..
Evaluating Human-LM interaction. These metrics measure interactions between the co-writers (i.e., human & LM).
Suppose humans perceive LM as a co-author in co-writing systems. Then their evaluations primarily derive from two dimensions: i) how does the human feel to collaborate with LM? (i.e., “Perception of LM as Partner”). This subgroup consists of both metrics of general perceptions (e.g., enjoyment, preference, etc.), and ethical metrics (e.g., stereotypical, etc.).
Meanwhile, ii) how credible is the LM’s feedback? (i.e., “Measure LM Feedbacks”), which involves common metrics for a variety tasks (e.g., usability, etc.), or task-oriented metrics, like imagination preferred by storytelling, or structural for scientific writing, etc.
Evaluating dynamic interaction trace. These metrics focus on evaluating the dynamic change of interactions along iterative writing process.
We identify three dimensions for evaluations. First, when human iteratively updates the artifacts, “Iterative Interaction Change” subgroup aims to compare metrics between multiple iterations, such as measuring human understanding on LM before start writing and when almost finish the article).
Also, we cover “Span Interactive Flow” subgroup here to assess metrics that need to observe spanning multiple artifact versions (e.g., consistency, learning curve).
Besides, the responding time affects user experience in the interactive systems. We thus include “Efficiency” metrics (e.g., latency, incorporation rate) to assess the process.
Evaluating writing artifact. These metrics gauge the content of the final writing artifact that human and LM jointly accomplish.
We broadly divide these metrics into “Measure Writing Artifacts”, where we can compare written artifacts with ground-truth articles (e.g., using Jaccard similarity) or recruit external experts for evaluation when without ground truth, and “Perception from Writer” dimensions, where writers provide subjective feedback on their outputs (e.g., satisfaction or ownership).
4. Case Study: \systemfor Co-writing
We consider \systemas a framework for researchers to fairly evaluate and compare co-writing systems. \systemcan be useful for researchers to: 1) identify key interactive evaluation aspects among human, LM, and writing artifact interactions during the co-writing process, 2) select appropriate metrics to assess and compare the co-writing systems, 3) comprehensively analyze and describe the human-LM interactive performance of the co-writing systems. Next we present a concrete use case of re-evaluating the Beyond Text Generation (BTG) (Dang et al., 2022) system to showcase how to use \systemfor evaluation.
Suppose the researchers have built the BTG system and need to evaluate its performance in an interactive setting. \systemcan help them assess the system comprehensively and compare it to existing baselines. For example, they can use the \systemframework to unpack their hypothesis into fine-grained evaluation requirements. A mapping may look like: “the BTG system can help humans write better articles (i.e., aspect3: evaluate writing artifact) by enabling better human-LM interactions (i.e., aspect1: evaluate human-LM interaction) in efficient ways (i.e., aspect2: evaluate dynamic interaction trace). With these aspects in mind, they can then dive into each aspect to select the appropriate metrics to support this statement. For instance, they can assess “writers’ perceptions of LM” by choosing enjoyable, preference metrics, and “how writers think about LM’s feedbacks” with usabiity, effective, coherence metrics, etc. Also, they can assess the dynamic interaction efficiency by analyzing the objective logging data (e.g., incorporation rate, latency). The final writing article can also be rated with a set of measures (e.g., external expert review, quality, satisfaction, ownership, etc.). Following the evaluation methods presented in Appendix A.2 (which captures the most common methods used for state-of-the-art system evaluations), they can map the metrics to actual user study designs. Note that the researchers apply all measurements to both the proposed BTG system and the baselines, and ideally with the same group of users, so that they can compare the performance of co-writing systems in a fair manner.
5. Conclusion
This work present \system: a human-centered framework to evaluate human-LM interactive co-writing systems. It provides a thinking tool for researchers to design comprehensive interaction evaluations and analyses. We further feature a use case study introducing how to use \systemstep-by-step for fair evaluations.
6. Acknowledgement
We especially Thank Dr. Ting-Hao ’Kenneth’ Huang for his insightful feedback, valuable support, and co-organizing the In2Writing Workshops!
References
- (1)
- Akoury et al. (2020) Nader Akoury, Shufan Wang, Josh Whiting, Stephen Hood, Nanyun Peng, and Mohit Iyyer. 2020. STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6470–6484. https://doi.org/10.18653/v1/2020.emnlp-main.525
- Chen et al. (2019) Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al. 2019. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2287–2295.
- Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- Dang et al. (2022) Hai Dang, Karim Benharrak, Florian Lehmann, and Daniel Buschek. 2022. Beyond Text Generation: Supporting Writers with Continuous Automatic Text Summaries. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–13.
- Lee et al. (2022) Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, et al. 2022. Evaluating Human-Language Model Interaction. arXiv preprint arXiv:2212.09746 (2022).
- Miles and Huberman (1994) Matthew B Miles and A Michael Huberman. 1994. Qualitative data analysis: An expanded sourcebook. sage.
- Mirowski et al. (2022) Piotr Mirowski, Kory W Mathewson, Jaylen Pittman, and Richard Evans. 2022. Co-writing screenplays and theatre scripts with language models: An evaluation by industry professionals. arXiv preprint arXiv:2209.14958 (2022).
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022).
- Rae et al. (2021) Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).
- Rezwana and Maher (2022) Jeba Rezwana and Mary Lou Maher. 2022. Designing Creative AI Partners with COFI: A Framework for Modeling Interaction in Human-AI Co-Creative Systems. ACM Transactions on Computer-Human Interaction (2022).
- Singh et al. (2022) Nikhil Singh, Guillermo Bernal, Daria Savchenko, and Elena L Glassman. 2022. Where to hide a stolen elephant: Leaps in creative writing with multimodal machine intelligence. ACM Transactions on Computer-Human Interaction (2022).
- Yuan et al. (2022) Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces. 841–852.
Appendix A Appendix
A.1. \systemMetric Details
We list the practical evaluation metrics of \systemwith details, including Interaction Aspects, Subgouprs, Metrics, Measure Questions, and References in Table 1.
A.2. Evaluation Methods in User Studies for Co-writing Systems
We summarize a list of evaluation methods that are commonly used in studying co-writing systems (Singh et al., 2022; Mirowski et al., 2022; Yuan et al., 2022) for the future work reference. This summary aims to serve the purpose of providing inspirations and benchmarks for future co-writing system work to design and compare user study evaluations.
(a) Coding users’ think aloud transcripts. Researchers can encourage the participants to articulate their thinking during interacting with the systems, such as why they decided to use this prompt for querying LM but not others, etc. After finishing the study, researchers can convert the video/radio into transcripts and code the transcripts for further data analysis. Some common qualitative data analysis approaches include thematic analysis, content coding, and topic modeling (Miles and Huberman, 1994), etc.
(b) Coding researchers’ observation notes. During the user studies, the researchers can also record their observations for later analysis. For instance, they can pre-design a set of topics (e.g., user’s emotion change, etc.) that are important to answer their research questions, and pay close attention to these topics during the studies.
(c) Coding (semi-)structured interview transcripts. Prior studies also commonly leverage (semi-)structured interviews to elicit users’ experience and feedback on using the systems. Researchers can design effective interview questions and invoke participants’ answers, to better support the research arguments.
(d) Questionnaires or Surveys. Previous studies also frequently design surveys, which include questions such as N-five-point Likert ratings, single choice or multiple choice questions, etc. These surveys can provide more accurate measures on user’s assessment.
(e) Interaction data logs. The data logs during interaction can provide more objective analysis on user behaviors. Typical interaction data logs for co-writing systems involve artifact submission count, prompt request frequency, latency, etc.
(f) Assessment on the written artifacts. These metrics aim to directly evaluate the quality of written artifacts, commonly using automatic metrics by comparing with ground truth (e.g., similarity), computing the artifact properties (e.g., word count, document length, etc.), or having external experts to assess the outputs.
Interaction Aspects | Subgroups | Metrics | Measure Questions | References |
➊ Evaluating Human- -LM Interaction | Perception of LM as Partner [General perceptions] | enjoyment | I enjoyed writing the story. | (Yuan et al., 2022; Mirowski et al., 2022) |
effort | I put lots of effort into getting AI suggestions. | (Singh et al., 2022) | ||
preference | I prefer the suggestions from the AI agent. | (Mirowski et al., 2022) | ||
communicative | I can communicate well with the AI agent. | (Singh et al., 2022) | ||
cognitive load | Interacting with the AI agent requires much cognitive load. | (Singh et al., 2022) | ||
collaborative | I feel the AI agent collaborative to work together. | (Yuan et al., 2022; Singh et al., 2022; Mirowski et al., 2022) | ||
ease | The AI agent is easy to learn and work with. | (Yuan et al., 2022; Mirowski et al., 2022) | ||
Perception of LM as Partner [Ethical Metrics] | understanding | I can understand the AI agent. | (Singh et al., 2022) | |
literal | The AI agent’s suggestions are literal | (Mirowski et al., 2022) | ||
stereotypical | The AI agent’s suggestions are stereotypical | (Mirowski et al., 2022) | ||
non-intentional | The AI agent does not have intentions to generate suggstions. | (Singh et al., 2022) | ||
Measure LM Feedbacks [Common Metrics] | coherent | The AI generations are coherent with prompts. | (Singh et al., 2022; Mirowski et al., 2022) | |
variety | The AI agent’s suggestion are various. | (Singh et al., 2022; Mirowski et al., 2022) | ||
helpful | I found the AI agent helpful. | (Yuan et al., 2022; Singh et al., 2022; Mirowski et al., 2022) | ||
effective | The AI agent is effective at suggesting ideas. | (Yuan et al., 2022) | ||
repetition | The AI agent generates repetitive suggestions. | (Singh et al., 2022) | ||
usability | The AI agent is useful for my writing. | (Singh et al., 2022) | ||
combinatorial | I feel the AI agent combines a broad set of information. | (Singh et al., 2022) | ||
Measure LM Feedbacks [Task-oriented Metrics] | uniqueness | The AI’s suggestions are unique. | (Yuan et al., 2022; Singh et al., 2022; Mirowski et al., 2022) | |
reality | emphThe AI’s suggestion aligns with common sense. | (Singh et al., 2022; Mirowski et al., 2022) | ||
novelty | The AI agent often generates unexpected suggestions. | (Singh et al., 2022) | ||
freedom | I feel the AI agent can express freely. | (Singh et al., 2022) | ||
structural | The AI suggestions are structural. | (Mirowski et al., 2022) | ||
imagination | I feel the AI agent has much imagination. | (Singh et al., 2022) | ||
unexpected | The AI suggestions are often unexpected to me. | (Singh et al., 2022; Mirowski et al., 2022) | ||
➋ Evaluating Dynamic Interaction Trace | Iterative Interaction Change [diff from prior knowledge] | different from initial expectation | what’s the difference before the initial expression. | (Singh et al., 2022) |
adjust to prior expectation | I adjusted my expectation to prior ones. | (Singh et al., 2022) | ||
Iterative Interaction Change [diff from prev. interact. states] | dynamics of suggestion integration | how does the suggestion integration change dynamically. | (Singh et al., 2022) | |
Span Interactive Flow | learning curve | I can learn to use this system quickly. | (Singh et al., 2022) | |
consistency | The AI generate consistent suggestions along interaction. | (Singh et al., 2022; Mirowski et al., 2022) | ||
flow and ordering | the flow and ordering of co-writing are smooth. | (Singh et al., 2022; Mirowski et al., 2022) | ||
Efficiency | latency | The elapsed time from human request to AI response. | (Dang et al., 2022) | |
incorporation rate | The rate of incorporating AI suggestions. | (Dang et al., 2022) | ||
request count | The count of human requests. | (Dang et al., 2022) | ||
time considering suggestions | The average time for human to consider AI suggestions. | (Dang et al., 2022) | ||
#accepted suggestions | The count of accepted AI suggestions. | (Dang et al., 2022) | ||
time to complete | The elapsed time for human to complete the task. | (Dang et al., 2022) | ||
➌ Evaluating Writing Artifact | Measure Writing Artifacts [With Ground Truth] | word edit distance | The word edit distance between prior- and post- articles. | (Mirowski et al., 2022) |
lemma-based Jaccard similarity | The similarity of ground truth and outcome article. | (Mirowski et al., 2022) | ||
document length difference | The difference between prior- and post- articles. | (Mirowski et al., 2022) | ||
Measure Writing Artifacts [No Ground Truth] | outcome creativity | The article I wrote with AI is creative. | (Singh et al., 2022; Mirowski et al., 2022) | |
quality | The outcome article is high-quality. | (Lee et al., 2022) | ||
external expert evaluation | The external experts assess the writing artifacts. | (Lee et al., 2022) | ||
word count | The total words count of the outcome article. | (Dang et al., 2022) | ||
open-source repetition scorer | Computing repetition score using exisint tools. | (Mirowski et al., 2022) | ||
Perception from Writer [Satisfaction] | writing goal | The outcome article reaches my writing goal. | (Dang et al., 2022; Mirowski et al., 2022) | |
pride | I’m proud of the final article. | (Dang et al., 2022; Mirowski et al., 2022) | ||
satisfied | I feel satisfied with the final article. | (Mirowski et al., 2022) | ||
Perception from Writer [Ownership] | ownership | I feel ownership over the final article. | (Yuan et al., 2022; Singh et al., 2022; Mirowski et al., 2022) | |
authorial discretion | I can decide what/how to put the AI suggestions into the article. | (Singh et al., 2022) |