Defining maximum acceptable latency of AI-enhanced CAI tools

Claudio Fantinuoli^1,2, Maddalena Montecchio¹
¹ Johannes Gutenberg University Mainz
² KUDO

1 Abstract

Recent years have seen an increasing number of studies around the design of computer-assisted interpreting tools with integrated automatic speech processing and their use by trainees and professional interpreters. This paper discusses the role of system latency of such tools and presents the results of an experiment designed to investigate the maximum system latency that is cognitively acceptable for interpreters working in the simultaneous modality. The results show that interpreters can cope with a system latency of 3 seconds without any major impact in the rendition of the original text, both in terms of accuracy and fluency. This value is above the typical latency of available AI-based CAI tools and paves the way to experiment with larger context-based language models and higher latencies.

2 Introduction

Automatic speech recognition (ASR) has been regarded as a technology “with considerable potential for changing the way interpreting is practiced” [23]. In particular, ASR has been proposed as a means to overcome the shortcomings in current implementations of computer-assisted interpreting (CAI) tools, such as the inherent difficulty to manually lookup terms while interpreting. By integrating AI-based features in the interpreter workstation, such as the real-time automatic suggestion of numbers and other problem triggers [13], and by integrating advancements in extractive and predictive algorithms based on machine learning [28], the cognitive load required by the use of CAI tools may be reduced, offering new opportunities for interpreters to improve their performance. First empirical research seems to point in this direction [8, 14].

AI-enhanced CAI tools are generally based on the concatenation of several modules that may comprise a speech-to-text transcription engine, a parsing module to identify the units of interest, for example numbers, terminology or proper names, and a visualisation component for the human-machine interaction. Other implementations may adopt an end-to-end approach, making the cascading architecture obsolete. Such tools can be deployed on the edge, running completely on the user’s device, or on the web, hosted on servers and accessed through a Web Browser. Generally speaking, CAI tools present suggestions with a certain amount of delay with respect to the original speech. This delay is due to the architectural design of the tool and its components, for example the ASR latency or the computation time needed to inference a language model, but also to the intrinsic nature of the task, which may require a certain amount of linguistic context in order to make an informed decision on what to show to the final user. In a time-sensitive activity such as simultaneous interpreting, system latency may require interpreters to adopt new interpreting strategies to successfully integrate the suggestions into their rendition. If latency, however, exceeds a certain threshold, interpreters may not be able to integrate the suggestions, or may do this at the expenses of fluency, cohesion, or even accuracy. In this case, the use of such high-latency tool would become detrimental for the user-machine interaction, for the interpreter’s cognitive load, and finally for the quality of the rendition.

Not much is known about this maximum acceptable latency threshold, i.e. the maximum ear-voice-span (EVS) that interpreters can cope with in order to successfully integrate external suggestions without impacting the overall interpreting performance. Not only this value may vary among subjects, but it may also be different from the one reported in experimental analysis of interpretation performed without the use of a CAI tool.

Our hypothesis is that displaying suggestions within the average EVS reported in literature should not negatively impact the interpreter rendition, even when this is ‘forced’ upon the interpreter by the latency with which the tool generates such suggestions. To our knowledge this is the first study that tries to empirically answer this question. On the one hand, knowledge about this threshold is crucial to understand if current AI-based CAI implementations can already meet interpreters expectations or if major efforts should be placed in further reducing system latency. On the other hand, an acceptable higher latency could allow the integration of more time-demanding NLP features, such as the automatic prediction of difficult parts of the speech [[, eg.]]vogler_lost_2019.

The reminder of this paper is organized as follows. Section 3 introduces the related work in the area of CAI tools and the empirical experiments conducted on them so far. Section 4 describes the data and the methodology adopted in this experiment. Section 5 introduces the evaluation framework. Section 6 presents the results. Finally, section 7 discusses the conclusion and the outlook.

3 Related work

Recent years have seen an increasing number of studies around the design of AI-based computer-assisted interpreting tools and their use by trainees and professional interpreters.

Computer-assisted interpreting tools are digital devices designed to support the interpreter in different steps of their work, from preparation to the very act of interpreting. They have been proposed by several researchers in the past 20 years or so [29, 24, 27], but it is only recently that there has been a surge in interest among the community’s members.

Thanks to new advances in artificial intelligence, ASR has reached a quality level that makes it suitable for integration into supportive technologies. By automating and extending the query system of CAI tools, this integration may solve the shortcomings of traditional tools [[, eg.]]fantinuoli_interpretbank._2016, fantinuoli_computer-assisted_2017, hansen-schirra_nutzbarkeit_2012 and extend the features available in an interpreters’ workstation, for example automatically suggesting translations of specific terms as well as transcribing numbers and proper names in real-time.

Over the years, a handful of empirical studies have been carried out to test the feasibility of the human-machine interaction in the simultaneous modality. They have focused in particular on the effectiveness of ASR-support during the interpretation of numbers [9, 8, 14], one of the problem triggers of simultaneous interpreting identified in literature [4, 15, 17, 26]. In order to measure the impact on the quality of the rendition, these studies have used either mock-up systems with a very short latency [9, 6] or real-life tools with a reported latency of under 2 seconds [8, 14].

From an interpreter perspective, the system latency impacts the ear-voice span (EVS) of interpreters since it forces the interpreter to wait a certain amount of time before being able to integrate the suggestions into their rendition. The EVS is the amount of time that separates the words uttered by the speaker in the source speech and the equivalent rendition in the target speech uttered by the interpreter [23]. Research in this field has a “long-standing tradition” [7] and EVS is considered an essential variable that can potentially impact interpreters’ performance [2, 16, 17, 20] such that it continues to be an object for analysis and assessment.

Previous research has found that the variable limits of interpreters’ EVS are generally attributable to two main factors: input-related factors, which might also enable interpreters to consciously regulate their lag, and personal factors, e.g. short term memory-related factors, that influence the maximum capacity of the memory [26, 1, 19]. Interpreters are free to regulate - at least to a certain degree - their EVS depending on the features of the speech segment they are translating, increasing or decreasing it consciously as a strategy [1, 10].

Measuring EVS is not immediate and easy, due to the fact that grammar, word order and syntactic structure might be different in the two languages [26]. Average EVS reported in literature is between 2 and 3 seconds [2, 19, 20], with a peak of 10 seconds [22]; if measured in words, a mean ranging from 5 to 10 words has been reported [25] .

4 Data and methodology

4.1 Dataset

For this experiment we used numbers as unit of interest to be suggested by the system, as this is a problem trigger that has been largely studied so far in empirical experiments related to CAI tools. We choose and edited therefore a speech particularly dense with numbers. The speech had a duration of 7 minutes and 15 seconds and did not pose any particular difficulties in terms of terminology.

The speech was delivered in English by a non-native speaker and participants were asked to translate it simultaneously into their native language (Italian). The average speech pace was 105.5 words per minute, which corresponds to an ideal speech rate for the simultaneous interpretation of improvised speeches and which is close to the ideal read-aloud speech rate of 100 words per minute. The speech had 25 stimuli (numbers); 10 of them were accompanied by a referent.

The speech was prerecorded and a video simulating an ASR system was edited ad-hoc by the authors of the experiment in order to retain full control over the variable ‘latency’. In the video the transcription of the numbers (without the relevant referent) was shown on screen after a variable delay compared to the moment the participants received the acoustic signal in the headset. The acoustic numerical input was removed from the recorded speech and replaced with a neutral acoustic signal (/beep/). This choice was made in order to force the participants to use the visualization of the numbers on screen, thus relying exclusively on the ASR simulation. Every visual input (number) was displayed based on a preset latency, which gradually increased during the course of the speech. Numbers were shown in isolation with no embedded transcription of the speech. The text was divided into five different sections. In every section the visualization of the stimulus occurred with a different latency – in the first section, the numbers appeared on the screen 1 second after the acoustic signal, in the second section after 2 seconds, and so forth. During their performance, participants received the visual numerical input on a maxi screen situated inside the classroom or on the monitors inside the booths.

4.2 Participants

A total of eight participants took part in this study. All participants were students. They were Italian native speakers with German and English in their language combination enrolled in their final year of a Master in Conference Interpreting program and had at least 1 year experience in simultaneous interpretation. The participants did not have past experience of interpreting with CAI tools or ASR-enhanced CAI tools. This condition is in line with experimental setups adopted in similar studies [[, eg.]]defrancq_automatic_2020, fantinuoli_measuring_2021.

Some weeks before the experiment, the participants were provided with three short videos containing a simulation of the ASR system in order to be minimally exposed to the experimental format. These videos were edited along the lines of the video used for the actual experiment, but all numerical visual inputs were displayed with the exact same latency, in contrast with the variable latency with which the numerical visual inputs were displayed in the experiment. The goal was to help the participants to get accustomed to the experiment format, i.e., the way numbers were transcribed, their font size, the color and the way the order of magnitude was shown [14].

5 Evaluation framework and procedure

The analysis of the collected data has been performed on two different levels, namely on a stimuli-based and on a segment-based level.

The aim of the stimuli-based evaluation is to assess the accuracy level achieved by the interpreters in rendering the units of interest in the target language, depending on the variable latency of the mockup system. Within this level of assessment, the accurate rendition of the units of interest is evaluated considering the following components: the numerical information comprising its order of magnitude, and, if present, the relevant referent. A number of parameters were collected for the evaluation: presence of the number in the rendition, accuracy of the number, presence of the referent, accuracy of the referent, pronunciation disfluency. Numbers and referents that were approximated, generalized or omitted were considered to be errors. Moreover, lexical, syntactic, phonological or articulation mistakes were also classified as errors.

The segment-based evaluation assesses the quality of the entire segments in which the numbers are embedded. This evaluation focuses on two different parameters. The first is the accuracy of the target segment, measured in terms of the following linguistic aspects: faithfulness, grammatical correctness, completeness, logical cohesion, consistency, plausibility. The second parameter is the listener’s perception in terms of delivery flow. This evaluation level focuses on the following paralinguistic aspects: perception of the interpreters’ voice and rhythm, eloquence, presentation, prosody and communicative effectiveness. The segment-based evaluation has been performed using a Likert scale (1 to 5) by three different evaluators.

There are three major limitations in this study that could be addressed in future research. Firstly, the level of interpretation proficiency of the participants selected for this experiment is biased towards the lower end. It is reasonable to assume that a randomized sample in terms of proficiency, for example including also professional interpreters, may increase the maximum acceptable latency measured in the experiment. Secondly, the experiment focused only on a single language pair. Because of intrinsic variations among languages, EVS and different strategies adopted by interpreters on the basis of the language combination may influence the threshold under scrutiny. Finally, variations in speech complexity, for example in terms of syntactic structures, delivery speed etc. have not been taken into consideration in our study. However, they may have a considerable impact on interpreters cognitive load and consequently on their ability to successfully integrate suggestions and may require a dynamic adaptation of latency according to this variable.

6 Results

6.1 Stimuli-based evaluation

As introduced in the previous section, the stimuli-based evaluation focuses on the stimuli (numbers) and their referents and not on the whole sentence in which they are embedded.

No number and no referent has been omitted at any latency by any of the candidates. The precision, however, varies according to the latency. Figure 1 presents the accuracy of the rendition of the numbers displayed by the ASR simulation system. The best results were achieved with the first three latencies: 97.14% at 1 and 2 seconds and 98.85% at 3 seconds. The better score at 3 seconds may be explained with the interpreters acclimatisation to the dynamic of the experiment, thus enabling them to adapt their interpreting approach after having generated a learning effect [21]. The worst results were observed when the latency increased above 3 seconds, and in particular 93.14% at 4 and 94.86% at 5 seconds.

Figure 1: Accuracy of number rendition

In Figure 5 the accuracy of the referents of the displayed numbers is presented. The highest score was reached with a latency of 3 seconds (100%). The accuracy decreases with the two higher latencies: 94.28% at 4 seconds and 85.71% at five seconds.

Figure 2: Accuracy of reference rendition

The number of pronunciation disfluencies (Figure 3) reached a pick with a latency of 4 seconds (25.71% of pronounced numbers present a disfluency). With lower latencies the number of disfluencies is low: 2.85% of numbers were disfluent at 1 second and 5.71% at 2 and 3 seconds.

Figure 3: Disfluency in number rendition

6.2 Segment-based evaluation

The segment-based evaluation aims at extending the assessment of the quality of the rendition from the unit of interest to the entire segment in which this unit appears. The segment-level accuracy was higher with a latency of 1 and 2 seconds. In particular, the best score was reached with 2 seconds. As in the stimuli-based evaluation, this leads us to conclude that the participants, although able to successfully integrate the accurate interpretation of the numbers displayed in their own EVS, needed a settling period in order to gain familiarity with the speech and the experimental format.

Figure 4: Accuracy at segment level

A minimal decline in segment-level accuracy was observed from latency 3 onward, while with 5 seconds latency the accuracy degrades quite considerably. It can be assumed that with this high latency, participants experienced difficulties in allocating the correct amount of cognitive resources, since they were required to focus on several tasks – on the screen with the number displayed with a specific delay, on retaining the information they had just heard in working memory and on processing the new segments of the speech that continued to be uttered by the speaker [[, eg.]]gile_basic_2009.

The fluency of the delivery was evaluated high in the latency range from 1 to 3 seconds. Paralinguistic aspects, e.g., the voice and the fluency of the rendition have been rated more agreeable in the first three sections. With 4 seconds, the delivery flow deteriorated significantly, and it reached its lowest score in the 5 seconds latency section. It is plausible to assume that not only the increasing cognitive load triggered by a longer EVS also had significant repercussions on the manner in which the interpreters delivered the target speech, as it adversely affected paralinguistic aspects, such as the rhythm, the intonation, the voice and the appearance [[, eg.]]buhler_linguistic_1986, becerra_first_2016.

Figure 5: Fluency at segment level

7 Conclusions and outlook

In this paper we presented the results of an empirical experiment aimed at measuring the maximum acceptable latency of an automatic suggestion feature for simultaneous interpretation. The results seem to suggest that interpreters are able to integrate suggestions by ad-hoc extending their ear-voice-span to 2 seconds without compromising the quality of their rendition and to 3 seconds without any major disruption. A further extension of the system latency seems to induce a consistent reduction of precision in the use of such suggestions and in the emergence of information losses in the overall rendition. This is in line with our original hypothesis that the system latency should not exceed the average interpreter’s EVS. Within the latency threshold outlined in the experiment, next generation AI-enhanced CAI tools could be able to accommodate more complex and context-based NLP-features without a significant risk of impairing the usability of the tool.

References

[1] Linda Anderson, Sylvie Lambert and Barbara Moser-Mercer “Simultaneous Interpretation: Contextual and translation aspects” In Bridging the Gap: Empirical Research in Simultaneous Interpretation Benjamins Translation Library, 1994, pp. 101–120
[2] Henri Barik “Simultaneous Interpretation: Qualitative and Linguistic Data” In Language and Speech 18.3, 1975, pp. 272–297
[3] Garcia Becerra “Do first impressions matter? The effect of first impressions on the assessment of the quality of simultaneous interpreting” In Across Languages and Cultures 17.1, 2016, pp. 77–98
[4] Sabine Braun and Andrea Clarici “Inaccuracy for Numerals in Simultaneous Interpretation: Neurolinguistic and Neuropsychological Perspectives” In The Interpreters’ Newsletter 7, 1996, pp. 85–102
[5] Hildegrund Bühler “Linguistic (semantic) and extra-linguistic (pragmatic) criteria for the evaluation of conference interpretation and interpreters” In Multilingua, 1986, pp. 231–235
[6] Sara Canali “Technologie und Zahlen beim Simultandolmetschen: utilizzo del riconoscimento vocale come supporto durante l’interpretazione simultanea dei numeri”, 2019
[7] Bart Defrancq “Corpus-based research into the presumed effects of short EVS” In Interpreting. International Journal of Research and Practice in Interpreting 17.1, 2015, pp. 26–45
[8] Bart Defrancq and Claudio Fantinuoli “Automatic speech recognition in the booth: Assessment of system performance, interpreters’ performances and interactions in the context of numbers” In Target. International Journal of Translation Studies, 2020 DOI: 10.1075/target.19166.def
[9] Bart Desmet, Mieke Vandierendonck and Bart Defrancq “Simultaneous interpretation of numbers and the impact of technological support” In Interpreting and technology, Language Science Press Language Science PRess, 2018, pp. 13–27
[10] Valentina Donato “Strategies adopted by student interpreters in SI: a comparison between the English-Italian and the German-Italian language-pairs” In Interpreter’s Newsletter 12, 2003
[11] Claudio Fantinuoli “Computer-assisted preparation in conference interpreting” In Translation & Interpreting 9.2, 2017, pp. 24–37
[12] Claudio Fantinuoli “InterpretBank. Redefining computer-assisted interpreting tools” In Proceedings of the Translating and the Computer 38 Conference London: Editions Tradulex, 2016, pp. 42–52
[13] Claudio Fantinuoli “Speech Recognition in the Interpreter Workstation” In Proceedings of the Translating and the Computer 39 London: London, 2017
[14] Claudio Fantinuoli and Elisabetta Pisani “Measuring the impact of automatic speech recognition on interpreter’s performances in simultaneous interpreting” In Empirical studies of translation and interpreting: the post-structuralist apporach Routledge, 2021
[15] Francesca Frittella ““70.6 Billion World Citizens”: Investigating the difficulty of interpreting number” In Translation & Interpreting 11.1, 2019, pp. 79–99
[16] David Gerver “Effects of Grammaticalness, Presentation Rate, and Message Length on Auditory Short-Term Memory” In Quarterly Journal of Experimental Psychology 21.3, 1969, pp. 203–208
[17] Daniel Gile “Basic Concepts and Models for Interpreter and Translator Training: Revised edition” Amsterdam: John Benjamins Publishing Company, 2009
[18] Silvia Hansen-Schirra “Nutzbarkeit von Sprachtechnologien für die Translation” In trans-kom 5.2, 2012, pp. 211–226
[19] Marianne Lederer “Simultaneous Interpretation—Units of Meaning and other Features” In Language Interpretation and Communication Springer, 1978, pp. 323–332
[20] Taehyung Lee “Ear Voice Span in English into Korean Simultaneous Interpretation” In Meta 47.4, 2004, pp. 596–606
[21] Christopher D. Mellinger and Thomas A. Hanson “Interpreter traits and the relationship with technology and visibility” In Translation and Interpreting Studies 13.3, 2018, pp. 366–392 DOI: 10.1075/tis.00021.mel
[22] P. Oléron and H. Nanpon “Research into simultaneous translation” In The interpreting studies reader, 2001, pp. 69–76
[23] Franz Pöchhacker “Introducing Interpreting Studies” Routledge, 2016
[24] Anja Rütten “Terminology Management Tools for Conference Interpreters – Current Tools and How They Address the Specific Needs of Interpreters” In Proceedings of the 39th Conference Translating and the Computer Geneva: Tradulex, 2017, pp. 98–103
[25] Nicholson Schweda “Linguistic and Extralinguistic Aspects of Simultaneous Interpretation” In Applied Linguistics 8, 1987, pp. 194–205
[26] Robin Setton and Andrew Dawrant “Conference interpreting: a complete course”, Benjamins translation library (BTL) volume 120 Amsterdam & Philadelphia: John Benjamins Publishing Company, 2016
[27] Christoph Stoll “Jenseits simultanfähiger Terminologiesysteme” bibtex: stoll_jenseits_2009 Trier: Wvt Wissenschaftlicher Verlag, 2009
[28] Nikolai Vogler, Craig Stewart and Graham Neubig “Lost in Interpretation: Predicting Untranslated Terminology in Simultaneous Interpretation” arXiv: 1904.00930 In arXiv:1904.00930 [cs], 2019 URL: http://arxiv.org/abs/1904.00930
[29] Martin Will “Zur Eignung simultanfähiger Terminologiesysteme für das Konferenzdolmetschen” In trans-kom 8.1, 2015, pp. 179–201