Voicify Your UI: Towards Android App Control with Voice Commands

Minh Duc Vu [email protected] https://orcid.org/0000-0002-4798-8701 , Han Wang [email protected] https://orcid.org/0000-0001-7862-6677 , Zhuang Li [email protected] https://orcid.org/0000-0002-9808-9992 Monash UniversityMelbourneAustralia , Gholamreza Haffari Monash UniversityMelbourneAustralia [email protected] https://orcid.org/0000-0001-7326-8380 , Zhenchang Xing CSIRO’s Data61 & Australian National UniversityCanberraAustralia [email protected] https://orcid.org/0000-0001-7663-1421 and Chunyang Chen Monash UniversityMelbourneAustralia [email protected] https://orcid.org/0000-0003-2011-9618

Abstract.

Nowadays, voice assistants help users complete tasks on the smartphone with voice commands, replacing traditional touchscreen interactions when such interactions are inhibited. However, the usability of those tools remains moderate due to the problems in understanding rich language variations in human commands, along with efficiency and comprehensibility issues. Therefore, we introduce Voicify, an Android virtual assistant that allows users to interact with on-screen elements in mobile apps through voice commands. Using a novel deep learning command parser, Voicify interprets human verbal input and performs matching with UI elements. In addition, the tool can directly open a specific feature from installed applications by fetching application code information to explore the set of in-app components. Our command parser achieved 90% accuracy on the human command dataset. Furthermore, the direct feature invocation module achieves better feature coverage in comparison to Google Assistant. The user study demonstrates the usefulness of Voicify in real-world scenarios.

Authors’ choice; of terms; separated; by semicolons; include commas, within terms only; this section is required.

^†^†copyright: acmlicensed^†^†journal: IMWUT^†^†journalyear: 2023^†^†journalvolume: 7^†^†journalnumber: 1^†^†article: 44^†^†publicationmonth: 3^†^†price: 15.00^†^†doi: 10.1145/3581998^†^†ccs: Human-centered computing Interaction techniques^†^†ccs: Human-centered computing Smartphones^†^†ccs: Human-centered computing Natural language interfaces^†^†ccs: Human-centered computing Sound-based input / output

1. Introduction

Refer to caption — Figure 1. Voicify provides an intuitive way to interact with Android devices, as users can directly open a feature from other applications and interact with UI elements using voice commands. We show real-world usage of searching for nasa in Instagram’s messaging feature.

Smart mobile devices have revolutionised user experience, in which touchscreens are becoming more and more popular. In normal usage scenarios, interacting with a touchscreen is handy and intuitive, however, it is a tedious process for users whose hands are occupied (e.g., cooking, holding a baby or typing) and those with temporary hand injuries or permanent motor impairment (Zhong et al., 2014). Under those circumstances, attempting to perform physical interactions on the phone such as tapping or scrolling is extremely inconvenient and imprecise. On the other hand, researchers have pointed out other problems associated with physical mobile interaction. Despite the increasing screen size (Kim et al., 2011), interactions become hindered because of the reduction in thumb movement coverage (Xiong and Muraki, 2016), requiring users’ efforts to reach far-part UI elements. On smaller screen sizes, misclicking regularly happens due to clustered elements and complex usage scenarios (Hu et al., 2018). Therefore, users envisage an alternative way to interact with smart devices that help them to overcome these aforementioned issues.

Advanced technologies have fueled the development of voice assistants, which provide an innovative solution to the described problems of physical touches, working as a hands-free input method under many circumstances. This technology changes how individuals interact with mobile devices to achieve daily tasks. The most well-known voice assistants include Alexa, Siri and Cortana, which are integrated into various smart devices across multiple platforms (Hoy, 2018). In Android, Google Assistant (Assistant, 2022) and Google Voice Access (Hume, 2020) are the most frequently used voice assistants. The distinguishable functionality between these two assistants is that while Voice Access offers full control for on-screen interactions but cannot reach an in-app feature directly, Google Assistant supports users to directly open a specific feature from other applications. Apart from these aforementioned Google systems that utilize voice assistant technology, researchers have proposed other systems to leverage the usage of voice interactions with Android devices (Zhong et al., 2014; Bhalerao et al., 2017; Liu et al., 2015).

However, there is room for further research as some key functionalities are still missing or can be extended. Many research only targeted a fixed set of tasks including messaging, calling or changing the device’s settings using voice commands (Liu et al., 2015; Kulhalli et al., 2017; Khan et al., 2018), they do not offer full control over one’s device, hence cannot replace touch gestures such as tapping or swiping. Voice Access addressed this issue by providing a system that allows users to fully control their Android devices using voice commands. However, the system has a limited number of features and users have reported reliability issues within the system (Access, 2022). Our experiment also shows that Voice Access requires a steep learning curve due to the complexity of command syntax. On the other hand, the ability to directly open a given feature from other applications has been overlooked or pretermitted by smartphone voice assistants. Many useful features from an application are located under the deep hierarchical navigation (Foster and Foxcroft, 2011), requiring several steps to reach from the home screen. Therefore, allowing users to reach the requested application screen is highly usable in a real-world context. Google Assistant has a fixed list of application features that can be invoked directly (namely App Action (Assistant, 2022)), which were registered by app developers. Because the registration process is tedious and requires extra effort, many apps omit the required declarations, hence their functionalities cannot be invoked by Google Assistant (Arsan et al., 2021). Another issue with the aforementioned systems is that they are not open-sourced, which prevents other developers from further researching and integrating with the systems.

This paper introduces Voicify (see Fig. 1), an application with novel approaches to enhance the usability of voice control in Android devices. The system works as a background service that interprets user’s voice commands and executes them. We proposed a semantic parser that can bridge the gap between natural language from human commands and interactive tasks on mobile devices. Voicify allows users to perform daily physical interactions with mobile devices such as scrolling, tapping or inputting text completely using voice commands. We achieve the mentioned functionalities by mapping the structured actions from the semantic parser to the collected on-screen UI elements. In addition, Voicify can invoke a specific feature from an installed application directly without any extra integration work for application developers. We achieved this by fetching available components from applications that are installed locally on the user’s device. Furthermore, users can give several commands within one utterance, the application will execute each command sequentially.

We validated the technical contributions of Voicify by evaluating the command parser and the direct invocation module. The command parser achieved over 90% accuracy on the testing data extracted from the AndroidHowTo dataset (Li et al., 2020a), which consisted of short-form human commands to perform specific tasks on Android devices. We then analysed the ability to directly open a feature from other applications in comparison to Google Assistant as a baseline. The result showed that Voicify has 76.9% feature coverage on the dataset of 119 in-app features extracted from 30 well-known applications, which was greater than 47% of Google Assistant. Lastly, we also conducted a user experiment to evaluate the system’s usability in real-world use cases. Participants were asked to complete given tasks using voice commands and provide feedback for Voicify compared to Voice Access as the baseline. The experimental result showed that Voicify is well-integrated and easy to use, which helped users achieve better performance on given tasks.

To summarize, the contributions of this paper include:

•

A highly usable voice control and navigation system, Voicify, that can recognise user vocal commands, associate them with app features (for direct feature invocations) or on-screen elements (for basic interactions) based on the pre-extracted and run-time application data, and execute them accordingly on Android devices.
•

A deep learning-based human command parser that can correctly map a user’s command into executable actions in the context of mobile apps.
•

Our experiments and user studies show that Voicify is easy to learn and use, and requires a minimal cognitive load to achieve better performance than the existing product when using real-world apps.
•

Voicify ¹¹1https://github.com/vuminhduc796/Voicify is open-sourced so that anyone can use and continue to improve the system.

2. Background & Related works

2.1. Natural Language Understanding in Voice Assistants

Natural Language Understanding (NLU) systems aim to interpret and process the user’s speech input. In recent years, research has been conducted in the space of natural language interpretation and mapping human instructions to user interface actions. Seq2Act model has been introduced to extract actions (such as open, click and navigate) and object targets from user instructions and associate them with mobile UI elements to support executable action sequences (Li et al., 2020a). The technology is not only limited to usages within the smartphone but also extended to other smart systems (Krishna and Nagendram, 2012; Bai, 2022) and home appliances (Park and Kim, 2018).

In Android, JustSpeak was launched with advanced command-chaining recognition and extended Google Automatic Speech Recognition (ASR) by its utterance parsing technique (Zhong et al., 2014). The predefined set of commands from JustSpeak is later extended by Smart Voice Assistant to handle calls and SMS using voice commands (Bhalerao et al., 2017). In recent years, many technologies have been proposed to make NLP components more approachable, hence improving the applicability in different Android systems. Arsan et al. used Dialogflow as a conversational agent, providing a platform that can extract the intent from user utterances (Arsan et al., 2021). In addition, by utilizing the pre-built Almond language model, DoThisHere provides a voice controlling system to get and set UI contents in Android (Yang et al., 2020). However, the inflexible nature of pre-built language models from Almond and Dialogflow limits the capability to be extended and fully support our use cases. Therefore, Voicify introduced an extensible parser using deep learning approach to cater for the flexibility in human commands. We propose the model as a reusable solution to improve communication between humans and technological devices.

2.2. Android UI Semantics & On-screen Interaction

A core challenge of developing a system-level assistive tool is processing and interacting with on-screen UI semantics in run-time. Android has introduced Accessibility Service API in Android 1.6, allowing the developer to create accessibility tools that assist different types of impairments (Developers, 2022b). This API allows Android developers to access additional accessibility metadata and the displaying content of the UI window and supports on-screen interactions with the device. Accessibility Service API has brought forth opportunities that propel the development of assistive and automation tools for Android (Bhalerao et al., 2017; Xie et al., 2021; Salehnamadi et al., 2021). Those tools perform data collection from user’s devices for analysis or perform sequential actions by matching the input with extracted on-screen textual data.

One of the earliest applications to leverage this technology was JustSpeak (Zhong et al., 2014). The solution supports on-screen interactions via voice commands such as tapping on-screen elements, based on the collected accessibility metadata. Similar work was conducted in Weber et al.’s “VoiceNavigator” application in 2016, which was the centre of their study on improving the visibility and learnability of mobile voice user interface applications (Corbett and Weber, 2016). Since then, Google has released Voice Access (Hume, 2020), which aims to assist people with temporary injury or motor impairment with basic navigation and controlling the current screen. While Voice Access is the most popular voice control in Android, its low usability negatively affects the user experience, according to the user reviews on Play Store listing (voi, 2022). Using a wide range of advanced features from Accessibility API, we propose Voicify to improve upon existing solutions and enhance the intuitiveness of Android assistive tool.

2.3. Android Components & Direct Feature Invocation

In Android, the transition between different activities (representing UI screens) happens when the user navigates to another screen. An activity can invoke another activity by sending an Intent messaging object to the Android OS, which is then used to retrieve the corresponding intent filter and eventually navigate to the destination activity. Although sending intent objects is powerful as it can request features from any installed applications (Alhanahnah et al., 2020), intent-based invocation has not been commonly applied for direct component communication. One of the recent works in this area applied intent-based communication for the component invocation to create application shortcuts (Arsan et al., 2021). While their approach requires static analysis of a given dataset to generate a pre-defined database of shortcuts, Voicify retrieves the data from on-device applications, which avoids the limitation of dataset coverage. On the other hand, Google has provided Google Assistant as a part of Android OS, which can directly open a feature from applications via intent invocation. However, the platform has limited coverage of features since it does not automatically compatible with Android applications. External developers are required to manually declare the mapping between their existing features as intents and corresponding voice command to trigger them and must frequently update the declarations. Voicify reads such data directly from the application files, which does not need extra effort from developers to comply with the platform.

Recently, Deep link has been introduced by Android, which allows the developer to assign a unique resource identifier (URI) to specific a feature within the app (Developers, 2022c). Several researchers have been improving the usability and performance of the Android native deep link system to enhance the ability to directly open a feature within an app. A record and replay implementation was proposed in the uLink framework to improve the default deep linking system, which provides a lightweight, universal solution with reduced developer efforts (Azim et al., 2016). Similarly, DroidLink proposed a solution that uses a model to analyze the transition of activities within the provided app to build shortcuts to different UI elements (Ma et al., 2016). While the mentioned approaches attempted to replicate the behaviour of Android deep links, the authors have mentioned the reliability issues and limitations in performing some actions. Therefore, Voicify utilizes deep links that have been created by developers for their applications to directly open its features.

3. Motivation

3.1. Effects of Learnability and Ease of Operation on User Experience for Voice Controlling System

In real-world usage, voice control systems are mostly used in situations where users are busy with other activities (Zhong et al., 2014; Kim et al., 2020). It is deduced that when users are focusing on their primary activity, voice control is the common choice to interact with mobile devices to perform a secondary activity. Therefore, users would not be able to fully concentrate on their voice-controlling tasks. In addition, researchers have pointed out that learnability and discoverability can strongly affect user experience with voice control system (Furqan et al., 2017). Therefore, it is integral to aim for ease of operation and minimal mental demand to improve the usability of Voicify.

3.2. Usefulness of Providing In-App Feature Shortcuts

Due to the small screen size, mobile apps often contain multiple screens. Therefore, it is time-consuming and complicated to reach the final destination screen for users to perform the task, given that the entire process is done via voice commands. Therefore, providing shortcuts to specific screens inside applications is useful. In Android, the idea has been implemented by Google Assistant to directly open a feature from a given application. However, Google Assistant has low feature coverage due to its implementation (Arsan et al., 2021), as well as lacking a method that supports users to fully complete user’s intended task via voice command to cater for hands-free usages. Hence, we proposed Voicify to open specific features inside the applications while avoiding Google Assistant’s shortcomings.

3.3. Robustness to the Rich Linguistic Variations in the Human Command Utterances

One major issue that caused the inaccuracy of the voice system was the rich linguistic variations in human utterances. Users might describe one action or target using multiple word choices. For example, different synonyms of “tap” such as “click” or “press” are spoken by the users to perform a tap on the screen. In addition, the application name such as “Gmail”, might be mentioned as “Gmail” or “Google mail”. The complexity of multiple variations in word choice within a single command caused NLP systems to misunderstand the user’s intention in their command utterances (Wijeratne et al., 2019), therefore, requiring a better approach.

3.4. Design Goals

Based on the motivation in the previous section, we propose the design goals to improve the usability of existing tools:

•

Comprehensibility: Reduce the learning curve of the system and require minimal cognitive load from users to operate the tool.
•

Efficiency: Provide direct access to in-app features without navigating through multiple screens using voice commands to reduce the time to complete a certain activity
•

Robustness: Understand the user’s intention in human speech utterances with rich linguistic variations to avoid misinterpretation of user commands.

4. The Voicify System

Based on the design implications, we propose Voicify to leverage the voice interaction between users and Android devices. Our system provides a background service, which continuously listens to users’ commands and executes the action. We solved the problem of misinterpreting user commands by proposing a deep learning command parser that caters for the rich linguistic variation in human speech. Voicify allows users to directly interact with UI elements using on-screen labels and numbering tooltips, hence improving the comprehensibility of the system. In addition, by processing the on-device application files, the system improves voice control efficiency by directly opening a feature that matches user requests. Our approach overcomes the limitation in the feature coverage of Google Assistant, minimizing developer efforts to comply with the Google platform. The overall system design is demonstrated in Fig. 2, which consists of three modules: Data Collection (Section 4.1), Command Parser (Section 4.2) and Dialogue Manager (Section 4.3).

4.1. Data Collection

Voicify efficiently explores and processes on-device data in the Android environment. To provide contextual data for executing actions, we collect both (i) static data about application components for direct invocation and (ii) dynamic data from the device’s screen to identify on-screen UI elements and perform interactions.

4.1.1. Application Components Exploration

To provide direct invocation for different app components, Voicify fetches the system files from locally installed applications that specifies the list of components provided by the app. Thus, all installed applications will be compatible with Voicify without any additional integration from the other app developers, resulting in higher application feature coverage. In addition, local retrieval will prevent application version mismatch when the locally installed version is not up-to-date with the released version from the developers, which was a reported issue that caused Google Assistant to malfunction (Xaif et al., 2022).

Our system uses explicit intents for the direct invocation process, requiring an activity name or a deep link to determine the target component. These data can be found inside the Manifest.xml file within the application source code. Therefore, Voicify locates and retrieves the Manifest.xml file from the on-device application’s directories, as shown in Fig. 3. The data inside application directories are encoded in Android-specific binary XML format, allocated into chunks with little-endian byte order by default (Application, 2022). Therefore, we introduced a binary decoder to convert the Manifest file for each installed application into UTF-8-character representation.

Voicify retrieves the intent filter for each exported component, which acts as the access point to open that component. Using the attributes from retrieved intent filters, we construct corresponding intents that can directly invoke that intent filter. In Fig. 3, we illustrate the representation of deep links (blue boxes) and exported components (red boxes) in the manifest file for the YouTube application. Since a deep link is a unique resource identifier, an intent that attaches a deep link can directly invoke an app component. The system retrieves it by concatenating the host, scheme and path prefix in the <data> tag. To open application components that are not integrated with deep links, we use the package name and the activity name to identify the destination component. Voicify utilizes the declared action name such as SEARCH and VIEW to improve the granularity of feature exploration. All extracted intents will be used for direct invocation process in Section 4.3.

4.1.2. UI Semantic Retrieval

In Voicify, UI semantic retrieval is an essential module that allows the system to identify the UI elements for interactions. Similar to Kalysch et al. (Kalysch et al., 2018), we utilized Android Accessibility Service API to gain a system-level semantic understanding of UI elements. Accessibility API provides information about on-screen elements via AccessibilityNodeInfo objects (Developers, 2022a), encapsulating the data of UI components.

Voicify retrieves interactive UI elements and classifies whether they are clickable, scrollable or editable. We use the extracted textual data from TextView to label interactive nodes, which are used for matching with the target from the users’ command. For non-TextView interactive elements, Voicify gains an understanding of them by investigating the effect of UI elements grouping (Zhang et al., 2021). Specifically, we obtain its label by searching for the adjacent TextView element which shares the same parent node, inspired by the label searching algorithm in Yang’s previous work (Yang et al., 2020). We also extend the search range to retrieve textual information from the child components which is located inside the interactive element to improve the algorithm’s coverage. Voicify uses spatial information about UI elements to identify adjacent and overlapping elements, which provides contextual information to improve the semantic analysis of the UI.

On several occasions, an interactive element (e.g., an icon) is not attached with any visible labels on the screen, hence the user is unable to specify the UI element by mentioning its label. Following the implementation of Voice Access (Hume, 2020), we provide a tooltip labelling system, where a distinct number is dynamically assigned as a temporary label for unlabeled elements. By using the Layout Inflater, we augment a numbering tooltip next to the unlabeled UI element by tracking its absolute coordinator on the screen. In this way, users can interact with unlabeled elements on the screen using the given number. For example, in Fig. 4, the top right icon is appended number 1 since it does not provide any visible label, in which the user can tap it by saying “tap number 1”. Voicify also caters for users with special visual requirements by allowing runtime customization of tooltips’ appearance, including the size, colour and opacity.

4.2. Command Parser

To understand user commands, most current voice assistant systems use various Natural Language Understanding (NLU) components to convert the command utterances into the formal meaning representation (MR) (Louvan and Magnini, 2020; Li et al., 2020b). Similarly, we proposed a semantic parser as our NLU component to convert utterances into formal meaning representations, which represent structured actions in our system. The generated MRs are then delivered to the Dialogue Manager (Section 4.3) system such that the module can understand and fulfil the user requirements. In this section, we will explain i) how we define the formal meaning representations, ii) the structures of our semantic parser and iii) how we generate training data for learning the semantic parser.

4.2.1. Meaning Representation

Currently, the NLU system in the other voice assistants adopts frame-based meaning representations composed of intents and slots (Gupta et al., 2018; Louvan and Magnini, 2020). However, such representation is known as having difficulties dealing with complex logic such as conjunction and negation in natural language (Cohen, 2020). To avoid the restrictions of frame-based representations, we propose a novel MR language, VoicifyLang. VoicifyLang defines several actions in the user commands, such as tapping, scrolling or entering text onto the UI elements. VoicifyLang also categorizes the action targets such as apps, components in the apps, the buttons on the screen, the input text and the directions of scrolling into different primitive types.

We defined the language in a way that it is flexible to deal with more complex logic operations in the human language. Therefore, the language is highly extensible to support more features in Voicify. The detailed syntax of our meaning representation is described by Abstract Syntax Description Language (Wang et al., 1997).

4.2.2. Semantic Parsing

Each MR in VoicifyLang comprises a sequence of tokens. We propose a novel parser, namely VoicifyParser, to convert each user command into a sequence of MR tokens. The architecture of VoicifyParser is mainly inherited from BERT-LSTM (Xu et al., 2020a), a semantic parser which uses an attention-based Sequence-to-Sequence neural network (Bahdanau et al., 2015) as the backbone. VoicifyParser builds a copying module and a schema encoding module onto the Sequence-to-Sequence to solve our task-specific problems, as illustrated in Fig. 5B. Specifically, the encoder of the Sequence-to-Sequence is a pre-trained language model, BERT (Liu et al., 2019; Kenton and Toutanova, 2019) and the decoder is a Long-short Term Memory (Hochreiter and Schmidhuber, 1997). The vanilla Sequence-to-Sequence model only generates the tokens stored in a fixed vocabulary. Specifically, the vocabulary for VoicifyLang contains the tokens of actions, targets and some special tokens such as the delimiters and placeholders. The actions are pre-defined by VoicifyLang, while targets, including apps, components and buttons, are extracted using the Data Collection module defined in Section 4.1. However, if the user command inputs the text, the text in the generated MR should all be copied from the user command utterance, given the VoicifyLang definition. Thus, the VoicifyParser employs a copying mechanism (Gu et al., 2016) that allows the parser to copy the text from the user utterance. Another problem is that the buttons on each app page are usually dynamically generated, which means many buttons are out of vocabulary. To solve this problem, VoicifyParser adopts a schema encoding mechanism (Wang et al., 2020). When the input user command intends to tap a button, Voicify collects all button names on the screen in runtime and sends them to the VoicifyParser. The schema encoding enables the VoicifyParser to generate tokens of a button out of the extracted buttons instead of only from the vocabulary.

4.2.3. Data Synthesis

The training data includes the pairs of user utterances, their aligned VoicifyLang MRs, and optionally a list of button names used to simulate the scenario in which the user wants to tap a button on a screen. Since manual annotation of user utterances is cost-intensive and time-consuming, we adopted a widely-used semi-automatic data collection method for semantic parsing, namely the Overnight approach (Wang et al., 2015), as illustrated in Fig. 5A. First, we manually wrote a set of synchronous context-free grammar (SCFG) (Chiang, 2007) rules. Expanding the SCFG rules could generate pairs of semantically equivalent canonical utterances and MRs. In the generated dataset, each MR has only one aligned canonical utterance. However, in the real-world scenario, natural language riches in linguistic variations. User speakers may vary word choices and morphologies for the same actions and targets in the MR and the syntactic structures for the utterances with the same MR. Therefore, we paraphrased each canonical utterance into multiple utterances with the same semantic meaning such that each MR would have multiple aligned utterances. The studies in (Wang et al., 2015; Xu et al., 2020b; Shiri et al., 2022) validate that such paraphrases could significantly improve the performance of the semantic parser. We applied automatic paraphrasing methods to reduce the paraphrasing cost. We used the best-performing paraphrase method, a commercial online paraphrase service, Quillbot²²2https://quillbot.com/, as in (Shiri et al., 2022). For each user command whose intention is to tap a button, we randomly sampled a list of buttons from the pre-defined set of buttons as the candidates for tapping. The button names are collected from the Rico dataset (Deka et al., 2017), which contains button names extracted from 10k Android UI screenshots. Thus, the parser can learn to generate tokens of the out-of-vocabulary buttons. The parser trained on the dataset generated by Overnight is proven to be robust to the richness and lexical and morphological diversities in the natural language (Wang et al., 2015).

4.3. Dialogue Manager

Dialogue Manager extracts the pair of action-target from the received MR and performs relevance ranking to identify the suitable target from the collected data. Finally, Voicify’s executor performs the action on the user’s device using Android-supported features.

4.3.1. Application Feature & UI Element Matching

The matching between application components and user-requested component consists of two steps. First, the system searches for the application name from the meaning representation in the installed application namespace. After identifying the matched application, Voicify searches for the component using the feature name and sends it to the executor. We search for the application name first to shrink the component search space and improve efficiency.

Since UI elements are dynamically changing when users are interacting with the phone, Voicify must continuously capture and analyse new UI elements. Voicify subscribes to the typeWindowContentChanged accessibility events (Developers, 2022b) to get notified when changes happen to user’s screen. The tool maintains separated dictionaries of labelled and unlabelled on-screen elements, which are updated using the data in AccessibilityNodeInfo objects delivered by Android OS, as described in Section 4.1. Upon receiving meaning representation, the matching module uses the attached UI element to get the corresponding node object, and deliver it to the executor to perform the action.

4.3.2. Executor

Android Accessibility API provides a wide range of system control methods that allows accessibility service developer to have full control over users’ device. Thus, Voicify used the performAction() method (Developers, 2022a) to perform actions, such as tapping, scrolling or entering text onto the UI elements. The executor sends the formed intent messaging object to directly open in-app components. On occasion, when none of the collected intents matches with the user’s request, we open the application entry screen as a baseline.

We implemented a queue containing action-target pairs to manage the sequential order of input actions. The executor will automatically pop the next action from the queue once the preceding action is successfully invoked, which allows command chaining to improve the usage efficiency. We implemented an automated validation system that verifies if the action is executed by monitoring the changes in the UI node metadata. Lastly, Voicify sends audio feedback to users as a confirmation that the system is ready for the next command.

4.4. Implementation

The Android client (Section 4.1 and Section 4.3) is implemented in Java using Android Studio. We utilized the native Google speech recognizer (Developers, 2022d) to interpret users’ speech into the textual format and Google Text-to-speech engine (developers, 2022) to provide audio feedback to users. We used Android Accessibility API (Developers, 2022b) to implement a background service that can retrieve on-screen UI elements and perform actions on the screen. We implemented Python scripts to perform data collection and generation for training the parser. The back-end server that contains the semantic parser for analysing the user’s command is developed using Flask framework (Grinberg, 2018) (Section 4.2). The communication between the back-end server and the Android client is handled using HTTP requests.

5. Technical Evaluation

In this section, we validate the performance of 2 novel components inside the system, namely i) the command parser and ii) the application feature retrieval module. We will first measure the accuracy of generating the meaning representation for natural human command. In addition, we perform a quantitative analysis of a set of application features in comparison with Google Assistant as the baseline.

5.1. Command Parser Evaluation

5.1.1. Experiment Setup & Metric

We evaluate the parser with respect to its ability to convert the user command into the correct meaning representations. The parser is firstly trained on a synthetic generated dataset curated by the Overnight approach. Then the parser converts the commands in a manually collected test set into MRs. Then we compare the parser-generated MRs with the ground truth MRs and calculate the parser performance in three metrics.

For the test set, we first crawled user commands from the AndroidHowTo dataset (Li et al., 2020a), which contains step-by-step commands on achieving different tasks in Android. Then, we filtered out irrelevant data entries such as commands to interact with the other operating systems to obtain 101 clear commands for the test set. Each test case is a single command to perform the step, such as “Tap Clear browsing data” or “Go to the Profiles tab”. The parser requires a list of current on-screen elements for each tapping interaction. Therefore, we extracted the list of clickable elements for each tapping command from the Rico dataset (Deka et al., 2017), which contains the metadata to identify clickable elements in each screen from 10k Android apps. Finally, the ground truth MR of each test case is manually annotated by one student and validated by another to ensure the annotation correctness. Such a test set includes rich linguistic variations. For example, given our manual calculation, one MR has an average of 1.2 aligned natural language commands. The actions such as OPEN, PRESS, SWIPE and ENTER are described by 4, 6, 2, and 2 different phrases, respectively. Evaluating our parser on this test set can validate our parser’s robustness to the rich linguistic variations.

To evaluate the performance of the parsers, we adopt three evaluation metrics, Exact Match Accuracy (EM Accuracy) (Dong and Lapata, 2018; Li et al., 2021), Target F1 and Action F1. Exact Match Accuracy is one of the most commonly used metrics in semantic parsing, which computes the percentage of the commands which are correctly mapped to its ground truth meaning representations. To further understand how well the parser could predict the correct actions and targets, we also computed the F1 scores of the targets and actions. The Target F1 and Action F1 report the average F1 score of all targets and actions, respectively, weighted by their corresponding frequencies in the ground truths.

We compared with two baselines, vanilla Seq2Seq (Bahdanau et al., 2015) and BERT-LSTM (Xu et al., 2020b). Vanilla Seq2Seq includes an LSTM encoder and an LSTM decoder. BERT-LSTM replaces the LSTM encoder of Seq2Seq with a pre-trained language model, BERT (Kenton and Toutanova, 2019), and an additional copying module on the LSTM decoder. As our model is specifically designed to solve the issues of the out-of-vocabulary on-screen buttons, we also evaluated the performance of BERT-LSTM and VoicifyParser when all the buttons are not stored in the pre-defined vocabulary list. As such, the parser can only access the button names from the set of extracted on-screen buttons in runtime.

5.1.2. Evaluation Result

As in Table 1, our parser, VoicifyParser, can achieve an EM accuracy as high as 90% on the test set with rich linguistic variations. VoicifyParser also consistently outperforms the other two baselines in generating complete MRs or the individual actions and targets in the MRs. Compared with the vanilla Seq2Seq, the accuracy of VoicifyParser is even 40% higher. Due to a lack of a copying mechanism, Seq2Seq can not correctly parse most of the input-text commands. For example, enter ‘4-digit PIN code’ . is incorrectly parsed into ( ENTER , ‘ 4-digit PIN ’ ). In our experiment, it fails on all such types of questions in the test set, which occupies 5% of the total test set. Vanilla Seq2Seq employs no pre-trained language model, so it is not robust to the rich linguistic variations in the user commands. As a result, it performs significantly worse than BERT-LSTM and VoicifyParser in all three aspects. BERT-LSTM and VoicifyParser have comparable performance in terms of Action F1. However, BERT-LSTM lacks a schema encoding mechanism as our model. Thus, it can not utilize the runtime on-screen information. Even with all the buttons stored in the pre-defined vocabulary list, the Target F1 of BERT-LSTM is still 7% lower than VoicifyParser, depicting that on-screen information is a great boost to the parser’s ability with respect to predicting the targets if the parser employs the schema encoding. With all the button elements removed from the vocabulary, the performance of BERT-LSTM drops significantly. BERT-LSTM (OOV) performs even worse than the Seq2Seq parser in all three metrics. Surprisingly, removing the button targets also drops the action prediction performance for the BERT-LSTM (OOV). For our VoicifyParser (OOV), the influence of removing button names is negligible, showing that schema encoding improves the robustness of our model to the out-of-vocabulary buttons in the test commands.

Table 1. The experiment results of different parsers in terms of three metrics. OOV (Out of Vocabulary) indicates that the button names in the ground truths of the test set are not stored in the vocabulary of the parsers.

Parsers	EM Accuracy	Target F1	Action F1
Seq2Seq	52.48	51.33	83.17
BERT-LSTM	80.20	85.71	96.55
BERT-LSTM (OOV)	29.70	35.41	60.27
VoicifyParser	91.09	93.05	97.03
VoicifyParser (OOV)	89.11	90.76	98.02

5.2. Application Feature Coverage

5.2.1. Experimental Setup & Metric

In this experiment, we validate the ability of Voicify to directly open the installed application’s features.We consider an in-app feature as a specific screen that serves the functionality to users. Given a set of features from various installed apps, we measure the application feature coverage by manually opening each feature via voice command and recording the success rate. Since Voice Access does not support opening features from other apps using voice commands, we used Google Assistant (Assistant, 2022) as the baseline. Using the result from the formative study that investigated common user tasks on smartphone (Arsan et al., 2021), we identified the 10 most common categories that will be evaluated in this experiment. For each application category, we randomly picked 3 applications from the Google Play Store top suggestions. All chosen applications are popular, ranging from 1 million downloads to more than 1 billion downloads. For each application, we identify a set of provided features that the application developer stated in the Play Store listing description. The number of features from each app varies, ranging from 2 to 5 depending on the complexity of the app. For example, applications in Clock & Alarm category are simple, only allowing users to add new alarm and view the list of added alarms, while applications in Post a picture category like Instagram offers many more features. In the end, we collected 117 test cases from 30 apps, each of them being an important feature stated by the developer when publishing the app on Play Store. A full list of the apps and features can be found at the Github repository³³3https://github.com/vuminhduc796/Voicify.

For each test case, we first record the ground truth as the screen that can provide the requested functionality to users. We used the success rate as the primary metric, each test case will be marked as successful if the tool can directly open the desired screen, otherwise, it will be marked as failed. We compared the success rate of Voicify and Google Assistant to validate the performance of the direct invocation module.

5.2.2. Results

Fig. 6 compares the feature coverage by category between Voicify and Google Assistant. Overall, Voicify successfully located and invoked 90 application features out of 119 test cases (76.9%), compared to 55 features from Google Assistant (47.0%). Both tools have achieved high performance in popular applications such as Yahoo Mail, SoundCloud and Twitter with over 80% success rate. However, the feature coverage from Google Assistant dropped dramatically for less well-known apps because those applications are missing the required declaration from their developers to integrate with Google Assistant. Since our approach does not require any developer efforts to integrate the application with Voicify, we achieved a solid performance across a wider range of apps.

We performed the error analysis by investigating failed test cases to understand the main problems that affect Voicify feature coverage. First, as mentioned before, some important features are encapsulated inside the application, which does not allow direct access from other applications. Hence, both Voicify and the baseline have poor performance for money transfer & banking apps as those applications might have enhanced protection, blocking direct access to certain features to prevent malicious attacks. Second, some application screens require data that is passed from the previous screen. This data is often stored in a bundle, attached to the intent to open. For example, when opening the video player from YouTube, the attached data must specify which video will be played using a certain video ID. Voicify is unable to identify and populate the required data, hence failing to directly open those features. Third, some applications have implemented a centralized activity as the entry point for all external invocations through intent. This implementation limited the list of features that is can be captured by Voicify, hence impacting the feature exploration of Voicify.

6. User Study

To demonstrate the usefulness of our tool in practice, we conducted a user study to evaluate Voicify system as a whole in real-world scenarios. As a baseline, we compare Voicify with Voice Access since the tool is currently the best voice command available in Android (Yamada, 2020). The goal of the study is to i) benchmark the user performance with Voicify, compared to the baseline and ii) compare the user feedback on the cognitive load and usability of Voicify to the baseline and lastly iii) collect qualitative feedback to suggest future improvement on Voicify. We record the time taken to finish each task using Voicify and Voice Access. In addition, we conducted a post-experiment interview with each user to collect both quantitative and qualitative feedback.

Table 2. The list of tasks for user evaluation.

No	Task	Application	#Steps
Task 1	Check for a saved cooking recipe.	World Cuisines	6
Task 2	Check for cooking steak instructions and set a timer.	Steak Timer	8
Task 3	Convert the mass of ingredients from teaspoon to tablespoon.	Unit Converter Ultimate	10
Task 4	Add a grocery shopping note.	Fast Notepad	12

6.1. Tasks

Based on an example scenario of having both hands busy when cooking as provided by participants in Section 3, we created 4 relevant tasks that users would perform on mobile devices when cooking. The tasks covered most common interactions on the screen, including tapping, swiping and entering text. We sorted the tasks based on the number of steps to achieve the task using voice command, hence from task 1 to task 4, the difficulty level increased. The list of tasks is mentioned in Table 2. To introduce the task to participants, we provided written step-by-step tutorials on how to achieve the task. We also recorded a walk-through video for each app to demonstrate the tasks using physical touches and the expected outcome.

6.2. Participants

We recruited 8 participants (6 males and 2 females) who can speak English at a proficient level. In addition, all participants have a good level of familiarity with technological devices, as they all use a smartphone regularly. Although participants had relevant exposure to virtual assistants such as Siri or Google Assistant, none of the participants were familiar with using assistive tools to control their smartphones with voice commands, especially none of the participants used any of the experiment tools. The reason for selecting this set of participants is that in this study, we also measure the learnability of the experimental tools. Each participant is awarded an A$50 gift card for their participation.

6.3. Procedure

We invited each participant to join a face-to-face meeting individually for the user evaluation. We set up one experimental device for all experiments, which is a Google Pixel 5 (Android 11) since some users do not have an Android device. In addition, some applications are not installed on participants’ devices or require a registered account to proceed. Lastly, different devices might have different response rates, hence affecting the correctness of results.

First, we provided users with some basic understanding of Android OS, and we introduced Google Voice Access and Voicify. We recorded demo videos on how to achieve tasks with Voicify. Since Voice Access has included a built-in tutorial for new users, we provided that material to the participant. After that, we trained users to achieve some basic tasks using both applications and allowed the users to practise using each tool for equally 5 minutes. We also introduced the applications that will be used for the evaluation. Since the majority of users were unfamiliar with the tasks and experimental applications, we guided them through the steps for each task and let them achieve each task with physical interactions to memorise the task. In the end, each user confirmed to have an approximate level of understanding of Android OS, both voice control applications and the experimental tasks.

After the training, we asked each participant to complete 4 different tasks with no interventions by the experimenters. Participants used Voicify to complete 2 tasks and used the baseline tool to complete 2 other tasks. They are not aware of which tool is developed by us. The order of the tasks and used tools will be rotated for each participant in a counter-balanced manner (DePuy and Berger, 2014) to avoid potential biases. For example, P1 first completed task 1 and task 4 using Voicify and then completed tasks 2 and task 3 using Voice Access while P5 completed task 1 and task 2 using Voice Access before completing task 3 and task 4 using Voicify. For each given step, we had a cut-off time of 60 seconds if the participant could not figure out the way to complete the step using the voice command.

We recorded the time taken to fulfil each task, including the cut-off time to perform quantitative analysis. We collected 32 data entries since each of the 8 participants has finished 4 tasks. In the end, using the System Usability Scale (SUS) (Brooke, 1996) form with a 5-point Likert scale, we evaluate the usability of Voicify, compared to Google Voice Access. In addition, we investigated the cognitive load when experimenting with each tool using the NASA-TLX (Hart, 2006) form with a 7-point Likert scale. Lastly, we collected qualitative feedback on which part they liked the most about Voicify and what might improve the system.

6.4. Result

6.4.1. Overall User Performance

We performed a quantitative analysis of the time taken (in seconds) to complete each task, as shown in Fig. 7(A). In general, as the number of steps increase, participants required more time to complete the task. The average time taken to complete the tasks with Voicify is 93.2 seconds, compared to 140 seconds using Voice Access, resulting in a 33.4% efficiency boost. We observed a significant disparity in recorded time due to (i) the usage of direct feature invocation in task 1 with Voicify and (ii) the complexity in inputting text and tapping unlabelled icons with Voice Access in task 4. Specifically, participants directly opened the list of saved recipes in task 1 using Voicify, hence they skipped 3 steps as described in Fig. 7(B), resulting in a shorter time. For task 4, participants were requested to input text into several text boxes, which caused some issues in selecting the text box and typing the text. In addition, some steps in task 4 required users to tap an unlabelled icon (e.g., the star icon to add a note to the favourite list), therefore, users made several unsuccessful attempts to guess the corresponding label of the icon. Using the grid-based tapping from Voice Access was proven to be inefficient as it required extra commands from users to show/hide and change the granularity of the grid. Voicify solved the problems by attaching a numbering tooltip next to the unlabelled icon, allowing users to promptly perform interactions. By observing the experiment, we recognized the issue in transforming the human voice into the textual format. Voicify did better in post-processing the raw user command using predefined heuristics to resolve common transcription errors. However, we noticed that Voicify requires slightly more time to process each command because of the latency in API communication to the backend server. This latency caused a short period of unresponsiveness after recording the user command, adversely affecting the performance and user experiences. The problem could be mitigated in the future by deploying a mobile version of the model locally on user devices.

6.4.2. Cognitive Load & Usability Ratings

Fig. 8(A) summarizes the participant’s feedback on their level of cognitive load for each system using the NASA-TLX form. With a lower level of effort, participants achieved significant improvement in the performance using Voicify (t = -2.61, p = 0.035) as a result of direct invocation usages. In addition, the command parser was precise in mapping faulty commands to correct actions, helping users to easily interact with the system. Participants experienced less frustration using Voicify, as well as confirmed that Voicify required less mental demand, compared to Voice Access. The result proved a significant improvement of Voicify, fulfilling the design implication of reducing the required cognitive loads to operate the tool.

We compared the quantitative feedback of participants, including 10 design and usability questions on a 5-point Likert scale and applied a pairwise t-test to the result, as shown in Fig. 8(B). The result validated that we improved the usability of the voice controlling system, as the average SUS score for Voicify is 72.813, while Voice Access received 54.688. Voicify was better integrated than Voice Access (t = 2.65, p = 0.033). Due to the intuitiveness of the tooltip labelling system and direct invocation, Voicify received significantly better ratings across multiple indicators. The system is confirmed to be less cumbersome to use (t = -2.45, p = 0.041) and required less learning (t = -2.58, p = 0.036), compared to the baseline. Participants confirmed an improvement in the ease of operation (t = 1.99, p = 0.041) since tooltip selection was more convenient compared to the grid-based selection from Voice Access to tap unlabelled icons.

6.4.3. Qualitative Feedback

In this section, we collate qualitative feedback from participants after experimenting Voicify and Voice Access. Overall, the participants are satisfied with the tool, as well as providing suggestions for further improvements.

Innovative method of interacting with phone devices. Participants who have never used a voice control system to perform a sequence of daily life actions were eager to explore it further. P5 expressed that “I really enjoy using the voice to control the phone, it is a new concept to me". Moreover, participants who are tech savvy are impressed by the capacity of voice assistant technology, as evident in P6’s comment that Voicify “can do most of the things that I need" and “understand what I want to say". In normal circumstances, participants prefer the traditional tapping interaction over voice command despite their positive experience with Voicify. It is then concluded that even though voice command technology is excitingly novel and capable, the barrier to its wide adoption is the intricate human behaviour adaptation process.

Usages of direct invocation. Voicify ’s novel feature of direct invocation was implemented in the experiment and suggested as an option to carry out a sequence of activities, which garnered positive feedback from P1 and P7. Participants mentioned that the direct invocation feature was a great advance for voice control systems, as it allows users to “quickly access" the desired screen in a particular application, alleviating the complexity of the task by reducing the number of steps and wait time. Therefore, future improvement will include expanding the set of in-app components that can directly be opened from the main screen using Voicify.

Numbering tooltips as an effective solution. Voicify ’s tooltip labelling system received overall positive feedback. P2 mentioned that the best thing about Voicify was “the ability to identify icons in an app in numbers whereas in Voice Access one may need to use grid selections" and P4 appreciated the bespoke ability to tap on “labelling icons for which you may not know the name/label using numbers". P3, P5 and P7 agreed that numeric labelling has helped them to quickly interact with different icons on the screen without having to know the name of the icons, which is convenient and stress-free. In contrast, participants noted the demerit of the grid-based selection from Voice Access. P1 and P3 expressed that the grid-based selection is “difficult” and “imprecise”. The result showed the inefficacy of grid-based selection from Voice Access and the user preferences towards tooltips selection from Voicify.

UI design improvement & system feedback. Participants acknowledge the great UI design and responsiveness of Voice Access while giving Voicify suggestions to deliver better user experiences. P3 mentioned Voice Access’s “impressive capability to interpret the speech on the go” and P7 expressed great interest in the real-time “closed-caption” that Voice Access generated. For Voicify, P2 and P6 mentioned that the toggle button to start the tool and the system status indicator blocked certain UI components. To tackle this issue, possible solutions include moving overlaying components onto the notification bar or providing the ability to show or hide UI components of Voicify in runtime using a predefined voice command. P7 has suggested that displaying “closed-caption as we talk to give a sense of immediate feedback to the user", which will improve the responsiveness of the Voicify.

7. Implications and future works

Voicify ’s capability and a seamless user experience are one of our priorities for further development which will propel mainstream adoption of voice interaction for multiple ubiquitous computing devices. In this section, we will discuss the implications and propose further improvements to the current system.

7.1. Interacting with Unlabelled Icons and Images

Our study suggested the applicability of tooltips to improve the efficiency and accuracy of voice interactions for multiple devices, such as smart cars and smart TVs. It helped to overcome shortcomings of extant voice-based selection methods, especially when it is critical to interact with low mental demand. From the experiment result in Section 6.4, tooltip labelling significantly improved user performance as it was more precise than grid-based selection with less visual occlusion. Using the tooltip system from Voicify, users were able to directly mention the number without the need to search for the matched tile in the grid selection system. As result, tooltips helped reduce the required cognitive effort to operate the system measured using the NASA-TLX form. During the experiment, when applying grid-based selection for screens with clustered elements, multiple elements fitted in one tile due to the low granularity of the grid, causing unwanted interactions. Participants expressed frustration when misclicking an element, causing extra navigation steps to finish the task. In addition, the tooltip system helped improve the learnability aspect of the system, as fewer steps and simpler command syntax are needed to perform the same interaction.

7.2. Mapping User Commands

In this work, we proposed an advanced deep-learning parser that interprets human commands to produce structured actions. The parser is designed to understand the nuance in human language and suggest the closest match for user queries. The result from Section 5 showed that VoicifyParser outperformed other advanced parsers due to its capability to handle usages of synonymous words and out-of-vocabulary labels. The ability to interpret and record new out-of-vocabulary words allows the vocabulary size to grow in runtime, in results allowing the parser to work seamlessly in real-world scenarios. The user study has proven the effectiveness of our parser, which results in an improvement in user performance and a reduction in the required cognitive load to operate the tool, as shown in Section 6.4.

Although VoicifyParser is currently fine-tuned to interpret a fixed set of functionalities proposed by Voicify, the parser is designed for high extensibility and usability. By providing additional training data for transfer learning progress (Ezen-Can, 2020), the model can be extended to cater for other meaning extraction problems with minimal efforts, such as controlling household appliances and smart vehicles. We hope to make the latest NLP technology more accessible and applicable for further research in the human-computer interaction domain.

7.3. Improvement for Feature Shortcuts

The findings imply the significance of providing shortcuts for users to achieve certain tasks, as performance is the top priority that determines user experience. In-app shortcuts helped users to achieve Task 1 in Section 6 with only one-third of the time taken without direct invocation, as well as receiving very positive feedback from user responses. Not only users who are unable to perform physical touch on the screen will be rewarded by the feature, but also users who can physically control the smartphone using their hands will use this feature to accelerate their tasks.

From the experiment, we observed that explorability and learnability are the key factors that affect the usability of feature shortcuts. When experimenting with Voicify to achieve Task 1 which contains an in-app shortcut to view all saved recipes, a participant did not use the shortcut as she did not know if the shortcut was available and the feature name attached to it. We acknowledge the explorability and learnability issues of the feature and propose a better recommendation system with additional GUIs to introduce available shortcuts to the user.

On the other hand, we acknowledged that most unexplored app features were encapsulated within fragments. The feature retrieval module in Voicify is currently based on the content within the Manifest file in which all activities are declared. Nevertheless, this requirement does not apply to fragments, and hence, features located within these fragments remain undiscovered. According to Li et al., more than 50% of the most popular apps are using Fragment as the basic building block (Li et al., 2017). Therefore, the full retrieval and invocation of fragments are critical to improving Voicify’s exploration of all in-app components. Further studies on reconstructing fragments and fragment navigation within an activity are proposed to improve the granularity of feature exploration and provide forward compatibility for newer app versions.

7.4. Limitation & Future Works

While invoking developer-defined deep links significantly improved our direct invocation module’s coverage, the module will depend on the developer’s implementation of deep links. Therefore, impaired deep links provided by developers may also affect the coverage of the direct invocation module from Voicify. By default, we show users the app’s launcher page if the deep link is broken. In addition, as developers may modify introduced deep links through app updates, the availability of deep links could be affected. Voicify mitigated the problem by automatically updating the list of available deep links for installed apps, as deprecated deep links will be removed from the database and newly introduced deep links will be added. We also provide a GUI component within the Voicify application that displays available deep links for each application to inform users about the availability of deep links. Lastly, as our tool depends on the integrated deep links, apps with fewer deep links will achieve lower feature coverage.

Another limitation to consider is that our selection of participants for user evaluation may have introduced validity threats. We chose participants who were not familiar with using voice commands to control their smartphones in order to minimize biases and assess the ease of learning for our system. Therefore, we acknowledge the absence of experimenting the tools with experienced users, which may provide other perspectives regarding the performance of each tool. However, we mitigated the stated issue by designing the experiment which is suitable for users with minor experience. Firstly, we first trained participants using both of the tools before conducting any experiments, which helped improve the user’s familiarity with voice commands. Secondly, the experimental tasks did not require an in-depth understanding of voice assistants and participants can fully achieve the tasks using the commands that they are trained with. Thirdly, while the level of experience from each user may affect the total time taken to finish the task, it does not affect the comparison between the tools as each same participant is asked to use both tools.

We propose several directions that Voicify can be improved. Firstly, the current system requires users to maintain visual contact with the touchscreen, which is burdensome in use cases where the mobile device is out of sight. Therefore, we propose future work to integrate with screen readers such as Talkback in Android OS to cater for non-visual usages. Lastly, while Voicify put forth an exciting opportunity to cater for motor impairments, we have not yet investigated the impact it has on users with physical constraints. Therefore, with further empirical research to improve voice assistive platforms for impairments, incremental improvements could be proposed on top of Voicify to benefit different types of disabilities.

8. Conclusion

In this paper, we present the Voicify system that enhances the usage of voice commands in Android devices. We proposed a novel parser that generates structured actions (also known as MR) for human commands to interact with the smartphone. The dialogue manager uses the actions to perform matching with the collected data and execute the actions on the device. Voicify demonstrates a novel approach to utilise the on-device data including applications code base and on-screen UI semantics, therefore, the computation workload is lightweight and can be executed locally. From the experiment, the natural language parser achieves outstanding accuracy for the human command dataset, compared to multiple baselines. Our experiment also shows that Voicify has an effective direct invocation module that has high coverage of application features without requiring extra developer effort to comply. We also conducted a user evaluation that indicates the high usability of the system in real-world tasks, fulfilling all design implications. Lastly, since Voicify is an open-source project, it lays the ground for future works on the intuitiveness of user verbal input which contributes to the further improvement of human-computer interactions.

References

(1)
voi (2022) 2022. Voice Access – Apps on Google Play. https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.voiceaccess&hl=en_AU&gl=US
Access (2022) Voice Access. 2022. Troubleshoot Voice Access. https://support.google.com/accessibility/android/answer/6377053?hl=en#:~:text=If%20you%20have%20trouble%20starting,Access%20from%20the%20lock%20screen.
Alhanahnah et al. (2020) Mohannad Alhanahnah, Qiben Yan, Hamid Bagheri, Hao Zhou, Yutaka Tsutano, Witawas Srisa-An, and Xiapu Luo. 2020. Dina: Detecting hidden android inter-app communication in dynamic loaded code. IEEE Transactions on Information Forensics and Security 15 (2020), 2782–2797.
Application (2022) Just An Application. 2022. Android Binary XML. https://justanapplication.wordpress.com/category/android/android-binary-xml/
Arsan et al. (2021) Deniz Arsan, Ali Zaidi, Aravind Sagar, and Ranjitha Kumar. 2021. App-Based Task Shortcuts for Virtual Assistants. In The 34th Annual ACM Symposium on User Interface Software and Technology. 1089–1099.
Assistant (2022) Google Assistant. 2022. Assistant. https://assistant.google.com/intl/en_au/platforms/phones/
Azim et al. (2016) Tanzirul Azim, Oriana Riva, and Suman Nath. 2016. uLink. Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services (2016).
Bahdanau et al. (2015) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
Bai (2022) Lijun Bai. 2022. Research on voice control technology for smart home system. In Proceedings of the Asia Conference on Electrical, Power and Computer Engineering. 1–7.
Bhalerao et al. (2017) Aditi Bhalerao, Samira Bhilare, Anagha Bondade, and Monal Shingade. 2017. Smart Voice Assistant: a universal voice control solution for non-visual access to the Android operating system. Int. Res. J. Eng. Technol 4, 2 (2017).
Brooke (1996) John Brooke. 1996. Sus: a “quick and dirty’usability. Usability evaluation in industry 189, 3 (1996).
Chiang (2007) David Chiang. 2007. Hierarchical phrase-based translation. computational linguistics 33, 2 (2007), 201–228.
Cohen (2020) Philip R Cohen. 2020. Back to the future for dialogue research. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13514–13519.
Corbett and Weber (2016) Eric Corbett and Astrid Weber. 2016. What can I say? Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services (2016).
Deka et al. (2017) Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. 845–854.
DePuy and Berger (2014) Venita DePuy and Vance W Berger. 2014. Counterbalancing. Wiley StatsRef: Statistics Reference Online (2014).
Developers (2022a) Android Developers. 2022a. AccessibilityNodeInfo. https://developer.android.com/reference/android/view/accessibility/AccessibilityNodeInfo.AccessibilityAction
Developers (2022b) Android Developers. 2022b. AccessibilityService. https://developer.android.com/guide/topics/ui/accessibility/service
Developers (2022c) Android Developers. 2022c. Deep Linking. https://developer.android.com/training/app-links/deep-linking
Developers (2022d) Android Developers. 2022d. SpeechRecognizer. https://developer.android.com/reference/android/speech/SpeechRecognizer
developers (2022) Android developers. 2022. Text To Speech. https://developer.android.com/reference/android/speech/tts/TextToSpeech
Dong and Lapata (2018) Li Dong and Mirella Lapata. 2018. Coarse-to-Fine Decoding for Neural Semantic Parsing. In 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 731–742.
Ezen-Can (2020) Aysu Ezen-Can. 2020. A Comparison of LSTM and BERT for Small Corpus. arXiv preprint arXiv:2009.05451 (2020).
Foster and Foxcroft (2011) Greg Foster and Terence Foxcroft. 2011. Barrel menu: a new mobile phone menu for feature rich devices. In Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment. 97–105.
Furqan et al. (2017) Anushay Furqan, Chelsea Myers, and Jichen Zhu. 2017. Learnability through Adaptive Discovery Tools in Voice User Interfaces. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI EA ’17). Association for Computing Machinery, New York, NY, USA, 1617–1623. https://doi.org/10.1145/3027063.3053166
Grinberg (2018) Miguel Grinberg. 2018. Flask web development: developing web applications with python. " O’Reilly Media, Inc.".
Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1631–1640.
Gupta et al. (2018) Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar, and Mike Lewis. 2018. Semantic Parsing for Task Oriented Dialog using Hierarchical Representations. In EMNLP.
Hart (2006) Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50. Sage publications Sage CA: Los Angeles, CA, 904–908.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
Hoy (2018) Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly 37, 1 (2018), 81–88.
Hu et al. (2018) Jinlong Hu, Junjie Liang, Yuezhen Kuang, and Vasant Honavar. 2018. A user similarity-based Top-N recommendation approach for mobile in-application advertising. Expert Systems with Applications 111 (2018), 51–60.
Hume (2020) Tom Hume. 2020. Use Voice Access to control your Android device with your voice. https://blog.google/outreach-initiatives/accessibility/voice-access-updates/.
Kalysch et al. (2018) Anatoli Kalysch, Davide Bove, and Tilo Müller. 2018. How Android’s UI Security is Undermined by Accessibility. In Proceedings of the 2nd Reversing and Offensive-oriented Trends Symposium. 1–10.
Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
Khan et al. (2018) Akif Khan, Shah Khusro, and Iftikhar Alam. 2018. Blindsense: An accessibility-inclusive universal user interface for blind people. Engineering, Technology & Applied Science Research 8, 2 (2018), 2775–2784.
Kim et al. (2020) Auk Kim, Jung-Mi Park, and Uichin Lee. 2020. Interruptibility for In-Vehicle Multitasking: Influence of Voice Task Demands and Adaptive Behaviors. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1, Article 14 (mar 2020), 22 pages. https://doi.org/10.1145/3381009
Kim et al. (2011) Ki Joon Kim, S Shyam Sundar, and Eunil Park. 2011. The effects of screen-size and communication modality on psychology of mobile device users. In CHI’11 Extended Abstracts on Human Factors in Computing Systems. 1207–1212.
Krishna and Nagendram (2012) Y Bala Krishna and S Nagendram. 2012. Zigbee based voice control system for smart home. International Journal on Computer Technology and Applications 3, 1 (2012), 163–168.
Kulhalli et al. (2017) Kshama V Kulhalli, Kotrappa Sirbi, and Abhijit J Patankar. 2017. Personal assistant with voice recognition intelligence. International Journal of Engineering Research and Technology 10, 1 (2017), 416–419.
Li et al. (2020a) Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020a. Mapping Natural Language Instructions to Mobile UI Action Sequences. In Annual Conference of the Association for Computational Linguistics (ACL 2020). https://www.aclweb.org/anthology/2020.acl-main.729.pdf
Li et al. (2017) Yongfeng Li, Jinbing Ouyang, Bing Mao, Kai Ma, and Shanqing Guo. 2017. Data flow analysis on Android platform with fragment lifecycle modeling and callbacks. EAI Endorsed Transactions on Security and Safety 4, 11 (2017), e2.
Li et al. (2020b) Zhuang Li, Lizhen Qu, and Gholamreza Haffari. 2020b. Context Dependent Semantic Parsing: A Survey. In Proceedings of the 28th International Conference on Computational Linguistics. 2509–2521.
Li et al. (2021) Zhuang Li, Lizhen Qu, Shuo Huang, and Gholamreza Haffari. 2021. Few-Shot Semantic Parsing for New Predicates. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 1281–1291.
Liu et al. (2015) Kuei-Chun Liu, Ching-Hung Wu, Shau-Yin Tseng, and Yin-Te Tsai. 2015. Voice helper: A mobile assistive system for visually impaired persons. In 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing. IEEE, 1400–1405.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Louvan and Magnini (2020) Samuel Louvan and Bernardo Magnini. 2020. Recent Neural Methods on Slot Filling and Intent Classification for Task-Oriented Dialogue Systems: A Survey. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 480–496. https://aclanthology.org/2020.coling-main.42
Ma et al. (2016) Yun Ma, Xuanzhe Liu, Ruogu Du, Ziniu Hu, Yi Liu, Meihua Yu, and Gang Huang. 2016. DroidLink: Automated generation of deep links for Android apps. arXiv preprint arXiv:1605.06928 (2016).
Park and Kim (2018) Geonwoo Park and Harksoo Kim. 2018. Low-cost implementation of a named entity recognition system for voice-activated human-appliance interfaces in a smart home. Sustainability 10, 2 (2018), 488.
Salehnamadi et al. (2021) Navid Salehnamadi, Abdulaziz Alshayban, Jun-Wei Lin, Iftekhar Ahmed, Stacy Branham, and Sam Malek. 2021. Latte: Use-case and assistive-service driven automated accessibility testing framework for android. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–11.
Shiri et al. (2022) Fatemeh Shiri, Terry Yue Zhuo, Zhuang Li, Shirui Pan, Weiqing Wang, Reza Haffari, Yuan-Fang Li, and Van Nguyen. 2022. Paraphrasing Techniques for Maritime QA system. In 2022 25th International Conference on Information Fusion (FUSION). IEEE, 1–8.
Wang et al. (2020) Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7567–7578.
Wang et al. (1997) Daniel C Wang, Andrew W Appel, Jeffrey L Korn, and Christopher S Serra. 1997. The zephyr abstract syntax description language.. In DSL, Vol. 97. 17–17.
Wang et al. (2015) Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1332–1342.
Wijeratne et al. (2019) Yudhanjaya Wijeratne, Nisansa de Silva, and Yashothara Shanmugarajah. 2019. Natural language processing for government: Problems and potential. International Development Research Centre (Canada) 1 (2019).
Xaif et al. (2022) Xaif, Swarup Bam, Satya Ram, Ajinkya, Vijay patel, and Kerwin. 2022. Android 12: Google assistant not working [8 fixes]. https://devsjournal.com/android-12-google-assistant-not-working-8-fixes.html
Xie et al. (2021) Nan Xie, Yiran Ni, Xiaoxiao Liu, Yu Cao, and Weimin Chen. 2021. Implementation of Simulation Control Automation Tool Based on Android Accessibility Service. Journal of Physics: Conference Series 1881, 3 (2021), 032071.
Xiong and Muraki (2016) Jinghong Xiong and Satoshi Muraki. 2016. Effects of age, thumb length and screen size on thumb movement coverage on smartphone touchscreens. International Journal of Industrial Ergonomics 53 (2016), 140–148.
Xu et al. (2020a) Silei Xu, Giovanni Campagna, Jian Li, and Monica S Lam. 2020a. Schema2qa: High-quality and low-cost q&a agents for the structured web. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1685–1694.
Xu et al. (2020b) Silei Xu, Sina Semnani, Giovanni Campagna, and Monica Lam. 2020b. AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 422–434.
Yamada (2020) Kannon Yamada. 2020. How to control your Android device entirely with your voice. https://www.makeuseof.com/tag/control-android-device-entirely-voice/
Yang et al. (2020) Jackie Yang, Monica S Lam, and James A Landay. 2020. Dothishere: multimodal interaction to improve cross-application tasks on mobile devices. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 35–44.
Zhang et al. (2021) Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, et al. 2021. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
Zhong et al. (2014) Yu Zhong, T. V. Raman, Casey Burkhardt, Fadi Biadsy, and Jeffrey P. Bigham. 2014. JustSpeak. Proceedings of the 11th Web for All Conference on - W4A 14 (2014).