The body image of social robots*
Abstract
The rapid development of social robots has challenged robotics and cognitive sciences to understand humans’ perception of the appearance of robots. In this study, robot-associated words spontaneously generated by humans were analyzed to semantically reveal the body image of 30 robots that have been developed over the past decades. The analyses took advantage of word affect scales and embedding vectors, and provided a series of evidence for links between human perception and body image. It was found that the valence and dominance of the body image reflected humans’ attitude towards the general concept of robots; that the user bases and usages of the robots were among the primary factors influencing humans’ impressions towards individual robots; and that there was a relationship between the robots’ affects and semantic distances to the word “person”. According to the results, building body image for robots was an effective paradigm to investigate which features were appreciated by people and what influenced people’s feelings towards robots.
I INTRODUCTION
The present study aims to understand how humans perceive robots and how they form semantic representations of the robots from purely visual characteristics. The term “body image” will be used to refer to the semantic understanding a person derives from viewing (images of) robots.
Neuroscientific research has shown that visual input to the brain is transformed into semantic representations through a default path [1]. Visual traits such as strokes, shapes, and colors are processed in the occipital lobe and then transmitted to the frontal and temporal lobes through two pathways for semantic interpretation of the object [2]. Importantly, significant traits such as faces and word forms are processed in dedicated regions close to occipital regions [3, 4], before other semantic information is processed in the temporal lobe [5, 6, 7]. Recently, a third pathway has been proposed that is specialized in processing social cues [8]. Therefore, the body image of a robot encompasses not only the visual appearance of the robot but also its semantic meaning and social significance.
Studies in semantics provided models for how concepts are organized in the brain’s semantic system. In psycholinguistics, a commonly used model is the word embedding model [9]. This model calculates the likelihood of words appearing next to each other in large, naturally occurring texts typically found on the internet using a single-layer neural network. Research has shown that the model accurately reflects real-world semantic information [10]. Therefore, incorporating a person’s understanding of a robot’s functions into such a semantic framework could potentially improve the model of a robot’s body image. Instead of focusing on the visual appearance of robots that are often difficult to quantify, we thus propose to deduce a robot’s body image by analyzing words that are spontaneously associate with the robot.
With the goal of understanding human’s mental representation of social robots, we propose to examine the characteristic features of robots from a semantic perspective. We will use a free word-association task to collect basic impressions about various robots and model these data within a sematic space using word vectors. We will also quantify the affect dimensions of the words using “valence”, “arousal” and “dominance” measurements from an affect lexicon [11]. Finally, we will determine how the word-vector that characterize a given robot relate to the word “person” within the semantic space, to assess how closely participants associate the robot with human environment. Taken together, the study demonstrates the importance of representation in studying human-robot interaction.
INSTRUCTION | Please select the option that best |
---|---|
describes your attitude at the moment: | |
STATEMENTS | I want a robot to assist me … |
(1) … at home | |
(2) … at school | |
(3) … in dangerous locations | |
(4) … in factories | |
(5) … in hospitals | |
(6) … in hotels | |
(7) … in museums | |
(8) … in offices | |
(9) … in police stations | |
(10) … in public transportations | |
(11) … in shopping centers | |
(12) … in sports facilities | |
RESPONSES | Strongly disagree |
Disagree | |
Neither agree or disagree | |
Agree | |
Strongly agree |
Word | Frequency | |
---|---|---|
1 | toy | 42 |
2 | cute | 39 |
3 | friendly | 33 |
4 | small | 30 |
5 | robot | 29 |
6 | helpful | 28 |
7 | scary | 25 |
8 | future | 23 |
9 | dog | 22 |
10 | creepy | 17 |
11 | fun | 16 |
12 | human | 16 |
13 | technology | 16 |
14 | child | 14 |
15 | cool | 14 |
16 | weird | 14 |
17 | animal | 12 |
18 | simple | 12 |
19 | happy | 11 |
20 | smart | 11 |
21 | strong | 11 |
22 | artificial intelligence | 10 |
23 | assistant | 10 |
24 | automatic | 10 |
25 | color | 10 |
26 | helper | 10 |
27 | modern | 10 |
28 | useful | 10 |
29 | wheels | 10 |
30 | automated | 9 |
II METHODS
II-A The robots and human participants
We selected 30 robots from the robots available on the market or as prototypes in the past decades. The robots are picked to make a wide range of diverse features, e.g. whether the robot has face, legs, arms, or not resemble common animal- or human-shapes at all. The robots have been included into the study are: Buddy, Eilik, Musio, Pico, Sima, Aibo, Ameca, Aquanaut1, Asimo, Astro, Atlas, Emiew, Hexa, Hitchbot, iCub, Jibo, Lovot, Minicheetah, Moxie, Nao, Neubie, Nicobo, Optimus, Parky, Pepper, Pyxel, Sawyer, Spotmini, Talos, Vector.
It’s worth mentioning that robot is an ambiguous concept, and sometimes it’s hard to decide which ones should be included. A typical case is drones, e.g. quadrotor aircraft. Although some people would regard them as robots (e.g. drones are listed as robots at https://robots.ieee.org/robots/), some others would consider it simply as aircraft. On the other side, the purpose of the present study is to investigate the robot body image as a basis for human-robot interaction, which is not a typical purpose of drones. Therefore they were excluded from the present study.
The experiment was conducted on the Prolific platform https://app.prolific.co and built with psychJS [12]. Before publishing the experiment task, we set pre-screening filters to only include English native speaker between 18 and 30 year old to maintain the homogeneity of the age span, within which people are very likely to grow up with internet, and more open to new technologies.
Before the tasks, participants were informed that they can quit the task at any moment during the task and will be compensated for the time spent on the task. The task averagely took 14 minutes per participant. Compensation is about €7.15/hour. Thirty participants were recruited into the experiment.
II-B The robot attitude scale
The experiment was made up of two sessions, one session measures the attitudes to robots, and the other session builds body images of the robots. To measure the attitudes, we designed a questionnaires. And to build the body images of robots, we used a free association task [13, 14]. The participants were required to first finish the questionnaires, and then the free association task.
The robot attitude questionnaire is a Likert scale presenting statements about whether the subject would like to be assist by robots in different scenarios, as shown in Tab. I. The questionnaire is modified from a scale used by another study examining the differences across cultures [15]. Participants chose the option that best describes the participants’ attitude on a 5-point Likert scale: from 1 - “Strongly disagree” to 5 - “Strongly agree”. The responses were attributed with values from 0 to 4, representing a spectrum from negative to positive attitude. The overall attitude within a participant was averaged over all statements. This questionnaire came before the free association task, hence robot was not supposed to represent any specific robot model but just the general concept of robots.
II-C Free association
After the questionnaires, we asked the participants to type 6 words that come into their minds for each of 10 images presented serially. Each image was presented at the top of the screen, followed by 6 text boxes. The participants needed to type in the words sequentially from the 1st to the 6th text box, and click on the “Finished” button to proceed to the next image. They were able to press “Enter” key to quickly jump to the next word after typing a word. For each image, all the 6 words were mandatory and no text box can be skipped. The “Auto fill-in” functionality of our psychJS program was disabled.
In order to avoid the confounding of participants intra-correlation, i.e. a participant acts similar across the robots observed, we used a randomized robots selection procedure. The randomization was described in another study [16]. With this procedure, every participant will see a different subset of 10 robot.
We listed the top 30 most frequent words in Tab. II.
II-D Word vectors and affect measurements
In order to represent words by vectors, word vectors were obtained from the Word2vec model pre-trained on Wikipedia news by FastText https://fasttext.cc/docs/en/english-vectors.html. The vectors had 300 dimensions. Since a robot is related to multiple words, the vectors were averaged per dimension to form a single 300-dimension vector representing the robot, as illustrated in Fig. 2.
In order to quantify the affect dimensions of words, we used the valence, arousal and dominance (VAD) measurements in Mohammad’s (2018) affect lexicon for 20,000 English word using best-worst scaling [16, 11]. After collecting, the responses in the task were pre-processed to convert plural words into singular forms and to replace capitalized words into lower-case words. For compound words and phrases, we primarily extracted the word that was more representative, for example, “canfly” was replaced as “fly”, “car-like” was replaced as “car”; when representativeness was hard to decide, e.g. “artificial intelligence”, the whole compound was abandoned. Some adjectives were not in the lexicon, but their synonyms were, e.g. “gangly” was not in the lexicon but “awkward” was in the lexicon. In this case, we replaced them with their synonyms according to the Merriam-Webster thesaurus (https://www.merriam-webster.com/thesaurus/). After preprocessing, the VAD lexicon covered 92.33% of the responses. In the lexicon, the values of the affect dimensions were ranged between 0 to 1, a larger value represents a higher affect rating.
III RESULTS




III-A Word affects predicts subjective attitude
At participant-level, we investigated whether the general attitude towards robots can be predicted by affect measures of the association words. Three linear mixed-effect (LME) models were built with attitude scores as the dependent variable and affect measures as the only fixed-effect independent variable respectively. Robots with random intercepts were included as the maximized random-effect structure.
The results showed that, valence is a strong predictor for attitude (likelihood ratio test, LRT ); dominance is a strong predictor for attitude (LRT ); whereas arousal didn’t predict attitude (LRT ).
The relationships between the attitudes and affects were shown in Fig. 1. The values of affects has been averaged across words within a participant, i.e. each dot represents a participants. Overall, the results validated that the semantic associations of a robot’s body image reflect the attitude of the observer.
III-B The graph of body images
People usually classify robots in various approaches, such as the robot’s shape, the functions, and the usage, etc. However, the relative positions between the approaches, or which approach is the most preferred approach by subjects, have scarcely been determined.
Here, we built a graph of robots based on the their association word vectors. The vectors within a robot were averaged to represent the robot. In the graph, each robot connected to three of its most similar other robots. based on the vector representations and pair-wise cosine similarity, The connected robots made up a graph of the 30 robots, as shown in Fig. 2 and Fig. 3D. In the body graph, we identified the some robot types that are commonly referred to as humanoids, robot assistants, robot pets, etc. However, some robots were not clearly classified as a member of a category. For example, Aibo, who is often referred to as a “robot dog”, is closer connected to Eilik rather than to Spot; and Neubie, who is shaped as a trolley or a locomotive, seems to be more likely to be connected with humanoid rather than with similar-shaped Parky and other wheel-based ones, e.g. Pico and Astro.
Even though, the most distinguishable categories are still identifiable by finding out cliques, i.e. the fully connected components in a graph. We underscored the 4-cliques in the graph to illustrate the robots categories, that is (A) robot pets, (B) humanoids, and (C) robot kids (in Fig. 3). Nao seems to be perfectly fit into the categories of children-orient robots, which include robot pets and robot kids. People who saw Nao would immediately come up with the user base and usages into which it’s supposed to fit. In contrast, Neubie, with its indefinite body image, is harder to be connected to a specific user base or usage. Overall, the categories of robots identified by their semantic relationships indicated that subjects tend to build body images based on the expected user base and usage of the robot, but not based on their shapes.
III-C Human distance predicts affects of body image
The distance between a body image and a human-related word might be useful in revealing how semantically close is the robot to human life. Here we obtained the averaged vectors for robots, and computed their distance with the word “person” as , where and are the vectors abstracting a robot and the target word “person”.
We chose “person” as the target word representing real human life out of the reason to maximize the number of words near the distance between each robot and the target word, which were used to control for the affect bias towards the target word. An example of such a bias is the valence of words, which is positive correlated with the words’ distance to “human”. Therefore, the affect level of words tend to show a systemic relationship with the distance to the target word, e.g. the distance to “human”. Therefore, we subtracted a baseline from the value of affects of a robot to standardize the robot’s affect level. The baseline is the mean affect level of words at a distance similar with the robot. Precisely, in the semantic space, we calculated the average difference of the distances between robots and the target word, which is used as a mask centering on the robot’s distance to target word. Masked words in the affect lexicon were taken and used as affect baseline words. The baseline words centering on both “human” and “people” were much fewer or fewer than those centering on “person”. Hence “person” as a target hold the most robust baseline.
Here, we investigated the perceived affects by examining the relationship between human distance and affects of body image. Fig. 4 shows the results respectively for valence, arousal and dominance. We find that valence and dominance are two functions of human distance. Valence grows with human distance at first, and then start to plateau with a slight drop at a level higher than the baseline; dominance grows monotonously with human distance. On the other hand, no simple relationship between human distance and arousal is identified.
Interestingly, we found most of the humanoids are further than others in human distance. For example, the most distant robots are Atlas, Optimus, Pepper, Musio, Hitchbot, Talos, iCub and Asimo; the nearest robots are Aquanaut, Jibo, Spot, Aibo, Pico, Pyxel and Buddy. The results indicated that, for robots, being more human-like might not be necessarily lead to fitting better into human life.
IV DISCUSSION
Psycholinguistics provides an enriched knowledge about words. Here, we introduce a paradigm that enables the utilization of this knowledge to analyze human-robot interaction.
Specifically, we analyzed words associated with the visual image of robots to explore humans’ perception of them, and demonstrated that the word affect is subject to people’s attitudes towards robots. Secondly, based on the associated words, robots can be grouped according to their intended user base and appearance. The third result highlighted the importance of a robot’s body image and its implications for human-robot interaction. The study concludes by emphasizing the significance of body images in helping researchers and designers understand the various dimensions of robots and the impact they have on human responses.
Our paradigm consists of two procedures. In the first procedure, robots are associated with a number of words, allowing us to infer perceived emotions, i.e. the degree of valence (“positive/pleasure” or “negative/displeasure”) and the degree of dominance (“powerful/strong” or “powerless/weak”). In Fig. 1, we demonstrate how word-related valence and dominance can be used to predict human attitudes towards robots from Likert scales. Free association has also been used by another recent study on social robots [17]. In that study, the authors investigated participants’ word associations with the written word “robot” and utilized factor analysis to categorize the mental representation of robots as social entities. Our study shares the same theoretical foundation with this previous research, but with a more specific focus on individual robots rather than a general robot concept or its social presence.
In the second procedure, the words are converted into vectors, abstracting a robot with a 300-dimensional vector embedded in a robust semantic space. In Fig. 3, we demonstrate that although the grouping of the different robots is based on the words that were associated with them, members of a group show similarities of their physical appearance and derived features, e.g. humanoids are grouped around the blue clique on the left side, next to the quadrupeds, while child-oriented robots manifest themselves as the green and light blue cliques. By increasing the number of robots (the present study only included 30 robots), this pattern might become even clearer. These results thus support the notion that the body image of a robot encompasses interconnected visual, semantic, and potentially social features, all as inferred from the robots’ morphology.
The paradigm provides an information-rich description of the robot perceived by its human subject, which is expected to become increasingly stereoscopic by probing the lexical scales and semantic space. For example, how these robots relate to each other, and even to other concepts within the semantic space (e.g. the concept of “person”). In Fig. 4, we demonstrate how valence and dominance relate to the proximity of the robot to the concept of “person”.
Finally, the discovery of body image being predicted by valence and dominance might also provide insights for commercial robot development. For example, in the analysis shown in the right panel of Fig. 4, humanoids are perceived as more dominant on the one hand, and further away from “person” on the other hand. This suggests that the likelihood of purchasing a robot like Jibo or Buddy (both low in dominance and close to “person”) might be higher than for humanoids.
References
- [1] A. Doerig, T. C. Kietzmann, E. Allen, Y. Wu, T. Naselaris, K. Kay, and I. Charest, “Semantic scene descriptions as an objective of human vision,” 2022. [Online]. Available: https://arxiv.org/abs/2209.11737
- [2] K. Kar and J. J. DiCarlo, “Fast recurrent processing via ventrolateral prefrontal cortex is needed by the primate ventral stream for robust core visual object recognition,” Neuron, vol. 109, no. 1, pp. 164–176, 2021.
- [3] T. Hannagan, A. Agrawal, L. Cohen, and S. Dehaene, “Emergence of a compositional neural code for written words: Recycling of a convolutional neural network for reading,” Proceedings of the National Academy of Sciences, vol. 118, no. 46, 2021.
- [4] D. Y. Tsao, W. A. Freiwald, T. A. Knutsen, J. B. Mandeville, and R. B. Tootell, “Faces and objects in macaque cerebral cortex,” Nature neuroscience, vol. 6, no. 9, pp. 989–995, 2003.
- [5] K. Patterson, P. J. Nestor, and T. T. Rogers, “Where do you know what you know? The representation of semantic knowledge in the human brain,” Nature Reviews Neuroscience, vol. 8, no. 12, pp. 976–987, Dec. 2007. [Online]. Available: https://www.nature.com/articles/nrn2277
- [6] S. Bracci and H. O. de Beeck, “Dissociations and associations between shape and category representations in the two visual pathways,” Journal of Neuroscience, vol. 36, no. 2, pp. 432–444, 2016.
- [7] J.-Q. Tong, J. R. Binder, C. J. Humphries, S. Mazurchuk, L. L. Conant, and L. Fernandino, “A distributed network for multimodal experiential representation of concepts,” Journal of Neuroscience, 2022.
- [8] D. Pitcher and L. G. Ungerleider, “Evidence for a third visual pathway specialized for social perception,” Trends in Cognitive Sciences, vol. 25, no. 2, pp. 100–110, 2021.
- [9] S. Bhatia, “Associative judgment and vector space semantics.” Psychological review, vol. 124, no. 1, p. 1, 2017.
- [10] G. Grand, I. A. Blank, F. Pereira, and E. Fedorenko, “Semantic projection recovers rich human knowledge of multiple object features from word embeddings,” Nature Human Behaviour, pp. 1–13, 2022.
- [11] S. Mohammad, “Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words,” in Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), 2018, pp. 174–184.
- [12] J. R. De Leeuw, “jspsych: A javascript library for creating behavioral experiments in a web browser,” Behavior research methods, vol. 47, no. 1, pp. 1–12, 2015.
- [13] D. L. Nelson, C. L. McEvoy, and S. Dennis, “What is free association and what does it measure?” Memory & cognition, vol. 28, no. 6, pp. 887–899, 2000.
- [14] S. De Deyne, D. J. Navarro, and G. Storms, “Better explanations of lexical and semantic cognition using networks derived from continued rather than single-word associations,” Behavior Research Methods, vol. 45, no. 2, pp. 480–498, Jun. 2013. [Online]. Available: https://link.springer.com/article/10.3758/s13428-012-0260-7
- [15] H. R. Lee and S. Sabanović, “Culturally variable preferences for robot design and use in south korea, turkey, and the united states,” in Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, 2014, pp. 17–24.
- [16] G. Hollis, “Scoring best-worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments,” Behavior research methods, vol. 50, no. 2, pp. 711–729, 2018.
- [17] S. Brondi, M. Pivetti, S. Di Battista, and M. Sarrica, “What do we expect from robots? social representations, attitudes and evaluations of robots in daily life,” Technology in Society, vol. 66, p. 101663, 2021.