¹¹institutetext: Graduate School of Information Science and Technology,
Osaka Institute of Technology
1-79-1 Kitayama, Hirakata, Osaka 573-0196, Japan
¹¹email: [email protected]

Selfie Taking with Facial Expression Recognition Using Omni-directional Camera

Kazutaka Kiuchi Shimpei Imamura Norihiko Kawai 0000-0002-7859-8407

Abstract

Recent studies have shown that visually impaired people have desires to take selfies in the same way as sighted people do to record their photos and share them with others. Although support applications using sound and vibration have been developed to help visually impaired people take selfies using smartphone cameras, it is still difficult to capture everyone in the angle of view, and it is also difficult to confirm that they all have good expressions in the photo. To mitigate these issues, we propose a method to take selfies with multiple people using an omni-directional camera. Specifically, a user takes a few seconds of video with an omni-directional camera, followed by face detection on all frames. The proposed method then eliminates false face detections and complements undetected ones considering the consistency across all frames. After performing facial expression recognition on all the frames, the proposed method finally extracts the frame in which the participants are happiest, and generates a perspective projection image in which all the participants are in the angle of view from the omni-directional frame. In experiments, we use several scenes with different number of people taken to demonstrate the effectiveness of the proposed method.

Keywords:

selfie omni-directional camera face detection facial expression recognition perspective projection image.

1 Introduction

According to a survey reported in 2011[11], about 71% of visually impaired people had recently taken photos. This suggests that many visually impaired people have the same desire to take photos and share them with others in the same way as sighted people do. On the other hand, a survey published in 2021 reported that about 90% of visually impaired people use iPhone’s VoiceOver app, which has a function that tells how many people are within the field of view of the camera when taking a photo of people, allowing the visually impaired to take photos of people. However, since this function recognizes a face as long as 70% of the face is in the frame, when a person takes a selfie with multiple people, they may not be in the frame. In addition, it is difficult for them to distinguish whether or not the photographed persons have good facial expressions.

In this study, focusing on a situation in which visually impaired people take selfies, we propose a method that enables to take selfies with multiple people using an omni-directional camera. Specifically, a user first takes a few seconds of video using an omni-directional camera. Next, the proposed method performs face detection on all frames, and then eliminates false face detections and complements undetected ones considering the consistency across all frames. After performing facial expression recognition on all the frames, the proposed method finally extracts the frame in which the participants are happiest, and generates a perspective projection image in which all the participants are in the angle of view from the omni-directional frame.

2 Related Work

This section introduces photography systems that support the visually impaired and methods using omni-directional images related to this study. Several methods have been proposed to assist visually impaired people in taking photographs. Jayant et al. proposed EasySnap[11, 19], which enables users to adjust the composition, zoom level, and brightness of the subject according to voice instructions. They also proposed PortraitFramer[11], which enables users to be informed of the positions of faces by vibration when the smartphone screen is touched, allowing the users to move the camera to achieve the desired composition. Vázquez et al.[18] proposed to automatically adjust the camera roll, and Balata et al.[5] provide feedback to users on compositions of their photos using center and golden ratio compositions. These methods enable users to confirm the composition before taking photos. In addition, services and systems have been proposed to support visually impaired people [6, 1, 2, 4, 3, 15, 14, 13, 8] in their daily lives. Although these help visually impaired people obtain information through photographs, none of them are intended to take selfies.

As mentioned above, although many support systems have been proposed that use the common camera in smartphones, it is still difficult for the visually impaired to grasp the positions of objects to be photographed and adjust the camera so that they are included within the angle of view of the camera. To handle these systems, the users need to be trained to a certain degree. Gonzalez Penuela et al. [9] also reported that it is difficult for visually impaired people to reliably follow voice instructions such as ”turn your smartphone 45° to the right”, and that some users find the vibratory feedback confusing and uncomfortable.

To solve these problems, methods using an omni-directional camera that can capture an image of an object without pointing the camera at the object has been proposed. Based on the results of Jokela et al.’s survey [12] on the usability of omni-directional camera, taking selfies and group photos is important use of an omni-directional camera. However, they reported that omni-directional images containing a lot of information can sometimes be uninteresting. To address this issue, Iwamura et al. [10] proposed VizPhoto, which selects an arbitrary object from the detected objects, and outputs a perspective projection image. Although this method succeeds in capturing the object within the angle of view, it does not aim to take selfies and does not take human facial expressions into consideration. As a study of selfies using an omni-directional camera, obata et al. [16] proposed a system for sending omni-directional video messages to family members who are separated from each other during a meal in order to communicate with them. As one of the functions of this system, the face image of the person of interest in the omni-directional video is combined with the frontal view of the omni-directional video by a face tracking function to share both the face and the view the user are seeing. However, this research focuses on communication through video messages, and does not aim to take selfies with good facial expressions.

Unlike the previous studies, this study focuses on taking selfies and aims to output a perspective projection image in which all the faces are in the angle of view and have good facial expressions.

3 Proposed Method

3.1 Outline

In the proposed method, first, a user takes a few seconds of omni-directional video. Second, the method performs face detection for all persons in all frames in the omni-directional video. Third, the method eliminates false detection and interpolates undetected faces. Fourth, the method performs facial expression recognition on all the frames, and extract the frame with the highest happiness value for all the participants. Finally, the method transforms the omni-directional image to the perspective projection image. In this step, it calculates the image center and the angle of view so that all the faces are within the frame. In the following, we describe each step in details.

3.2 Omni-directional Video Capture

In the proposed method, a user takes a selfie with an omni-directional camera according to the following three conditions.

1.

A user extends his/her arm and holds the camera at about the same height as his/her shoulder.
2.

Persons to be photographed are within about 1.5 meters from the camera.
3.

Persons to be photographed gather in one direction from the camera and do not spread out to the sides.

This study also assumes that the downward direction of the omni-directional image is the direction of gravity in the real world, using the direction of gravity obtained by the acceleration sensor inside the omni-directional camera.

3.3 Frame Extraction with Face Detection and Expression Recognition

3.3.1 Face Detection on All Frames

The proposed method detects faces from all frames in an omni-directional video. Here, We use MTCNN[20] (Multi-task Cascaded Convolutional Neural Networks). Figure 1(a) shows an example of face detection, in which the face detection is successful. However, objects other than faces such as keyboards and clocks can also be falsely detected as shown in Figs. 1(b)(c)(d). In addition, MTCNN often fails to detect faces as shown in Fig. 2. This may be caused by various factors such as brightness, hair, and hats.

To reduce the false detection, assuming that persons to be photographed are within approximately 0.3 to 1.5 meters from the camera as mentioned above, we set the minimum lengths for one side of the bounding box. This reduces the number of false detections such as the small bounding boxes shown in Figs. 1(b)(c). However, since this alone does not completely prevent false detections, further methods to prevent false detections are necessary.

3.3.2 Eliminating False Detection Faces and Interpolating Undetected Faces

To further reduce the reduction of false detections and also interpolate undetected faces, we first calculate the center coordinates of the bounding boxes from the upper left coordinates, width and height of bounding boxes for all the detected faces in all the frames. Second, we cluster all the center coordinates of bounding boxes using Mean Shift method [7]. Third, if the ratio of the number of elements in a class to the total number of frames is below a certain threshold $T$ , the bounding boxes in the class are deleted as false detection. In addition, we set the maximum length of one side of the bounding box and if the ratio of the number of elements in a class that have large regions as shown in Fig. 1(d) is more than $T$ , we eliminate the class. Finally, if an element in a remaining class is not present in a certain frame, a bounding box is generated in that frame using the mean coordinates and the mean width and height of the bounding boxes in the class. This allows undetected faces to be interpolated. The face detection results with false detection eliminated and undetected faces interpolated are used for the next facial expression recognition.

3.3.3 Facial Expression Recognition and Extraction of a Frame

The proposed method applies FER (Face Emotion Recognizer)[17], which recognizes the facial expressions in seven categories: angry, disgust, fear, happy, sad, surprise and neutral. Based on the recognition results, the frame with the highest value is extracted from the omni-directional video using Eq. (1).

H=M-cV,

(1)

where $M$ and $V$ represent the mean and variance of the happiness expression values of all the detected faces in a frame and $c$ is a coefficient. The overall happiness value $H$ defined in Eq. (1) is calculated for each frame and the frame with the highest $H$ is extracted. In this study, we define happiness as the mean minus the variance, based on the idea that it would be ideal if everyone’s facial expressions were as uniform as possible.

3.4 Transformation to Perspective Projection Image

The proposed method transforms the omni-directional frame with the highest $H$ to a perspective projection image in which all the participants are in the angle of view. Figure 3 shows the relationships for the transformation. The transformation from an omni-directional image to a perspective projection image is obtained by considering the following three relationships 1, 2, and 3.

1.

A coordinate in an omni-directional image and a 3D position of a sphere
2.

A coordinate in a perspective projection image and a 3D position of a plane
3.

3D positions of a sphere and a plane through which the same ray passes

In addition, according to the positions of the photographed faces, the proposed method determines the center of the perspective projection image and the angle of view. We describe each relationship in detail below.

3.4.1 Relationship between Coordinate in Omni-directional Image and 3D Position of Sphere

Coordinates $(u,v)$ in an omni-directional image with equirectangular projection are corresponding to $\phi$ and $\theta$ , which are angles around and from Z-axis connecting the top and bottom of the sphere, respectively. Given that $h_{o}$ and $w_{o}$ denote the vertical and horizontal lengths of the omni-directional image, respectively, coordinates $(u,v)$ of the omni-directional image can be expressed using angles $\phi$ and $\theta$ as follows:

\left\{\,\begin{aligned} &u=\frac{w_{o}}{2\pi\varphi}\\ &v=\frac{h_{o}}{\pi\theta}.\end{aligned}\right.

(2)

3.4.2 Relationship between Coordinate in Perspective Projection Image and 3D Position of Plane

Figure 5 shows the 3D position of the plane, which is corresponding to a perspective projection image, tangent to the sphere. The plane consists of the center position $\boldsymbol{o}$ , and two orthogonal unit vectors $\boldsymbol{a}$ and $\boldsymbol{b}$ . They respectively become the center of the perspective projection image, and horizontal and vertical axes. In this study, $\boldsymbol{o}$ , $\boldsymbol{a}$ and $\boldsymbol{b}$ are defined so that the upward direction of the image becomes that of the real world as follows:

\boldsymbol{o}=\begin{pmatrix}\sin\beta\cos\alpha\\ \sin\beta\sin\alpha\\ \cos\beta\end{pmatrix},\boldsymbol{a}=\begin{pmatrix}\sin\alpha\\ -\cos\alpha\\ 0\end{pmatrix},\boldsymbol{b}=\begin{pmatrix}-\cos\alpha\cos\beta\\ -\sin\alpha\cos\beta\\ \sin\beta\end{pmatrix},

(3)

where $\alpha$ and $\beta$ are angles around and from Z-axis connecting the top and bottom of the sphere, respectively.

Figure 5 shows the relationship between a coordinate in a perspective projection image and the 3D position on the plane. The left and right figures in Fig. 5 represents a perspective projection image with height $h_{p}$ and width $w_{p}$ , and a 3D plane with height $m$ and width $l$ , respectively. In this case, the 3D position on the plane is represented using parameter $(s,t)$ as follows:

\boldsymbol{o}+s\boldsymbol{a}+t\boldsymbol{b}.

(4)

From Fig. 5 and Eq.(4), the relationship between the coordinate $(p,q)$ in the perspective projection image and $(s,t)$ is expressed by a translation and scaling matrix as follows.

\begin{bmatrix}s\\ t\\ 1\end{bmatrix}=\begin{bmatrix}s_{x}&0&t_{x}\\ 0&s_{y}&t_{y}\\ 0&0&1\end{bmatrix}\begin{bmatrix}p\\ q\\ 1\end{bmatrix}.

(5)

3.4.3 Relationship between 3D Positions of Sphere and Plane through which the Same Ray Passes

Figure 6 shows the relationship between points on the sphere and the plane where the same ray passes. Here, since the line passing through the point on the sphere and the center of the sphere intersects the plane, the relationship can be expressed with parameter $k$ as follows:

k\begin{bmatrix}\sin\theta\cos\varphi\\ \sin\theta\sin\varphi\\ \cos\theta\end{bmatrix}=\boldsymbol{o}+s\boldsymbol{a}+t\boldsymbol{b}.

(6)

From the above Eqs. (2), (5) and (6), the coordinates $(p,q)$ in the perspective projection image can be mapped to the coordinates $(u,v)$ in the omnidirectional image to generate the perspective projection image. Here, it is necessary to find the center $\boldsymbol{o}$ of the plane and the lengths $l$ and $m$ of the edges of the plane, which are corresponding to the center and the view angle of the perspective projection image so that the all the participants photographed are within the image.

3.4.4 Determination of the most left, right, top and bottom faces

This study deals with omni-directional images in which the left and right edges of the images are connected. For this reason, the left and right positions of the faces in the omni-directional image do not necessarily correspond to their actual positions. Therefore, the left-most and right-most faces that appear in the perspective projection image are determined.

The proposed method calculates the distance from a face to the face with the longest distance from it in the positive direction of the x-axis of the omni-directional image. This is done for all faces, and the pair of faces with the shortest distance is extracted as the left face and the right face. Figure 7 illustrates an example in which A, B, and C represent three faces. First, if we consider A to be the left-most face and look for the face farthest to the left in the positive direction of the x-axis, we find C and calculate the distance between A and C. The same process is performed when B and C are considered the left-most faces. Next, the method finds the shortest distance. In the case of Fig. 7, B and A are the leftmost and the rightmost faces, respectively. As for the most top and bottom faces, we simply find the faces with the smallest and largest y-coordinates.

3.4.5 Calculation of Center on Perspective Projection Image

The proposed method calculates 3D coordinate $\boldsymbol{o}$ on the plane, which is corresponding to the center of the perspective projection image. Specifically, we obtain the mean of the x-coordinates of the bounding box centers of the most left and right faces, and the mean of the y-coordinates of the bounding box centers of the most top and bottom faces. Note that the average of x is calculated considering only in the positive direction of the x-axis. Using the equations mentioned above, we transform x and y coordinates into angles $(\alpha,\beta)$ to obtain 3D position $\boldsymbol{o}$ .

3.4.6 Calculation of Angle of View of Perspective Projection Image

Since the vertical and horizontal lengths $l$ and $m$ of the 3D plane shown in Fig. 5 determine the angle of view of the perspective projection image, $(l,m)$ is calculated from the coordinates of the faces. Specifically, $s$ and $t$ are obtained from the coordinates of the corners of all bounding boxes using Eqs (2) and (6), and the maximum absolute values of $s$ and $t$ are obtained. Since $\boldsymbol{a}$ and $\boldsymbol{b}$ are unit vectors, doubling $s$ and $t$ yields an angle of view that includes all the faces. However, as shown in Fig. 1(a), the bounding box does not include the hair area. Therefore, by adding offset $i$ , $l$ and $m$ are determined as follows:

\left\{\,\begin{aligned} &l=2s+i\\ &m=2t+i.\end{aligned}\right.

(7)

4 Experiments and Discussions

To demonstrate the effectiveness of the proposed method, we performed comparison experiments with a baseline method. For the baseline method, we used the default settings of MTCNN, i.e., the minimum value of one side of the bounding box for face detection is 20 pixels and the maximum value is not fixed. To extract the frame with the highest $H$ from all frames, the baseline method did not eliminate false detections or undetections, but simply extracted the frames in which the number of faces detected in a frame was the mode throughout all frames as a mechanism for eliminating false detections and undetections instead, and extracted the frame with the highest $H$ from the frames.

For the experiments, we used five types of omni-directional videos ( $3840\times 1920$ pixels) with different conditions, as shown in Table 5, and performed three types of validation. Experiment 1 is for validation of happy values, Experiments 2 and 3 are for validation of false detections, and Experiments 4 and 5 are for validation of undetections. These experiments were conducted by sighted people, not visually impaired people, and we empirically set the parameters of the proposed method as shown in Table 5.

Table 5 shows the experimental results of $H$ and $(s,t)$ of perspective projection images. Table 5 shows the experimental results for detection by the proposed method, and Table 5 shows the experimental results for detection by the baseline method. In the following, we discuss the three types of validation in turn.

Table 1: Experimental conditions.

Expt.	No. of participants	No. of unrelated persons
1	6	5
2	3	0
3	2	1
4	5	0
5	7	4

Table 2: Parameters of the proposed method.

Minumum length of bounding box	50 pixels
Maximum length of bounding box	250 pixels
Threshold $T$ for eliminating false detections	25 %
Coefficient $c$ for happy value	1
Offset $i$ for perspective projection image	0.3

Table 3: Experimental results of the proposed and baseline methods for

H

(s,t)

	Proposed		Baseline
Expt.	$H$	$(s,t)$	$H$	$(s,t)$
1	0.458	$(1.880,0.454)$	0.400	$(1.883,0.452)$
2	0.895	$(1.213,0.469)$	0.648	$(1.959,0.622)$
3	0.309	$(0.335,0.216)$	0.155	$(11.82,1.677)$
4	0.483	$(1.057,0.363)$	0.612	$(1.071,0.362)$
5	0.402	$(2.294,0.504)$	0.491	$(145.4,34.72)$

Table 4: Experimental results of the proposed method for detection. (a) No. of frames, (b) No. of classes after eliminating false detection, (c) No. of classes by Mean-shift, (d) No. of frames in which false detection occurred before eliminating false detection, and (e) No. of frames in which undetected participants existed before interpolation.

Expt.	(a)	(b)	(c)	(d)	(e)
1	51	6	7	1	32
2	143	3	7	32	10
3	152	2	2	0	0
4	41	5	5	0	32
5	35	7	8	1	30

Table 5: Experimental results of the baseline method for detection. (a) No. of frames, (b) Mode throughout all frames, (c) No. of frames whose number of detections is mode, (d) Ratio of frames with No. of detections is mode, (e) No. of frames in which false detection occured, (f) No. of frames in which undetected participants existed, (g) No. of ideal frames in which no false detections and no undetections, and (h) Ratio of ideal frames.

Expt.	(a)	(b)	(c)	(d)	(e)	(f)	(g)	(h)
1	51	6	19	0.372	31	38	5	0.098
2	143	4	52	0.363	101	2	41	0.286
3	152	3	108	0.710	147	2	5	0.032
4	41	4	25	0.609	1	39	2	0.048
5	35	6	13	0.371	34	32	0	0.000

4.1 Validation on Happiness Values

In Experiment 1, using the scene where six people are the photographed participants and four people are not related, we validate whether the proposed method can extract frames with higher $H$ than the baseline method. Figures 8(a) and (b) show the omni-directional frames with the highest $H$ extracted by the proposed method and the baseline method, respectively. Figures 8(d) and (e) are the perspective projection images generated from Figs. 8(a) and (b), and $H$ of the images by the proposed and baseline methods are 0.458 and 0.400, respectively, indicating that the proposed method is 0.058 higher. This is because the proposed method was able to compute $H$ for all the frames by setting the upper and lower sizes of the bounding box and eliminating false detections and interpolating undetected participants. On the other hand, for the baseline method, only 37.2% of all frames had a mode, as shown in Table 5. For example, as shown in Figs. 8(c) and (f), in the frame extracted by the proposed method, an unrelated person in the distance was detected by the baseline method, and this frame was not extracted because the number of detections is seven.

4.2 Validation on false detection elimination

In Experiments 2 and 3, we validate whether the elimination of false detections by the proposed method is effective. First, in Experiment 2, although the number of photographed participants is three, the mode by the baseline method is four as shown in Table 5. This resulted in extracting the frame with the highest $H$ (Fig. 10(b)), which contains the false detection car shown on the right side of Fig. 10(c), and generating the perspective projection image with unnecessarily wide angle of view by the baseline method shown in Fig. 10(e). Also, $H$ is calculated as if the car is a face. Such a happiness value is not appropriate for selecting a frame.

In contrast, the results of the proposed method shown in Fig. 10(a) and the left figure in Fig. 10(c) and 10(d) show that the false detection of the car is eliminated and the perspective projection image contains only the participants. Specifically, in the procedure of the proposed method, the mean shift classifies the bounding boxes into seven classes, and threshold $T$ for the ratio of detections eliminated four types of false detections, which occurred in 32 frames, as shown in Table 5. As a result, three classes remained, which are number of participants.

Next, we describe Experiment 3, in which the number of people photographed is two and the number of unrelated persons is one. The baseline method extracted the frame with the highest $H$ (Fig. 10(b)), in which an unrelated person about four meters away from the camera was detected, as shown on the right side of Fig. 10(c). In this case, the angle of view of the perspective projection image that includes the three people becomes considerably larger, and the texture near the edges of the image is significantly stretched as shown in Fig. 10(e), which does not have proper visibility.

On the other hand, the results of the proposed method shown in Figs. 10(a), 10(d) and the left of 10(c) show that the false detection was eliminated and that only the target participants appear in the perspective projection images. Since the elimination of the false detection in this case is due to the setting of the minimum size of bounding boxes, we confirmed that such setting is effective. From the results, we can say that the proposed method of eliminating false detections by the mean shift clustering is more effective than the baseline method using the mode to generate perspective projection images in which only the photographed participants appear.

4.3 Validation on interpolation of Undetected Faces

In Experiments 4 and 5, we validate whether the interpolation of undetected faces is effective. First, let us start with Experiment 4, in which the number of participants is five. The baseline method extracted the frame with the highest $H$ (Fig. 12(b)), and output Fig. 12(e), excluding undetected persons, as shown on the right side of Fig. 12(c). In this scene, the detection accuracy of the some participants by MTCNN was poor, and the mode throughout all the frame is four, as shown in Table 5. Also, all the five participants were detected in only two frames, indicating that the baseline method cannot appropriately extract the frame with the highest $H$ for all the participants.

In contrast, the results of the proposed method shown in Figs. 12(a) and (d) and the left of Fig. 12(c) show that the proposed method interpolated undetected faces. As shown in Table 5, although the proposed method still failed to detect faces in 32 frames, since each of the five participants was detected in more than 25% of all the frames, the mean value of the bounding boxes for each person could be used to interpolate the undetected persons.

Next, we describe Experiment 5, in which the number of people photographed is seven and the number of unrelated persons is four. The baseline method extracted the frame with the highest $H$ (Fig. 12(b)), which has one false detection and two undetected, as shown in the three images on the right side of Fig. 12(d). In this case, the mode is six as shown in Table 5, which indicates that the persons detected in each frame differs depending on false detections and undetections. In addition, since the falsely detected person was located further away from the target participants, the calculated $(s,t)$ i.e., the angle of view, became larger as shown in Table 5, resulting in failing to generate an appropriately visible image.

In contrast, the results of the proposed method shown in Figs. 12(a), (c) and the left three images in Fig. 12(d) show that the false detection was eliminated and the undetected persons were interpolated. As a result, only the participants appear in the perspective projection images. As shown in Table 5, even though some of participants were undetected in 30 frames in the proposed method, the mean-shift clustering determined the number of classes appropriate for the number of participants, and all undetected faces were interpolated. From these results, we confirmed that the proposed method with undetected interpolation is effective because it can calculate $H$ for all the photographed participants in all the frames.

5 Conclusion

In this study, we proposed a method that enables to take selfies with multiple people using an omni-directional camera. The proposed method succeeded in eliminating false detections and interpolating undetected faces to extract the frame with the happiest expressions of all the photographed participants from all the frames, and generating a perspective projection image in which all the participants are present. In the experiments, we compared the proposed method with the baseline method to demonstrate its effectiveness. Future work includes the determination of optical parameters for enhancing false detection elimination and undetected interpolation.

References

[1] Be my eyes. https://www.bemyeyes.com/, (Accessed on 24/11/2023)
[2] Envision ai. https://apps.apple.com/jp/app/envision-ai/id1268632314, (Accessed on 24/11/2023)
[3] Orcam myeye 2. https://www.orcam.com/en/myeye2/, (Accessed on 24/11/2023)
[4] Taptapsee. https://taptapseeapp.com/, (Accessed on 24/11/2023)
[5] Balata, J., Mikovec, Z., Neoproud, L.: Blindcamera: Central and golden-ratio composition for blind photographers. In: Proceedings of the Mulitimedia, Interaction, Design and Innnovation, pp. 1–8 (2015)
[6] Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.C., Miller, R., Tatarowicz, A., White, B., White, S., et al.: Vizwiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd annual ACM Symposium on User Interface Software and Technology. pp. 333–342 (2010)
[7] Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002)
[8] Cutter, M.P., Manduchi, R.: Towards mobile ocr: How to take a good picture of a document without sight. In: Proceedings of the 2015 ACM Symposium on Document Engineering. pp. 75–84 (2015)
[9] Gonzalez Penuela, R.E., Vermette, P., Yan, Z., Zhang, C., Vertanen, K., Azenkot, S.: Understanding how people with visual impairments take selfies: Experiences and challenges. In: Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility. pp. 1–4 (2022)
[10] Iwamura, M., Hirabayashi, N., Cheng, Z., Minatani, K., Kise, K.: Visphoto: photography for people with visual impairment as post-production of omni-directional camera image. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. pp. 1–9 (2020)
[11] Jayant, C., Ji, H., White, S., Bigham, J.P.: Supporting blind photography. In: Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility. pp. 203–210 (2011)
[12] Jokela, T., Ojala, J., Väänänen, K.: How people use 360-degree cameras. In: Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia. pp. 1–10 (2019)
[13] Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: Feasibility and challenges. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 5839–5849 (2017)
[14] Kutiyanawala, A., Kulyukin, V., Nicholson, J.: Teleassistance in accessible shopping for the blind. In: Proceedings on the International Conference on Internet Computing. p. 1 (2011)
[15] Meijer, P.: The voice - new frontiers in sensory substitution. https://www.seeingwithsound.com/, (Accessed on 24/11/2023)
[16] Obata, K., Nakamura, Y., Chen, L., Augeri, J.: Asynchronous co-eating through video message exchange: Support for making video messages. In: Proceedings of Cross-Cultural Design. Applications in Health, Learning, Communication, and Creativity. pp. 338–348 (2020)
[17] Shenk, J., CG, A., Arriaga, O., Owlwasrowk: justinshenk/fer: Zenodo (Sep 2021), https://doi.org/10.5281/zenodo.5362356, (Accessed on 18/7/2023)
[18] Vázquez, M., Steinfeld, A.: An assisted photography framework to help visually impaired users properly aim a camera. ACM Transactions on Computer-Human Interaction 21(5), 1–29 (2014)
[19] White, S., Ji, H., Bigham, J.P.: Easysnap: real-time audio feedback for blind photography. In: Adjunct Proceedings of the 23nd annual ACM Symposium on User Interface Software and Technology. pp. 409–410 (2010)
[20] Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016)