Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Zheng-Yan Sheng [email protected] University of Science and Technology of China , Yang Ai [email protected] University of Science and Technology of China , Yan-Nian Chen [email protected] University of Science and Technology of China and Zhen-Hua Ling [email protected] University of Science and Technology of China

(2023)

Abstract.

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website¹¹1 Source code and audio samples are available at https://levent9.github.io/ZeroshotFaceVC-demo/.

voice conversion, zero-shot, face-voice alignment

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†doi: 10.1145/3581783.3613825^†^†isbn: 979-8-4007-0108-5/23/10^†^†ccs: Information systems Multimedia content creation

1. Introduction

Voice conversion (VC) aims to convert the voice characteristics of a source speaker to a target speaker while keeping the linguistic content unchanged (Mohammadi and Kain, 2017; Sisman et al., 2020). It has potential applications in various fields, such as communication aids for the speech-impaired (Veaux et al., 2013), speaker de-identification (Srivastava et al., 2020), and dubbing(Gan et al., 2022). Zero-shot VC (Qian et al., 2019; Yuan et al., 2021) is a specific VC task, which allows for the conversion of voices from any source speakers to a newly coming (i.e., unseen in the training data) target speaker using only one reference utterance from the target speaker. Zero-shot VC has gained much research attention in recent years considering its flexibility and less dependency on the amount of training data from target speakers.

In addition to voice, face image is another modality that also carries information about individual identities. Some speaker characteristics, such as age and gender, can be inferred from both voices and face images. Previous studies (Mavica and Barenholtz, 2013; Kamachi et al., 2003; Smith et al., 2016) have presented evidences to support the notion that people can accurately identify an unfamiliar voice and a static image of the corresponding face belonging to the same person with greater than chance accuracy.

Therefore, this paper presents a novel VC task named zero-shot voice conversion based on face images (i.e., zero-shot FaceVC). Instead of using reference utterances, the task leverages a single face image from the unseen target speaker to convert the utterances from any source speakers. The objective of this task is to explore to what extent facial properties can serve as indicators of voice characteristics. If it is feasible to infer suitable voice characteristics from the face image of an unseen speaker, zero-shot FaceVC can hold immense potential in various applications. For instance, an editable virtual face can be used to produce a personalized voice in virtual anchors, and automatic movie dubbing can generate voices more consistent with characters’ appearance.

To the best of our knowledge, there are only two existing studies (Kameoka et al., 2019b; Lu et al., 2021) on voice conversion based on face images (i.e., FaceVC). Different from the zero-shot scenario considered in this paper, both of them explored the task of many-to-many FaceVC, where both source and target speakers were seen in the training set. In one study (Kameoka et al., 2019b), the speaker embeddings utilized in the original VC model were replaced with face embeddings, which were estimated by reconstructing face images. Another study (Lu et al., 2021) employed a three-stage training strategy, including face-voice reparameterization and facial-to-audio transformation, to achieve better performance.

The most critical challenge in FaceVC is face-voice alignment, i.e., to derive corresponding voice representations given face representations. In previous studies, face representations were estimated by either relying on the supervision of mel-spectrum reconstruction (Kameoka et al., 2019b) or minimizing the mean square error (MSE) between speaker embeddings and face embeddings (Lu et al., 2021). The former strategy (Kameoka et al., 2019b) can’t be extented to zero-shot FaceVC since the recordings of target speakers are unavailable. The latter (Lu et al., 2021) adopted the simple MSE loss, assuming that the distribution of voice representations given face representations was unimodal. Both of them failed to describe the complex mapping relationship between voice and face spaces, and can’t fulfill the requirement of the zero-shot FaceVC task.

Therefore, this paper proposes a face-voice memory-based zero-shot FaceVC (FVMVC) method. In this method, a memory-based face-voice alignment (MFVA) module is developed that utilizes trainable slots to quantize the common characteristics between face and voice spaces. At the training stage, the slot values in MFVA are optimized by not only minimizing the reconstruction loss of speaker embeddings, but also reducing the Kullback-Leibler divergence between the slot weight distributions in both spaces. At the inference stage, given a face image of an unseen target speaker, a recalled face embedding is calculated using the slot weights estimated from the reference image and the slot values in the voice space.

In addition, Zero-shot VC usually adopts an auto-encoder framework (Qian et al., 2019; Lian et al., 2022; Yuan et al., 2021; Wang et al., 2021), which suffers from the inconsistency between the training and inference phases. More specifically, the speaker representations and content representations are from the same speaker at the training stage, while they are from different speaker at the conversion stage. To mitigate this problem for zero-shot FaceVC, we propose a mixed supervision strategy, introducing a simple yet effective inter-speaker supervision in addition to the intra-speaker supervision in traditional auto-encoder frameworks. The inter-speaker supervision is achieved by creating pseudo-parallel training data using the speaker embeddings extracted from the recordings of another speaker in the training set. Besides, in order to obtain speaker-independent content representations, we initially pretrain a zero-shot VC model and transfer the knowledge from zero-shot VC to zero-shot FaceVC.

We have noticed that it should be impossible to recover the exact voice of target speakers from only face images. Instead, we focus on three properties of the speech generated by zero-shot FaceVC. The first one is the homogeneity among the voice characteristics of the speech converted using different face images of the same target speaker. The second is the diversity of the voice characteristics converted using the face images of different target speakers. And the third is the consistency between the voice characteristics of the converted utterances and their corresponding face images in some important aspects, e.g., gender. Therefore, a series of subjective and objective metrics are designed in this paper to evaluate these properties mentioned above.

In summary, our main contributions are as follows. First, we propose a new task named zero-shot voice conversion based on face images (zero-shot FaceVC). Second, we propose a face-voice memory-based zero-shot FaceVC (FVMVC) method for this task, which contains a memory-based face-voice alignment module, a mixed supervision strategy and zero-shot VC pretraining. Third, we design a series of metrics to evaluate the proposed task and conduct extensive experiments to demonstrate the effectiveness of our proposed method.

2. Related Work

2.1. Voice Conversion

VC is a task that automatically converts the speech from a source speaker to a speech sound like being spoken by a target speaker while preserving the linguistic content (Mohammadi and Kain, 2017; Sisman et al., 2020). This task can be categorized into two categories: parallel and non-parallel conditions. Since parallel data are not always available, many non-parallel VC techniques have been proposed, including the methods based on variational auto-encoder (VAE) (Kameoka et al., 2019a; Saito et al., 2018), generative adversarial network (GAN) (Kaneko and Kameoka, 2018; Kameoka et al., 2018; Wang et al., 2020; Lee et al., 2021), recognition-synthesis (Chen et al., 2022; Mohammadi and Kim, 2019; Saito et al., 2018) and disentanglement (Zhang et al., 2019).

As a special case of non-parallel VC, zero-shot VC has attracted widespread attention in recent years. Zero-shot VC methods usually follow auto-encoder frameworks, where the encoder extracts content and speaker representations from speech respectively, and the decoder reconstructs speech by combining the above representations. Hence, speech representation disentanglement is crucial for this task (Yang et al., 2022a; Wang et al., 2021). Recently, several zero-shot VC methods (Yuan et al., 2021; Wang et al., 2021; Yang et al., 2022a) based on information theory have emerged, with the aim of disentangling the content-related and speaker identity-related information. IDE-VC (Yuan et al., 2021) employed mutual information (MI) with speaker labels as supervision for disentanglement. VQMIVC (Wang et al., 2021) combined vector quantization with contrastive predictive coding (VQCPC) (van Niekerk et al., 2020; Baevski et al., 2019) and MI for fully unsupervised training.

Refer to caption — Figure 1. The overall flowchart of our proposed FVMVC, where Rec. Loss represents reconstruct loss. During the training phase, two pairs of utterances and the corresponding face images from speaker A and speaker B are used for training simultaneously. Speaker A is used for intra-speaker training and is selected as the source speaker for inter-speaker training, while speaker B is chosen as the target speaker for inter-speaker training.

2.2. Learning Voice-Face Association

In recent years, learning the voice-face association has aroused the interest of researchers. As face and voice are inherently correlated, various cross-modal generation tasks involving both face and voice have been proposed, in addition to voice conversion supported by face images. Examples of such tasks include generating the talking face video from the audio (Park et al., 2022; Zhou et al., 2019; Chen et al., 2019; Song et al., 2018; Zhou et al., 2021), synthesizing speech from the talking face images (Prajwal et al., 2020; Kim et al., 2021; Wang et al., 2022b), reconstructing the face image from the corresponding voice (Oh et al., 2019; Wu et al., 2022a; Choi et al., 2020; Wen et al., 2019; Bai et al., 2022) and synthesizing the speaker’s voice with a face image during text-to-speech (FaceTTS) (Goto et al., 2020; Yang et al., 2023; Plüster et al., 2021; Wang et al., 2022a; Wu et al., 2022b; Yang et al., 2022b).

The most relevant task to voice conversion based on face images is FaceTTS, as they both utilize face images to extract speaker identities for controlling voice characteristics. As far as we know, Face2Speech (Goto et al., 2020) was the first work to address FaceTTS, which pretrained a face encoder with the supervisedly generalized end-to-end (GE2E) (Wan et al., 2018) loss and then replaced the speaker encoder with the face encoder in a multi-speaker TTS model. Following the Face2Speech framework, more elaborate model structures and training strategies (Wang et al., 2022a; Wu et al., 2022b; Plüster et al., 2021) have been proposed to promote the quality of synthetic speech. Recently, 3D face shapes and refined face attributes have also been utilized to generate speech (Yang et al., 2023, 2022b), which provided a referable approach to voice editing. However, the voice-face alignment in FaceTTS has yet to be explored. This paper elaborately designs a memory-based module for the alignment between these two modalities for zero-shot FaceVC, which can also be inserted into the FaceTTS framework for voice control.

3. METHOD

As shown in Figure 1, our proposed FVMVC follows the standard auto-encoder paradigm, consisting of a content encoder, a speaker encoder, a face encoder, a pitch extractor, a decoder, and a memory-based face-voice alignment (MFVA) module.

During the inference phase, our proposed FVMVC utilizes three inputs, including a face image from the target speaker, together with the waveforms and the mel-spectrograms of the utterance to be converted from the source speaker. By processing the face image through a face encoder and an MFVA module sequentially, the voice characteristics representation based on the face image (i.e., the recalled face embedding) is obtained. Similar to zero-shot VC, the mel-spectrograms of the source speaker provide a speaker-independent content representation, while waveforms are used for extracting normalized fundamental frequencies. All the representations mentioned above are ultimately sent into the decoder, which generates mel-spectrograms of the converted utterance. These mel-spectrograms are then converted to waveforms through the vocoder. During the training phase, we incorporate speaker embeddings, which are extracted from mel-spectrograms via the speaker encoder, to supervise the training of the MFVA module.

3.1. Memory-based Face-Voice Alignment

The alignment between face and voice is intended to retrieve the corresponding speaker embedding when only a face image is available. The retrieved speaker embedding from a face image is referred to as the recalled face embedding. We introduce the MFVA module to improve the modeling of voice-face alignment, thereby promoting the performance of zero-shot FaceVC.

As shown in Figure 2, during the training phase, the MFVA module takes a pair of face embedding ${\bm{h}\in\mathbb{R}^{D}}$ and speaker embedding ${\bm{s}\in\mathbb{R}^{D}}$ as input, and generates a recalled face embedding ${\bm{\hat{h}}\in\mathbb{R}^{D}}$ for voice control, where the face embedding ${\bm{h}}$ and the speaker embedding ${\bm{s}}$ are extracted by the face encoder and the speaker encoder respectively, and $D$ represents the dimension of the projected face or speech embedding. MFVA is composed of a voice-value memory ${\bm{M}_{voice}=[\bm{m}_{v}^{1},\bm{m}_{v}^{2},\cdots,\bm{m}_{v}^{N}]^{\intercal}}$ ${\in\mathbb{R}^{N\times D}}$ and a face-key memory ${\bm{M}_{face}=[\bm{m}_{f}^{1},\bm{m}_{f}^{2},\cdots,\bm{m}_{f}^{N}]^{\intercal}}$ ${\in\mathbb{R}^{N\times D}}$ , where $N$ denotes the number of the slots and $D$ is the dimension for each slot, which equals to the dimension of the projected speaker embedding or face embedding. The training of the MFVA module contains two objectives, i.e., (1) storing sufficient voice characteristics information in voice-value memory, and (2) minimizing the distance between the distributions of two modalities.

The sufficiency of the voice characteristics information. Voice-value memory ${\bm{M}_{voice}}$ is made up of a bank of trainable slots ${\{\bm{m}_{v}^{i}}\}_{i=1}^{N}$ , where $\bm{m}_{v}^{i}\in\mathbb{R}^{D}$ is the $i$ -th slot. The voice-value memory is designed to exclusively capture the voice-related information and expected to generate any voice. Specifically, when we take a speaker embedding ${\bm{s}}$ as a query, the attention weight between the query and each slot is computed with cosine similarity and softmax normalization function as follows,

(1)

w_{v}^{i}=softmax(\frac{\bm{s}^{\intercal}\bm{m}_{v}^{i}}{\left\|\bm{s}\right\|_{2}\left\|\bm{m}_{v}^{i}\right\|_{2}}),

where $w_{v}^{i}$ represents the degree of relevance between the $i$ -th slot $\bm{m}_{v}^{i}$ and the speaker embedding ${\bm{s}}$ . Then we can get the attention weight vector ${\bm{w}_{voice}}=[w_{v}^{1},w_{v}^{2},\cdots,w_{v}^{N}]\in\mathbb{R}^{N}$ by computing attention weight with all slots. In the end, we reconstruct the speaker embedding from all slots,

(2)

\hat{\bm{s}}=\bm{M}_{voice}^{\intercal}\bm{w}_{voice},

and minimize the MSE between the input speaker embedding ${\bm{s}}$ and the recalled speaker embedding $\hat{\bm{s}}$ ,

(3)

\mathcal{L}_{store}=\|\bm{s}-\hat{\bm{s}}\|_{2}^{2}.

In this way, the slots in the voice-value memory can be used as basis vectors for building the voice characteristics space, and various combinations of the slots can represent arbitrary voices.

The alignment between voice and face space. We utilize the slots as a streamlined bridge to map the face embedding onto the voice space. In detail, given the face embedding ${\bm{h}}$ , we generate the recalled face embedding ${\hat{\bm{h}}}$ in a similar way as the recalled speaker embedding, where the attention weight is calculated with the face-key memory ${\bm{M}_{face}}$ and aligned with slots in voice-value memory ${\bm{M}_{voice}}$ as follows,

(4)

w_{f}^{i}=softmax(\frac{\bm{h}^{\intercal}\bm{m}_{f}^{i}}{\left\|\bm{h}\right\|_{2}\left\|\bm{m}_{f}^{i}\right\|_{2}}),

(5)

{\bm{w}_{face}}=[w_{f}^{1},w_{f}^{2},\cdots,w_{f}^{N}],

(6)

\hat{\bm{h}}=\bm{M}_{voice}^{\intercal}\bm{w}_{face},

where $\bm{m}_{f}^{i}\in\mathbb{R}^{D}$ the $i$ -th trainable slot in the face-key memory ${\bm{M}_{face}}$ . Consequently, we combine the slots that are solely related to the voice to generate the recalled face embedding $\hat{\bm{h}}$ . In addition, face images often contain various background noises and irrelevant information to the voice, such as shooting angle and image background. Voice-value memory ${M_{voice}}$ can impose an information bottleneck to remove these non-essential details. In the end, we align the slot-weight distributions between two modalities using Kullback-Leibler divergence,

(7)

\mathcal{L}_{align}=D_{KL}(\bm{w}_{voice}||\bm{w}_{face}).

In this way, during the inference phase, only given the face embedding as input, we can generate reasonable recalled face embedding to support the voice characteristics information for zero-shot FaceVC.

3.2. Structures of Other Modules

Content Encoder uses vector quantization (VQ) (van Niekerk et al., 2020) with contrastive predictive coding (CPC) (Baevski et al., 2019) to extract content embedding from mel-spectrograms, where VQ can be seen as an information bottleneck to remove inconsequential content information and CPC is used to explore the local structure of speech.

Face Encoder is used to extract the face embedding from the face image. We firstly leverage MTCNN (Zhang et al., 2016) for face detection on the original face image. Then we enhance the pretrained FaceNet (Schroff et al., 2015) with a self-attention module to extract the facial feature as the face embedding. During the zero-shot FaceVC training phase, only the self-attention module is jointly trained with other modules.

Speaker Encoder takes in mel-spectrograms to generate the length-fixed speaker embedding. The speaker encoder consists of two parts: Resemblyzer²²2https://github.com/resemble-ai/Resemblyzer and a self-attention module. The Resemblyzer is pretrained in the speaker verification task with GE2E loss function (Wan et al., 2018). To better integrate the pretrained speaker verification network into the zero-shot VC network, we enhance the pretrained speaker verification network with a self-attention module, which is jointly trained with other modules during the zero-shot VC training phase.

Pitch Extractor extracts fundamental frequencies from input waveforms based on the period detection of the vocal fold vibration (Morise et al., 2009) and performs z-normalization for each utterance.

Decoder maps the content representation, speaker-identity representation, and pitch representation into mel-spectrograms. The decoder mainly consists of convolutional blocks and two long short-term memory (LSTM) layers.

3.3. Mixed Supervision Strategy

The mixed supervision strategy contains intra-speaker supervision and inter-speaker supervision. The left part in Figure 1 shows the intra-speaker supervision following the traditional auto-encoder paradigm. During the training phase, we firstly encode the mel-spectrograms $\bm{X}_{A}$ of input utterance from Speaker $A$ into a speaker-independent content embedding ${\bm{c}_{A}}$ , normalized fundamental frequencies ${\bm{f}_{A}}$ and a length-fixed speaker embedding ${\bm{s}_{A}}$ . Meanwhile, we utilize the corresponding face image $\bm{Z}_{A}$ to get the face embedding ${\bm{h}_{A}}$ . Then, we feed the face embedding $\bm{h}_{A}$ and the speaker embedding ${\bm{s}_{A}}$ into the MFVA module to get recalled face embedding $\hat{\bm{h}}_{A}$ . In the end, the decoder $D$ maps the above representations to reconstruct the mel-spectrograms ${\hat{\bm{X}}_{A}=D(\bm{c}_{A},\bm{\hat{h}}_{A},\bm{f}_{A})}$ . The decoder is jointly trained with the MFVA and the self-attention module in the face encoder by minimizing the following reconstruction loss,

(8)

\mathcal{L}_{Intra}=\|\hat{\bm{X}}_{A}-\bm{X}_{A}\|_{2}^{2}+\|\hat{\bm{X}}_{A}-\bm{X}_{A}\|_{1}.

However, the training-inference inconsistency phenomenon exists when only adopting intra-speaker supervision, because the content embedding and the recalled face embedding are from the same speaker during the training phase while different during the inference phase. We introduce a simple but efficient inter-speaker supervision strategy when the parallel corpus is unavailable. In this strategy, the voice converted by the target speaker embedding is used as the pseudo-parallel corpus of the face embedding for supervised training. Specifically, as shown in the middle part of Figure 1, in addition to the speaker $A$ , a pair of face image and mel-spectrograms of an extra speaker ${B}$ is as input, and the speaker embedding ${\bm{s}_{B}}$ and the recalled face embedding $\hat{\bm{h}}_{B}$ are obtained in the same way as inference phase. Then speaker A is treated as the source speaker and speaker $B$ is viewed as the target speaker to convert voice as follows,

(9)

\bm{X}_{speech}=D(\bm{c}_{A},\bm{s}_{B},\bm{f}_{A}),

(10)

\bm{X}_{face}=D(\bm{c}_{A},\hat{\bm{h}}_{B},\bm{f}_{A}),

where $\bm{X}_{speech}$ refers to the converted mel-spectrograms that are supported by the speaker embedding $\bm{s}_{B}$ , and $\bm{X}_{face}$ denotes the converted mel-spectrograms obtained using the recalled face embedding $\hat{\bm{h}}_{B}$ . Then, we optimize the MFVA by minimizing the reconstruction loss,

(11)

\mathcal{L}_{Inter}=\|\bm{X}_{speech}-\bm{X}_{face}\|_{2}^{2}+\|\bm{X}_{speech}-\bm{X}_{face}\|_{1}.

In summary, the final loss function during the training phase of zero-shot FaceVC is as follows,

(12)

\mathcal{L}=\lambda_{1}\mathcal{L}_{store}+\lambda_{2}\mathcal{L}_{align}+\lambda_{3}\mathcal{L}_{Inter}+\mathcal{L}_{Intra},

where $\lambda_{1},\lambda_{2}$ , and $\lambda_{3}$ are constant weights to control how the importance of each term, and $\mathcal{L}_{store}$ and $\mathcal{L}_{align}$ are described in Section 3.1.

3.4. Pretraining Strategy

It is widely recognized that the content encoder may encode the speaker identity-related information. Only when the speaker and content representations are disentangled, the voice characteristic of the utterance can be converted by changing the speaker-identity representation (Qian et al., 2019; Yuan et al., 2021). Hence speech representation disentanglement is a critical factor that significantly impacts the performance of the zero-shot FaceVC. Taking this into considerations, we first pretrain a zero-shot VC model and then transfer the content encoder, speaker encoder, and decoder to the zero-shot FaceVC for better performance. Specifically, during the training phase of the zero-shot FaceVC, the pretrained content encoder and the pretrained speaker encoder are fixed, while the pretrained decoder is further optimized with other modules.

In order to achieve speech representation disentanglement, mutual information (MI) is introduced to evaluate the dependency between different representations. We minimize the MI between content embeddings, speaker embeddings and fundamental frequencies utilizing the variational contrastive log-ratio upper bound (vCLUB) (Cheng et al., 2020) during the training phase of zero-shot VC. Except for the MI loss, reconstruction loss, InfoNCE loss (Oord et al., 2018) and VQ loss (van Niekerk et al., 2020) are used to optimize the zero-shot VC model. For more information on these functions, please refer to the VQMIVC (Wang et al., 2021) paper.

4. Experiments

4.1. Datasets

Zero-shot FaceVC tasks place high demands on datasets, requiring not only a large volume of background-noise-free speech from various speakers but also clear face images of the corresponding individuals. Based on the above considerations, we conducted the experiments on the LRS3-TED (Afouras et al., 2018) dataset to evaluate the zero-shot FaceVC task. This dataset includes 5,594 TED and TEDx talks in English, totaling over 400 hours of video content. The cropped face tracks in the video are provided at a resolution of 224×224 with a frame rate of 25 frames per second. The audio tracks are also available in a single-channel 16-bit 16 kHz format. In this dataset, the duration of speech from different speakers follows a long-tail distribution, which means that the majority of speakers have only a small number of utterances. To address the issue of uneven video distribution among speakers, we opted to use the top 200 speakers with the highest number of videos as our training set and validation set. During inference, we randomly selected a total of 12 newly coming speakers not in the training and validation sets: 8 target speakers (4 female and 4 male) and 4 source speakers (2 female and 2 male) for evaluation.

Table 1. Objective and subjective evaluation results of comparison systems. The definitions of all metrics can be found in Section 4.4.

	Homogeneity		Diversity		Consistency		Quality
Method	SHR $\uparrow$	SHO $\uparrow$	SDR $\downarrow$	SDO $\downarrow$	GA $\uparrow$	MOS-FVC $\uparrow$	MOS-SN $\uparrow$
Ground Truth	0.8245	1.0000	0.5524	0.5524	1.0000	3.7042	4.2183
SpeechVC	0.7267	0.8229	0.5890	0.6408	0.9895	3.5917	3.6022
Auto-FaceVC (Lu et al., 2021)	0.7186	0.8132	0.6351	0.7042	0.9239	3.4289	3.5969
attentionCVAE (Yang et al., 2022b)	0.7153	0.8081	0.6874	0.7789	0.9166	3.4292	3.5982
FVMVC	0.7313	0.8692	0.6188	0.6781	0.9791	3.5417	3.5993

4.2. Implementation Details

To extract acoustic features, we first extracted the audio from the video clips with the FFmpeg tool (Tomar, 2006). Then 80-dim mel-spectrograms and normalized fundamental frequencies were calculated with a 25ms Hanning window, a 10ms frame shift and a 400-point short-time Fourier transform (STFT). We extracted a 512-dimensional face embedding from each frame of the video using MTCNN (Zhang et al., 2016) and FaceNet (Schroff et al., 2015) subsequently. The dimension of face embeddings was projected into ${D=256}$ , which was the same as the dimension of speaker embeddings and the slot dimension. The number of slots in MFVA was $N=96$ . We applied the pretrained Parallel WaveGAN (PWG) vocoder (Yamamoto et al., 2020) to convert the mel-spectrograms to waveforms.

For the pretraining strategy, the zero-shot VC model was trained for 1000 epochs using a batch size of 256. The mini-batch Adam optimizer was initialized with a learning rate of 1e-6 and was warmed up to 1e-3 after 2000 iterations. The learning rate was then decayed by a factor of 0.5 at epochs 300, 400, and 500. At the zero-shot FaceVC training stage, the model was updated for 2000 epochs using a batch size of 256. Similar to pretraining the zero-shot VC model, the learning rate of the Adam optimizer was initialized as 1e-6 and warmed up to 2.5e-4 after 3000 iterations. The learning rate was then decayed by a factor of 0.5 at epochs 800, 1200, and 1600. The constant weights ${\lambda_{1}}$ , ${\lambda_{2}}$ , and ${\lambda_{3}}$ in Equation 12 were respectively set to be 1, 10, and 0.2.

4.3. Comparison systems

As we are the first to attempt the zero-shot FaceVC task and there are no existing comparable methods, we compared our proposed method with the following systems to evaluate its performance:

(1) Ground Truth: This method transferred the natural mel-spectrograms of target speakers to waveforms using the pretrained PWG vocoder. Since there are no parallel utterances between source and target speakers, the ground truth results can’t be compared directly with the converted results, and are just used for indicating the upperbound of various metrics.

(2) SpeechVC: This is our pretrained zero-shot VC model using natural reference utterances of target speakers for inference.

(3) Auto-FaceVC (Lu et al., 2021): This method originally adopted AutoVC (Qian et al., 2019) as the backbone. To better adapt the model to the zero-shot FaceVC task, we altered its backbone AutoVC to VQMIVC while preserving its original training strategy.

(4) attentionCVAE (Yang et al., 2022b): This method used the face attributes to control the voice characteristics in the multi-speaker text-to-speech task. To adapt it to our zero-shot FaceVC task, we inserted its face attributes-based voice control module to replace the face encoder and the MFVA module in our proposed model.

4.4. Metrics

We developed several objective metrics to evaluate the homogeneity, diversity and consistency of the converted voice. The underlying motivation for measuring homogeneity is that the zero-shot FaceVC system should generate homogeneous voice characteristics with different face images from the same target speaker, regardless of the image’s shooting angle and background. We applied the well-known open-source speaker verification toolkit, Resemblyzer³³3https://github.com/resemble-ai/Resemblyzer, to extract the speaker embedding from converted utterances of the same speaker and calculate the cosine similarity between them. The greater the value of cosine similarity, the higher the homogeneity between different utterances. Based on the descriptions above, we employed two methods to match the utterances and calculate the cosine similarity between them. (1) We employed a randomized approach to match utterances converted by different face images of the same target speaker. To avoid chance, we shuffled the utterances 500 times and calculated the average cosine similarity between all pairs, which we referred to as the speaker homogeneity score by random matching (SHR). (2) In addition to the random matching, we also conducted one-to-one matching of all utterances converted from the same utterance to the same target speaker by different face images. We then averaged cosine similarity between these pairs to obtain the speaker homogeneity score by one-to-one matching (SHO).

Apart from homogeneity, it is crucial for the voice characteristics converted from different target speakers to be diverse rather than uniform and indistinguishable. Similar to homogeneity, we obtained the speaker embedding of converted utterances from the same source speaker using different target speakers’ face images by Resemblyzer and calculated the cosine similarity between them. We hypothesize that a lower similarity indicates higher diversity in the voice characteristics between target speakers. To measure speaker diversity, we also matched the utterance in two ways. (1) We randomly matched the converted utterances from the same source speaker to different target speakers and averaged the cosine similarity between them in 100 shuffles, which we referred to as the speaker diversity score by random matching (SDR). (2) We conducted a one-to-one matching of utterances converted from the same utterance to different target speakers and averaged cosine similarity between them to obtain the speaker diversity score by one-to-one matching (SDO).

When assessing the consistency between the voice characteristics and the corresponding face images, the gender attribute is the primary factor to consider. Hence, we used the open source speech segments toolkit, inaSpeechSegmenter⁴⁴4https://github.com/ina-foss/inaSpeechSegmenter, to calculate gender accuracy (GA) for each converted utterance. Specifically, a speech segmenter (Doukhan et al., 2018) was firstly used to discard segments that did not contain any speech. Next, the remaining speech segments were classify into either male or female using the convolutional neural networks. The gender of a converted utterance was consistent with the target speaker if all the speech segments in this utterance were classified as the gender of the target speaker.

For subjective evaluation, we adopted two mean opinion scores in terms of face-voice consistent degree (MOS-FVC) and speech naturalness (MOS-SN). MOS-FVC was used to evaluate whether the face image and the voice characteristics were consistent with each other, e.g., a middle-aged man’s face with the little girl’s voice would be considered inconsistent. MOS-SN was used to quantitatively measure the naturalness of the converted voice. The listeners were asked to score each converted utterance on a scale from 1 (completely unnatural or completely inconsistent) to 5 (completely natural or completely consistent) for two metrics.

4.5. Evaluation Results

We chose 6 utterances from each of the 4 source speakers and randomly selected one face frame in 3 videos from each of the 8 target speakers for inference. Then we matched them pairwise and converted a total of 576 utterances for objective evaluation. Two subjective metrics were evaluated on the Amazon Mechanical Turk platform⁵⁵5https://www.mturk.com/. 20 converted utterances were randomly selected from each system and a total of 20 listeners participated in the test. All objective and subjective evaluation results are reported in Table 1.

We can observe that the proposed FVMVC outperformed the Auto-FaceVC and attentionCVAE systems on all objective metrics significantly ( ${p<0.05}$ in paired t-tests). Compared with Auto-FaceVC, the slots in MFVA quantify the voice characteristics space, which makes the voice control via face images more homogeneous. Additionally, our proposed FVMVC incorporates the MFVA module to alleviate the problem of over-smoothing, resulting in a more diverse range of voice characteristics. In attentionCVAE, facial attributes can only provide limited information such as gender, age, and ethnicity. As a result, while this method ensures relatively consistent voice characteristics and accurate gender, it also tends to generate very similar voices for different target speakers, resulting in a loss of diversity. In addition, following the method described in attentionCVAE (Yang et al., 2022b), we found that the facial attributes may vary across different face images of the same speaker, which can lead to heterogeneity among the voice characteristics of the same target speakers. With regard to GA, our proposed FVMVC has demonstrated a significant improvement, increasing from 90.85% and 91.66% in the two aforementioned methods to 97.91%. The MOS-FVC has a strong correlation with the GA. In cases where the voice and the face image display a clear gender mismatch, the consistency score between them tends to be significantly low. Since the three methods employed the same backbone, their performance in terms of MOS-SN is quite comparable.

We can observe that our proposed FVMVC performs better than SpeechVC with respect to SHR and SHO. This could be attributed to the presence of stable identity information in face images and several speaker-independent factors included in natural reference speech, such as prosody and emotions. These factors can influence the homogeneity between the converted utterances of the same target speakers. In terms of SDR, SDO and MOS-FVC, our proposed method is less effective than SpeechVC, which is caused by the limited amount of voice characteristics information contained in the face image compared to the one contained in the natural reference speech. Additionally, for GA and MOS-SN, the performance of our proposed FVMVC and SpeechVC is essentially similar.

4.6. Visual Analysis

We utilized Resemblyzer to extract speaker embeddings from the utterances generated by the Ground Truth and those converted by three other systems, i.e., SpeechVC, Auto-FaceVC, and FVMVC. We present their t-SNE (Chan et al., 2019) visualization in Figure 4. For the Auto-FaceVC system, some embedding clusters contained both male and female target speakers, as shown in the red box of Figure 4(c). This led to that a single converted utterance may contain both male and female voices in different segments. On the other hand, our proposed FVMVC model produced embeddings with a clear boundary between two genders, which further demonstrates the effectiveness of our method on GA and MOS-FVC metrics. In addition, the embeddings of different target speakers overlapped a lot for the Auto-FaceVC system. While similar to SpeechVC, our proposed FVMVC model had much less embedding overlap across target speakers, which also further confirms the better speaker diversity achieved by our method.

4.7. Case Study

We selected 2 target speakers’ 6 face images taken from different perspectives for voice conversion, as shown in Figure 3. The first three columns belong to the first target speaker, and the last three columns belong to the second target speaker. The slot weights and corresponding mel-spectrograms of converted utterances based on the face images are visualized. We discover that the distributions of slot weights remain consistent across different face images of the same speaker, regardless of the angle or expression displayed in the images. This finding suggests that the speaker’s facial features are of decisive importance in the process of aligning face and voice, and are minimally affected by external factors such as camera position, background, and other sources of noise. As we can see from the third row in Figure 3, with the aid of stable recalled face embeddings, the mel-spectrograms converted by different face images exhibit a high level of uniformity.

Additionally, we attempted to achieve voice characteristics interpolation by manipulating the slot weights in the MVFA module. We chose a male and a female target speakers for creating new voices by interpolation, as depicted in Figure 5. Specifically, we blended the slot weight of two face images with distinct weights to obtain the new recalled face embedding. From left to right, as the slot weights of the male speaker $B$ increases, the voice characteristics gradually shift from female to male, and the fundamental frequencies gradually decreases. This further validates the effectiveness of the MFVA module for face-based voice control.

4.8. Ablation Study

Table 2. Objective evaluation results of Ablation Studies. The definitions of all metrics can be found in Section 4.4.

Method	SHR $\uparrow$	SHO $\uparrow$	SDR $\downarrow$	SDO $\downarrow$	GA $\uparrow$
FVMVC	0.7313	0.8692	0.6188	0.6781	0.9791
w/o Inter-speaker	0.7301	0.8629	0.6262	0.6908	0.9444
w/o MFVA	0.7124	0.8257	0.6321	0.7111	0.9167
w/o Pretraining	0.7013	0.8113	0.6436	0.7140	0.8925

In this section, we conducted ablation experiments on our proposed FVMVC to explore the effectiveness of each module. As shown in Table 2, we conducted experiments by removing the inter-speaker supervision, the MFVA module and the pretraining strategy from the proposed FVMVC, respectively. The results show that when inter-speaker supervision was removed, the model’s performance on SDR, SDO and GA decreased. This suggests that alleviating the inconsistency between training and inference phases can help to fit the recalled face embedding to the decoder, resulting in more diverse voice generation. After removing the MFVA module, the output of the face encoder was directly fed into the decoder without any constraints imposed by speaker embeddings. We find that the model’s performance decreased in all aspects, highlighting the crucial role of alignment between face and voice in the model’s performance. We also attempted to train the model from scratch without pretrained strategy. Our results show that the pretraining on the zero-shot VC task has a significant positive impact on the proposed FVMVC model.

5. Conclusion

In this paper, we propose the FVMVC model to tackle a novel task of zero-shot FaceVC. The slots in the MFVA module act as a link between face and voice, promoting the performance of voice control based on face images of unseen speakers. In addition, we have implemented a mixed supervision strategy to alleviate the long-standing issue of inconsistency between training and inference in VC tasks. As a result, based on the face images of newly coming speakers, the proposed FVMVC is able to generate a more consistent and diverse voice.

As a future direction, we aim to explore the unified pre-training face-voice alignment,with a specific emphasis on voice control within text-to-speech, voice conversion, and singing synthesis tasks. Additionally, we plan to compile a comprehensive, large-scale video dataset featuring multiple speakers, ensuring its cleanliness and incorporating speaker details like age, gender, race, and physical appearance descriptions.

References

(1)
Afouras et al. (2018) Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. LRS3-TED: A large-Scale dataset for visual speech recognition. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP). 66–71.
Baevski et al. (2019) Alexei Baevski, Steffen Schneider, and Michael Auli. 2019. Vq-wav2vec: self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453 (2019).
Bai et al. (2022) Yeqi Bai, Tao Ma, Lipo Wang, and Zhenjie Zhang. 2022. Speech Fusion to Face: Bridging the Gap Between Human’s Vocal Characteristics and Facial Imaging. In Proceedings of the ACM International Conference on Multimedia (MM). 2042–2050.
Chan et al. (2019) David M Chan, Roshan Rao, Forrest Huang, and John F Canny. 2019. GPU accelerated t-distributed stochastic neighbor embedding. J. Parallel and Distrib. Comput. 131 (2019), 1–13.
Chen et al. (2019) Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 7832–7841.
Chen et al. (2022) Yan-Nian Chen, Li-Juan Liu, Ya-Jun Hu, Yuan Jiang, and Zhen-Hua Ling. 2022. Improving recognition-synthesis based any-to-one voice conversion with cyclic training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7007–7011.
Cheng et al. (2020) Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. Club: A contrastive log-ratio upper bound of mutual information. In Proceedings of the International conference on machine learning (ICML). PMLR, 1779–1788.
Choi et al. (2020) Hyeong-Seok Choi, Changdae Park, and Kyogu Lee. 2020. From inference to generation: End-to-end fully self-supervised generation of human face from speech. In Proceedings of the International Conference on Learning Representations (ICLR).
Doukhan et al. (2018) David Doukhan, Jean Carrive, Félicien Vallet, Anthony Larcher, and Sylvain Meignier. 2018. An open-source speaker gender detection framework for monitoring gender equality. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5214–5218.
Gan et al. (2022) Wendong Gan, Bolong Wen, Ying Yan, Haitao Chen, Zhichao Wang, Hongqiang Du, Lei Xie, Kaixuan Guo, and Hai Li. 2022. IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion. arXiv preprint arXiv:2201.00269 (2022).
Goto et al. (2020) Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. 2020. Face2Speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image.. In Proceedings of the International Speech Communication Association (INTERSPEECH). 1321–1325.
Kamachi et al. (2003) Miyuki Kamachi, Harold Hill, Karen Lander, and Eric Vatikiotis-Bateson. 2003. Putting the face to the voice’: Matching identity across modality. Current Biology 13, 19 (2003), 1709–1714.
Kameoka et al. (2018) Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT). 266–273.
Kameoka et al. (2019a) Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2019a. ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 1432–1443.
Kameoka et al. (2019b) Hirokazu Kameoka, Kou Tanaka, Aaron Valero Puche, Yasunori Ohishi, and Takuhiro Kaneko. 2019b. Crossmodal voice conversion. arXiv preprint arXiv:1904.04540 (2019).
Kaneko and Kameoka (2018) Takuhiro Kaneko and Hirokazu Kameoka. 2018. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of the European Signal Processing Conference (EUSIPCO). 2100–2104.
Kim et al. (2021) Minsu Kim, Joanna Hong, and Yong Man Ro. 2021. Lip to speech synthesis with visual context attentional GAN. Proceedings of the Neural Information Processing Systems (NeurIPS) 34 (2021), 2758–2770.
Lee et al. (2021) Sang-Hoon Lee, Ji-Hoon Kim, Hyunseung Chung, and Seong-Whan Lee. 2021. Voicemixer: Adversarial voice style mixup. Proceedings of the Neural Information Processing Systems (NeurIPS) 34, 294–308.
Lian et al. (2022) Jiachen Lian, Chunlei Zhang, and Dong Yu. 2022. Robust disentangled variational speech representation learning for zero-shot voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6572–6576.
Lu et al. (2021) Hsiao-Han Lu, Shao-En Weng, Ya-Fan Yen, Hong-Han Shuai, and Wen-Huang Cheng. 2021. Face-based voice conversion: Learning the voice behind a face. In Proceedings of the ACM International Conference on Multimedia (MM). 496–505.
Mavica and Barenholtz (2013) Lauren W Mavica and Elan Barenholtz. 2013. Matching voice and face identity from static images. Journal of Experimental Psychology: Human Perception and Performance 39, 2 (2013), 307.
Mohammadi and Kain (2017) Seyed Hamidreza Mohammadi and Alexander Kain. 2017. An overview of voice conversion systems. Speech Communication 88 (2017), 65–82.
Mohammadi and Kim (2019) Seyed Hamidreza Mohammadi and Taehwan Kim. 2019. One-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams.. In Proceedings of the International Speech Communication Association (INTERSPEECH). 704–708.
Morise et al. (2009) Masanori Morise, Hideki Kawahara, and Haruhiro Katayose. 2009. Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech. In Proceedings of the Audio for Games. Audio Engineering Society.
Oh et al. (2019) Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T Freeman, Michael Rubinstein, and Wojciech Matusik. 2019. Speech2face: Learning the face behind a voice. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 7539–7548.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Park et al. (2022) Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro. 2022. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In Proceedings of the AAAI conference on artificial intelligence (AAAI), Vol. 36. 2062–2070.
Plüster et al. (2021) Björn Plüster, Cornelius Weber, Leyuan Qu, and Stefan Wermter. 2021. Hearing Faces: Target speaker text-to-speech synthesis from a face. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 757–764.
Prajwal et al. (2020) KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13796–13805.
Qian et al. (2019) Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 5210–5219.
Saito et al. (2018) Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke Takamichi. 2018. Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5274–5278.
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 815–823.
Sisman et al. (2020) Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2020. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020), 132–157.
Smith et al. (2016) Harriet MJ Smith, Andrew K Dunn, Thom Baguley, and Paula C Stacey. 2016. Concordant cues in faces and voices: Testing the backup signal hypothesis. Evolutionary Psychology 14, 1 (2016), 1474704916630317.
Song et al. (2018) Yang Song, Jingwen Zhu, Dawei Li, Xiaolong Wang, and Hairong Qi. 2018. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786 (2018).
Srivastava et al. (2020) Brij Mohan Lal Srivastava, Nathalie Vauquier, Md Sahidullah, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent. 2020. Evaluating voice conversion-based privacy protection against informed attackers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2802–2806.
Tomar (2006) Suramya Tomar. 2006. Converting video formats with FFmpeg. Linux journal 2006, 146 (2006), 10.
van Niekerk et al. (2020) Benjamin van Niekerk, Leanne Nortje, and Herman Kamper. 2020. Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. In Proceedings of the International Speech Communication Association (INTERSPEECH). 4836–4840.
Veaux et al. (2013) Christophe Veaux, Junichi Yamagishi, and Simon King. 2013. Towards personalised synthesised voices for individuals with vocal disabilities: Voice banking and reconstruction. In Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies. 107–111.
Wan et al. (2018) Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879–4883.
Wang et al. (2021) Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. 2021. VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. In Proceedings of the International Speech Communication Association (INTERSPEECH). 1344–1348.
Wang et al. (2022b) Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, and Helen Meng. 2022b. VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7252–7256.
Wang et al. (2022a) Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, and Li Liu. 2022a. Residual-guided personalized speech synthesis based on face image. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4743–4747.
Wang et al. (2020) Ruobai Wang, Yu Ding, Lincheng Li, and Changjie Fan. 2020. One-shot voice conversion using star-GAN. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7729–7733.
Wen et al. (2019) Yandong Wen, Bhiksha Raj, and Rita Singh. 2019. Face reconstruction from voice using generative adversarial networks. Proceedings of the neural information processing systems (NeurIPS), 5265–5274.
Wu et al. (2022a) Cho-Ying Wu, Chin-Cheng Hsu, and Ulrich Neumann. 2022a. Cross-modal perceptionist: Can face geometry be gleaned from voices?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10452–10461.
Wu et al. (2022b) Xing Wu, Sihui Ji, Jianjia Wang, and Yike Guo. 2022b. Speech synthesis with face embeddings. Applied Intelligence 52, 13 (2022), 14839–14852.
Yamamoto et al. (2020) Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.
Yang et al. (2022a) SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, et al. 2022a. Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion. arXiv preprint arXiv:2208.08757 (2022).
Yang et al. (2022b) Zhihan Yang, Zhiyong Wu, and Jia Jia. 2022b. Speaker characteristics guided speech synthesis. In Proceedings of the International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
Yang et al. (2023) Zhihan Yang, Zhiyong Wu, Ying Shan, and Jia Jia. 2023. What does your face sound like? 3D face shape towards voice. (2023).
Yuan et al. (2021) Siyang Yuan, Pengyu Cheng, Ruiyi Zhang, Weituo Hao, Zhe Gan, and Lawrence Carin. 2021. Improving zero-shot voice style transfer via disentangled representation learning. In Proceedings of the International Conference on Learning Representations (ICLR).
Zhang et al. (2019) Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai. 2019. Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2019), 540–552.
Zhang et al. (2016) Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23, 10 (2016), 1499–1503.
Zhou et al. (2019) Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence (AAAI), Vol. 33. 9299–9306.
Zhou et al. (2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 4176–4186.