Expressive Talking Head Video Encoding
in StyleGAN2 Latent-Space

Trevine Oorloff and Yaser Yacoob
University of Maryland, College Park, MD 20742, USA
{trevine,yaser}@umd.edu

Abstract

While the recent advances in research on video re-enactment have yielded promising results, the approaches fall short in capturing the fine, detailed, and expressive facial features (e.g., lip-pressing, mouth puckering, mouth gaping, and wrinkles) which are crucial in generating realistic animated face videos. To this end, we propose an end-to-end expressive face video encoding approach that facilitates data-efficient high-quality video re-synthesis by optimizing low-dimensional edits of a single Identity-latent. The approach builds on StyleGAN2 image inversion and multi-stage non-linear latent-space editing to generate videos that are nearly comparable to input videos. While existing StyleGAN latent-based editing techniques focus on simply generating plausible edits of static images, we automate the latent-space editing to capture the fine expressive facial deformations in a sequence of frames using an encoding that resides in the Style-latent-space (StyleSpace) of StyleGAN2. The encoding thus obtained could be super-imposed on a single Identity-latent to facilitate re-enactment of face videos at 1024². The proposed framework economically captures face identity, head-pose, and complex expressive facial motions at fine levels, and thereby bypasses training, person modeling, dependence on landmarks/ keypoints, and low-resolution synthesis which tend to hamper most re-enactment approaches. The approach is designed with maximum data efficiency, where a single $W+$ latent and 35 parameters per frame enable high-fidelity video rendering. This pipeline can also be used for puppeteering (i.e., motion transfer). A high-quality 4K-video dataset was used and will be released. The project page is located at https://trevineoorloff.github.io/ExpressiveFaceVideoEncoding.io/.

{strip}

Figure 1: The proposed framework is capable of capturing fine, detailed, and highly expressive facial features (e.g., lip-pressing, mouth puckering and gaping, gaze, wrinkles). Top: Demonstrates how our re-synthesis results compare with a few state-of-the-art models: FOMM [32], LIA [42], and StyleHEAT[45]. Bottom: Depicts the zoomed-in images of the ID-frame and the Synthesized frame generated through our approach using the encoding of the ID-frame (ID-latent) and 35 parameters capturing the facial deformations of Groundtruth.

Refer to caption — Figure 2: The multi-stage pipeline for encoding a video in latent-space. The (1) Pre-processing stage aligns the input sequence of frames, which are fed to the (2) GAN inversion stage to obtain the corresponding sequence of $W+$ latents. Out of which the best inversion which also has near frontal head-pose is chosen to be the ID-latent in the (3) ID-latent selection stage. The (4) Head-pose encoding stage, encodes the yaw and pitch of the target frames, in reference to the ID-latent while generating a series of head-pose adjusted ID-latents. Subsequently, the (5) Facial-attribute encoding stage, encodes the facial deformations using 32 parameters anchoring onto the head-pose adjusted ID-latents. Finally, the encoded parameters (35/frame) and the ID-latent is used to synthesize the re-enacted frames at the (6) Rendering stage.

1 Introduction

Talking-head re-enactment, which involves animating a static portrait image to mimic the changes in head-pose and other facial attribute deformations of a driving video while maximally preserving the identity across the frames, has a wide range of applications such as AR/VR, telepresence, and movie production. Intuitively to facilitate re-enactment, one has to decompose the motion from the identity of the driving sequence of frames, and to this end, most contemporary methods utilize facial landmarks/keypoints-based [40, 32, 39], 3D facial representation-based [13, 49], and latent-based [42] approaches to encode the facial deformations. While these methods generate promising results and each of them has its own pros and cons (Sec. 2), most common drawbacks of the existing approaches include limitation to low resolution (commonly $256^{2}$ and $512^{2}$ at most), the requirement of extensive training data and person modeling, and especially the inability to capture extreme poses and intricate expressive facial details (see Fig. 1) which detracts from the realism of re-enacted videos.

On the other hand, the recent advances in StyleGAN2-based inversion techniques [1, 5, 7, 30, 37] enable manipulation of high-resolution ( $1024^{2}$ ) real-world images [2, 3, 4, 20, 31, 43] due to the highly disentangled property of their latent-spaces. However, such latent-based manipulation techniques are mostly limited to static images and focus on simply generating plausible edits (e.g., changes to smile, age, hair color, etc.). While recent research [10, 23, 45] has employed StyleGAN2 to generate high-resolution re-enactment video, they utilize 3D parametric models to capture facial deformations. While such priors are able to capture global facial attributes, they are not capable of capturing the fine and intensely expressive facial deformations.

In order to bridge the gap between high-fidelity static portrait image synthesis/manipulation and face re-enactment of intense expressions and speech, we propose a novel end-to-end face video encoding approach that automates the latent-editing process to capture head-pose and fine and complex expressive facial deformations using merely 35 parameters/frame that reside in the Style-latent-space (StyleSpace, $SS$ ) of StyleGAN2. We extend single image generation models, namely StyleGAN2 [24] and StyleFlow [3] to the temporal dimension. Quantitative evaluation of latent-spaces: $Z,\,W,\,W+,$ and $S$ , by [43], indicates that within the StyleGAN2’s latent-spaces, the proposed StyleSpace has the best disentanglement, completeness, and informativeness. Thus, we perform edits on $SS$ as it enables control of individual facial attributes without re-training a network to enforce disentanglement [12]. Moreover, since the latent-spaces are sparse (i.e., only specific points in the space are visually valid and meaningful) we propose optimization frameworks that anchor the latent-space attribute editing to the real images. The computed latent paths between frames are non-linear and therefore avoid the limitations of common linear latent editors [42].

In this research, we focus on both re-synthesis and puppeteering of videos using a compact encoding scheme while focusing on accurate reconstruction of expressive facial deformations. In re-synthesis, we encode a video using a low-dimensional representation of small edits of an Identity-latent (ID-latent). The proposed pipeline is capable of capturing and regenerating complex facial features as shown in Figs. 1 and 3 while achieving state-of-the-art performance at $1024^{2}$ . Further, since the encoding is independent of the subject in the video, we can substitute the ID-latent (e.g., an inversion of a real face) of a different subject and apply the face deformation parameters to generate high-fidelity puppeteering video.

Our video clip encoding is extremely compact, a single latent ( $18\times 512$ ) corresponding to an ID-latent and only 35 parameters per frame that control the head-pose (3 parameters) and the facial features edits (32 parameters). The 70 bytes per frame are nearly half the state-of-the-art (see Tab. 5 [40]). The key contributions of the paper are:

•

a novel algorithm for high-resolution $(1024^{2}$ ) face video encoding for re-synthesis and puppeteering with emphasis on precise reconstruction of both expressive and talking facial attributes in contrast to common models that only facilitate talking head re-enactment,
•

an extremely compact encoding of head-pose and facial attribute deformations,
•

a generative approach that employs optimization frameworks instead of person-centric data training,
•

a framework employing image inversion to anchor the sparse latent-edits to image-based constraints, facilitating accurate spatio-temporal modeling,
•

a novel method to find StyleSpace channels corresponding to facial attributes based on index sensitivity,
•

an approach that automates the latent-space editing process to capture facial attribute deformations in contrast to prevailing work on latent-space editing that simply illustrate plausible semantic visual results,
•

a high-quality (4K) video dataset that will be shared.

2 Related Work

2.1 Latent-Space Based Editing

Understanding the latent-space of a pre-trained GAN has led to better controllability over the generated output. Research such as [20, 31] explore the latent-space of StyleGAN to identify the interpretable semantic directions that control attributes such as aging, smile, gender, pose, etc. within the latent-space. However, the entangled nature of the latent-space limits the manipulation, as it often leads to undesirable artifacts.

StyleSpace [43], StyleFlow [3], and StyleRig [33] are a few prominent algorithms based on the StyleGAN2 architecture that yield impressive control over latent-based manipulations. The authors of StyleSpace analyzed $SS$ and formulated an algorithm to identify the style channels that control specific attributes by backtracking gradients. StyleFlow, on the other hand, uses a flow-based model conditioned on the attributes to enable non-linear and conditioned latent-space edits. Even though the StyleRig algorithm enables a rig-like control over the 3D semantic parameters of faces generated through StyleGAN, it has limited manipulative directions [33]. In contrast to all these latent editing approaches, which simply generate plausible edits to static images, our algorithm attempts to automate the latent editing to quantify facial deformations in the form of $SS$ edits.

2.2 Face Video Re-enactment

Controlling the facial attributes and their motion through facial keypoints/landmarks are popularly used in video re-enactment [18, 39, 48, 47, 32]. While these approaches provide a strict guidance over the facial attributes, they are challenged to capture fine expressive facial details (e.g., hair, teeth, lip compression, wrinkle dynamics) and accessories (e.g., eyeglasses). Further, they are dependent on the accuracy of the landmarks and suffer in re-enactment video synthesis when the head and/or face geometries of the source and target considerably differ [36]. Wang et al. in [40] sought to address the latter through a 3D-keypoint representation which is in turn used to warp the source frame. Even though this method yields promising results, it requires extensive training (data and computational resources), the resolution is limited to $512^{2}$ , and the generated frames lack sharpness and contain blocking artifacts.

Approaches such as [13, 15, 17, 27, 52] employ 3D facial models (e.g., 3DMM) to guide the synthesis, and are excellent at capturing facial movements. Despite the potential of 3D model-based approaches to generate high-quality videos, they represent only the inner-face region; thus are comparatively poor at constructing surrounding facial features (e.g., hair) or complex features such as teeth, wrinkles, complex mouth motion and require 3D training data that are resource and computation intensive.

2.3 StyleGAN-based Video Synthesis

The ability to synthesize high-resolution photo-realistic images and the rich latent-space of StyleGAN are stimulating video synthesis research. [34] and [14] train a temporal architecture that is used to navigate the latent-space of a pre-trained StyleGAN to search for temporally coherent directions for synthesizing videos at $1024^{2}$ . While the former is limited to generating random video clips, StyleVideoGAN facilitates re-enactment using a PCA-based approach to transform the learned motion trajectories to the source image. While Bounareli et al. in [10] propose a method to find controllable directions of the $W+$ space of StyleGAN2 with the help of a 3D model synthesizing videos at $256^{2}$ , the research of [45, 23] utilize 3D models to capture the facial motion, hence share the drawbacks of 3D prior based models discussed above, despite the ability of generating $1024^{2}$ videos.

In addition to the inability of capturing the highly expressive facial attributes precisely, all these approaches attempt to learn a model that decomposes the motion-related content and hence require a training phase. In contrast, our model extends the inherent disentangled nature of the StyleSpace ( $SS$ ) of a pre-trained StyleGAN2 to achieve this decomposition in our pipeline. Further, in contrast to the above StyleGAN-based approaches which require the entire latent ( $18\times 512$ ) per frame, the proposed framework provides an extremely compact encoding scheme comprising of 0.38% of parameters per frame (35 vs. $18\times 512$ ) while generating videos at $1024^{2}$ .

3 Methodology

Our approach consists of six stages: video pre-processing, GAN inversion, ID-latent selection, head-pose encoding, facial attribute encoding, and rendering. The entire flow is represented in Fig. 2 and utilizes the e4e encoder [35], StyleFlow, and StyleSpace (with significant changes to these components to achieve our objectives).

We use the following notation to describe the pipeline. Notations beginning with $L$ and $L^{ss}$ denote $W+$ latents and the corresponding $SS$ latents, respectively. $L^{ss}$ is obtained using the affine transform $\mathcal{A}(\cdot)$ , i.e., $L^{ss}=\mathcal{A}(L)$ . $I$ denotes a real image (groundtruth) and $S$ denotes a synthesized image from a latent. For example, $S_{t}=G(L_{t})$ describes the generation of an image from a latent, and the subscript refers to the frame at time $t$ . $G$ is the original StyleGAN2 generator, but it is supplemented by two style generators, $G_{sf}$ for StyleFlow and $G_{ss}$ for StyleSpace. $E$ is the e4e encoder used for real image inversion into $W+$ space. $Y_{t}$ and $P_{t}$ are the optimal Yaw and Pitch used by $G_{sf}$ at time $t$ . Finally, $\mathbf{\alpha}_{t}$ is a 32-dimensional vector that controls the facial deformations of the generator $G_{ss}$ , given a latent $L_{t}$ .

The pre-processing stage generates a set of face images that are stabilized and aligned so that their inversion to latent-space achieves maximal identity preservation and continuity of spatio-temporal head and face motions. The inversion employs the e4e encoder to generate a sequence of latents, $L_{1},\dots,L_{t}$ in $W+$ space corresponding to the sequence of frames. The images generated from these latents serve as the basis for rigid and non-rigid optimizations, replacing the raw image input. They enable controlled editability in conjunction with image loss metrics. It is important to note that optimization constraints are applied in the image space and not in the latent/parameter space. In the third stage, a single latent from the sequence is selected as an ID-latent, $L_{ID}$ , for generating the various head-poses of the person in the video.

L_{ID}=\operatorname*{argmax}_{L_{t}}(ID_{similarity}(I_{t},G(L_{t}))

(1)

Using a single $L_{ID}$ as the anchor to perform head-pose and facial motion edits, not only reduces the data requirement of rendering but also minimizes the identity variation across frames. In a re-enactment setting, the image corresponding to $L_{ID}$ functions as the single source image and the sequence of frames $\{I_{t}\}$ function as the driving frames. Refer Appendices C and D for further details on stages 2 and 3 respectively.

The fourth stage finds, for each frame, the head transformation (i.e., $Y_{t}$ and $P_{t}$ ) in StyleFlow latent-space needed to render $L_{ID}$ as close as possible to the synthesized image $G(L_{t})$ by minimizing,

\min_{Y_{t},P_{t}}\;\mathcal{L}\{G_{sf}(L_{ID},Y_{t},P_{t})\;,\;G(L_{t})\}.

(2)

$G_{sf}(L_{ID},Y_{t},P_{t})$ results in a new latent, ${LH}_{t}$ , that captures the correct head-pose at time $t$ starting from $L_{ID}$ . The fifth stage solves, for each frame, the set of facial deformations $\alpha_{t}$ in $SS$ , that when applied to ${LH}_{t}^{ss}$ matches as close as possible to $G(L_{t})$ . The result is a set of 32 parameters that achieve $G(L_{t})\approx G_{ss}(LH_{t}^{ss},\alpha_{t})$ through minimizing,

\min_{\alpha_{t}}\;\mathcal{L}\{G_{ss}({LH}_{t}^{ss},\alpha_{t})\;,\;G(L_{t})\}.

(3)

Finally, an image is synthesized by applying,

S_{t}=G_{ss}(LH_{t}^{ss},{\alpha_{t}})=G_{ss}(G_{sf}(L_{ID},Y_{t},P_{t}),{\alpha_{t}}).

(4)

Thus, synthesizing the re-enacted frame at time $t$ using a fixed $L_{ID}$ and 34 style controlling parameters (plus the initial Roll angle, $R_{t}$ used in pre-processing).

3.1 Video Pre-Processing

Face alignment is an important step in face image inversion regardless of whether an encoder or optimization approach is employed since StyleGAN2 is a fixed-resolution architecture. Temporal consistency of the alignment is critical due to the role each frame plays in our optimizations. Slight misalignments may change identity, head-pose, or misinterpret facial feature attributes (shape and dynamics). The alignment used in StyleGAN2 depends on the commonly used 68 facial landmarks, including mouth and eye coordinates for warping. However, these undergo dynamic changes in a video clip which generate jitters and rescaling in face alignment. To avoid the impact of dynamic coordinates, [14] cropped the full face excluding the eyes and mouth coordinates. We consider this insufficient to alleviate the combined effects of head-pose and facial motions. Instead, our alignment aims to: (1) completely stabilize the head when head-pose does not change between consecutive frames, so that non-rigid face motions are captured in a maximally aligned form, (2) rely on inversion to capture the relative head alignment when the head-pose rotates out-of-plane.

We employ [8] for detecting faces and tracking features in a video clip. However, the landmarks are not sufficiently accurate for face alignment over a sequence of frames. Dense optical-flow captures a combination of rigid and non-rigid facial motions. However, since our objective is to only align the rigid head motion between frames, we employ a parametric optical-flow model [9] to register a frame at time $t$ to a key frame $k_{i}$ at time $i$ ( $<t$ ). When the rigid head motion is small or limited to the 2D plane, the registration is accurate for the duration (occasionally, several tens of frames), but upon out-of-plane head rotation, the registration requires adjusting the key frame to a new $k_{i+1}$ . When the head out-of-plane rotation is rapid, consecutive frames may become key frames. (Further details in Appendix B).

3.2 Head-Pose Encoding

Temporally consistent head-pose is challenging to recover and synthesize. Head-pose is represented by three degrees of rotation, Yaw, Pitch, and Roll, computed with respect to a virtual point at the center of the head. There are numerous landmark and mesh-based approaches for estimating head-pose. Instead, we choose an analysis-by-synthesis approach to estimate the closest rendering of a latent to the target image (Eq. 2). StyleFlow proposed an effective system for a single latent-based edit of head-pose by controlling the Yaw and Pitch angles. The Roll angle is a 2D image-based transformation and is relegated to a pre-processing step necessary for face-alignment as required by StyleGAN2.

An important feature of StyleFlow is that the attribute editing direction is dependent and conditioned on the given latent (i.e., it is specific to a person and relevant attributes captured by the generator). This conditional architecture leads to improved disentangling and it also allows continuous parameter editing. Critically, the edit path is non-linear in the latent-space in contrast to the state-of-the-art that relies on linear and fixed directions that apply to all latents.

We re-formulate the head-motion as a head-pose matching problem between a rendered image of the real-frame’s encoded latent, $L_{t}$ , and the rendered image of a rotated $L_{ID}$ which is solved as a minimization problem (Eq. 2). The minimization employs two losses, L2 and LPIPS to search the Yaw-Pitch space using gradient descent. These losses are computed over a masked area of the face that is based on an 81-landmark model (an extension of the 68-landmark model to include the forehead). However, the eyes, mouth, and eyebrows are excluded in the L2 loss, since these non-rigid areas are not relevant to 3D head rotations. The outcome of this stage is an alignment of the $L_{ID}$ to match the head-pose at time $t$ , and it is represented by a new latent $LH_{t}$ (in $W+$ ) that will be further edited to capture the non-rigid motions of the eyebrows, eyes, mouth, and chin.

3.3 Facial Attribute Encoding

The facial attribute encoding extends [43], where the authors demonstrate the highly disentangled nature of the $SS$ and provide a few StyleSpace indices that have mostly disentangled control over facial attributes. The facial-attribute encoding, $\alpha_{t}$ , (32 parameters) of each frame is applied to the latent $LH^{ss}_{t}$ , which is a transformation of $LH_{t}$ to $SS$ via $LH^{ss}_{t}=\mathcal{A}(LH_{t})$ .

3.3.1 Choice of StyleSpace Indices

The StyleSpace indices are analyzed to make sure that maximally disentangled indices that capture complex and detailed expressive facial attributes as shown in Figs. 1 and 3 are selected. For a specific facial feature $f\in\mathcal{F}$ , we score each index $i\in\{l,c\}$ using index sensitivity, $\Gamma_{f,i}$ , which measures the change in image space for a unit change in the StyleSpace index. $\Gamma_{f,i}$ is defined as,

\Gamma_{f,i}=\frac{1}{|\{\delta_{k}\}|}\sum_{k}{\left\{\frac{\mathcal{L}_{LPIPS}(S_{k}*M,S_{k-1}*M)}{|\delta_{k}-\delta_{k-1}|}\right\}},

(5)

where $S_{k}=G_{ss}(L^{ss}_{ID}+\delta_{k}\mathbbm{1}_{i})$ is the synthesized image generated using $L_{ID}$ perturbed by $\delta_{k}$ at $SS$ index $i$ , $M$ is the binary mask over the facial attribute considered, and $\mathbbm{1}_{i}=$ { $1$ when $(l,c)=i$ ; $0$ elsewhere}. We choose $\{\delta_{k}\}$ to be a sequence of successive values with $|\{\delta_{k}\}|$ elements, and the subscript $k$ indicates the iterating index. Additionally, we calculate the index sensitivity over the whole face (i.e., $M$ is a matrix of ones that covers the whole face) and is denoted by $\Gamma_{i}$ . Subsequently, we rank the indices based on $\Gamma_{f,i}$ and $\Gamma_{i}$ values and choose the indices that have a higher $\Gamma_{f,i}$ and a negligible $\Gamma_{i}$ based on simple thresholding. We repeat the scoring on multiple subjects and frames sampled from the dataset and obtain the prominent indices across the sampled data. This novel approach enables the selection of maximally disentangled StyleSpace indices corresponding to the specific facial attribute chosen. The list of facial attributes $\mathcal{F}$ and the set StyleSpace indices $\mathcal{V}$ , thus chosen are tabulated in Tab. 4 in the Appendix.

The significance of our $SS$ indices selection process as opposed to the algorithm proposed in [43] is as follows. We observed that the StyleSpace, $SS$ representation is not unique. i.e., optimizing

\min_{\alpha_{inv_{t}}}\;\mathcal{L}\{G_{ss}({LH}_{t}^{ss}+\alpha_{t}+\alpha_{inv_{t}}),G_{ss}({LH}_{t}^{ss})\}

(6)

does not necessarily yield $\alpha_{t}+\alpha_{inv_{t}}\approx 0$ . Therefore, as [43] back propagates to compute the gradient w.r.t. a $SS$ index, the gradients would be less accurate, as the $SS$ indices contributing to an identical facial deformation of two frames would differ (as not unique). Instead, we use a forward approach, perturbing each index separately and computing the corresponding deformation loss, thus directly computing the true gradient (sensitivity in the image space for a unit change of each $SS$ index) which is more accurate.

Facial Deformation Attribute Encoding: We compute the optimal encoded latent values, $\alpha_{t}$ , that edit facial attributes to capture the facial deformations. $\alpha_{t}$ represents the offset values from $LH^{ss}_{t}$ and is obtained through a per-frame optimization (Eq. 3) over the $SS$ indices and is presented in Algorithm 1 in the Appendix. The reconstruction of the latent $L_{t}$ obtained from the e4e encoder is used as the groundtruth in the optimization and denoted by $S_{t}$ , while the rendered re-enacted frame during the optimization is denoted by $\hat{S}_{t}$ .

Initialization of indices ( $\bm{LH^{ss}_{t}}$ ): Due to the sparsity of the latent-space and as the optimization is over a multi-dimensional space, it is highly probable for the optimization algorithm to converge consecutive frames, which are nearby in image-space, onto local-minima that are distant in the latent-space. The slight differences in the optimum point of consecutive frames could introduce jitter in re-enactment. Therefore, to bias the algorithm to solve for $\alpha_{t}$ in the vicinity of the previous frame’s optimum, we initialize the $SS$ indices we optimize, $i=(l,c)\in\mathcal{V}$ of $LH^{ss}_{t}$ as,

LH^{ss}_{t}(l,c)=LH^{ss}_{t-1}(l,c),\,\forall(l,c)\in\mathcal{V}.

(7)

Index-specific learning rate, $\bm{\eta_{f,i}}$ : We observed that different subjects and indices have different sensitivities to a unit change in the StyleSpace ( $\Gamma_{f,i}$ ) (see Sec. E.3). This observation corroborates the non-linear nature of latent editing discussed in StyleFlow. Hence, using the same learning rate across all indices would result in an undue dominance of high-sensitivity indices, thus generating non-optimal results. Therefore, for each input video and each facial attribute, we compute an index-specific learning rate using,

\eta_{f,i}=\exp\left\{-1.5\,\Gamma_{f,i}\,/\max\limits_{i\in\mathcal{V}_{f}}\,(\Gamma_{f,i})\right\},

(8)

that was obtained empirically. For each epoch, optimization is done in parallel for all the attributes and the optimization over indices corresponding to the gaze is skipped for frames where blinking is detected.

Loss Functions: The algorithm is optimized by minimizing over multiple losses. The total loss is defined as,

\displaystyle\mathcal{L}=\mathcal{L}_{m}+\mathcal{L}_{e}+\mathcal{L}_{p}+\mathcal{L}_{ID}+\mathcal{L}_{FP},

(9)

where the loss terms $\mathcal{L}_{ID}$ and $\mathcal{L}_{FP}$ represent the identity loss and the Face-Parsing loss respectively and the subscripts $m$ , $e$ , and $p$ correspond to the losses computed over extracted regions of the {mouth + chin/ jaw}, {eyes + eyebrows}, and {pupil}, respectively.

$\displaystyle\mathcal{L}_{m}$	$\displaystyle=\mathcal{L}_{LPIPS_{m}}+\mathcal{L}_{L2_{m}}+\mathcal{L}_{IF_{m}},$	(10)
$\displaystyle\mathcal{L}_{e}$	$\displaystyle=\mathcal{L}_{LPIPS_{e}}+\mathcal{L}_{L2_{e}}+\mathcal{L}_{IF_{e}},$	(11)
$\displaystyle\mathcal{L}_{p}$	$\displaystyle=\mathcal{L}_{L2_{p}}+\mathcal{L}_{IF\_L2_{p}},$	(12)

where $\mathcal{L}_{LPIPS}$ , $\mathcal{L}_{L2}$ , and $\mathcal{L}_{IF}$ represent the LPIPS loss, L2 loss, and Inter-frame loss, respectively. Refer Sec. E.4 for further details.

3.4 Rendering

Once the encoding is complete, the $L_{ID}$ and the time-series of the 35 parameters, $\{\alpha_{t},Y_{t},P_{t},R_{t}\}$ are transmitted to the renderer. To synthesize the re-enactment video, first $LH_{t}$ is obtained from $L_{ID}$ to adjust for the head-pose using StyleFlow for each frame. Then $LH_{t}$ is transformed to $SS$ , and the 32 indices responsible for the facial attributes, $\alpha_{t}$ are applied to synthesize the image using the generator, $G_{ss}$ .

{}\hat{S}_{t}=G_{ss}(LH^{ss}_{t}+\alpha_{t}\mathbbm{1})

(13)

There exist an inherent quality loss in the initial encoding as the real-world subjects would mostly be out-of-domain of StyleGAN resulting in notable deviations between the e4e encoded frames and real frames. [30] propose fine-tuning the StyleGAN’s generator around “pivots” to improve the photo-realism of images while maintaining editability. Adapting from this concept, we fine-tune only the layers post-StyleSpace, $G_{ss}$ by solving Eq. 14 using { $LH^{ss}_{t},\alpha_{t}$ } as pivots (in contrast to $W+$ latents in PTI) with real frames as reference. The optimization is performed over the entire sequence of frames simultaneously compared to the single image tuning in PTI.

\min_{\theta}\;\sum_{t}\mathcal{L}\{G_{ss}({LH}_{t}^{ss},\alpha_{t};\theta)\;,\;I_{t}\}

(14)

4 Experiments and Results

Method	res.	L1 $\downarrow$	LPIPS $\downarrow$	$\mathcal{L}_{ID}$ $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	FVD $\downarrow$	$\rho_{\scaleto{AU}{3pt}}$ $\uparrow$	$\rho_{\scaleto{GZ}{3pt}}$ $\uparrow$	$\rho_{\scaleto{pose}{3pt}}$ $\uparrow$
Wang et al. *	$512^{2}$	2.42	0.030	0.087	32.8	0.957	12.0	82.3	0.881	0.965	0.983
StyleVid.GAN *	1024²	4.04	0.109	0.104	28.8	0.926	28.8	223.3	0.739	0.884	0.979
Ours*	1024²	1.96	0.026	0.067	34.1	0.960	13.6	79.8	0.899	0.971	0.987
FOMM	$256^{2}$	3.07	0.036	0.174	31.0	0.932	28.7	140.3	0.710	0.755	0.648
LIA	$256^{2}$	3.24	0.042	0.164	30.0	0.929	30.2	162.9	0.546	0.693	0.619
fs-vid2vid	$512^{2}$	5.75	0.093	0.158	25.2	0.900	42.4	359.6	0.571	0.784	0.629
StyleHEAT	1024²	4.13	0.097	0.134	27.6	0.933	25.1	281.9	0.673	0.701	0.763
Ours	1024²	1.99	0.030	0.097	34.2	0.963	15.9	85.2	0.771	0.834	0.880
Ours (ReStyle)	$1024^{2}$	2.01	0.031	0.099	34.0	0.959	16.9	93.9	0.767	0.831	0.843
Ours – PTI_post	$1024^{2}$	2.71	0.048	0.127	32.0	0.956	23.2	125.7	0.726	0.819	0.833

Table 1: Quantitative comparison of video re-synthesis against baselines. The top (*), middle, and bottom sections respectively consist of metrics computed for the 6 videos received upon requests to authors, evaluation against the dataset of 150 videos, and results of ablations. We yield state-of-the-art results at

1024^{2}

on all metrics while using only 0.38% of latent-space parameters of StyleGAN2.

4.1 Dataset and Evaluation

We selected 150 video clips (4K videos) from the video-sharing site www.pexels.com that combine high visual quality with expressive head and facial motions that are present in common low-resolution datasets. Each video contains a single face performing significant face deformations, head motion, and speech. Additional details on the dataset are in Sec. F.1.

Method	res.	$\mathcal{L}_{ID}$ $\downarrow$	FID $\downarrow$	FVD $\downarrow$	$\text{FVD}_{\scaleto{M}{3pt}}$ $\downarrow$	$\rho_{\scaleto{AU+GZ}{3pt}}$ $\uparrow$
FOMM	$256^{2}$	0.153	77.0	396.8	103.0	0.501
LIA	$256^{2}$	0.174	82.3	406.0	112.4	0.527
fs-vid2vid	$512^{2}$	0.202	73.6	445.1	112.7	0.640
StyleHEAT	1024²	0.181	81.0	437.5	109.8	0.667
Ours	1024²	0.095	63.9	386.5	82.3	0.708

Table 2: Quantitative comparison of puppeteering against baselines evaluated across 50 puppet-puppeteer pairs. Our approach achieves the best performance across all metrics.

We compare our results against two state-of-the-art StyleGAN-based models (most relevant): StyleHEAT and StyleVideoGAN, a latent-based model: LIA, and three other state-of-the-art models (keypoint/landmark-based) that facilitate re-enactment: fs-vid2vid, FOMM, and Wang et al. [40]. Publicly available models were used for all algorithms except StyleVideoGAN and [40] for which the authors kindly processed six videos. Note: All algorithms were evaluated at their native resolution using multiple metrics scoring: spatial quality, spatio-temporal quality and appearance, and temporal consistency of identity (further details in Sec. F.2).

Referring to the top of Tab. 1, we achieve state-of-the-art performance at $1024^{2}$ with significantly improved re-synthesis results compared to StyleVideoGAN while utilizing only 0.38% of the latent-space parameters used by them ( $35$ vs. $18\times 512$ per frame). Further, our encoding scheme outperforms Wang et al. in almost all metrics except FID (yet comparable) despite metrics of (Wang et al.) being computed in its native resolution of $512^{2}$ . Moreover, evaluation across the full dataset, (middle of Tab. 1), shows that our approach outperforms fs-vid2vid, FOMM, LIA, and StyleHEAT in all scores by large margins. It is critical to note that lower native resolutions [32, 39, 40, 42] significantly favor several metrics since there is no penalty for loss of details (e.g., L1, SSIM, FVD, etc.) with respect to $1024^{2}$ metrics. Hence it is essential to emphasize on the qualitative analysis which more accurately reflects the potential of our framework.

Figs. 1 and 3 illustrate, qualitatively, the capturing of fine facial details such as lip pressing, mouth puckering and gaping, dynamic wrinkles around the eyes, mouth, nasal-bridge, and forehead enhancing photo-realism of the re-enacted videos which are not necessarily captured by the metrics (see Figs. 10 and 11 and videos in the project page for more examples). To the best of our knowledge, such fine expressive details were not explicitly addressed by previous research.

Method	Vid.1	Vid.2	Vid.3	Vid.4	Vid.5
StyleFlow	46.7	41.3	33.7	17.0	38.5
Ours	16.0	19.9	16.3	11.5	21.9

Table 3: Mean head-pose loss (

\downarrow

) comparison between straightforward head-pose adjustment using StyleFlow vs. our approach on few sample videos.

Similarly, as shown in Tab. 2, we achieve the best puppeteering results across all metrics considered. Further, Fig. 4 demonstrates the versatility of our method as even complex facial attribute deformations (e.g., lip pressing, puckering, wrinkles) of the driving frames are transferred successfully through the proposed framework.

4.2 Ablation Study

As ablations, we study several design choices in our pipeline, namely: the use of a different encoder, the significance of the head-pose encoding approach, using real frames as reference in facial attribute optimization, and the effect of the generator fine-tuning stage.

Using ReStyle encoder[5] replacing e4e generates comparable results (Tab. 1) implying that the proposed scheme is functional irrespective of the encoder provided that the inversion is within the editable domain of the latent-space.

Further using real frames $\{I_{t}\}$ as reference for the facial attribute encoding optimization (Sec. 3.3.1) instead of the synthesized frames $\{S_{t}\}$ resulted in visually sub-optimal results requiring us to abandon tighter pixel-level metrics as $\mathcal{L}_{L2}$ , which are essential in capturing fine facial details such as wrinkles and gaze. Hence, we opted to use $\{S_{t}\}$ for the optimization stage. We suspect this behavior to be caused due to the natural noise present in real images to which the StyleSpace optimization might be sensitive to.

Even though StyleFlow is capable of directly generating a head-pose adjusted latent, provided { $Y_{t},P_{t}$ }, the quantified estimates of head-pose (using OpenFace) for a video stream are not sufficiently accurate to render using StyleFlow, resulting in inaccurate poses and significant jitter. Our synthesis-based optimization approach based on losses in image-space generates more accurate head-pose images consistent with reference frames (see Tab. 3).

It could be observed that while the re-synthesis results without the fine-tuning stage (Ours-PTI_post in Tab. 1) yet outperform fs-vid2vid, FOMM, LIA, StyleVideoGAN, and StyleHEAT in almost all scores, the fine-tuning stage improves the photo-realism which is well reflected in the performance improvement with the addition of the fine-tuning stage. This is justifiable due to the tendency of real-world subjects being out of the domain of StyleGAN and the inherent loss of the encoder.

4.3 Limitations

Despite the promising results, the proposed approach has a few limitations. As the pipeline is based on the StyleGAN2 architecture, it inherits the limitations from StyleGAN2 and its inversion methods (e.g., fixed resolution, alignment requirements, texture sticking). While StyleGAN3 could alleviate the issue of texture sticking, we choose the richer latent space of StyleGAN2 due to its better disentangled latent space which is well structured and expressive for editing as opposed to StyleGAN3 [6]. Further, the encoding pipeline is sensitive to occlusions resulting in visual artifacts in the synthesized images. Additionally, certain scenarios with extreme facial deformations and profile views could yet be challenging, which stems from the low representation of the FFHQ dataset used in training StyleGAN2.

5 Conclusion

We extend the StyleGAN2’s photo-realism and disentanglement of its StyleSpace spatio-temporally, to propose a novel end-to-end pipeline for latent-based expressive face video encoding, which enables high-fidelity ( $1024^{2}$ ) video re-enactment using a single $W+$ latent and 35 parameters per frame. Our algorithm achieves state-of-the-art performance while using a fraction (0.38%) of parameters compared to StyleGAN latent-based approaches. To the best of our knowledge we are the first to (1) automate latent-space editing (that was previously used to merely generate plausible facial edits) to capture extremely fine, rich, and complex facial deformations, and (2) to propose an extremely compact latent-based face video encoding scheme based on StyleGAN2 enabling re-enactment. The negative societal impact is discussed in Appendix G.

References

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4432–4441, 2019.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8296–8305, 2020.
[3] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40(3):1–21, 2021.
[4] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a matter of style: Age transformation using a style-based regression model. arXiv preprint arXiv:2102.02754, 2021.
[5] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: A residual-based stylegan encoder via iterative refinement. arXiv preprint arXiv:2104.02699, 2021.
[6] Yuval Alaluf, Or Patashnik, Zongze Wu, Asif Zamir, Eli Shechtman, Dani Lischinski, and Daniel Cohen-Or. Third time’s the charm? image and video editing with stylegan3. arXiv preprint arXiv:2201.13433, 2022.
[7] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit H Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. arXiv preprint arXiv:2111.15666, 2021.
[8] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE, 2018.
[9] M.J. Black and Y. Yacoob. Recognizing facial expressions in image sequences using local parameterized models of image motion. International Journal of Computer Vision 25, 23–48, 1997.
[10] Stella Bounareli, Vasileios Argyriou, and Georgios Tzimiropoulos. Finding directions in gan’s latent space for neural face reenactment. arXiv preprint arXiv:2202.00046, 2022.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. CVPR, 2019.
[12] Chris Donahue, Zachary C Lipton, Akshay Balsubramani, and Julian McAuley. Semantically decomposing the latent spaces of generative adversarial networks. arXiv preprint arXiv:1705.07904, 2017.
[13] Michail Christos Doukas, Stefanos Zafeiriou, and Viktoriia Sharmanska. Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14398–14407, 2021.
[14] Gereon Fox, Ayush Tewari, Mohamed Elgharib, and Christian Theobalt. Stylevideogan: A temporal generative model using a pretrained stylegan. arXiv preprint arXiv:2107.07224v1, 2021.
[15] Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. Text-based editing of talking-head video. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
[16] Manjary P Gangan, K Anoop, and VL Lajish. Distinguishing natural and computer generated images using multi-colorspace fused efficientnet. Journal of Information Security and Applications, 68:103261, 2022.
[17] Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. Warp-guided gans for single-photo facial animation. ACM Transactions on Graphics (TOG), 37(6):1–12, 2018.
[18] Kuangxiao Gu, Yuqian Zhou, and Thomas Huang. Flnet: Landmark driven fetching and learning network for faithful talking facial animation synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10861–10868, 2020.
[19] Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue Huang, and Xiaokang Yang. Collaborative learning for faster stylegan embedding. arXiv preprint arXiv:2007.01758, 2020.
[20] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. arXiv preprint arXiv:2004.02546, 2020.
[21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[22] Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. Countering malicious deepfakes: Survey, battleground, and horizon. International Journal of Computer Vision, pages 1–57, 2022.
[23] Wonjun Kang, Geonsu Lee, Hyung Il Koo, and Nam Ik Cho. One-shot face reenactment on megapixels. arXiv preprint arXiv:2205.13368, 2022.
[24] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
[26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
[27] Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. pagan: real-time avatars using dynamic textures. ACM Transactions on Graphics (TOG), 37(6):1–12, 2018.
[28] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
[29] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2287–2296, 2021.
[30] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744, 2021.
[31] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9243–9252, 2020.
[32] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in Neural Information Processing Systems, 32:7137–7147, 2019.
[33] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. Stylerig: Rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6142–6151, 2020.
[34] Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069, 2021.
[35] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
[36] Soumya Tripathy, Juho Kannala, and Esa Rahtu. Single source one shot reenactment using weighted motion from paired feature points. arXiv preprint arXiv:2104.03117, 2021.
[37] Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, and Daniel Cohen-Or. Stitch it in time: Gan-based facial editing of real videos. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
[38] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
[39] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Bryan Catanzaro, and Jan Kautz. Few-shot video-to-video synthesis. In NeurIPS, 2019.
[40] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10039–10049, 2021.
[41] Xin Wang, Hui Guo, Shu Hu, Ming-Ching Chang, and Siwei Lyu. Gan-generated faces detection: A survey and new perspectives. arXiv preprint arXiv:2202.07145, 2022.
[42] Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. Latent image animator: Learning to animate images via latent space navigation. arXiv preprint arXiv:2203.09043, 2022.
[43] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12863–12872, 2021.
[44] Jianjin Xu, Zheyang Xiong, and Xiaolin Hu. Frame difference-based temporal loss for video stylization. arXiv preprint arXiv:2102.05822, 2021.
[45] Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pretrained stylegan. arXiv preprint arXiv:2203.04036, 2022.
[46] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
[47] Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, and Victor Lempitsky. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, pages 524–540. Springer, 2020.
[48] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9459–9468, 2019.
[49] lei Zhang and Chris Pollett. Facial expression video synthesis from the StyleGAN latent space. In Xudong Jiang and Hiroshi Fujita, editors, Thirteenth International Conference on Digital Image Processing (ICDIP 2021), page 7, Singapore, Singapore, June 2021. SPIE.
[50] Mingxu Zhang, Hongxia Wang, Peisong He, Asad Malik, and Hanqing Liu. Exposing unseen gan-generated image using unsupervised domain adaptation. Knowledge-Based Systems, 257:109905, 2022.
[51] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[52] Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.

Appendix

Appendix A Overview

The outline of the appendix is as follows.

•

Sec. B: Detailed steps on alignment in the pre-processing stage
•

Sec. C: Discussion on the GAN inversion stage
•

Sec. D: Additional details on identity-latent selection
•

Sec. E: Illustrated explanations of noteworthy sections of facial attribute encoding
•

Sec. F: Further details and examples of experiments, results, and limitations of the proposed framework
•

Sec. G: Discussion on Potential Negative Societal Impact

Appendix B Video Pre-Processing: Alignment

The alignment carried out in the pre-processing stage could be elaborated further using the three steps below.

1.

Detect eye blinking and compensate for its effect on landmark location of the eyes. This improves StyleGAN2-based alignment by removing the sensitivity to eye shape change during blinking.
2.

Registration of the face between a frame and a key frame uses a parameterized affine optical-flow model of the head [9], excluding the non-rigid face features (eyebrows, eyes, and mouth). The over-constrained optical-flow model is very effective at stabilizing the face between consecutive frames unless there are changes in the Yaw/Pitch of the head. We employ the mean L2 distance to automatically determine the quality of the inter-frame alignment over the non-rigid parts of the face (i.e., compute the residual error in RGB values of face stabilization). A mean distance beyond a fixed threshold indicates that the affine motion model is not successful at stabilizing the rigid part of the face, triggering step (3).
3.

Key frame change that forces a new key frame to be the basis for future frames’ face stabilization (aligned according to step (1)).

For optical flow head registration, the threshold of the mean RGB registration error over the face (excluding eyes, mouth, and eyebrow areas) had to exceed 45.0 (if the inter-frame Yaw and Pitch change is less than $2^{\circ}$ ), or 30.0 (if the inter-frame Yaw or Pitch change exceeds $2^{\circ}$ ). The objective is to avoid forcing face registration when the head is moving out-of-plane. Instead, a change in the key frame is triggered, allowing the StyleGAN2 encoder to capture the new head-pose. Fig. 5 depicts the key frames from a short sequence when the head moves to near profile and then back.

Appendix C GAN Inversion

Two factors were considered in choosing an appropriate GAN inversion method: (1) faithful representation of the given image (i.e., minimal reconstruction loss), (2) ability to facilitate latent space edits. [35] suggested that there exists a trade-off between these two factors, i.e., distortion and editability. Generally, inversion is done using a trained encoder and/or an optimization framework. While the former has better editability, it has a comparatively high reconstruction loss and vice versa. Recently, [30] proposed to bridge the gap between the two trade-offs by fine-tuning the generator but this adds computational and information transmission costs. We chose the e4e encoder, which was designed to facilitate the inversion of real images in proximity to the regions StyleGAN was trained on, thus mitigating the trade-off.

The e4e encoder [35] while producing state-of-the-art results in GAN inversion of real images, has a few failing instances. For certain subjects, (e.g., Fig. 6 (a)) the identity of the encoded image deviates considerably from the real frame. In such cases, as we perform the inversion per-frame, there is a tendency for the identity to change across the frames of a single video clip as well. The identity change across frames could be due to the poor convergence of the encoder resulting from the existence of a higher per-frame loss due to poor identity. Additionally, there exist cases where the e4e encoder failed to capture certain facial attributes successfully (e.g., Fig. 6 (b) and (c)) which could be due to the low representation of complex features in the StyleGAN2 training dataset (FFHQ). Further, certain visual artifacts and deformations tend to appear in certain cases similar to the examples shown in Fig. 6 (d), (e), and (f), which could be caused due to occlusions (d) and the noisiness in the neighborhood of the inverted $W+$ .

However, the impact of most of these issues on the re-synthesis is mitigated as (1) we anchor our deformations with respect to a single ID-frame that has the highest identity match with the real and (2) utilize a generator fine-tuning stage (PTI_post) to minimize the identity disparity between the real and synthesized frames.

Appendix D Identity-Latent Selection

The per-frame inversion creates a series of latents. Depending on the extent of head motion, deformation in StyleGAN2 space is likely. Therefore, the choice of the ID-frame is of great significance as it serves as the base identity for the face and head-pose deformations across the entire sequence of frames. Hence, we first use ArcFace [11] to compute the similarity between the source and the reconstructed images of the face and then the (1) closest of the face-matches that is also (2) near frontal view of the person, and (3) has no blink is chosen as the representative $L_{ID}$ , the basis for re-enactment. An example plot depicting the variation of the identity similarity (computed based on ArcFace is given in Fig. 7 (a) and the corresponding best and worst ID-frame candidates based on our criteria are shown in Fig. 7 (b).

Appendix E Facial Attribute Encoding

E.1 Head-Pose Encoding

The flow of the head-pose encoding is illustrated in Fig. 8. Moreover, to evaluate the significance of our optimization based head-pose encoding approach, we compared our results post head-pose adjustment against the straightforward use of StyleFlow with the $\{Y_{t},P_{t}\}$ parameters computed using [8]. While quantitative results on 6 sample videos were provided in Tab. 3, please refer to the supplementary videos for qualitative comparisons. It could be seen that our approach captures the head-pose well and has a significantly low jitter compared to the straightforward approach with StyleFlow.

E.2 Choice of StyleSpace Indices

The significance of the StyleSpace indices selection process as opposed to the algorithm proposed in [43] is as follows. We observed that the StyleSpace, $SS$ representation is not unique. i.e., optimizing

\min_{\alpha_{inv_{t}}}\;\mathcal{L}\{G_{ss}({LH}_{t}^{ss}+\alpha_{t}+\alpha_{inv_{t}}),G_{ss}({LH}_{t}^{ss})\}

(15)

In Fig. 12, we illustrate the facial deformations corresponding to the manipulation of each of the 32 StyleSpace indices tabulated in Tab. 4. A pair of images marked as $(l,c):+/-$ is included for each StyleSpace index, $(l,c)\in\mathcal{V}$ denoting the sign of the perturbation added to the respective StyleSpace index.

Facial Attribute, $\mathcal{F}$	StyleSpace Indices, $\mathcal{V}$
Mouth	{6: 113, 202, 214, 259, 378, 501}, {11: 6, 41, 78, 86, 313, 361, 365}, {8: 17, 387}, {14: 12},{15: 45}
Chin/ Jaw	{5: 50, 505}, {6: 131}, {8: 390}
Eyes	{9: 63}, {11: 257}, {12: 82, 414}, {14: 239}, {17: 28}
Eyebrows	{8: 6, 28}, {9: 30}, {11: 320}
Gaze	{9: 409}

Table 4: StyleSpace indices corresponding to the deformation of facial attributes. The indices take the form of {

l

c_{1},c_{2},\dots

}, where

l

and

c

denote respectively the layer index and channel index of the StyleSpace.

E.3 Index Specific Learning Rate

The variation of index sensitivity computed over the indices corresponding to the {mouth + chin/jaw} is shown in Fig. 9 (a). The significant variations seen in the plot make it evident that the index sensitivities cannot be simply ignored and hence, the indices cannot be treated the same during optimization. In order to alleviate the dominance of indices with a higher index sensitivity, we compute an index specific learning rate, $\eta_{f,i}$ , $\Gamma_{f,i}$ specified in Eq. 5 in the main-paper. The $\Gamma_{f,i}$ corresponding to the indices in Fig. 9 (a) are depicted in Fig. 9 (b). It could be seen that the $\eta_{f,i}$ of indices having a higher $\Gamma_{f,i}$ is comparatively lower than the indices of lower $\Gamma_{f,i}$ , thus effectively alleviating the dominance.

E.4 Details on Optimization

Inputs:

•

Head-pose adjusted $W+$ latents: $LH_{t}$ and $LH_{t-1}$
•

Target frames: $S_{t-1}$ and $S_{t}$
•

Rendered frames: $\hat{S}_{1}$ and $\hat{S}_{t-1}$
•

StyleSpace of $t-1$ : $LH^{ss}_{t-1}$ and $\alpha_{t-1}$
•

Optimizer $F^{\prime}$ , $N$ number of epochs, and $G_{ss}$

Initialization:

•

Obtain the StyleSpace latent, $LH^{ss}_{t}=\mathcal{A}(LH_{t})$
•

Initialize $LH^{ss}_{t}(l,c),\,\forall i=(l,c)\in\mathcal{V}$
•

Initialize $\alpha_{t}=[0,\dots,0]$
•

Compute the index-specific learning rates, $\eta_{f,i}$

\eta=\{\eta_{f,i};\,\forall f\in\mathcal{F},i\in\mathcal{V}\}

Optimization:

for n = [1:N] do

\hat{S}_{t}=G_{ss}\{LH^{ss}_{t}+\alpha_{t}\mathbbm{1}_{i}\}

where

\mathbbm{1}_{i}=\{1

when

(l,c)\in\mathcal{V};

0

elsewhere}

\mathcal{L}=\mathcal{L}\{\hat{S}_{1},\,\hat{S}_{t-1},\hat{S}_{t},\,S_{t-1},\,S_{t}\}

\alpha_{t}\leftarrow\alpha_{t}-\eta F^{\prime}(\nabla_{\alpha_{t}}\mathcal{L},\alpha_{t})

end for

Output:

•

32-dimensional $\alpha_{t}$

Algorithm 1 Optimization Flow for frame t

The face deformation attribute encoding algorithm in Sec. 3.3 of the main paper is presented in Algorithm 1. The AdamW [26] optimizer with AMSGrad [28] was utilized with an initial learning rate of $\eta=\{\eta_{f,i};\,\forall f\in\mathcal{F},i\in\mathcal{V}\}$ , $(\beta_{1},\,\beta_{2})=(0.9,\,0.999)$ , and $\epsilon=1e^{-8}$ . The optimization was over 100 epochs ( $N=100$ ) and the learning rate was decayed every $10$ epochs with a decaying factor of $0.8$ using a learning rate scheduler for improved convergence. The optimization was approximately 1 min./frame on a single GTX1080Ti GPU. Additional details on the loss terms defined in Eqs. 9, 10, 11 and 12 of the main-paper are given below.

$\bm{\mathcal{L}_{LPIPS}}$ : The LPIPS loss [51], which is known to learn perceptual similarities well [19, 29], was used to capture the structural details of the facial attributes between $S_{t}$ and $\hat{S}_{t}$ . Nevertheless, $\mathcal{L}_{LPIPS}$ was not used in solving for the gaze ( $\mathcal{L}_{p}$ ) as it is invariant to slight spatial changes and hence introduces a slight jitter when used.

$\bm{\mathcal{L}_{L2}}$ : This denotes the L2 norm between the $S_{t}$ and $\hat{S}_{t}$ , and enables precise reconstruction (e.g., the case of gaze).

$\bm{\mathcal{L}_{ID}}$ : To mitigate the risk of changing the identity of the subject across frames while optimizing over the latent space, the identity loss is in place as a regularization term. This is computed between $\hat{S}_{1}$ and $\hat{S}_{t}$ .

$\bm{\mathcal{L}_{FP}}$ : As we optimize over 32 indices in parallel, we noted occasional nose, mouth, and chin/jaw deformations. To discourage unwarranted deformations, the Face-Parsing loss, which is the L2 norm of the difference between the masked face-parsing scores [46] of the rendered and target frames, is used instead of facial-landmark coordinates loss (e.g., [8]). Face-parsing scores facilitate the gradient flow through the optimization and are more precise and stable across the frames.

\displaystyle\mathcal{L}_{FP}=||FP(\hat{S}_{t})*M-FP(S_{t})*M||_{2}

(16)

where function $FP(\cdot)$ yields face-parsing scores and $M$ denotes the binary mask of the face.

$\bm{\mathcal{L}_{IF}}$ : The inter-frame loss is a derivation of the Frame Difference-Based (FDB) loss proposed in [44], to enforce temporal coherence between frames. We minimize this loss along with the other spatial losses to avoid enforcing temporal continuity posteriori. Provided the target video is temporally coherent, this loss is based on the concept that the image space and feature space differences between consecutive frames embed the temporal coherence. We use LPIPS and L2 losses to compute differences in the feature and image spaces, respectively.

	$\displaystyle\mathcal{L}_{IF}$	$\displaystyle=\mathcal{L}_{IF\_LPIPS}+\mathcal{L}_{IF\_L2}$		(17)
	$\displaystyle\mathcal{L}_{IF\_*}$	$\displaystyle=\mathcal{L}_{}\{S_{t},S_{t-1}\}-\mathcal{L}_{}\{\hat{S}_{t},\hat{S}_{t-1}\}$		(18)

where $*$ denotes either LPIPS or L2.

Appendix F Experiments and Results

F.1 Dataset

As stated in Sec. 4.1 of the main-paper, we compose a dataset consisting of video clips of 4K resolution sourced from the site www.pexels.com. The videos were chosen such that diverse subjects belonging to various ethnicities, age groups, and having different facial geometries, performing significant head-pose movements and facial deformations (both expressions and speech) were included. The results were computed based on 150 videos chosen from the dataset, with a mean of 304 frames, a minimum of 100 frames, and a maximum of 1000 frames.

F.2 Evaluation Metrics

The following metrics were used for the quantitative evaluation of our re-enactment videos in comparison with baselines, which are tabulated in Tabs. 1 and 2 of the main-paper.

Mean L1-distance, L1: The per-pixel L1-distance was averaged across pixels, channels, and frames to obtain the score. The pixel values of the input images were in the range of [0,255].

Learned Perceptual Image Patch Similarity Loss, LPIPS: The metric was computed per-frame using the original implementation of [51] computed using the feature space of AlexNet [25].

Identity Loss, $\mathcal{L}_{ID}$ : The identity loss was computed using,

\mathcal{L}_{ID}=1-\langle\phi(S_{t}),\phi(\hat{S}_{t})\rangle

(19)

where $\phi$ represents the pretrained ArcFace network and $\langle\cdot,\cdot\rangle$ denotes the cosine similarity. While in re-synthesis (Tab. 1 in the main-paper) the loss was computed between the synthesized frame and the real frame, for puppeteering (Tab. 2 in the main-paper) the loss was computed between each frame and the puppet’s ID-frame.

Peak Signal to Noise Ratio, PSNR: This was computed using the built-in function of python’s scikit-image package using images having pixel values in the range [0,255].

Fréchet Inception Distance, FID: This metric, which is used to measure the photo-realism between two datasets, was computed based on the original implementation of [21] with a batch size of 100. Note: The input images are rescaled to $299\times 299$ at the input of the inception network.

Fréchet Video Distance, FVD: The spatio-temporal perceptual score measured through FVD was computed using the original implementation of [38]. Video fragments of length 120 frames were scored with a batch size of 8 and averaged to obtain the final FVD score due to resource limitations. Note: The frames are rescaled to $224\times 224$ by the algorithm.

Fréchet Video Distance - Mouth, FVD_M: Similar to FVD, with the exception of the metric being scored over the masked area of the mouth region.

Action Unit, Gaze, Pose Correlations, $\mathbf{\rho_{\scaleto{AU}{3pt}}}$ , $\mathbf{\rho_{\scaleto{GZ}{3pt}}}$ , $\mathbf{\rho_{\scaleto{pose}{3pt}}}$ : These metrics measure the time-series correlation between the Action Unit activations, Gaze angles, and Yaw and Pitch angles respectively, which are computed using OpenFace 2.0 [8] of the synthesized and the reference sequences. These provide an insight into how well the facial deformations ( $\mathbf{\rho_{\scaleto{AU}{3pt}}}$ ), eye motion ( $\mathbf{\rho_{\scaleto{GZ}{3pt}}}$ ), and pose ( $\mathbf{\rho_{\scaleto{pose}{3pt}}}$ ) are captured by the algorithm in a spatio-temporal sense.

Note: All metrics except FVD, were computed per frame and averaged across all the frames. Further, except for identity loss and correlation metrics, all other metrics were computed over a masked-out region of the reference face of each frame.

F.3 Video Results

The additional examples of video re-synthesis and puppeteering depicted in Fig. 10 and Fig. 11 respectively reaffirm the versatility of our approach. Video examples comparing the state-of-the-art approaches could be viewed on the project page. In comparison to our results, visual artifacts, lack of sharpness, and incorrect pose and facial deformations could be observed in the re-synthesis and puppeteering examples of the baseline approaches.

F.4 Limitations

There are multiple scenarios where latent-based video encoding may fail: (1) due to limitations inherited from StyleGAN2 (e.g., fixed resolution, entanglements, alignment requirements, texture sticking), (2) during pre-processing if the face is misaligned with respect to StyleGAN2 expectations, (3) extreme facial deformations and profile views, stemming from the low representation in the FFHQ dataset used in training StyleGAN2, (4) possible identity drift in editing StyleFlow or StyleSpace, (5) wearables such as eyeglasses can be challenging in some cases due to remaining latent space entanglement, (6) both latent-space inversion and editing are sensitive to occlusions.

Appendix G Potential Negative Societal Impact

Since the proposed pipeline successful captures the fine, detailed, and expressive facial attributes, it improves the realism of face re-enactment. Thus, our model could be misused to create re-enactments with ill-intent (e.g., defamation) and we strongly oppose such malicious use. The research on detection of DeepFakes have progressively advanced as well [41, 22, 50, 16], and the data from our model could be used to improve such methods, thus reducing the potential negative societal impact.

Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space