This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sample IDs Protection in Vertical Federated Learning

Abstract

In this paper, we propose a framework to pRotect sAmple IDs in Vertical fEderated Learning (Ravel).

1 Introduction

With the increasing concerns on data security and user privacy in machine learning, federated learning (FL) [17] becomes a promising solution to these challenges. Based on how sensitive data are distributed among various parties, FL can be classified into different categories [31], notable among which are vertical FL and horizontal FL. Different from horizontal FL where the data are partitioned by examples, vertical FL assumes that the data are partitioned by different features (including labels). A typical example of vertical FL is when an online media platform AA displays advertisements of an ecommerce company BB to its users and charges BB for each conversion (e.g., user clicking the ad and buying the product). In this case, both parties have different features for the same set of users: party AA has features on users’ media viewing records and party BB has the users’ product browsing records and conversion labels. The conversion labels are not available to party AA because users’ buying processes happen entirely on party BB’s website/mobile app.

If both parties want to train a deep learning model jointly to predict the conversion rate based on feature-partitioned data, they could sign legal agreements to allow them to share the data with each other. However, due to privacy and business concerns, recently companies are unwilling to share their data with other parties. Split learning [14, 29] enables the joint training of deep learning models without sharing sensitive data by splitting the execution of a deep learning model between the parties on a layer-wise basis. In vanilla split learning, the party without labels (called the passive party) sends the computation results of the intermediate layer (called the cut layer) rather than the raw data to the party with labels (called active party). The active party then completes the rest of the forward step computation, computes the gradients based on the labels, and sends the gradients with respect to the cut layer back to the passive party. The passive party then completes the back propagation with the gradients of the cut layer using chain rule.

At first glance, the process of split learning seems private because only the intermediates computation results of the cut layer, rather than raw features or labels, are communicated between the two parties. However, such “gradient sharing” scheme has been shown to be vulnerable in the setting of horizontal FL: in a recent work of Zhu et al. [33], it is shown that the central server could recover raw features and labels of a device using model parameters and gradients shared by that device. We wonder whether the raw data (specifically the labels from the active party) can also be leaked by sharing gradients of the cut layer in the split learning (vertical FL) setting.

In this work, we first identify a simple but accurate method that uses the norm of the communicated gradients from the active party to uncover the labels in imbalanced datasets. We then discuss several protection techniques to mitigate this issue. Among them, we have designed a principled approach that directly maximizes the worst-case error of label detection. This theoretically justified technique strategically perturbs the gradient randomly before communication. We experimentally demonstrate the effectiveness and competitiveness of the proposed protection method compared to other baselines.

{comment}

In this paper, we first propose a more general approach than iDLG to extract the ground-truth labels from the shared gradients w.r.t hidden layers of any NN with cross-entropy loss over one-hot labels. By deviation, we demonstrated that the gradient norm of positive instance are larger than the negative ones’. With this rule, we can uncover the label of instances by ranking the corresponding gradient norms in a mini-batch and select the top ones as the positive instances.

The norm attack described above puts a server challenge to the current vertical FL and some split learning settings. We then propose to use gradient perturbation strategy to prevent label leakage. The amount of the perturbation is determined by minimizing the symmetrized KL divergence between perturbated gradients of positive and negative instances. To avoid add unlimited noise to the gradients and make the model converge, some constraints are added to the minimization problem. An approximate theoretical guarantee is achieved for this protection strategy. Extensive experiments demonstrated that we can achieve a reasonable trade-off between model performance and protection quality.

2 Related Work

Uncovering the raw data. Even though raw data are not shared across different parties in federated learning, they are still not secure when gradients and model parameters are shared among different parties. In the horizontal FL setting, Zhu et al. [33] showed that an honest-but-curious server can uncover the raw features and labels of a participating device by knowing the model architecture, parameters, and the communicated gradient of the loss on the device’s data. Building upon the techniques in Zhu et al. [33], Zhao et al. [32] showed that the ground truth label of an example can be extracted by exploiting the relationship between the directions of the gradients of the weights connected to the logits of different classes. In this work, we study a different setting in FL — the two-party split learning setting for vertical FL [31], where no parties have access to the model architecture or model parameters on the other party’s side. We show one can recover the ground-truth labels (not the raw input features [30]) by using magnitude (22-norm) instead of the direction of the gradients.

Privacy protection methods. Because sharing intermediate computation results (model parameters, representations, gradients) might leak the raw data, FL communications will need to proceed in a privacy-preserving manner. There are generally three classes of approaches to communicate in a privacy-preserving manner: 1) cryptography-based methods such as Secure Multi-party Computation [9, 20, 6, 2] and homomorphic encryption[4, 25]; 2) system-based methods such as Trusted Execution Environments [26, 28]; 3) perturbation methods such as randomly perturbing the communicated message [1, 18], shuffling the messages [12, 8], reducing message’s data-precision, compressing and sparsifying the message [33], etc. Our work belongs to the third category where we add random perturbations to the gradients before communication to protect the labels. Many of the randomness-based privacy protection methods was proposed in the domain of horizontal FL where the aim is to protect the privacy of each example/ each device’s dataset [1, 13, 19, 18, 5, 27, 22, 24]. In this case, Differential privacy (DP) [10, 11] was used to quantitatively measure the proposed random mechanisms’ ability to protect the privacy of any individual records. To the best of our knowledge [16], in this paper, we introduce the first several randomness-based techniques to protect the active party’s label privacy with the goal of maximizing detection error during split learning in vertical FL.

2.1 Experimental Settings

In this section, we evaluate the proposed gradient perturbation methods on the Criteo111https://www.kaggle.com/c/criteo-display-ad-challenge/data and Avazu222https://www.kaggle.com/c/avazu-ctr-prediction/data dataset. Criteo is a real-world large-scale binary classification dataset with approximately 4545 million user click records in online advertising. Avazu contains approximately 4040 million entries (1111 days of clicks/not clicks of Avazu data). Both datasets are highly imbalanced: 25%25\% of the Criteo and 17%17\% of the Avazu data is in the positive class. We split each dataset randomly into two parts: with 90%90\% for training and the rest for tests. We train a Wide&Deep model [7] where the passive party consists of embedding layers for input features and four layers of 128128-unit ReLU activated multilayer perceptron (deep part) and the active party consists of the last two layers of the deep part and the entire wide part of the model. In every iteration, the passive party sends a mini-batch of 1,0241,024 examples’ 128128-dimensional vector to the active party and the active party sends the gradients of the loss w.r.t. these vectors back to the passive side.

We compare our proposed protection techniques (\maxnorm  and \marvell) with a baseline named iso,  where i.i.d. isotropic Gaussian noise is added to every gradient example gg in the communicated mini-batch. Every gig_{i} in the batch of size BB is added with \calN(0,(s/d)maxi=1B\normgi22Id×d)\calN(0,{(s/d)\cdot\max_{i=1}^{B}\norm{g_{i}}_{2}^{2}}I_{d\times d}), where ss is a parameter to tune.

We quantitatively measure the model’s utility and privacy by the AUC of the ROC. The value is estimated at every gradient update iteration using the mini-batch examples. Higher leak AUC means the label is leaked more likely to leak by the communicated gradients. A model with good trade-off between utility and privacy should have a high-performance AUC (low test loss) and a leak AUC close to 0.50.5 (random guess).

2.2 Robustness of Marvell

Refer to caption
(a)
Refer to caption
(b)
\toprule Parameters ACE
\midruleMarvell 0.1 0.01170
0.2 0.01167
0.3 0.01159
0.4 0.01187
\midrule\maxnorm - 0.01184
\midruleiso 1.0 0.01178
\midruleno_noise - 0.01148
\bottomrule
(c)
Figure 1: Figure (a) and (b) shows the leak AUC of no protection training and Marvell with different values of lower bound L=[0.1,0.2,0.3,0.4]L=[0.1,0.2,0.3,0.4] at the cut layer (layer 4) and a previous layer (layer 1) respectively on Criteo. Figure (c): Comparing Marvell and \maxnorm with the baseline method iso on ACE. In all the figures/table, no_noise is the standard training with no gradient perturbation as protection.

In this section, we show the robustness of \marvell on protecting different layers of the model and its effects on the corresponding model’s reliability.

In Figure 1(a), we plot the leak AUC for every batch of communicated gradient at the cut layer (layer 4) throughout the entire training. We see that with no gradient perturbation, the leak AUC is consistently around 0.90.9 through the majority of training. On the other hand, when using our proposed technique (\marvell), we see a large reduction in leak AUC. The leak AUC becomes lower (more privacy protection) when we impose a higher lower bound LL for the worst-case detection error. In addition, it is natural to ask whether the gradients of the layers before the cut layer can also leak the label information as the passive party keeps back propagating towards the first layer. In Figure 1 (b), we plot the leak AUC of the first layer’s activation gradients for the same set of methods shown in Figure 1 (a)—the leak AUC is at the same high level for the no protection method, indicating that previous layer’s gradients could also leak the label information. However, our method with different lower bound LL can still significantly reduce the leak AUC even though the protection was analyzed at the cut layer.

Adaptive Calibration Error (ACE)  [21] can measure how well a model’s predicted probabilities of outcomes reflect true probabilities of those outcomes. We report the ACE in Figure 1 (c) to show that different perturbation methods have little effect on the reliability of the corresponding model’s confidence in its predictions. \marvell and \maxnorm can have similar ACE as \nonoise, demonstrating that our proposed protection methods’ robustness on the model’s reliability of prediction.

2.3 Trade-off between Utility and Privacy

Refer to caption
(a: norm attack on Criteo)
Refer to caption
(b: hint attack on Criteo)
Refer to caption
(c: hint attack on Criteo)
Refer to caption
(d: norm attack on Avazu )
Refer to caption
(e: hint attack on Avazu)
Refer to caption
(f: hint attack on Avazu)
Figure 2: Figure (a) shows the trade-off between leak AUC (norm attack) and test AUC on Criteo. Figure (b) and (c) shows the trade-off between leak AUC (hint attack) and test loss and test AUC on Criteo respectively. Figure (d), (e), and (f) show the corresponding trade-offs on Avazu dataset. Each point represents a utility and privacy performance with a specific setting (for example ss for \iso and sumKL for \marvell). Since \maxnorm  and \nonoise has no parameter to tune, they have only one point in the corresponding Figures. We set the number of hints as 55 and use inner product to compute the pairwise distance for hint attack.
Refer to caption
(a: Leak AUC of Norm Attack)
Refer to caption
(b: Leak AUC of Hint Attack)
Refer to caption
(c: Performance AUC)
Figure 3: Figure (a) and (b) shows different protection methods’ leak AUC of norm attack and hint attack respectively at the cut layer. Figure (c) shows different protection methods’ performance AUC. The x-asix represents the date (ranging from Jan. to Feb. 2020), and the y-axis represents the corresponding daily average utility and privacy.
{comment}
Refer to caption
(a: Norm Attack on Criteo)
Refer to caption
(b: Hint Attack on Criteo)
Refer to caption
(c: Norm Attack on Avazu)
Refer to caption
(d: Hint Attack on Avazu)
Figure 4: Figure (a) and (c) shows the trade-off between utility and privacy (norm attack) on Criteo and Avazu dataset respectively. Figure (b) and (d) shows the trade-off between utility and privacy (hint attack) on Criteo and Avazu dataset respectively. Each point represents a utility and privacy performance with a specific setting (for example ss for \iso and sumKL for \marvell). Since \maxnorm  and \nonoise has no parameter to tune, they have only one point in the corresponding Figures.

In this section, we show that \marvell and \maxnorm  can achieve a trade-off between model performance (utility) and privacy more easily than \iso under the \normattack. In addition, we would also like to test our proposed protection methods’ effectiveness on preventing label leakage against some extreme scenarios. We introduce an attack method named as hint attack that is powerful but difficult to deploy in reality. \hintattack assumes that the passive party knows some positive instances (working as hints)333The passive party cannot use the hints repeatedly in different mini-batches, since it can be detected by the active party easily. in each communicated mini-batch during training, and the passive party can compute the pairwise distance (inner product/cosine similarity for example) between hints and unknown label instances based on their corresponding gradients. An instance will be assigned a positive label if it is close to any hint. In our experiments, \hintattack with the number of hints set as 55 makes the leak AUC be close to 1.01.0 on both Criteo and Avazu datasets. It’s worth mentioning that hint attack with inner product is still effective for balanced datasets, since positive and negative gradients have opposite directions even in balanced datasets.

As shown in Figure 5, the x-axis represents the corresponding model’s leak AUC (the closer to be 0.50.5 the better) under some attack, and the y-axis reflects the corresponding model’s performance (evaluated by test AUC on Figure (c) and (f) and test loss on Figure (b) and (e)) at the same time. Generally speaking, a protection technique that can achieve good trade-off between test AUC and privacy should fall into the upper-left area of the figure. We observe that in Figure  5 (a) and (d) that \marvell and \maxnorm achieves better trade-off between test AUC and leak AUC than \iso on preventing the norm attack. However, \maxnorm becomes less effective than \marvell when dealing with the hint attack, as shown in Figure 5 (b), (c), (e), and (f). A model that can achieve a good trade-off between test loss and privacy should fall into the lower-left area of the figure. In this case, \marvell achieves the best trade-off between leak AUC and test loss on both Criteo and Avazu as shown in Figure  5 (b) and (e) respectively. Also, since we do not optimize AUC directly, it is possible to observe that \iso has a worse test loss than \marvell but has a similar (but not better) test AUC as \marvell. Overall, as shown in Figure 5, \marvell has achieved smaller test loss and higher test AUC with approximately the same if not better leak AUC than \maxnorm and \iso. This demonstrates that the optimization-based model achieves a better utility privacy trade-off than other comparison partners.

{comment}
Refer to caption
(a: Norm Attack on Criteo)
Refer to caption
(b: Hint Attack on Criteo)
Refer to caption
(c: Norm Attack on Avazu)
Refer to caption
(d: Hint Attack on Criteo)
Figure 5:
Refer to caption
Figure 6: .
Refer to caption
Figure 7: .
Refer to caption
Figure 8: .

3 Conclusion

In this paper, we identify a simple norm-based method to uncover the private labels of the active party using the communicated gradients in the two-party split learning problem for binary classification. We then propose a theoretically principled method to determine the best additive random perturbation to the gradients to reduce the probability of label leakage against a general class of label uncovering methods. We also conducted extensive experiments to demonstrate the effectiveness of our proposed protection methods.

We note that label leakage in the gradient-sharing scheme is a general phenomenon, as long as prior knowledge on the gradient distributions is known to the passive party. It remains open if the passive party can computational efficiently exploit this leakage when cryptographic scheme is applied in the learning protocol like in [3, 15]. Another open problem is whether there exist other methods to uncover the label information from the gradients of not necessarily imbalanced datasets and understanding how our method could protect against those attacks theoretically and empirically. Finally, our protection technique \marvell is designed to maximize the overall detection error. It would be interesting to see if similar ideas can be used to satisfy other protection requirements.

Acknowledgements We would like to express our thanks to the team members at ByteDance for providing the collaborative federated learning framework, Fedlearner444\urlhttps://github.com/bytedance/fedlearner. In addition, we would like to thank Liangchao Wu, Di Wu, and Xiaobing Liu for their discussions and supports. The code is availale at \urlhttps://github.com/bytedance/fedlearner/tree/master/example/privacy/label_protection.

References

  • Abadi et al. [2016] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016.
  • Agrawal et al. [2019] N. Agrawal, A. Shahin Shamsabadi, M. J. Kusner, and A. Gascón. Quotient: two-party secure neural network training and prediction. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1231–1247, 2019.
  • Aono et al. [2016] Y. Aono, T. Hayashi, L. Trieu Phong, and L. Wang. Scalable and secure logistic regression via homomorphic encryption. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 142–144, 2016.
  • Aono et al. [2017] Y. Aono, T. Hayashi, L. Wang, S. Moriai, et al. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Transactions on Information Forensics and Security, 13(5):1333–1345, 2017.
  • Bhowmick et al. [2018] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
  • Bonawitz et al. [2017] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191, 2017.
  • Cheng et al. [2016] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016.
  • Cheu et al. [2019] A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 375–403. Springer, 2019.
  • Du et al. [2004] W. Du, Y. S. Han, and S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In Proceedings of the 2004 SIAM international conference on data mining, pages 222–233. SIAM, 2004.
  • Dwork [2006] C. Dwork. Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, editors, Automata, Languages and Programming, pages 1–12, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-35908-1.
  • Dwork et al. [2014] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • Erlingsson et al. [2019] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468–2479. SIAM, 2019.
  • Geyer et al. [2017] R. C. Geyer, T. Klein, and M. Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
  • Gupta and Raskar [2018] O. Gupta and R. Raskar. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications, 116:1–8, 2018.
  • Kim et al. [2018] M. Kim, Y. Song, S. Wang, Y. Xia, and X. Jiang. Secure logistic regression based on homomorphic encryption: Design and evaluation. JMIR medical informatics, 6(2):e19, 2018.
  • Lyu et al. [2020] L. Lyu, H. Yu, and Q. Yang. Threats to federated learning: A survey. arXiv preprint arXiv:2003.02133, 2020.
  • McMahan et al. [2017a] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017a.
  • McMahan et al. [2017b] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017b.
  • McMahan et al. [2018] H. B. McMahan, G. Andrew, U. Erlingsson, S. Chien, I. Mironov, N. Papernot, and P. Kairouz. A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210, 2018.
  • Nikolaenko et al. [2013] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft. Privacy-preserving ridge regression on hundreds of millions of records. In 2013 IEEE Symposium on Security and Privacy, pages 334–348. IEEE, 2013.
  • Nixon et al. [2019] J. Nixon, M. Dusenberry, L. Zhang, G. Jerfel, and D. Tran. Measuring calibration in deep learning. CoRR, abs/1904.01685, 2019.
  • Pichapati et al. [2019] V. Pichapati, A. T. Suresh, F. X. Yu, S. J. Reddi, and S. Kumar. Adaclip: Adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643, 2019.
  • Platt [1998] J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. 1998.
  • Rajkumar and Agarwal [2012] A. Rajkumar and S. Agarwal. A differentially private stochastic gradient descent algorithm for multiparty classification. In N. D. Lawrence and M. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 933–941, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
  • Sathya et al. [2018] S. S. Sathya, P. Vepakomma, R. Raskar, R. Ramachandra, and S. Bhattacharya. A review of homomorphic encryption libraries for secure computation. arXiv preprint arXiv:1812.02428, 2018.
  • Subramanyan et al. [2017] P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas, and S. A. Seshia. A formal foundation for secure remote execution of enclaves. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2435–2450, 2017.
  • Thakkar et al. [2019] O. Thakkar, G. Andrew, and H. B. McMahan. Differentially private learning with adaptive clipping. arXiv preprint arXiv:1905.03871, 2019.
  • Tramer and Boneh [2018] F. Tramer and D. Boneh. Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. arXiv preprint arXiv:1806.03287, 2018.
  • Vepakomma et al. [2018] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564, 2018.
  • Vepakomma et al. [2019] P. Vepakomma, O. Gupta, A. Dubey, and R. Raskar. Reducing leakage in distributed deep learning for sensitive health data. arXiv preprint arXiv:1812.00564, 2019.
  • Yang et al. [2019] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
  • Zhao et al. [2020] B. Zhao, K. R. Mopuri, and H. Bilen. idlg: Improved deep leakage from gradients. arXiv preprint arXiv:2001.02610, 2020.
  • Zhu et al. [2019] L. Zhu, Z. Liu, and S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems, pages 14774–14784, 2019.

Appendix

Proof of Theorem LABEL:thm:optimal_solution.

Based on the assumption made in Section LABEL:sec:_optimization, the optimization problem we want to solve is:

min\displaystyle\min \KL\calN(g¯(1),uI+Σ1)\calN(g¯(0),vI+Σ0)+\KL\calN(g¯(0),vI+Σ0)\calN(g¯(1),uI+Σ1)\displaystyle\KL{\calN(\bar{g}^{(1)},uI+\Sigma_{1})}{\calN(\bar{g}^{(0)},vI+\Sigma_{0})}+\KL{\calN(\bar{g}^{(0)},vI+\Sigma_{0})}{\calN(\bar{g}^{(1)},uI+\Sigma_{1})}
\subjectto\displaystyle\subjectto
Σ0Σ1=\displaystyle\Sigma_{0}\Sigma_{1}= Σ1Σ0,\displaystyle\;\Sigma_{1}\Sigma_{0},
p\tr(Σ1)+(1p)\tr(Σ0)\displaystyle p\cdot\tr(\Sigma_{1})+(1-p)\cdot\tr(\Sigma_{0})\leq P,\displaystyle\;P,
Σ1\displaystyle\Sigma_{1}\succeq \bzero,\displaystyle\;\bzero,
Σ0\displaystyle\Sigma_{0}\succeq \bzero.\displaystyle\;\bzero.

By writing out the analytical close-form of the KL divergence between two Gaussian distributions, the optimization can be written as:

minΣ0,Σ1\Sym\tr((Σ1+vI)1(Σ0+uI))+\tr((Σ0+uI)1(Σ1+vI))+(g¯(1)g¯(0))\paren(Σ1+vI)1+(Σ0+uI)1(g¯(1)g¯(0)\displaystyle\min_{\Sigma_{0},\Sigma_{1}\in\Sym}\;\tr((\Sigma_{1}+vI)^{-1}(\Sigma_{0}+uI))+\tr((\Sigma_{0}+uI)^{-1}(\Sigma_{1}+vI))+(\bar{g}^{(1)}-\bar{g}^{(0)})^{\top}\paren{(\Sigma_{1}+vI)^{-1}+(\Sigma_{0}+uI)^{-1}}(\bar{g}^{(1)}-\bar{g}^{(0)}
\subjectto\displaystyle\subjectto
Σ0Σ1=\displaystyle\Sigma_{0}\Sigma_{1}= Σ1Σ0,\displaystyle\;\Sigma_{1}\Sigma_{0}, (1)
p\tr(Σ1)+(1p)\tr(Σ0)\displaystyle p\cdot\tr(\Sigma_{1})+(1-p)\cdot\tr(\Sigma_{0})\leq P,\displaystyle\;P,
Σ1\displaystyle\Sigma_{1}\succeq \bzero,\displaystyle\;\bzero,
Σ0\displaystyle\Sigma_{0}\succeq \bzero.\displaystyle\;\bzero.

By the commutative constraint on the two positive semidefinite matrices Σ1\Sigma_{1} and Σ0\Sigma_{0}, we know that we can factor these two matrices using the same set of eigenvectors. We thus write:

Σ0=\displaystyle\Sigma_{0}= Q\diag(λ1(0),,λd(0))Q,\displaystyle Q^{\top}\diag(\lambda_{1}^{(0)},\ldots,\lambda_{d}^{(0)})Q,
Σ1=\displaystyle\Sigma_{1}= Q\diag(λ1(1),,λd(1))Q,\displaystyle Q^{\top}\diag(\lambda_{1}^{(1)},\ldots,\lambda_{d}^{(1)})Q,

where Q\Reald×dQ\in\Real^{d\times d} is an orthogonal matrix and the eigenvalues λi(0),λi(1)\lambda_{i}^{(0)},\lambda_{i}^{(1)} are nonnegative and decreasing in value.

Using this alternative expression of Σ1\Sigma_{1} and Σ0\Sigma_{0}, we can express the optimization in terms of {λi(1)},{λi(0)},Q\{\lambda_{i}^{(1)}\},\{\lambda_{i}^{(0)}\},Q:

min{λi(1)},{λi(0)},Q\displaystyle\min_{\{\lambda_{i}^{(1)}\},\{\lambda_{i}^{(0)}\},Q} i=1dλi(0)+uλi(1)+v+i=1dλi(1)+vλi(0)+u+\brckQ(g¯(1)g¯(0))\diag\paren,1λi(0)+u+1λi(1)+v,Q(g¯(1)g¯(0)\displaystyle\sum_{i=1}^{d}\frac{\lambda_{i}^{(0)}+u}{\lambda_{i}^{(1)}+v}+\sum_{i=1}^{d}\frac{\lambda_{i}^{(1)}+v}{\lambda_{i}^{(0)}+u}+\brck{Q(\bar{g}^{(1)}-\bar{g}^{(0)})}^{\top}\diag\paren{\ldots,\frac{1}{\lambda_{i}^{(0)}+u}+\frac{1}{\lambda_{i}^{(1)}+v},\ldots}Q(\bar{g}^{(1)}-\bar{g}^{(0)}
\subjectto\displaystyle\subjectto\quad p(i=1dλi(1))+(1p)(i=1dλi(0))P\displaystyle p(\sum_{i=1}^{d}\lambda_{i}^{(1)})+(1-p)(\sum_{i=1}^{d}\lambda_{i}^{(0)})\leq\;P
λi(1) 0,i[d]\displaystyle-\lambda_{i}^{(1)}\leq\;0,\;\forall i\in[d]
λi(0) 0,i[d].\displaystyle-\lambda_{i}^{(0)}\leq\;0,\;\forall i\in[d].
λi(1)λj(1),i<j.\displaystyle\lambda_{i}^{(1)}\geq\lambda_{j}^{(1)},\forall i<j.
λi(0)λj(0),i<j.\displaystyle\lambda_{i}^{(0)}\geq\lambda_{j}^{(0)},\forall i<j.
Qorthogonal.\displaystyle Q\;\;\textrm{orthogonal}.

For any fixed feasible {λi(1)},{λi(0)}\{\lambda_{i}^{(1)}\},\{\lambda_{i}^{(0)}\}, we see that the corresponding minimizing QQ will set its first row to be the unit vector in the direction of Δg\Delta g. Thus by first minimizing QQ, the optimization objective reduces to:

i=1dλi(0)+uλi(1)+v+i=1dλi(1)+vλ1(0)+u+gλ1(0)+u+gλ1(1)+v\displaystyle\sum_{i=1}^{d}\frac{\lambda_{i}^{(0)}+u}{\lambda_{i}^{(1)}+v}+\sum_{i=1}^{d}\frac{\lambda_{i}^{(1)}+v}{\lambda_{1}^{(0)}+u}+\frac{g}{\lambda_{1}^{(0)}+u}+\frac{g}{\lambda_{1}^{(1)}+v}
=\displaystyle= i=2dλi(0)+uλi(1)+v+i=2dλi(1)+vλi(0)+u+λ1(1)+v+\normΔg22λ1(0)+u+λ1(0)+u+\normΔg22λ1(1)+v.\displaystyle\sum_{i=2}^{d}\frac{\lambda_{i}^{(0)}+u}{\lambda_{i}^{(1)}+v}+\sum_{i=2}^{d}\frac{\lambda_{i}^{(1)}+v}{\lambda_{i}^{(0)}+u}+\frac{\lambda_{1}^{(1)}+v+\norm{\Delta g}_{2}^{2}}{\lambda_{1}^{(0)}+u}+\frac{\lambda_{1}^{(0)}+u+\norm{\Delta g}_{2}^{2}}{\lambda_{1}^{(1)}+v}.

We see that for the pair of variable (λi(1),λi(0))(\lambda_{i}^{(1)},\lambda_{i}^{(0)}) the function λi(0)+uλi(1)+v+λi(1)+vλ1(0)+u\frac{\lambda_{i}^{(0)}+u}{\lambda_{i}^{(1)}+v}+\frac{\lambda_{i}^{(1)}+v}{\lambda_{1}^{(0)}+u} is strictly convex over the line segment pλi(1)+(1p)λi(0)=cp\lambda_{i}^{(1)}+(1-p)\lambda_{i}^{(0)}=c for any positive cc and attains the the minimum value at λi(1)=0\lambda_{i}^{(1)}=0 when v<uv<u and λi(0)=0\lambda_{i}^{(0)}=0 when vuv\geq u. Suppose without loss of generality u<vu<v, then for the optimal solution we must have λi(0)=0\lambda_{i}^{(0)}=0 for all ii. Under this condition, we notice that the function xux+v+x+vux\mapsto\frac{u}{x+v}+\frac{x+v}{u} is strictly convex on the positive reals. Thus by Jensen inequality, the optimal solution’s variables λi(1)\lambda_{i}^{(1)} must take on the same value for all ii. As a result, we have proved that at the optimal solution, we must have:

λi(0)=λj(0) and λi(1)=λj(1), for i,j2.\displaystyle\lambda_{i}^{(0)}=\lambda_{j}^{(0)}\textrm{ and }\lambda_{i}^{(1)}=\lambda_{j}^{(1)},\textrm{ for }i,j\geq 2.

Hence, we have the equivalent optimization problem:

minλ1(0),λ1(1),λ2(0),λ2(1)\displaystyle\min_{\lambda_{1}^{(0)},\lambda_{1}^{(1)},\lambda_{2}^{(0)},\lambda_{2}^{(1)}} (d1)λ2(0)+uλ2(1)+v+(d1)λ2(1)+vλ2(0)+u+λ1(0)+u+\normΔg22λ1(1)+v+λ1(1)+v+\normΔg22λ1(0)+u\displaystyle\;(d-1)\frac{\lambda_{2}^{(0)}+u}{\lambda_{2}^{(1)}+v}+(d-1)\frac{\lambda_{2}^{(1)}+v}{\lambda_{2}^{(0)}+u}+\frac{\lambda_{1}^{(0)}+u+\norm{\Delta g}_{2}^{2}}{\lambda_{1}^{(1)}+v}+\frac{\lambda_{1}^{(1)}+v+\norm{\Delta g}_{2}^{2}}{\lambda_{1}^{(0)}+u} (2)
\subjectto\displaystyle\subjectto
pλ1(1)+p(d1)λ2(1)+(1p)λ1(0)+(1p)(d1)(λ2(0))P\displaystyle p\lambda_{1}^{(1)}+p(d-1)\lambda_{2}^{(1)}+(1-p)\lambda_{1}^{(0)}+(1-p)(d-1)(\lambda_{2}^{(0)})\leq\;P
\tenquadλ1(1) 0\displaystyle\tenquad-\lambda_{1}^{(1)}\leq\;0
\tenquadλ1(0) 0\displaystyle\tenquad-\lambda_{1}^{(0)}\leq\;0
\tenquadλ2(1) 0\displaystyle\tenquad-\lambda_{2}^{(1)}\leq\;0
\tenquadλ2(0) 0\displaystyle\tenquad-\lambda_{2}^{(0)}\leq\;0
\tenquadλ2(1)λ1(1) 0\displaystyle\tenquad\lambda_{2}^{(1)}-\lambda_{1}^{(1)}\leq\;0
\tenquadλ2(0)λ1(0) 0.\displaystyle\tenquad\lambda_{2}^{(0)}-\lambda_{1}^{(0)}\leq\;0.

Given the optimal solution to the above 4-variable problem (λ1(0),λ2(0),λ1(1),λ2(1))(\lambda_{1}^{(0)*},\lambda_{2}^{(0)*},\lambda_{1}^{(1)*},\lambda_{2}^{(1)*}), we can set QQ to be any orthogonal matrix whose first row is the vector Δg\normΔg2\frac{\Delta g}{\norm{\Delta g}_{2}}. Plugging this back into the expression of Σ1\Sigma_{1} and Σ0\Sigma_{0} gives us the final result.

Thus the proof is complete. ∎


\remark

By analyzing the KKT condition of this four variable problem we can find that the optimal solution must occur on the hyperplane pλ1(1)+p(d1)λ2(1)+(1p)λ1(0)+(1p)(d1)λ2(0)=Pp\lambda_{1}^{(1)}+p(d-1)\lambda_{2}^{(1)}+(1-p)\lambda_{1}^{(0)}+(1-p)(d-1)\lambda_{2}^{(0)}=P. Knowing from the proof above that in addition λ2(1)=0\lambda_{2}^{(1)}=0 if uvu\geq v and λ2(0)=0\lambda_{2}^{(0)}=0 if u<vu<v further reduces this problem to a 3-variable problem. Keeping one of them fixed and optimizing the other two reduces the problem to a line search over a convex objective. Alternatingly fixing one of these three variables while optimizing the other two would eventually converge and give us the optimal solution.