Let’s consider more general nonlinear approaches to study teleconnections of climate variables
1 Introduction
The recent work [1] is concerned about the problem of extracting features from spatio-temporal geophysical signals. Authors introduce the complex rotated MCA (xMCA) to deal with lagged effects and non-orthogonality of the feature representation. This method essentially (1) transforms the signals to a complex plane with the Hilbert transform; (2) applies an oblique (varimax and Promax) rotation to remove the orthogonality constraint; and (3) performs the eigendecomposition in this complex space [2]. We argue that this method is essentially a particular case of the method called rotated complex kernel principal component analysis (ROCK-PCA) introduced in [3], where we proposed the same approach: first transform the data to the complex plane with the Hilbert transform and then apply the varimax rotation, with the only difference that the eigendecomposition is performed in the dual (kernel) Hilbert space [4, 5]. In essence, the latter allows us to generalize the xMCA solution by extracting nonlinear (curvilinear) features when nonlinear kernel functions are used. Hence, the solution of xMCA boils down to that of ROCK-PCA when the inner product is computed in the input data space instead of in the high-dimensional (possibly infinite) kernel Hilbert space where data has been mapped to [4, 5].
In this short correspondence we show theoretical proof that xMCA is a special case of ROCK-PCA and provide quantitative evidence that more expressive and informative features can be extracted when working with kernels; results of the decomposition of global sea surface temperature (SST) fields are shown to illustrate the capabilities of ROCK-PCA to cope with nonlinear processes, unlike xMCA.
2 On the theoretical equivalence
Let us consider two datasets and containing samples from two random variables and with the same temporal dimension and spatial dimensions and , respectively. The MCA maximizes the cross-covariance subject to orthogonality constraints . MCA can be solved as a generalized eigenvalue problem or alternatively as a singular value decomposition (SVD). Actually, as noted in [1], MCA solves exactly the same problem as the standard canonical correlation analysis (CCA) [6], which was extended to work with more than just two datasets in [7]. In practice, CCA has a limitation when working with more samples (grid spatial points) than dimensions (time steps observations), and , and hence typically requires an extra regularization. This is a well-known issue in standard CCA, but does not play in favour of MCA neither as it involves the same (and very high) computational cost, i.e. that of solving an SVD with a matrix size . This huge computational problem can be actually addressed by working in the dual (or -mode) instead of the primal (-mode) where [1] define it. We note here that this is not only more computationally convenient but also allows us to generalize xMCA to nonlinear cases by means of the theory of kernels [5] and to show it is a particular case of the kernel CCA (kCCA) method [8] directly connected to ROCK-PCA [3]. In what follows, we show these theoretical connections, and demonstrate that kCCA not only generalizes CCA and MCA but, with the proper term arrangement, it also reduces to solving a single eigenproblem as in kernel PCA approaches [5, 3].
2.1 MCA is equivalent to CCA
Both MCA and CCA solve the problem of finding orthogonal projections and of data and , respectively, such that the cross-covariance of the projected data is maximized:
that is
One can solve this problem by singular value decomposition (SVD) of the cross-covariance matrix as suggested in [1], . Alternatively, one can proceed in the standard way by introducing the orthogonality constraints and then deriving and equating to zero, which leads us to a generalized eigenproblem to solve , which can be better conditioned solving the eigenproblem . The equivalence between SVD and CCA has been widely studied too, and has led to interesting and controversial discussions about the appropriateness, superiority and interpretability of the feature projections computed by one method over the other, which was pointed out in [9] and later nicely studied and summarized in [10].
2.2 Dual CCA helps in the computational cost
Solving the SVD or the generalized eigenproblem involved in MCA is very challenging because of the needed regularization and computational cost involved. Note that in standard problems in geosciences one has more samples (grid spatial points) than dimensions (time steps observations) and hence the cross-covariance is very large, . However, this can be addressed by operating in -mode instead, as it is typically done in PCA/EOF. Let us define the projection matrices and in the span of the data matrices and , that is and , where is the new projection matrix, with typically and . Now, the problem can be rewritten as
that is
where and are the Gram matrix (inner products between the data matrices). One could use the SVD to solve this dual (-mode) problem by simply , or by the equivalent dual CCA problem with (properly regularized) Gram matrices as before: or . Obviously a great computational advantage can be achieved by solving this dual problem: only two matrices are actually computed and one (smaller) inverted. We introduced this technicality to help the use and adoption of MCA, but mainly to note that this will allow us to derive the nonlinear generalization of MCA that follows, which is equivalent to the method proposed in [3].
2.3 Kernelized CCA exists and generalizes MCA
Working with Gram matrices and this duality has been widely studied and proposed in multivariate statistics [11]. Yet, more interestingly, nonlinear versions can be derived by using similar arguments within the field of kernel methods [5, 4, 12]. The framework of kernel multivariate statistics is aimed at extracting nonlinear projections while still working with linear algebra. Let us first consider a feature map that projects input data into a Hilbert feature space . The new mapped data set is defined as , and the features extracted from the input data will now be given by , where matrix is of size .
To implement practical kernel MVA algorithms we need to rewrite the equations in terms of inner products in only. For doing so, we rely on the availability of two kernel matrices abd of dimension , and on the Representer’s Theorem [13], which states that the projection vectors can be written as a linear combination of the training samples, i.e, , , matrices and being the new argument for the optimization. Note the resemblance with the previous dual reformulation of the problem when working with Gram matrices instead of covariance matrices. This is typically referred to as the kernel trick and has been used to develop nonlinear (kernel-based) versions of standard linear methods like PCA, PLS, and CCA/MCA. The nonlinear kernel PCA (kPCA) was introduced in [5] and later the kernel CCA (kCCA) was introduced in [8]. Actually, this nonlinear generalization of CCA, and hence of MCA, is not new and has been used and studied widely [14, 15, 16, 17]. Many improved variants have been proposed to cope with more than two datasets in multiview settings [8], semisupervised learning problems [18, 4] as well as to deal with temporal domains efficiently [19]. It is easy to show that kCCA boils down to solving or equivalently the eigenproblem with properly regularized matrices.
In kernel methods one has to choose a form of the kernel function that encodes the similarities between data points. It is customary to use the radial basis function (RBF) kernel, , as it is a characteristic kernel with good theoretical and practical properties, it has only the parameter to tune, and generalizes the linear kernel by capturing all higher order moments of similarity between time series. This is the kernel function we used in [3]. kCCA maximizes the same objective as the linear CCA and MCA, yet in . The kCCA is computationally efficient in our setting and can extract nonlinear relations. Also note that when a linear (dot product) kernel is used for both domains: and , one comes back to the equivalent yet more efficient CCA/MCA introduced before. Therefore, kCCA generalizes CCA (or MCA) and the solution is the same as in the KPCA schemes in [5, 3].
2.4 Remarks
The proposed xMCA in [1] resolves MCA (or alternatively CCA) in the complex plane and adds an oblique extra rotation therein, as proposed earlier in [3]. These important operations do not change the rationale that connects xMCA and ROCK-PCA because both methods operate with the same data. The previous demonstrations can be actually derived with complex data and after a rotation in Hilbert spaces: the same arguments and equations are valid when working with complex algebra in Hilbert spaces, cf. [20], and a rotation in Hilbert spaces can be expressed solely in terms of kernel evaluations too [5, 12].
3 On the information content of the components
A classical way to evaluate dimensionality reduction methods, like PCA or CCA, is to quantify how much variance of the data is explained by retaining a number of principal components. A more sensible criterion is information, but its estimation is complicated in multivariate and high-dimensional problems because it typically implies multivariate density estimation. Nevertheless, kernel methods have provided an adequate framework to estimate independence between random variables and hence to quantify information [21]. Interestingly, there is a tight connection between kernel-dependence estimation and kCCA; in [22], a regularized correlation operator was derived from the covariance and cross-covariance operators, and its largest singular value (the kernel canonical correlation, or kCCA) was used as a statistic to test independence. Later, in [23], the largest singular value of the cross-covariance operator, called the constrained covariance (COCO) [23], was proposed as an efficient alternative to kCCA that needs no regularization. A variety of empirical kernel quantities derived from bounds on the mutual information that hold near independence were also proposed, namely the kernel Generalised Variance (kGV) and the Kernel Mutual Information (kMI) [24]. Later, HSIC [25] was introduced as an extension of COCO that allows considering the entire spectrum of the cross-covariance operator, not just the largest singular value. An illustrative example of these measures using synthetic data is given in Table 1, which evidences that linear methods (like correlation, CCA and MCA) fail in estimating dependence, while their kernel versions capture the nonlinear data structure and thus provide better information measures. We show in Fig. 1 how a nonlinear version of CCA/MCA (kCCA) can yield more information for a wide range of values of the kernel length scale , thus generalizing its linear counterpart.
![]() |
![]() |
![]() |
|
Pearson’s R | 0.94 | 0.04 | 0.06 |
MCA | 1.64 | 0.00 | 0.00 |
CCA | 0.94 | 0.04 | 0.06 |
HSIC (linear kernels) | 3.42 | 0.25 | 0.00 |
HSIC (RBF kernels) | 21.32 | 13.49 | 0.86 |
kGV | 5.82 | 4.85 | 1.44 |
kCCA | 0.95 | 0.96 | 0.42 |
![]() |
![]() |
![]() |
4 An empirical comparison
We showed that a kernel method generalizes the performance and yields more expressive (informative) features than its linear counterpart, while still relying on linear algebra. Yet, one may ask why nonlinear components are necessary in the study of teleconnections and time-lagged correlations in the geosciences in practice. We illustrate this with an experiment decomposing the global sea surface temperature (SST) from the MetOffice reanalysis HadISST1 [26]. The same example was introduced in [1] with linear CMA and previously in [3] with the nonlinear PCA generalization, both operating in the complex domain and with an extra oblique rotation.
Decomposition of the spatio-temporal datacube with each method allows us to extract similar principal components of SST with only very minor differences. The explained variance is more scattered across the eigenspectrum with the xMCA that with ROCK-PCA: for example, the explained variance of the first principal component (PC1) corresponding to the annual oscillation is for xMCA and for the ROCKPCA. This is an under-representation issue in xMCA, since the annual amplitude is close to 8.5∘C and the inter-annual variability not larger than 0.11∘C per decade, with a global average increase of 0.3∘ since 1960 until 2020 [27, 28]. This causes an unrealistic attribution of the relevance to the interdecadal trend (PC3 of xMCA from [1]), being the of the overall variance. This difference in the attribution of the explained variance emerges from the definition of the MCA, since the extracted eigenvalues do not match the individual explained variance of the separated phenomenon for more than a single variable. The extracted components in both cases have the same physical interpretation, see figures 2 and 3 from [3], but with difference into the mixture of components.
![]() |
Rotation of principal components with Varimax or Promax allows us to linearly maximize variance representation of a subset of principal components. This can lead to a more representative subset modes of variation if the represented physical phenomena are linearly separable. Yet, a nonlinear spatio-temporal mixture of physical phenomenon will not be separable by a linear spatio-temporal decomposition and an extra rotation as in xMCA. In this case, we can see that xMCA identifies the third PC as the interdecadal trend, but the spatial distribution of the variability is mixed with other patterns as the ENSO spatial pattern, which should exhibit a more homogeneous distribution [27]. Figure 2 shows in the right panel the identified interdecadal trend by ROCK-PCA instead, showing a more consistent spatial pattern as previously reported elsewhere [27].
![]() |
In Figure 4, we show the difference between the linear and the nonlinear case of the estimated covariance matrix, where the scatter plots show how the complex components differentiate the data into monthly variability obtained with ROCK-PCA (right) unlike in the linear case (left). Adding the nonlinear mapping of the data, we can differentiate the different physical phenomena, which a linear rotation cannot disentangle. These results support the use of more general nonlinear approaches to decompose climatic data since they allow unraveling physical phenomena mixed in spatio-temporal data and thus extract more meaningful and informative representations.
![]() |
[2pt]
![]() |
5 Outlook
Both ROCK-PCA [3] and xMCA [1] extract features in a complex domain after an additional oblique rotation is applied, which were proposed to deal with time-lagged processes of spatio-temporal data. Distinctly, ROCK-PCA can extract nonlinear features as it operates in a reproducing kernel Hilbert space. We showed that xMCA is a particular case of kernel CCA when assuming linearity. We also showed the tight connection between xMCA and ROCK-PCA. Working in the dual domain (-mode) was not only computationally advantageous but also allowed us to derive the nonlinear version of xMCA which, after reformulation as a eigenproblem, reduces to solving ROCK-PCA when one dataset is involved. ROCK-PCA is not only more efficient computationally, but also provides more expressive and informative components, which permit capturing meaningful physical processes more accurately. Using a linear kernel is indeed quite restrictive in general, as it has been demonstrated widely that components and modes of variability in Earth and climatic data show nonlinear relations. Actually, in other works, we explored the potential use of ROCK-PCA to characterize nonlinear cross-information between covariates and global teleconnections [29] and to identify causal footprints of climate phenomena [30]. In all these cases, nonlinear kernels were strictly necessary to achieve physically meaningful results. We hope that this clarification will place the novelty and limitations of xMCA in context, motivate the use of nonlinear approaches to study teleconnections, and ultimately encourage contributions in this exciting field of statistical analysis of spatio-temporal Earth and climate data.
Acknowledgments
This work was supported by the ”Understanding and Modeling of the Earth System with Machine Learning” (USMILE) under the European Research Council (ERC) Synergy Grant 2019, Agreement N∘ 855187.
References
- [1] Niclas Rieger, Álvaro Corral, Estrella Olmedo, and Antonio Turiel. Lagged teleconnections of climate variables identified via complex rotated maximum covariance analysis. Journal of Climate, 34(24):9861 – 9878, 2021.
- [2] J. D. Horel. Complex principal component analysis: Theory and examples. Journal of Applied Meteorology and Climatology, 23(12):1660 – 1673, 1984.
- [3] Diego Bueso, Maria Piles, and Gustau Camps-Valls. Nonlinear PCA for spatio-temporal analysis of earth observation data. IEEE Transactions on Geoscience and Remote Sensing, 58(8):5752–5763, 2020.
- [4] Jerónimo Arenas-García and Kaare Brandt Petersen. Kernel Multivariate Analysis in Remote Sensing Feature Extraction, chapter 14, pages 327–352. John Wiley & Sons, Ltd, 2009.
- [5] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319, 07 1998.
- [6] Hotelling Harold. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
- [7] Jon R Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433–451, 1971.
- [8] Pei Ling Lai and Colin Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05):365–377, 2000.
- [9] Matthew Newman and Prashant D Sardeshmukh. A caveat concerning singular value decomposition. Journal of Climate, 8(2):352–360, 1995.
- [10] Steve Cherry. Singular value decomposition analysis and canonical correlation analysis. Journal of Climate, 9(9):2003–2009, 1996.
- [11] Ian T Jolliffe. Rotation of principal components: some comments. Journal of Climatology, 7(5):507–510, 1987.
- [12] José Luis Rojo-Álvarez, Manel Martínez-Ramón, Jordi Munoz-Mari, and Gustau Camps-Valls. Digital signal processing with Kernel methods. John Wiley & Sons, 2018.
- [13] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
- [14] Siamak Mehrkanoon and Johan AK Suykens. Regularized semipaired kernel cca for domain adaptation. IEEE transactions on neural networks and learning systems, 29(7):3199–3213, 2017.
- [15] Carlos Alzate and Johan AK Suykens. A regularized kernel cca contrast function for ica. Neural Networks, 21(2-3):170–181, 2008.
- [16] Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf, et al. Kernel methods for measuring independence. JMLR, 2005.
- [17] Kenji Fukumizu, Francis R Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(2), 2007.
- [18] Matthew B Blaschko, Christoph H Lampert, and Arthur Gretton. Semi-supervised laplacian regularization of kernel canonical correlation analysis. In Joint European conference on machine learning and knowledge discovery in databases, pages 133–145. Springer, 2008.
- [19] Felix Bießmann, Frank C Meinecke, Arthur Gretton, Alexander Rauch, Gregor Rainer, Nikos K Logothetis, and Klaus-Robert Müller. Temporal kernel cca and its application in multimodal neuronal data analysis. Machine Learning, 79(1):5–27, 2010.
- [20] Pantelis Bouboulis and Sergios Theodoridis. Extension of wirtinger’s calculus to reproducing kernel hilbert spaces and the complex kernel lms. IEEE Transactions on Signal Processing, 59(3):964–978, 2010.
- [21] A. Gretton, O. Bousquet, A. J. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT, pages 63–77, 2005.
- [22] Francis Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.
- [23] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. In Robert G. Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 112–119, New Jersey, 2005. Society for Artificial Intelligence and Statistics.
- [24] A. Gretton, R. Herbrich, and A. Hyvärinen. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129, 2005.
- [25] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In Proc. 16th Intl. Conf. Algorithmic Learning Theory, pages 63–77, Springer, 2005. Springer.
- [26] N. A. Rayner, D. E. Parker, E. B. Horton, C. K. Folland, L. V. Alexander, D. P. Rowell, E. C. Kent, and A. Kaplan. Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century. Journal of Geophysical Research: Atmospheres, 108(D14), 2003.
- [27] Claire E. Bulgin, Christopher J. Merchant, and David Ferreira. Tendencies, variability and persistence of sea surface temperature anomalies. Scientific Reports, 1(10):7986, 5 2020.
- [28] Ge Chen, Baomin Shao, Yong Han, Jun Ma, and Bertrand Chapron. Modality of semiannual to multidecadal oscillations in global sea surface temperature variability. Journal of Geophysical Research: Oceans, 115(C3), 2010.
- [29] Diego Bueso, Maria Piles, and Gustau Camps-Valls. Cross-information kernel causality: Revisiting global teleconnections of enso over soil moisture and vegetation. In Proceedings of the 9th International Workshop on Climate Informatics: CI 2019, Climate Informatics, pages 172–176, Paris, 2019. NCAR.
- [30] Diego Bueso, Maria Piles, and Gustau Camps-Valls. Explicit granger clkopf98ausality in kernel hilbert spaces. Phys. Rev. E, 102:062201, Dec 2020.