This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Let’s consider more general nonlinear approaches to study teleconnections of climate variables

D. Bueso, M. Piles and G. Camps-Valls
Image Processing Laboratory (IPL), Universitat de València, Spain
http://isp.uv.es -- {diego.bueso,maria.piles,gustau.camps}@uv.es

1 Introduction

The recent work [1] is concerned about the problem of extracting features from spatio-temporal geophysical signals. Authors introduce the complex rotated MCA (xMCA) to deal with lagged effects and non-orthogonality of the feature representation. This method essentially (1) transforms the signals to a complex plane with the Hilbert transform; (2) applies an oblique (varimax and Promax) rotation to remove the orthogonality constraint; and (3) performs the eigendecomposition in this complex space [2]. We argue that this method is essentially a particular case of the method called rotated complex kernel principal component analysis (ROCK-PCA) introduced in [3], where we proposed the same approach: first transform the data to the complex plane with the Hilbert transform and then apply the varimax rotation, with the only difference that the eigendecomposition is performed in the dual (kernel) Hilbert space [4, 5]. In essence, the latter allows us to generalize the xMCA solution by extracting nonlinear (curvilinear) features when nonlinear kernel functions are used. Hence, the solution of xMCA boils down to that of ROCK-PCA when the inner product is computed in the input data space instead of in the high-dimensional (possibly infinite) kernel Hilbert space where data has been mapped to [4, 5].

In this short correspondence we show theoretical proof that xMCA is a special case of ROCK-PCA and provide quantitative evidence that more expressive and informative features can be extracted when working with kernels; results of the decomposition of global sea surface temperature (SST) fields are shown to illustrate the capabilities of ROCK-PCA to cope with nonlinear processes, unlike xMCA.

2 On the theoretical equivalence

Let us consider two datasets 𝐗an×da{{\bf X}}_{a}\in{\mathbb{R}}^{n\times d_{a}} and 𝐗bn×db{{\bf X}}_{b}\in{\mathbb{R}}^{n\times d_{b}} containing samples from two random variables AA and BB with the same temporal dimension nn and spatial dimensions dad_{a} and dbd_{b}, respectively. The MCA maximizes the cross-covariance Cab=1n1𝐗~a𝐗~bC_{ab}=\frac{1}{n-1}\tilde{{\bf X}}_{a}^{\top}\tilde{{\bf X}}_{b} subject to orthogonality constraints 𝐔a𝐂aa𝐔a=𝐔b𝐂bb𝐔b=1{{\bf U}}_{a}^{\top}{{\bf C}}_{aa}{{\bf U}}_{a}={{\bf U}}_{b}^{\top}{{\bf C}}_{bb}{{\bf U}}_{b}=1. MCA can be solved as a generalized eigenvalue problem or alternatively as a singular value decomposition (SVD). Actually, as noted in [1], MCA solves exactly the same problem as the standard canonical correlation analysis (CCA) [6], which was extended to work with more than just two datasets in [7]. In practice, CCA has a limitation when working with more samples (grid spatial points) than dimensions (time steps observations), ndan\ll d_{a} and ndbn\ll d_{b}, and hence typically requires an extra regularization. This is a well-known issue in standard CCA, but does not play in favour of MCA neither as it involves the same (and very high) computational cost, i.e. that of solving an SVD with a matrix size da×dbd_{a}\times d_{b}. This huge computational problem can be actually addressed by working in the dual (or RR-mode) instead of the primal (QQ-mode) where [1] define it. We note here that this is not only more computationally convenient but also allows us to generalize xMCA to nonlinear cases by means of the theory of kernels [5] and to show it is a particular case of the kernel CCA (kCCA) method [8] directly connected to ROCK-PCA [3]. In what follows, we show these theoretical connections, and demonstrate that kCCA not only generalizes CCA and MCA but, with the proper term arrangement, it also reduces to solving a single eigenproblem as in kernel PCA approaches [5, 3].

2.1 MCA is equivalent to CCA

Both MCA and CCA solve the problem of finding orthogonal projections 𝐔a{{\bf U}}_{a} and 𝐔b{{\bf U}}_{b} of data 𝐗a{{\bf X}}_{a} and 𝐗b{{\bf X}}_{b}, respectively, such that the cross-covariance of the projected data is maximized:

max𝐔a,𝐔b{(𝐗a𝐔a)(𝐗b𝐔b)}s.t.(𝐗a𝐔a)𝐗a𝐔a=(𝐗b𝐔b)𝐗b𝐔b=1,\max_{{{\bf U}}_{a},{{\bf U}}_{b}}\{({{\bf X}}_{a}{{\bf U}}_{a})^{\top}({{\bf X}}_{b}{{\bf U}}_{b})\}~{}~{}~{}~{}s.t.~{}~{}~{}({{\bf X}}_{a}{{\bf U}}_{a})^{\top}{{\bf X}}_{a}{{\bf U}}_{a}=({{\bf X}}_{b}{{\bf U}}_{b})^{\top}{{\bf X}}_{b}{{\bf U}}_{b}=1,

that is

max𝐔a,𝐔b{𝐔a𝐂ab𝐔b}s.t.𝐔a𝐂aa𝐔a=𝐔b𝐂bb𝐔b=1.\max_{{{\bf U}}_{a},{{\bf U}}_{b}}\{{{\bf U}}_{a}^{\top}{{\bf C}}_{ab}{{\bf U}}_{b}\}~{}~{}~{}~{}s.t.~{}~{}~{}{{\bf U}}_{a}^{\top}{{\bf C}}_{aa}{{\bf U}}_{a}={{\bf U}}_{b}^{\top}{{\bf C}}_{bb}{{\bf U}}_{b}=1.

One can solve this problem by singular value decomposition (SVD) of the cross-covariance matrix as suggested in [1], 𝐂ab=𝐔a𝚺𝐔b{{\bf C}}_{ab}={{\bf U}}_{a}\bm{\Sigma}{{\bf U}}_{b}^{\top}. Alternatively, one can proceed in the standard way by introducing the orthogonality constraints and then deriving and equating to zero, which leads us to a generalized eigenproblem to solve 𝐂ab𝐂bb1𝐂ab𝐔a=𝚲2𝐂aa𝐔a{{\bf C}}_{ab}{{\bf C}}_{bb}^{-1}{{\bf C}}_{ab}^{\top}{{\bf U}}_{a}={\bm{\Lambda}}^{2}{{\bf C}}_{aa}{{\bf U}}_{a}, which can be better conditioned solving the eigenproblem 𝐂aa1𝐂ab𝐂bb1𝐂ab𝐔a=𝚲2𝐔a{{\bf C}}_{aa}^{-1}{{\bf C}}_{ab}{{\bf C}}_{bb}^{-1}{{\bf C}}_{ab}^{\top}{{\bf U}}_{a}={\bm{\Lambda}}^{2}{{\bf U}}_{a}. The equivalence between SVD and CCA has been widely studied too, and has led to interesting and controversial discussions about the appropriateness, superiority and interpretability of the feature projections computed by one method over the other, which was pointed out in [9] and later nicely studied and summarized in [10].

2.2 Dual CCA helps in the computational cost

Solving the SVD or the generalized eigenproblem involved in MCA is very challenging because of the needed regularization and computational cost involved. Note that in standard problems in geosciences one has more samples (grid spatial points) than dimensions (time steps observations) and hence the cross-covariance is very large, 𝐂abda×db{{\bf C}}_{ab}\in{\mathbb{R}}^{d_{a}\times d_{b}}. However, this can be addressed by operating in RR-mode instead, as it is typically done in PCA/EOF. Let us define the projection matrices 𝐔a{{\bf U}}_{a} and 𝐔b{{\bf U}}_{b} in the span of the data matrices 𝐗a{{\bf X}}_{a} and 𝐗b{{\bf X}}_{b}, that is 𝐔a=𝐗a𝐔{{\bf U}}_{a}={{\bf X}}_{a}^{\top}{{\bf U}} and 𝐔b=𝐗b𝐔{{\bf U}}_{b}={{\bf X}}_{b}^{\top}{{\bf U}}, where 𝐔n×n{{\bf U}}\in{\mathbb{R}}^{n\times n} is the new projection matrix, with typically ndan\ll d_{a} and ndbn\ll d_{b}. Now, the problem can be rewritten as

max𝐔{𝐔𝐗a𝐗a𝐗b𝐗b𝐔}s.t.𝐔𝐗a𝐗a𝐗a𝐗a𝐔=𝐔𝐗b𝐗b𝐗b𝐗b𝐔=1,\max_{{{\bf U}}}\{{{\bf U}}^{\top}{{\bf X}}_{a}{{\bf X}}_{a}^{\top}{{\bf X}}_{b}{{\bf X}}_{b}^{\top}{{\bf U}}\}~{}~{}~{}~{}s.t.~{}~{}~{}{{\bf U}}^{\top}{{\bf X}}_{a}{{\bf X}}_{a}^{\top}{{\bf X}}_{a}{{\bf X}}_{a}^{\top}{{\bf U}}={{\bf U}}^{\top}{{\bf X}}_{b}{{\bf X}}_{b}^{\top}{{\bf X}}_{b}{{\bf X}}_{b}^{\top}{{\bf U}}=1,

that is

max𝐔{𝐔a𝐆aa𝐆bb𝐔}s.t.𝐔𝐆aa𝐆aa𝐔=𝐔𝐆bb𝐆bb𝐔b=1,\max_{{{\bf U}}}\{{{\bf U}}_{a}^{\top}{{\bf G}}_{aa}{{\bf G}}_{bb}{{\bf U}}\}~{}~{}~{}~{}s.t.~{}~{}~{}{{\bf U}}^{\top}{{\bf G}}_{aa}{{\bf G}}_{aa}{{\bf U}}={{\bf U}}^{\top}{{\bf G}}_{bb}{{\bf G}}_{bb}{{\bf U}}_{b}=1,

where 𝐆aa=𝐗a𝐗a{{\bf G}}_{aa}={{\bf X}}_{a}{{\bf X}}_{a}^{\top} and 𝐆bb=𝐗b𝐗b{{\bf G}}_{bb}={{\bf X}}_{b}{{\bf X}}_{b}^{\top} are the Gram matrix (inner products between the data matrices). One could use the SVD to solve this dual (RR-mode) problem by simply 𝐆aa𝐆bb=𝐔aΣ𝐔b{{\bf G}}_{aa}{{\bf G}}_{bb}={{\bf U}}_{a}\Sigma{{\bf U}}_{b}^{\top}, or by the equivalent dual CCA problem with (properly regularized) Gram matrices as before: 𝐆aa𝐆bb(𝐆bb𝐆bb)1𝐆bb𝐆aa𝐔=𝚲𝐆aa𝐆aa𝐔{{\bf G}}_{aa}{{\bf G}}_{bb}({{\bf G}}_{bb}{{\bf G}}_{bb})^{-1}{{\bf G}}_{bb}{{\bf G}}_{aa}{{\bf U}}={\bm{\Lambda}}{{\bf G}}_{aa}{{\bf G}}_{aa}{{\bf U}} or (𝐆aa𝐆aa)1𝐆aa𝐆bb(𝐆bb𝐆bb)1𝐆bb𝐆aa𝐔=𝚲𝐔({{\bf G}}_{aa}{{\bf G}}_{aa})^{-1}{{\bf G}}_{aa}{{\bf G}}_{bb}({{\bf G}}_{bb}{{\bf G}}_{bb})^{-1}{{\bf G}}_{bb}{{\bf G}}_{aa}{{\bf U}}={\bm{\Lambda}}{{\bf U}}. Obviously a great computational advantage can be achieved by solving this dual problem: only two matrices are actually computed and one (smaller) inverted. We introduced this technicality to help the use and adoption of MCA, but mainly to note that this will allow us to derive the nonlinear generalization of MCA that follows, which is equivalent to the method proposed in [3].

2.3 Kernelized CCA exists and generalizes MCA

Working with Gram matrices and this duality has been widely studied and proposed in multivariate statistics [11]. Yet, more interestingly, nonlinear versions can be derived by using similar arguments within the field of kernel methods [5, 4, 12]. The framework of kernel multivariate statistics is aimed at extracting nonlinear projections while still working with linear algebra. Let us first consider a feature map ϕ:d{\bm{\phi}}:{\mathbb{R}}^{d}\rightarrow\cal F that projects input data into a Hilbert feature space \cal F. The new mapped data set is defined as 𝚽=[ϕ(𝐱1),,ϕ(𝐱n)]{\bm{\Phi}}=[{\bm{\phi}}({{\bf x}}_{1}),\cdots,{\bm{\phi}}({{\bf x}}_{n})]^{\top}, and the features extracted from the input data will now be given by 𝚽=𝚽𝐔{\bm{\Phi}}^{\prime}={\bm{\Phi}}{{\bf U}}, where matrix 𝐔{{\bf U}} is of size dim()×nf\text{dim}({\cal F})\times n_{f}.

To implement practical kernel MVA algorithms we need to rewrite the equations in terms of inner products in \cal F only. For doing so, we rely on the availability of two kernel matrices 𝐊aa=𝚽a𝚽a{{\bf K}}_{aa}={\bm{\Phi}}_{a}{\bm{\Phi}}_{a}^{\top} abd 𝐊bb=𝚽b𝚽b{{\bf K}}_{bb}={\bm{\Phi}}_{b}{\bm{\Phi}}_{b}^{\top} of dimension n×nn\times n, and on the Representer’s Theorem [13], which states that the projection vectors can be written as a linear combination of the training samples, i.e, 𝐔=𝚽a𝐀a{{\bf U}}={\bm{\Phi}}_{a}^{\top}{{\bf A}}_{a}, 𝐔=𝚽b𝐀b{{\bf U}}={\bm{\Phi}}_{b}^{\top}{{\bf A}}_{b}, matrices 𝐀a{{\bf A}}_{a} and 𝐀b{{\bf A}}_{b} being the new argument for the optimization. Note the resemblance with the previous dual reformulation of the problem when working with Gram matrices instead of covariance matrices. This is typically referred to as the kernel trick and has been used to develop nonlinear (kernel-based) versions of standard linear methods like PCA, PLS, and CCA/MCA. The nonlinear kernel PCA (kPCA) was introduced in  [5] and later the kernel CCA (kCCA) was introduced in [8]. Actually, this nonlinear generalization of CCA, and hence of MCA, is not new and has been used and studied widely [14, 15, 16, 17]. Many improved variants have been proposed to cope with more than two datasets in multiview settings [8], semisupervised learning problems [18, 4] as well as to deal with temporal domains efficiently [19]. It is easy to show that kCCA boils down to solving 𝐊aa𝐊bb(𝐊bb𝐊bb)1𝐊bb𝐊aa𝐀=𝚲𝐊aa𝐊aa𝐀{{\bf K}}_{aa}{{\bf K}}_{bb}({{\bf K}}_{bb}{{\bf K}}_{bb})^{-1}{{\bf K}}_{bb}{{\bf K}}_{aa}{{\bf A}}={\bm{\Lambda}}{{\bf K}}_{aa}{{\bf K}}_{aa}{{\bf A}} or equivalently the eigenproblem (𝐊aa𝐊aa)1𝐊aa𝐊bb(𝐊bb𝐊bb)1𝐊bb𝐊aa𝐀=𝚲𝐀({{\bf K}}_{aa}{{\bf K}}_{aa})^{-1}{{\bf K}}_{aa}{{\bf K}}_{bb}({{\bf K}}_{bb}{{\bf K}}_{bb})^{-1}{{\bf K}}_{bb}{{\bf K}}_{aa}{{\bf A}}={\bm{\Lambda}}{{\bf A}} with properly regularized matrices.

In kernel methods one has to choose a form of the kernel function k(𝐱,𝐱)k({{\bf x}},{{\bf x}}^{\prime}) that encodes the similarities between data points. It is customary to use the radial basis function (RBF) kernel, k(𝐱,𝐱)=exp(γ𝐱𝐱2)k({{\bf x}},{{\bf x}}^{\prime})=\exp(-\gamma\|{{\bf x}}-{{\bf x}}^{\prime}\|^{2}), as it is a characteristic kernel with good theoretical and practical properties, it has only the parameter γ\gamma to tune, and generalizes the linear kernel by capturing all higher order moments of similarity between time series. This is the kernel function we used in [3]. kCCA maximizes the same objective as the linear CCA and MCA, yet in \cal F. The kCCA is computationally efficient in our setting and can extract nonlinear relations. Also note that when a linear (dot product) kernel is used for both domains: 𝐊aa=𝐆aa=𝐗a𝐗a{{\bf K}}_{aa}={{\bf G}}_{aa}={{\bf X}}_{a}{{\bf X}}_{a}^{\top} and 𝐊bb=𝐆aa=𝐗b𝐗b{{\bf K}}_{bb}={{\bf G}}_{aa}={{\bf X}}_{b}{{\bf X}}_{b}^{\top}, one comes back to the equivalent yet more efficient CCA/MCA introduced before. Therefore, kCCA generalizes CCA (or MCA) and the solution is the same as in the KPCA schemes in [5, 3].

2.4 Remarks

The proposed xMCA in [1] resolves MCA (or alternatively CCA) in the complex plane and adds an oblique extra rotation therein, as proposed earlier in [3]. These important operations do not change the rationale that connects xMCA and ROCK-PCA because both methods operate with the same data. The previous demonstrations can be actually derived with complex data and after a rotation in Hilbert spaces: the same arguments and equations are valid when working with complex algebra in Hilbert spaces, cf. [20], and a rotation in Hilbert spaces can be expressed solely in terms of kernel evaluations too [5, 12].

3 On the information content of the components

A classical way to evaluate dimensionality reduction methods, like PCA or CCA, is to quantify how much variance of the data is explained by retaining a number of principal components. A more sensible criterion is information, but its estimation is complicated in multivariate and high-dimensional problems because it typically implies multivariate density estimation. Nevertheless, kernel methods have provided an adequate framework to estimate independence between random variables and hence to quantify information [21]. Interestingly, there is a tight connection between kernel-dependence estimation and kCCA; in [22], a regularized correlation operator was derived from the covariance and cross-covariance operators, and its largest singular value (the kernel canonical correlation, or kCCA) was used as a statistic to test independence. Later, in [23], the largest singular value of the cross-covariance operator, called the constrained covariance (COCO) [23], was proposed as an efficient alternative to kCCA that needs no regularization. A variety of empirical kernel quantities derived from bounds on the mutual information that hold near independence were also proposed, namely the kernel Generalised Variance (kGV) and the Kernel Mutual Information (kMI) [24]. Later, HSIC [25] was introduced as an extension of COCO that allows considering the entire spectrum of the cross-covariance operator, not just the largest singular value. An illustrative example of these measures using synthetic data is given in Table 1, which evidences that linear methods (like correlation, CCA and MCA) fail in estimating dependence, while their kernel versions capture the nonlinear data structure and thus provide better information measures. We show in Fig. 1 how a nonlinear version of CCA/MCA (kCCA) can yield more information for a wide range of values of the kernel length scale σ\sigma, thus generalizing its linear counterpart.

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Pearson’s R 0.94 0.04 0.06
MCA 1.64 0.00 0.00
CCA 0.94 0.04 0.06
HSIC (linear kernels) 3.42 0.25 0.00
HSIC (RBF kernels) 21.32 13.49 0.86
kGV 5.82 4.85 1.44
kCCA 0.95 0.96 0.42
Table 1: Dependence estimates for three examples revealing high correlation (and hence high dependence) [left], high dependence but null correlation [middle], and zero correlation and dependence [right]. The Pearson’s correlation coefficient R, MCA, CCA and linear HSIC only capture second order statistics while the rest capture in general higher order dependencies.
Refer to caption Refer to caption Refer to caption
Figure 1: Mutual information as a function of the kernel length scale σ\sigma for both linear (p=1p=1, dashed) and kernel kCCA (nonlinear CCA/MCA). As the σ\sigma value increases, the nonlinear version approximates the solution of the linear one in all three cases. There is an optimal σ\sigma to capture both correlation (left plot) and dependence (middle plot) while for the independence case (right plot) the difference is insignificant (note the low values of mutual information) through the kernel length scale σ\sigma domain.

4 An empirical comparison

We showed that a kernel method generalizes the performance and yields more expressive (informative) features than its linear counterpart, while still relying on linear algebra. Yet, one may ask why nonlinear components are necessary in the study of teleconnections and time-lagged correlations in the geosciences in practice. We illustrate this with an experiment decomposing the global sea surface temperature (SST) from the MetOffice reanalysis HadISST1 [26]. The same example was introduced in [1] with linear CMA and previously in [3] with the nonlinear PCA generalization, both operating in the complex domain and with an extra oblique rotation.

Decomposition of the spatio-temporal datacube with each method allows us to extract similar principal components of SST with only very minor differences. The explained variance is more scattered across the eigenspectrum with the xMCA that with ROCK-PCA: for example, the explained variance of the first principal component (PC1) corresponding to the annual oscillation is 62.7%62.7\% for xMCA and 88.1%88.1\% for the ROCKPCA. This is an under-representation issue in xMCA, since the annual amplitude is close to 8.5C and the inter-annual variability not larger than 0.11C per decade, with a global average increase of 0.3 since 1960 until 2020 [27, 28]. This causes an unrealistic attribution of the relevance to the interdecadal trend (PC3 of xMCA from [1]), being the 6.9%6.9\% of the overall variance. This difference in the attribution of the explained variance emerges from the definition of the MCA, since the extracted eigenvalues do not match the individual explained variance of the separated phenomenon for more than a single variable. The extracted components in both cases have the same physical interpretation, see figures 2 and 3 from [3], but with difference into the mixture of components.

Refer to caption
Figure 2: Global SST decomposition results by ROCK-PCA (adapted from [3]), cf. compare to Fig. 5 and Fig. 8 in [1]. Spatial distribution of the first (a) and second (b) components for both the spatial covariation amplitude normalized to one (top row), the spatial covariation phase (middle row), and the temporal component (bottom row). ENSO regions are also shown.

Rotation of principal components with Varimax or Promax allows us to linearly maximize variance representation of a subset of principal components. This can lead to a more representative subset modes of variation if the represented physical phenomena are linearly separable. Yet, a nonlinear spatio-temporal mixture of physical phenomenon will not be separable by a linear spatio-temporal decomposition and an extra rotation as in xMCA. In this case, we can see that xMCA identifies the third PC as the interdecadal trend, but the spatial distribution of the variability is mixed with other patterns as the ENSO spatial pattern, which should exhibit a more homogeneous distribution [27]. Figure 2 shows in the right panel the identified interdecadal trend by ROCK-PCA instead, showing a more consistent spatial pattern as previously reported elsewhere [27].

Refer to caption
Figure 3: Spatial distribution of the amplitude normalized of the third component (top), the spatial covariation phase (middle) and the temporal component (bottom) by ROCK-PCA [3]. The ENSO3.4 index is highly correlated with the real component of the third component (ρ=0.81\rho=0.81). Compare to Fig. 10 in [1] where the PC is highly correlated to the ENSO Modoki index.

In Figure 4, we show the difference between the linear and the nonlinear case of the estimated covariance matrix, where the scatter plots show how the complex components differentiate the data into monthly variability obtained with ROCK-PCA (right) unlike in the linear case (left). Adding the nonlinear mapping of the data, we can differentiate the different physical phenomena, which a linear rotation cannot disentangle. These results support the use of more general nonlinear approaches to decompose climatic data since they allow unraveling physical phenomena mixed in spatio-temporal data and thus extract more meaningful and informative representations.

Refer to caption [2pt] Refer to caption
Figure 4: Scatter plot of real and imaginary components of linear (left) and nonlinear (right) covariance matrix of SST. Mapping data into a kernel Hilbert space allows us to extract the relation of monthly variance in semantic and compact clusters.

5 Outlook

Both ROCK-PCA [3] and xMCA [1] extract features in a complex domain after an additional oblique rotation is applied, which were proposed to deal with time-lagged processes of spatio-temporal data. Distinctly, ROCK-PCA can extract nonlinear features as it operates in a reproducing kernel Hilbert space. We showed that xMCA is a particular case of kernel CCA when assuming linearity. We also showed the tight connection between xMCA and ROCK-PCA. Working in the dual domain (RR-mode) was not only computationally advantageous but also allowed us to derive the nonlinear version of xMCA which, after reformulation as a eigenproblem, reduces to solving ROCK-PCA when one dataset is involved. ROCK-PCA is not only more efficient computationally, but also provides more expressive and informative components, which permit capturing meaningful physical processes more accurately. Using a linear kernel is indeed quite restrictive in general, as it has been demonstrated widely that components and modes of variability in Earth and climatic data show nonlinear relations. Actually, in other works, we explored the potential use of ROCK-PCA to characterize nonlinear cross-information between covariates and global teleconnections [29] and to identify causal footprints of climate phenomena [30]. In all these cases, nonlinear kernels were strictly necessary to achieve physically meaningful results. We hope that this clarification will place the novelty and limitations of xMCA in context, motivate the use of nonlinear approaches to study teleconnections, and ultimately encourage contributions in this exciting field of statistical analysis of spatio-temporal Earth and climate data.

Acknowledgments

This work was supported by the ”Understanding and Modeling of the Earth System with Machine Learning” (USMILE) under the European Research Council (ERC) Synergy Grant 2019, Agreement N 855187.

References

  • [1] Niclas Rieger, Álvaro Corral, Estrella Olmedo, and Antonio Turiel. Lagged teleconnections of climate variables identified via complex rotated maximum covariance analysis. Journal of Climate, 34(24):9861 – 9878, 2021.
  • [2] J. D. Horel. Complex principal component analysis: Theory and examples. Journal of Applied Meteorology and Climatology, 23(12):1660 – 1673, 1984.
  • [3] Diego Bueso, Maria Piles, and Gustau Camps-Valls. Nonlinear PCA for spatio-temporal analysis of earth observation data. IEEE Transactions on Geoscience and Remote Sensing, 58(8):5752–5763, 2020.
  • [4] Jerónimo Arenas-García and Kaare Brandt Petersen. Kernel Multivariate Analysis in Remote Sensing Feature Extraction, chapter 14, pages 327–352. John Wiley & Sons, Ltd, 2009.
  • [5] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299–1319, 07 1998.
  • [6] Hotelling Harold. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
  • [7] Jon R Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433–451, 1971.
  • [8] Pei Ling Lai and Colin Fyfe. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05):365–377, 2000.
  • [9] Matthew Newman and Prashant D Sardeshmukh. A caveat concerning singular value decomposition. Journal of Climate, 8(2):352–360, 1995.
  • [10] Steve Cherry. Singular value decomposition analysis and canonical correlation analysis. Journal of Climate, 9(9):2003–2009, 1996.
  • [11] Ian T Jolliffe. Rotation of principal components: some comments. Journal of Climatology, 7(5):507–510, 1987.
  • [12] José Luis Rojo-Álvarez, Manel Martínez-Ramón, Jordi Munoz-Mari, and Gustau Camps-Valls. Digital signal processing with Kernel methods. John Wiley & Sons, 2018.
  • [13] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
  • [14] Siamak Mehrkanoon and Johan AK Suykens. Regularized semipaired kernel cca for domain adaptation. IEEE transactions on neural networks and learning systems, 29(7):3199–3213, 2017.
  • [15] Carlos Alzate and Johan AK Suykens. A regularized kernel cca contrast function for ica. Neural Networks, 21(2-3):170–181, 2008.
  • [16] Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf, et al. Kernel methods for measuring independence. JMLR, 2005.
  • [17] Kenji Fukumizu, Francis R Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(2), 2007.
  • [18] Matthew B Blaschko, Christoph H Lampert, and Arthur Gretton. Semi-supervised laplacian regularization of kernel canonical correlation analysis. In Joint European conference on machine learning and knowledge discovery in databases, pages 133–145. Springer, 2008.
  • [19] Felix Bießmann, Frank C Meinecke, Arthur Gretton, Alexander Rauch, Gregor Rainer, Nikos K Logothetis, and Klaus-Robert Müller. Temporal kernel cca and its application in multimodal neuronal data analysis. Machine Learning, 79(1):5–27, 2010.
  • [20] Pantelis Bouboulis and Sergios Theodoridis. Extension of wirtinger’s calculus to reproducing kernel hilbert spaces and the complex kernel lms. IEEE Transactions on Signal Processing, 59(3):964–978, 2010.
  • [21] A. Gretton, O. Bousquet, A. J. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT, pages 63–77, 2005.
  • [22] Francis Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.
  • [23] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. In Robert G. Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 112–119, New Jersey, 2005. Society for Artificial Intelligence and Statistics.
  • [24] A. Gretton, R. Herbrich, and A. Hyvärinen. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129, 2005.
  • [25] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In Proc. 16th Intl. Conf. Algorithmic Learning Theory, pages 63–77, Springer, 2005. Springer.
  • [26] N. A. Rayner, D. E. Parker, E. B. Horton, C. K. Folland, L. V. Alexander, D. P. Rowell, E. C. Kent, and A. Kaplan. Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century. Journal of Geophysical Research: Atmospheres, 108(D14), 2003.
  • [27] Claire E. Bulgin, Christopher J. Merchant, and David Ferreira. Tendencies, variability and persistence of sea surface temperature anomalies. Scientific Reports, 1(10):7986, 5 2020.
  • [28] Ge Chen, Baomin Shao, Yong Han, Jun Ma, and Bertrand Chapron. Modality of semiannual to multidecadal oscillations in global sea surface temperature variability. Journal of Geophysical Research: Oceans, 115(C3), 2010.
  • [29] Diego Bueso, Maria Piles, and Gustau Camps-Valls. Cross-information kernel causality: Revisiting global teleconnections of enso over soil moisture and vegetation. In Proceedings of the 9th International Workshop on Climate Informatics: CI 2019, Climate Informatics, pages 172–176, Paris, 2019. NCAR.
  • [30] Diego Bueso, Maria Piles, and Gustau Camps-Valls. Explicit granger clkopf98ausality in kernel hilbert spaces. Phys. Rev. E, 102:062201, Dec 2020.