Integrative decomposition of multi-source data by identifying partially-joint score subspaces

SeoWon Choi [email protected] Sungkyu Jung ^*^*footnotemark: * [email protected]

Abstract

Analysis of multi-source dataset, where data on the same objects are collected from multiple sources, is of rising importance in many fields, most notably in multi-omics biology. A novel framework and algorithms for integrative decomposition of such multi-source data are proposed to identify and sort out common factor scores in terms of whether the scores are relevant to all data sources (fully joint), to some data sources (partially joint), or to a single data source. The key difference between the proposed method and existing approaches is that raw source-wise factor score subspaces are utilized in the identification of the partially-joint block-wise association structure. To identify common score subspaces, which may be partially joint to some of data sources, from noisy observations, the proposed algorithm sequentially computes one-dimensional flag means among source-wise score subspaces, then collects the subspaces that are close to the mean. The proposed decomposition boasts fast computational speed, and is superior in identifying the true partially-joint association structure and recovering the joint loading and score subspaces than competing approaches. The proposed decomposition is applied to a blood cancer multi-omics data set, containing measurements from three data sources. Our method identifies a latent score, partially joint to the drug panel and methylation profile data sources but not relevant to RNA sequencing profiles, which helps discovering hidden clusters in the data.

keywords:

Multi-block data , Factor model , Principal angles , Data integration , Dimension reduction

^†^†journal: Computational Statistics and Data Analysis

withthenumberofcolumnsof(U(1),1T,U(2),1T)TandU(1),2aretwoandone,respectively.

Withtheblock-wisesparseconstraintimposed,theobjectivefunction(

LABEL:rule4)forUcanbewrittenseparatelyforeachdatablock,i.e.,

\|\widehat{Z}-U\cdot\widehat{W}^{T}\|_{F}^{2}=\sum_{k=1}^{K}\|\widehat{Z}_{k}-U_{(k)}\widehat{W}_{(k)}^{T}\|_{F}^{2}.

(8)

Here,U(k)and^W(k)arethecolumn-wiseconcatenationofeachU(k),i′sand^Wi′s,respectively,fori∈{i:k∈Si

and^r(Si)>0}.Theminimizerof(8)is^U(k)=^Zk^W(k)(^W(k)T^W(k))-1,and^U(k),ifori∈J(k)areobtainedbydisjoining^U(k).Bytheblock-wisesparsestructure,set^U(k),i=0ifk∉Si.

3.3 TuningParameterSelection

Thepartially-jointstructureidentification,proposedinSection3.1,dependsheavilyonthetuningparameterλ∈[0,π/2).Ifλistoosmall,thenallscoresareidentifiedasindividualscores,specifictoeachdatablocks.Ifλistoolarge,thenindividualandpartially-jointscoresmaybefalselyidentifiedasfully-jointscores.Weusedatasplittingtoselectthevalueoftuningparameterλ∈[0,π/2).Forasingleinstanceofdatasplitting,randomlysplitnsamplesofX=[X1T,…,XKT]Tintotwogroupsofequalproportions,thetrainingsetXtr=[Xtr,1T,…,Xtr,KT]TandthetestsetXtest=[Xtest,1T,…,Xtest,KT]T.GiventhesignalrankrkofeachXk,wethenextractthetrainingsignalmatrices^Ztr,kfork=1,…,KusingtherankrkapproximationofXtr,k.Foreachλonthetuningparametergrid,weidentifythepartially-jointstructurefrom^Ztr,k′s,andobtainthepartially-jointscore^Wtr,λandthepartially-jointloadingmatrix^Utr,λ,asdiscussedinSections3.1and3.2.Toassessthedegreestowhichtheestimatesaregeneralizedtothetestset,wefirstevaluatethescorematrixforthetestset,givenbytheloadingmatrixestimates^Utr,λfromthetrainingdata.Thetestscorematrix^Wtest,λisdefinedastheminimizer^Wtest,λ∈Rntest×^rof