This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Computational Benefit of Multimodal Learning

Zhou Lu
Princeton University
[email protected]
Abstract

Human perception inherently operates in a multimodal manner. Similarly, as machines interpret the empirical world, their learning processes ought to be multimodal. The recent, remarkable successes in empirical multimodal learning underscore the significance of understanding this paradigm. Yet, a solid theoretical foundation for multimodal learning has eluded the field for some time. While a recent study by [13] has shown the superior sample complexity of multimodal learning compared to its unimodal counterpart, another basic question remains: does multimodal learning also offer computational advantages over unimodal learning? This work initiates a study on the computational benefit of multimodal learning. We demonstrate that, under certain conditions, multimodal learning can outpace unimodal learning exponentially in terms of computation. Specifically, we present a learning task that is NP-hard for unimodal learning but is solvable in polynomial time by a multimodal algorithm. Our construction is based on a novel modification to the intersection of two half-spaces problem.

1 Introduction

At the heart of human perception lies multimodality. This capability enables us to perceive and interrelate different facets of the same empirical object. It’s particularly important during the infantile stage of human development, where it helps unify disparate symbols, fostering comprehensive cognition as a foundation for adulthood. The analogy of raising a child in a ”room of text” alone highlights the limitations of a unimodal approach; it’s bound to be counterproductive.

In the realm of machine learning, multimodality plays a role analogous to its significance in human cognition. Here, we view machine learning as the machine’s process of perception. Multimodal learning entails accumulating vast amounts of training data across various modalities and subsequently deploying the trained model to handle new unimodal tasks. This learning progression mirrors the transition from infancy to adulthood in humans. Empirical studies have consistently shown that models trained using multiple modalities often surpass finely-tuned unimodal models, even when evaluated on new unimodal data.

In spite of notable empirical successes, like Gato [17] and GPT-4 [15], the theoretical explanations of multimodal learning remain relatively underexplored. Thus, establishing a solid theoretical foundation becomes imperative.

A recent study by [13] set the stage for a broader understanding of the statistical advantages of multimodal learning. The research showed that multimodal learning achieves superior generalization bounds compared to unimodal learning, especially when the data exhibits both connection and heterogeneity. However, the question arose: does multimodal learning also present computational advantages?

Our work provides an affirmative answer. We show a computational separation between multimodal and unimodal learning. Specifically, we introduce a learning task, rooted in the intersection of two half-spaces problem, which poses an NP-hard challenge for any unimodal learning algorithm. Yet, this very task yields to a polynomial solution under a multimodal learning paradigm. This dichotomy demonstrates the potential exponential computational advantage of multimodal learning over its unimodal counterpart. Coupled with the statistical insights from [13], our findings further illuminate the vast potential of multimodal learning.

1.1 Related Works

Theoretical Multimodal Learning: despite the empirical success of multimodal learning, a cohesive theoretical foundation was long missing in this area. Most existing theoretical findings are bound by specific assumptions and contexts. For instance, studies such as [23, 1, 7, 19] navigate multimodal learning within a multi-view framework, operating under the assumption that individual modalities are, in isolation, adequate for predictions. [20, 11] delve into algorithms pivoting on information-theoretical relationships across modalities. [18] consider the specific problem of the benefit of contrastive loss in multimodal learning with a linear data-generating model. [9] studies the generalization ability of multimodal learning in estimating the latent space representation.

A recent work [13] proposes a broad-based theory on the statistical guarantee of multimodal learning. They prove that multimodal learning admits an O(m)O(\sqrt{m}) improvement in generalization error over unimodal learning. This is achieved by dissecting the learning of the composition of two hypotheses, where the sum of complexities of the hypotheses is markedly smaller than that of their composition. Additionally, they pinpoint connection and heterogeneity amidst modalities as the two pivotal elements propelling these statistical advantages of multimodal learning.

Empirical Multimodal Learning: applications of multimodal learning can be traced back to the last century, aiming at combining vision and audio data to improve the performance of speech recognition [22, 14]. As the field evolved, multimodal learning carved a niche in multimedia, enhancing capabilities in indexing and search functionalities [6, 12].

Recently, there is a trend in applying multimodal learning in deep learning practices, including modality generation [3, 8, 16] and large-scale generalist models [17, 15]. A consistently observed empirical phenomenon is that a multimodal model is able to outperform a finely-tuned unimodal model, even on unimodal population data.

2 Setting

In this section, we delineate the setup of multimodal learning and essential background on the intersection of two half-spaces problem.

2.1 Multimodal Learning Setup

In this paper, we restrict our focus to the fundamental, yet non-trivial, scenario of two modalities for a clear exposition, adopting the setup of [13]. Formally, the multimodal learning classification framework encompasses two modalities, denoted as 𝒳,𝒴n\mathcal{X},\mathcal{Y}\subset\mathbb{R}^{n}, and a label space 𝒵={±}\mathcal{Z}=\{\pm\}. Consequently, every data point can be represented as a tuple (x,y,z)(x,y,z).

Given a hypothesis class \mathcal{H} and a training dataset (X,Y,Z)(X,Y,Z) with mm data points (xi,yi,zi)(x_{i},y_{i},z_{i}), our aim in (proper) learning from (X,Y,Z)(X,Y,Z) is to output a hypothesis hh\in\mathcal{H}, that minimizes the empirical risk:

emp=i=1m𝟏h(xi,yi)zim.\ell_{emp}=\frac{\sum_{i=1}^{m}\mathbf{1}_{h(x_{i},y_{i})\neq z_{i}}}{m}.

When each data point (x,y,z)(x,y,z) adheres to a specific data distribution DD over (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}), the goal of (properly) PAC-learning (X,Y,Z)(X,Y,Z) is to output a hypothesis hh\in\mathcal{H}, such that the population risk

pop=𝔼(x,y,z)D[𝟏h(x,y)z]\ell_{pop}=\mathbb{E}_{(x,y,z)\sim D}[\mathbf{1}_{h(x,y)\neq z}]

is small with high probability. In addition, we mandate a bijective mapping between x,yx,y for any potential data point (x,y,z)(x,y,z).

For brevity, we occasionally write (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) for short to denote the learning problem, when it is clear from the context. The unimodal learning problems (𝒳,𝒵)(\mathcal{X},\mathcal{Z}) and (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) can be defined in a similar way, in which the label yy or xx is masked. In learning (𝒳,𝒵)(\mathcal{X},\mathcal{Z}), we are given a hypothesis class \mathcal{H} and a set (X,Z)(X,Z) of training data with mm data points (xi,zi)(x_{i},z_{i}). The empirical risk and the population risk are defined as follows respectively.

emp=i=1m𝟏h(xi)zim,pop=𝔼(x,y,z)D[𝟏h(x)z].\ell_{emp}=\frac{\sum_{i=1}^{m}\mathbf{1}_{h(x_{i})\neq z_{i}}}{m},\ \ \ell_{pop}=\mathbb{E}_{(x,y,z)\sim D}[\mathbf{1}_{h(x)\neq z}].

2.2 Intersection of Two Half-spaces

In our quest to demonstrate a computational separation between multimodal and unimodal learning, we sought to architect a specific learning challenge that presents as NP-hard for unimodal learning, but for which an efficient multimodal solution exists.

A candidate of such problem is the ’intersection of two half-spaces,’ formally defined below:

Definition 1 (Intersection of two half-spaces).

An instance of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} is a set of points in n\mathbb{R}^{n} each labeled either ‘+’ or ‘-’ and the goal is to find an intersection of two half-spaces which correctly classifies the maximum number of points, where a ‘+’ point is classified correctly if it lies inside the intersection and a ‘-’ point is classified correctly if it lies outside of it.

Previous work has shown that PAC-learning this intersection is inherently NP-hard, even in the realizable setting, encapsulated in the following result:

Theorem 2 ([10]).

Let \ell be any fixed integer and ϵ>0\epsilon>0 be an arbitrarily small constant. Then, given a set of labeled points in n\mathbb{R}^{n} with a guarantee that there is an intersection of two half-spaces that classifies all the points correctly, there is no polynomial time algorithm to find a function ff of up to \ell linear threshold functions that classifies 12+ϵ\frac{1}{2}+\epsilon fraction of points correctly, unless NP = RP.

A slightly weaker version of the above result which will be of use is the following:

Proposition 3.

Let ϵ>0\epsilon>0 be an arbitrarily small constant. Then, given a set of labeled points in n\mathbb{R}^{n} with a guarantee that there is an intersection of two half-spaces that classifies all the points correctly, there is no polynomial time algorithm to find a function ff of an intersection of two half-spaces that classifies 12+ϵ\frac{1}{2}+\epsilon fraction of points correctly, unless NP = RP.

It’s clear Proposition 3 is a direct consequence of Theorem 2, given that the intersection of two half-spaces naturally translates to \ell linear threshold functions with =2\ell=2. Through out this paper we will only consider the case of proper learning with our hypothesis class including only intersections of two half-spaces.

3 A Computational Separation between Multimodal and Unimodal Learning

To demonstrate the computational benefit of multimodal learning, we present an instance in which both unimodal learning problems (𝒳,𝒵)(\mathcal{X},\mathcal{Z}) and (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) are NP-hard, while the multimodal learning problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) can be solved efficiently. In particular, we require the existence of a bijective mapping f:𝒳𝒴f:\mathcal{X}\to\mathcal{Y} satisfying y=f(x)y=f(x) for any data point (x,y,z)(𝒳,𝒴,𝒵)(x,y,z)\in(\mathcal{X},\mathcal{Y},\mathcal{Z}), so that the hardness result is purely computational. The task of constructing such an instance can be decomposed into three steps

  1. 1.

    We start by setting (𝒳,𝒵)(\mathcal{X},\mathcal{Z}) as a NP-hard problem, in this case, an instance of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS}.

  2. 2.

    Based on (𝒳,𝒵)(\mathcal{X},\mathcal{Z}), we construct a bijective mapping between x,yx,y, to obtain a new NP-hard problem (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) by preserving the 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} structure.

  3. 3.

    The bijective mapping should be designed carefully, such that the multimodal problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) can be solved efficiently.

Below we describe the construction of the instance and the main idea behind. A detailed proof is provided in the next section.

Step 1: We set one of the unimodal learning problem, say (𝒳,𝒵)(\mathcal{X},\mathcal{Z}), as an instance of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS}. We denote any problem of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} by H1H2H_{1}\cap H_{2} with halfspaces H1,H2H_{1},H_{2} in n\mathbb{R}^{n}, where each Hi=(x|rixci)H_{i}=(x|r_{i}^{\top}x\leq c_{i}) is determined by the unit vector rir_{i} and cic_{i}\in\mathbb{R}.

Step 2: A critical observation is that, any 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problem H1H2H_{1}\cap H_{2} can be transformed into a new 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problem by applying a coordinate change, under which each xx is mapped to a new point with the corresponding zz remaining the same. Denote Qn×nQ\in\mathbb{R}^{n\times n} as any orthogonal matrix, we obtain H^1H^2\hat{H}_{1}\cap\hat{H}_{2} where H^i=(x|r^ixci)\hat{H}_{i}=(x|\hat{r}_{i}^{\top}x\leq c_{i}) by setting r^i=Qri\hat{r}_{i}=Qr_{i}. Let y=Qxy=Qx, we create a new NP-hard unimodal problem (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}), as QQ defines a bijective mapping from the set of all 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problems to itself.

Step 3: It remains unclear how the multimodal problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) can be easy to learn. Our strategy is to design a special QQ for each H1H2H_{1}\cap H_{2}, by encoding the information of H1H2H_{1}\cap H_{2} into the transformation QQ. Ideally, with nn linearly-independent xix_{i}, we can recover the matrix QQ by basic linear algebra. With the exact values of r1,r2r_{1},r_{2} in hand, we get c1,c2c_{1},c_{2} by listing the distances from all xx to the hyperplane rix=0r_{i}^{\top}x=0 in O(mn2)O(mn^{2}) time. The obtained classifier achieves zero loss on the training data.

However, it’s challenging to directly encode the vectors r1,r2r_{1},r_{2} into the n×nn\times n matrix QQ. There are two main obstacles. First, how to encode the information of r1,r2r_{1},r_{2} is unclear: QQ is under the constraint of an orthogonal matrix, which might be violated by simply filling r1,r2r_{1},r_{2} into QQ. Using more complicated techniques of encoding may bring other concerns such as the existence of a closed-form representation or whether decoding can be done efficiently. Second, the quality of such encoding is questionable: even if we find a way to encode r1,r2r_{1},r_{2} into QQ, we still need to make sure (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) exhausts the set of all possible 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} instances. Otherwise although each (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) problem is an 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} instance, the set of all possible (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) problems is a merely a subset of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS}, preventing us from directly applying the NP-hardness result.

Fortunately, we have a very simple remedy: enlarging the dimension nn by twice, then using the first nn coordinates for 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} while the latter 2n2n coordinates to encode the information of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS}. Roughly speaking, we create 2n2n null coordinates with no effect on the 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problem, while they carry the information of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} which can only be retrived by knowing both x,yx,y. In particular, for any 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problem H1H2H_{1}\cap H_{2}, we set QQ as

Q=(In000r120r22).Q=\begin{pmatrix}I_{n}&0&0\\ 0&\frac{r_{1}}{\sqrt{2}}&\cdots\\ 0&\frac{r_{2}}{\sqrt{2}}&\cdots\end{pmatrix}.

The vectors r1,r2r_{1},r_{2} are simply flattened and set as the first column of the second block. Since the norm of this column is 1, QQ can be easily made feasible. The identity matrix InI_{n} ensures (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) exhausts the set of all possible 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} instances. The main result of this paper is given by the following theorem.

Theorem 4 (Computational separation).

There exists a multimodal learning problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) which is PAC-learnable in polynomial time, while both unimodal learning problems (𝒳,𝒵)(\mathcal{X},\mathcal{Z}), (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) are NP-hard, even if there is a bijective mapping f:𝒳𝒴f:\mathcal{X}\to\mathcal{Y} such that y=f(x),(x,y,z)(𝒳,𝒴,𝒵)y=f(x),\forall(x,y,z)\sim(\mathcal{X},\mathcal{Y},\mathcal{Z}).

Theorem 4 demonstrates that multimodal learning solves some learning tasks exponentially faster than unimodal learning. Such exponential separation explains the empirical superiority of multimodal learning from the perspective of computation, supplementing the statistical guatantees in [13].

Notably, the two pivotal factors leading to the statistical benefit of multimodal learning in [13], namely connection and heterogeneity, are also evident in our construction. In particular, the mapping QQ between 𝒳,𝒴\mathcal{X},\mathcal{Y} is bijective, meaning there exists a perfect connection between both modalities. On the other hand, 𝒳,𝒴\mathcal{X},\mathcal{Y} carry different information about the problem, which is useless alone but effective when put together, indicating s strong heterogeneity.

4 Proof of Theorem 4

We first introduce the necessary ingredients for the construction of the learning problem. For each pair of unit vectors v1,v2nv_{1},v_{2}\in\mathbb{R}^{n}, there exist orthogonal matrices in 2n\mathbb{R}^{2n} with its first column to be (v12,v12)(\frac{v_{1}}{\sqrt{2}},\frac{v_{1}}{\sqrt{2}}) since (v12,v12)2=1\|(\frac{v_{1}}{\sqrt{2}},\frac{v_{1}}{\sqrt{2}})\|_{2}=1. In particular, for each pair v1,v2v_{1},v_{2} we fix one such orthogonal matrix FF, defining a function F(v1,v2):2n2n×2nF(v_{1},v_{2}):\mathbb{R}^{2n}\to\mathbb{R}^{2n\times 2n} as below:

F(v1,v2)=(v12v22).F(v_{1},v_{2})=\begin{pmatrix}\frac{v_{1}}{\sqrt{2}}&\cdots\\ \frac{v_{2}}{\sqrt{2}}&\cdots\end{pmatrix}.

In addition, we define an orthogonal transformation matrix Q(v1,v2)3n×3nQ(v_{1},v_{2})\in\mathbb{R}^{3n\times 3n} as

Q(v1,v2)=(In00F(v1,v2)).Q(v_{1},v_{2})=\begin{pmatrix}I_{n}&0\\ 0&F(v_{1},v_{2})\end{pmatrix}.

The matrix Q(r1,r2)Q(r_{1},r_{2}) will serve as a fingerprint of an 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problem H1H2H_{1}\cap H_{2}. We also define a variant of the intersection of two half-spaces problem.

Definition 5 (Low-dimensional intersection of two half-spaces).

An instance of 𝐈𝐧𝐭𝐇𝐒λ\mathbf{IntHS_{\lambda}} is a set of points in n\mathbb{R}^{n} each labeled either ‘+’ or ‘-’, in which the labels only depend on the first λn\lambda n coordinates where λ(0,1)\lambda\in(0,1) is a constant. The goal is to find an intersection of two half-spaces which correctly classifies the maximum number of points, where a ‘+’ point is classified correctly if it lies inside the intersection and a ‘-’ point is classified correctly if it lies outside of it.

Lemma 6.

For every constant λ>0\lambda>0, learning 𝐈𝐧𝐭𝐇𝐒λ\mathbf{IntHS_{\lambda}} is NP-hard.

Proof.

We prove by reduction. Suppose for contradiction 𝐈𝐧𝐭𝐇𝐒λ\mathbf{IntHS_{\lambda}} can be learnt in polynomial time, then for each instance of 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS}, we can create a new instance of 𝐈𝐧𝐭𝐇𝐒λ\mathbf{IntHS_{\lambda}} with dimension nλ\frac{n}{\lambda} by extension. In particular, each point xnλx\in\mathbb{R}^{\frac{n}{\lambda}} shares the same label as x[1:n]x_{[1:n]} in the original 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} instance. As a result, any classifier of 𝐈𝐧𝐭𝐇𝐒λ\mathbf{IntHS_{\lambda}} applies to the 𝐈𝐧𝐭𝐇𝐒\mathbf{IntHS} problem with the same accuracy, contradicting Proposition 3. ∎

Now we are ready to state the learning problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}): mm data points (xi,yi,zi)(x_{i},y_{i},z_{i}) are given, where xi,yi3nx_{i},y_{i}\in\mathbb{R}^{3n} represent the two modalities and zi=±z_{i}=\pm is the label. It’s guaranteed that there is an intersection of two half-spaces that classifies all the points correctly, with supports of the defining unit vectors being the first nn coordinates. In other words, it’s a realizable instance of 𝐈𝐧𝐭𝐇𝐒𝟏𝟑\mathbf{IntHS_{\frac{1}{3}}}.

In particular, there are unit vectors r1,r2nr_{1},r_{2}\in\mathbb{R}^{n} and constants c1,c2c_{1},c_{2}\in\mathbb{R} (unknown to the learner), such that all pairs (xi,zi)(x_{i},z_{i}) can be perfectly classified by H^1H^2\hat{H}_{1}\cap\hat{H}_{2}, where H^i=(x|r^ixci)\hat{H}_{i}=(x|\hat{r}_{i}^{\top}x\leq c_{i}) and r^i=(ri,𝟎2n)\hat{r}_{i}=(r_{i},\mathbf{0}_{2n}). Meanwhile, yi=Q(r1,r2)xiy_{i}=Q(r_{1},r_{2})x_{i} holds for all data points, and all pairs (yi,zi)(y_{i},z_{i}) can be perfectly classified by H~1H~2\tilde{H}_{1}\cap\tilde{H}_{2}, where H~i=(x|r~ixci)\tilde{H}_{i}=(x|\tilde{r}_{i}^{\top}x\leq c_{i}) and r~i=Q(r1,r2)(ri,𝟎2n)\tilde{r}_{i}=Q(r_{1},r_{2})(r_{i},\mathbf{0}_{2n}).

Define the hypothesis set 𝒮\mathcal{S} as

𝒮={h|h(x)=sgn(min(c1r1x,c2r2x)),ci,ri2=1},\mathcal{S}=\{h|h(x)=\textbf{sgn}(\min(c_{1}-r_{1}^{\top}x,c_{2}-r_{2}^{\top}x)),c_{i}\in\mathbb{R},\|r_{i}\|_{2}=1\},

which is exactly the set of all intersection of two half-spaces. We have the following results.

Lemma 7.

Properly learning (𝒳,𝒵)(\mathcal{X},\mathcal{Z}) with 𝒮\mathcal{S} is NP-hard.

Proof.

It is a direct consequence of Lemma 6, noticing that (𝒳,𝒵)(\mathcal{X},\mathcal{Z}) is an 𝐈𝐧𝐭𝐇𝐒𝟏𝟑\mathbf{IntHS_{\frac{1}{3}}} instance. ∎

Lemma 8.

Properly learning (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) with 𝒮\mathcal{S} is NP-hard.

Proof.

Although (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) is also an 𝐈𝐧𝐭𝐇𝐒𝟏𝟑\mathbf{IntHS_{\frac{1}{3}}} instance, we still need to verify that (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) exhausts all possible 𝐈𝐧𝐭𝐇𝐒𝟏𝟑\mathbf{IntHS_{\frac{1}{3}}} instances (otherwise we can’t apply Lemma 6, for example when all (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) obey the same 𝐈𝐧𝐭𝐇𝐒𝟏𝟑\mathbf{IntHS_{\frac{1}{3}}} instance). Notice that QQ induces a mapping H1H2H1H2H_{1}\cap H_{2}\to H_{1}\cap H_{2}, and it’s equivalent to proving it is a surjective mapping. For any 𝐈𝐧𝐭𝐇𝐒𝟏𝟑\mathbf{IntHS_{\frac{1}{3}}} instance H^1H^2\hat{H}_{1}\cap\hat{H}_{2} where H^i=(x|r^ixci)\hat{H}_{i}=(x|\hat{r}_{i}^{\top}x\leq c_{i}) and r^i=(ri,𝟎2n)\hat{r}_{i}=(r_{i},\mathbf{0}_{2n}), because r^i\hat{r}_{i} also has support in the first nn coordinates, we have that r^i=Q(r1,r2)ri\hat{r}_{i}=Q(r_{1},r_{2})r_{i} with ri=r^ir_{i}=\hat{r}_{i}, proving the mapping is surjective. ∎

Lemma 9.

Assume m3nm\geq 3n, (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) is properly learnable with 𝒮\mathcal{S} (applied to xx only) in O(mn2)O(mn^{2}) time, when there exist 3n3n data points with linearly-independent xix_{i}.

Proof.

Consider the simple algorithm 1 which consists three steps:

  1. 1.

    find a set SS of linearly-independent xix_{i} (line 2-6).

  2. 2.

    find QQ by solving a linear system of SS (line 7-8).

  3. 3.

    rank xix_{i} along the directions of r1,r2r_{1},r_{2} to get c1,c2c_{1},c_{2} (line 9-10).

Step 1 runs in O(mn2)O(mn^{2}) time, since testing orthogonality between two points runs in O(n)O(n) time and |S|=O(n)|S|=O(n). Step 2 runs in O(n3)O(n^{3}) time which is the complexity of solving a system of linear equations. Step 3 runs in O(mn)O(mn) time. Under our assumption m3nm\geq 3n, the total running time is O(mn2+n3+mn)=O(mn2)O(mn^{2}+n^{3}+mn)=O(mn^{2}).

We still need to verify the found classifier h(x)h(x):

h(x)=sgn(min(c1r1x,c2r2x))h(x)=\textbf{sgn}(\min(c_{1}-r_{1}^{\top}x,c_{2}-r_{2}^{\top}x))

does classify all data points correctly. By the construction of QQ, we know there is a classifier h(x)h^{*}(x) which classifies all data points correctly, which shares the same rir_{i} with h(x)h(x):

h(x)=sgn(min(c1r1x,c2r2x)).h^{*}(x)=\textbf{sgn}(\min(c^{*}_{1}-r_{1}^{\top}x,c^{*}_{2}-r_{2}^{\top}x)).

By the choice of c1,c2c_{1},c_{2}, we have that c1c1,c2c2c_{1}\leq c^{*}_{1},c_{2}\leq c^{*}_{2}. Denote h+={x3n,h(x)=+}h_{+}=\{x\in\mathbb{R}^{3n},h(x)=+\}, we have that

(h+X)(h+X)=X+,(h_{+}\cap X)\subset(h^{*}_{+}\cap X)=X_{+},

by the fact h+h+h_{+}\subset h^{*}_{+}. Meanwhile, by the construction of h(x)h(x), we have that X+h+X_{+}\subset h_{+}, and further

X+=(X+X)(h+X).X_{+}=(X_{+}\cap X)\subset(h_{+}\cap X).

As a result, X+=h+XX_{+}=h_{+}\cap X which means h(x)h(x) does classify all data points correctly. ∎

Algorithm 1 Learning by decoding
1:  Input: mm data points (xi,yi,zi)(x_{i},y_{i},z_{i}).
2:  Set S={x1}S=\{x_{1}\}, t=2t=2.
3:  while |S|<3n|S|<3n do
4:     If xtx_{t} is orthogonal to each member of SS, add xtx_{t} to SS.
5:     t=t+1t=t+1.
6:  end while
7:  Solving the linear system Qxi=yiQx_{i}=y_{i}, xiS\forall x_{i}\in S.
8:  Recover r1,r2r_{1},r_{2} from QQ.
9:  Let X+X_{+} be the set of all xix_{i} with zi=+z_{i}=+.
10:  Set ci=maxxX+rixc_{i}=\max_{x\in X_{+}}r_{i}^{\top}x.

Lemma 9 concerns only the learnability on the training data, to extend this result to PAC-learnability we introduce the following definition.

Definition 10.

A data distribution DD on (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) is called non-degenerate, if

(xi,yi,zi)D,i[3n](λ𝟎,s.t.i=13nλixi=0)=0.\mathbb{P}_{(x_{i},y_{i},z_{i})\sim D,i\in[3n]}(\exists\lambda\neq\mathbf{0},s.t.\sum_{i=1}^{3n}\lambda_{i}x_{i}=0)=0.

Most distributions whose support has non-zero measure are non-degenerate, including common uniform and Gaussian distributions. We have the following result for PAC-learnability.

Lemma 11.

Assume mm data points are sampled from a non-degenerate distribution DD and m3nm\geq 3n, (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) is properly PAC-learnable with 𝒮\mathcal{S} (applied to xx only) in O(mn2)O(mn^{2}) time. In particular, with probability at least 1δ1-\delta, the generalization error ϵ\epsilon of algorithm 1 is upper bounded by

ϵ=O(nlogm+log1δm).\epsilon=O\left(\sqrt{\frac{n\log m+\log\frac{1}{\delta}}{m}}\right).
Proof.

By the assumption that DD is non-degenerate, we have that with probability 1, there exist 3n3n data points with linearly-independent xix_{i}. By the conclusion of Lemma 9, the learnt classifier achieves zero loss on training data.

From classic statistical learning theory, the generalization error of such classifier can be characterized by the VC-dimension of the hypothesis class.

Theorem 12 ([21]).

With probability at least 1δ1-\delta, for every hh in the hypothesis class \mathcal{H}, if hh is consistent with mm training samples, the generalization error ϵ\epsilon of hh is upper bounded by

ϵ=O(dlogm+log1δm),\epsilon=O\left(\sqrt{\frac{d\log m+\log\frac{1}{\delta}}{m}}\right),

where dd denotes the VC-dimension of \mathcal{H}.

We only need to determine the VC-dimension of the class of intersection of two half-spaces in 3n\mathbb{R}^{3n}. It’s well known the VC-dimension of a single half-space is O(n)O(n). [2] shows that the kk-fold intersection of any VC-class has VC-dimension bounded by O(dklogk)O(dk\log k). Putting d=nd=n and k=2k=2 concludes the proof. ∎

5 Separation in Improper Learning

As an extension of our result in proper learning, we consider the problem whether multimodality still possesses such exponential computational benefit when the learner is allowed to output arbitrary hypothesis beyond the hypothesis set \mathcal{H}, i.e. the improper learning setting.

The general problem of learning intersections of halfspaces is known to be hard even in the improper learning setting, defined as below.

Definition 13 (Intersection of half-spaces).

An instance of 𝐈𝐧𝐭𝐇𝐒(𝐍)\mathbf{IntHS(N)} is a set of points in n\mathbb{R}^{n} each labeled either ‘+’ or ‘-’ and the goal is to find an intersection of NN number of half-spaces which correctly classifies the maximum number of points, where a ‘+’ point is classified correctly if it lies inside the intersection and a ‘-’ point is classified correctly if it lies outside of it.

We will rely on the following hardness of improper learning intersections of halfspaces.

Theorem 14.

[[4, 5]] If limnq(n)=\lim_{n\to\infty}q(n)=\infty is a super-constant, there is no efficient algorithm that improperly learns q(n)q(n) numbers of intersections of halfspaces in RnR^{n}.

Using a similar analysis to Theorem 4, we obtain the following separation result.

Theorem 15 (Improper computational separation).

There exists a multimodal learning problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) which is PAC-learnable in polynomial time, while both unimodal learning problems (𝒳,𝒵)(\mathcal{X},\mathcal{Z}), (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) are NP-hard in the improper learning setting, even if there is a bijective mapping f:𝒳𝒴f:\mathcal{X}\to\mathcal{Y} such that y=f(x),(x,y,z)(𝒳,𝒴,𝒵)y=f(x),\forall(x,y,z)\sim(\mathcal{X},\mathcal{Y},\mathcal{Z}).

Proof.

The proof is identical to that of Theorem 4 except for two minor places:

  1. 1.

    We need a strengthened version of Lemma 6 with λ\lambda being 1/poly(n)1/\textbf{poly}(n).

  2. 2.

    The hard instance construction and the algorithm of multimodal learning is slightly modified to accommodate the new Lemma.

We begin with the strengthened version of Lemma 6. The definition of 𝐈𝐧𝐭𝐇𝐒(N)λ\mathbf{IntHS}(N)_{\lambda} follows directly from Definition 5, and we won’t repeat here.

Lemma 16.

Given any super-constant q(n)q(n). For all constants C1,c>0C\geq 1,c>0, improperly learning 𝐈𝐧𝐭𝐇𝐒(q(n))1Cnc\mathbf{IntHS}(q(n))_{\frac{1}{Cn^{c}}} is NP-hard.

Proof.

We prove by reduction. Suppose for contradiction 𝐈𝐧𝐭𝐇𝐒(q(n))1Cnc\mathbf{IntHS}(q(n))_{\frac{1}{Cn^{c}}} can be learnt in polynomial time, then for each instance of 𝐈𝐧𝐭𝐇𝐒(q(n))\mathbf{IntHS}(q(n)), we create a new instance of 𝐈𝐧𝐭𝐇𝐒(q(Cnc+1))1Cnc\mathbf{IntHS}(q^{\prime}(Cn^{c+1}))_{\frac{1}{Cn^{c}}} with dimension Cnc+1Cn^{c+1} by extension, where q(Cnc+1)=q(n)q^{\prime}(Cn^{c+1})=q(n) is still a super-constant. In particular, each point xCnc+1x\in\mathbb{R}^{Cn^{c+1}} shares the same label as x[1:n]x_{[1:n]} in the original 𝐈𝐧𝐭𝐇𝐒(q(n))\mathbf{IntHS}(q(n)) instance. Since and polynomial of CncCn^{c} is also a polynomial of nn, we conclude that any classifier of 𝐈𝐧𝐭𝐇𝐒(q(n))1Cnc\mathbf{IntHS}(q(n))_{\frac{1}{Cn^{c}}} applies to the 𝐈𝐧𝐭𝐇𝐒(q(n))\mathbf{IntHS}(q(n)) problem with the same accuracy, contradicting Theorem 14. ∎

Specifically, we will only use Lemma 16 with C=1,c=1/2C=1,c=1/2. Now we are ready to state the learning problem (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}): mm data points (xi,yi,zi)(x_{i},y_{i},z_{i}) are given, where xi,yinx_{i},y_{i}\in\mathbb{R}^{n} represent the two modalities and zi=±z_{i}=\pm is the label. It’s guaranteed that there is an intersection of n1\sqrt{n}-1 number of half-spaces H1,H2,,Hn1H_{1},H_{2},\cdots,H_{\sqrt{n}-1} that classifies all the points correctly, with supports of the defining unit vectors being the first n\sqrt{n} coordinates. In other words, it’s a realizable instance of 𝐈𝐧𝐭𝐇𝐒(n1)1/n\mathbf{IntHS}(\sqrt{n}-1)_{1/\sqrt{n}} (with q(x)=x1q(x)=x-1).

In particular, there are unit vectors r1,r2,,rn1nr_{1},r_{2},\cdots,r_{\sqrt{n}-1}\in\mathbb{R}^{\sqrt{n}} and constants c1,c2,,cn1c_{1},c_{2},\cdots,c_{\sqrt{n}-1}\in\mathbb{R} (unknown to the learner), such that all pairs (xi,zi)(x_{i},z_{i}) can be perfectly classified by iH^i\cap_{i}\hat{H}_{i}, where H^i=(x|r^ixci)\hat{H}_{i}=(x|\hat{r}_{i}^{\top}x\leq c_{i}) and r^i=(ri,𝟎nn)\hat{r}_{i}=(r_{i},\mathbf{0}_{n-\sqrt{n}}). Similarly, we can define the QQ matrix as

Q(v1,v2,,vn1)=(In00F(v1,v2,,vn1)),Q(v_{1},v_{2},\cdots,v_{\sqrt{n}-1})=\begin{pmatrix}I_{\sqrt{n}}&0\\ 0&F(v_{1},v_{2},\cdots,v_{\sqrt{n}-1})\end{pmatrix},

where the function F(v1,v2,,vn1):nn(nn)×(nn)F(v_{1},v_{2},\cdots,v_{\sqrt{n}-1}):\mathbb{R}^{n-\sqrt{n}}\to\mathbb{R}^{(n-\sqrt{n})\times(n-\sqrt{n})} is chosen as below:

F(v1,v2,,vn1)=(v1n1v2n1vn1n1).F(v_{1},v_{2},\cdots,v_{\sqrt{n}-1})=\begin{pmatrix}\frac{v_{1}}{\sqrt{\sqrt{n}-1}}&\cdots\\ \frac{v_{2}}{\sqrt{\sqrt{n}-1}}&\cdots\\ \cdots&\cdots\\ \frac{v_{\sqrt{n}-1}}{\sqrt{\sqrt{n}-1}}&\cdots\end{pmatrix}.

Meanwhile, yi=Q(v1,v2,,vn1)xiy_{i}=Q(v_{1},v_{2},\cdots,v_{\sqrt{n}-1})x_{i} holds for all data points, and all pairs (yi,zi)(y_{i},z_{i}) can be perfectly classified by iH~i\cap_{i}\tilde{H}_{i}, where H~i=(x|r~ixci)\tilde{H}_{i}=(x|\tilde{r}_{i}^{\top}x\leq c_{i}) and r~i=Q(v1,v2,,vn1)(ri,𝟎nn)\tilde{r}_{i}=Q(v_{1},v_{2},\cdots,v_{\sqrt{n}-1})(r_{i},\mathbf{0}_{n-\sqrt{n}}).

Via the same argument as Theorem 4, according to Lemma 16, both improperly learning (𝒳,𝒵)(\mathcal{X},\mathcal{Z}) and improperly learning (𝒴,𝒵)(\mathcal{Y},\mathcal{Z}) are hard.

We only need to show (𝒳,𝒴,𝒵)(\mathcal{X},\mathcal{Y},\mathcal{Z}) can be learnt efficiently. The same algorithm will be applied to decode all the rir_{i} and cic_{i} is set as maxxX+rix\max_{x\in X_{+}}r_{i}^{\top}x. The classifier we use is still

h(x)=sgn(min(c1r1x,c2r2x,,cn1rn1x)).h(x)=\textbf{sgn}(\min(c_{1}-r_{1}^{\top}x,c_{2}-r_{2}^{\top}x,\cdots,c_{\sqrt{n}-1}-r_{\sqrt{n}-1}^{\top}x)).

By the construction of QQ, we know there is a classifier h(x)h^{*}(x) which classifies all data points correctly, which shares the same rir_{i} with h(x)h(x):

h(x)=sgn(min(c1r1x,c2r2x,,cn1rn1x)).h^{*}(x)=\textbf{sgn}(\min(c^{*}_{1}-r_{1}^{\top}x,c^{*}_{2}-r_{2}^{\top}x,\cdots,c^{*}_{\sqrt{n}-1}-r_{\sqrt{n}-1}^{\top}x)).

By the choice of cic_{i}, we have that cici,ic_{i}\leq c^{*}_{i},\forall i. Denote h+={xn,h(x)=+}h_{+}=\{x\in\mathbb{R}^{n},h(x)=+\}, we have that

(h+X)(h+X)=X+,(h_{+}\cap X)\subset(h^{*}_{+}\cap X)=X_{+},

by the fact h+h+h_{+}\subset h^{*}_{+}. Meanwhile, by the construction of h(x)h(x), we have that X+h+X_{+}\subset h_{+}, and further

X+=(X+X)(h+X).X_{+}=(X_{+}\cap X)\subset(h_{+}\cap X).

As a result, X+=h+XX_{+}=h_{+}\cap X which means h(x)h(x) does classify all data points correctly. The rest of the proof on PAC learning easily follows from Theorem 4 and we omit it here.

6 Conclusion

In this paper, we take a preliminary step towards unraveling the computational benefit of multimodal learning. We demonstrate an exponential separation in computation between multimodal and unimodal learning by constructing a variant of the intersection of two half-spaces problem, which is NP-hard for any unimodal algorithm but can be efficiently solved by a multimodal algorithm. Complementing the statistical merits of multimodal learning as shown in [13], our result provides a more comprehensive theoretical understanding of the power of multimodal learning.

The main limitation of this work, in our opinion, is on the contrived argument that multimodal learning is tractable: the efficient learning scheme provided in this work only succeeds on a narrow, intricately designed class of problem instances. These results alone are not enough to suggest that computational benefits of multimodal learning are a more general phenomenon.

We conclude with two questions as future directions to improve the preliminary results presented in this work.

  1. 1.

    Can we show such separation in computation for more natural learning problems? Ideally, a good efficient learning algorithm for the multimodal setting should have less dependence on the problem structure, such as ERM.

  2. 2.

    Can we obtain a general sufficient condition for the computational benefit of multimodal learning? Even a polynomial improvement is interesting.

References

  • [1] Massih R Amini, Nicolas Usunier, and Cyril Goutte. Learning from multiple partially observed views-an application to multilingual text categorization. Advances in neural information processing systems, 22, 2009.
  • [2] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  • [3] Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D Manning. Text to 3d scene generation with rich lexical grounding. arXiv preprint arXiv:1505.06289, 2015.
  • [4] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 441–448, 2014.
  • [5] Amit Daniely and Gal Vardi. From local pseudorandom generators to hardness of learning. In Conference on Learning Theory, pages 1358–1394. PMLR, 2021.
  • [6] Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, and Yannis Avrithis. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 15(7):1553–1568, 2013.
  • [7] Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017, 2020.
  • [8] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  • [9] Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems, 34:10944–10956, 2021.
  • [10] Subhash Khot and Rishi Saket. On hardness of learning intersection of two halfspaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 345–354, 2008.
  • [11] Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Faisal Mahmood, Ruslan Salakhutdinov, and Louis-Philippe Morency. Quantifying & modeling feature interactions: An information decomposition framework. arXiv preprint arXiv:2302.12247, 2023.
  • [12] Rainer W Lienhart. Comparison of automatic shot boundary detection algorithms. In Storage and retrieval for image and video databases VII, volume 3656, pages 290–301. SPIE, 1998.
  • [13] Zhou Lu. A theory of multimodal learning. arXiv preprint arXiv:2309.12458, 2023.
  • [14] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
  • [15] OpenAI. Gpt-4 technical report. arXiv, 2023.
  • [16] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  • [17] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  • [18] Yunwei Ren and Yuanzhi Li. On the importance of contrastive loss in multimodal learning. arXiv preprint arXiv:2304.03717, 2023.
  • [19] Karthik Sridharan and Sham M Kakade. An information theoretic framework for multi-view learning. 2008.
  • [20] Xinwei Sun, Yilun Xu, Peng Cao, Yuqing Kong, Lingjing Hu, Shanghang Zhang, and Yizhou Wang. Tcgm: An information-theoretic framework for semi-supervised multi-modality learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 171–188. Springer, 2020.
  • [21] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity: festschrift for alexey chervonenkis, pages 11–30. Springer, 2015.
  • [22] Ben P Yuhas, Moise H Goldstein, and Terrence J Sejnowski. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11):65–71, 1989.
  • [23] Changqing Zhang, Zongbo Han, Huazhu Fu, Joey Tianyi Zhou, Qinghua Hu, et al. Cpm-nets: Cross partial multi-view networks. Advances in Neural Information Processing Systems, 32, 2019.