Multimodal Federated Learning with Missing Modality
via Prototype Mask and Contrast

Guangyin Bao Qi Zhang Duoqian Miao Zixuan Gong Liang Hu Ke Liu Yang Liu Chongyang Shi

Abstract

In real-world scenarios, the challenge of random modality missing in multimodal federated learning (mFL) hampers the development of federated frameworks and significantly diminishes model inference accuracy. However, existing mFL methods are predominantly limited to specific scenarios with either unimodal clients or complete multimodal clients. They typically create modality-specific encoders on clients and train modality fusion modules on servers, suffering from severe task drift between clients and servers and struggling to generalize effectively in intricate modality-missing scenarios. In this paper, we introduce a prototype library into the FedAvg-based Federated Learning framework, thereby enabling mFL to alleviate the task drift and performance degradation resulting from modality missing during both training and inference. The proposed method utilizes prototypes as masks representing missing modalities to formulate a task-calibrated training loss and a model-agnostic uni-modality inference strategy. In addition, a proximal term based on prototypes is constructed to enhance local training. Extensive experiments demonstrate the state-of-the-art performance of our approach across a series of missingness settings. Code is available at https://github.com/BaoGuangYin/PmcmFL.

Federated Learning, Multimodal Learning, Prototype Learning

1 Introduction

Multimodal pre-trained models have exhibited superior performance in various downstream tasks (Wang et al., 2023b; Yu et al., 2022; Chen et al., 2023b), with the availability of large-scale data being a major contributing factor. However, collecting large-scale data in practical applications may result in privacy leakage. An alternative approach is to adopt a decentralized machine learning paradigm, such as federated learning (FL). In FL (McMahan et al., 2017), distributed clients collaborate to train a global model without sharing their private datasets. However, it faces practical challenges when dealing with complex real-world multimodal data. One such challenge is the widely-existing issue of missing modalities. The presence of missing modalities on each client poses constraints on local training thereby leading to a significant drop in global model inference accuracy. Furthermore, the non-independent and identically distributed (non-IID) nature of data among clients (Zhu et al., 2021; Mendieta et al., 2022; Li et al., 2020; Huang et al., 2022; Qu et al., 2022; Li et al., 2021) sharpens the challenges of modality missing, highlighting the urgent need to address the issue in practical multimodal data.

Refer to caption — (a) Differences in task scenarios.

Some pioneer works (Yu et al., 2023; Zhao et al., 2022; Chen & Zhang, 2022; Feng et al., 2023) attempt to carry out multimodal federated learning with missing modalities. These approaches employ modality-specific encoders for each modality, such as visual encoders for images and language encoders for text, and train them using unimodal or self-supervised training tasks. These local encoders are then either aggregated into a global encoder (Zhao et al., 2022), or the global encoder is trained by aligning the local encoders through knowledge distillation (KD) on public data (Yu et al., 2023). To perform downstream tasks that require modality fusion, the server needs additional training data to train a modality fusion module and a downstream task head. These previous works focused on simple scenarios of modality missing, consisting of unimodal clients and modality-complete multimodal clients (as shown in Figure 1(a)). However, it is common for each client to have partial data with missing modalities (Feng et al., 2023), leading to more intricate practical scenarios. An example is observed in social media platforms where users generate three types of data: images only, text only, and image-text pairs.

Facing such intricate scenarios of modality missing, the existing multimodal FL frameworks encounter the following issues: 1) Clients cannot obtain fused representations due to only modality-specific encoders being deployed. 2) Severe task drift occurs between clients and server, e.g., image/text classification on clients but visual question answering (VQA) on the server, resulting in inconsistent optimization directions. 3) Lack of any strategy for addressing modality missing. 4) The availability of the server’s downstream task datasets has also been controversial in FL. These motivate us to propose a new general multimodal FL framework for real-world modality missing and fused representations learning.

In this paper, we aim to empower the FL framework with the ability to handle intricate modality-missing scenarios during both training and inference. Moreover, we develop modality-interactive encoders and fusion modules for each client to avoid the necessity of training data on the server (cf. Figure 1(b)). To this end, we need to address three challenges: CH1. How to handle complex missing patterns for interactive encoder? CH2. How to perform representation fusion when data is non-IID and modalities are missing? CH3. How to ensure the performance of the global model when there are missing modalities during inference?

Specifically, we innovatively utilize the flexible Multiway Transformers (Wang et al., 2023b; Peng et al., 2022; Bao et al., 2022) to construct our multimodal encoders. By selecting different modality expert networks, Multiway Transformers can serve as both modality-specific encoders and interactive encoders, thereby adapting to complex patterns of modality missing (CH1). Inspired by prototype learning in FL (Liao et al., 2023; Lyu et al., 2023; Huang et al., 2023; Dai et al., 2023; Xu et al., 2023; Chen et al., 2023a), we propose a novel prototype-based multimodal FL framework, termed PmcmFL (Prototype Mask and Contrast for Multimodal FL) to achieve fused representation learning with non-IID data and missing modalities (CH2). In PmcmFL, we construct and maintain a prototype library. During training, the prototypes act as global prior knowledge of the missing modality to compensate for cross-modal fusion and calibrate task drift. The prototype library also introduces a proximal item based on prototypes to reduce heterogeneity among clients. Consequently, we utilize prototypes for inference with missing modalities (CH3), where various matching algorithms are elaborately introduced to identify the prototype with the closest semantics.

Our experiments show that PmcmFL achieves state-of-the-art performance. Notably, it brings 0.2-3.7% accuracy improvements across different modality missing rates. Regarding inference with missing modalities, it achieves a remarkable 23.8% accuracy improvement.

The main contributions are summarized below:

•

Our work stands out as the first attempt to empower our FL framework with the capability to alleviate the global model performance degradation resulting from modality missing during both training and inference.
•

We propose PmcmFL to achieve task-calibrated training and higher inference accuracy when dealing with modality-incomplete data.
•

To the best of our knowledge, we are the first to adopt Multiway Transformer as a versatile encoder to address the complex patterns of modality missing.

2 Related Work

2.1 Multimodal Learning

Multimodal learning has attracted increasing attention from the research community. The model with dual-encoder architecture (Radford et al., 2021) uses separate encoders for each modality, with shallow modality interaction. On the contrary, the models with interactive-encoder architecture (Wang et al., 2023b; Kim et al., 2021) process input from different modalities and concentrate on modeling modality interactions via fuse tokens without modality-specific encoders.

In order to overcome the missing modality issue in multimodal learning, many methods have been developed. Ma et al. (Ma et al., 2021) propose the SMIL to train a feature reconstruction network using a meta-learning algorithm. Zhao et al. (Zhao et al., 2021) propose the MMIN to learn robust multimodal representations by training cascade residual autoencoders. Ma et al. (Ma et al., 2022) enhance the robustness of Transformer through multi-task learning and optimal fusion strategy search. Wang et al. (Wang et al., 2023a) propose the ShaSpec to generate features of missing modalities using a shared encoder. These methods have introduced additional branches and complex training algorithms to the model, which are unsuitable for FL.

2.2 Federated Learning with Prototype

Prototype refers to the centroid of the instances belonging to the identical class (Snell et al., 2017). Due to its scalability, prototypes are widely used to solve various problems in FL (Liao et al., 2023; Lyu et al., 2023; Liu et al., 2023; Li et al., 2023). Tan et al. (Tan et al., 2022) propose FedProto to conduct FL using prototypes rather than aggregation. Huang et al. (Huang et al., 2023) utilize prototypes to solve domain shift in FL. Dai et al. (Dai et al., 2023) utilize prototypes to alleviate the performance decline caused by heterogeneous data in FL. Some other FL frameworks (Xu et al., 2023; Chen et al., 2023a), while not explicitly mentioning prototypes, still leverage the concept of prototypes to address corresponding issues. In this paper, we primarily utilize prototypes to address modality missing issues.

2.3 Multimodal Federated Learning

Combining multimodal learning with federated learning is a novel research problem, with the key challenge being how to perform modality interactions when there are missing modalities for each client. Zhao et al. (Zhao et al., 2022) propose FedIoT to train autoencoders for each modality on every client and give more aggregation weight to multimodal clients. Chen et al. (Chen & Zhang, 2022) propose FedMSplit to construct a dynamic graph for adaptively selecting multimodal client models where some modalities may be missing. Yu et al. (Yu et al., 2023) propose CreamFL to conduct cross-modal interaction using inter-modal and intra-modal contrast. However, these studies only considered scenarios involving either unimodal or modality-complete multimodal clients. Feng et al. (Feng et al., 2023) propose FedMultimodal as the first attempt to consider scenarios where modalities of partial data may be missing on each client. They address this by masking the missing modalities using zero tensors. In this paper, we consider scenarios similar to FedMultimodal, with a focus on an efficient and versatile method for addressing modality missing.

3 Methodology

3.1 Overview

We consider a federated learning setting including $N$ multimodal clients. Without loss of generality, we take images and text as instances of multimodal data. In Appendix A, we analyze how to generalize our framework to other multimodal tasks. We decouple the model architecture into a representation encoder $\mathcal{E}$ , a deep fusion layer $\mathcal{F}$ , and a task head $\mathcal{G}$ . In each image-text pair, the image or text will be missing with a probability of $\rho$ . In this case, all the images can be denoted as $\mathcal{I}_{n}=\{(i_{nk},y_{nk})\}_{k=1}^{|\mathcal{I}_{n}|}$ , where $i_{nk}$ is the $k$ -th image on the $n$ -th client, $y_{nk}$ is its corresponding label. Similarly, all the text can be denoted as $\mathcal{T}_{n}=\{(t_{nk},y_{nk})\}_{k=1}^{|\mathcal{T}_{n}|}$ , and all the image-text pairs can be denoted as $\mathcal{M}_{n}=\{(i_{nk},t_{nk},y_{nk})\}_{k=1}^{|\mathcal{M}_{n}|}$ . Furthermore, each client develops a local model $(\mathcal{E}_{n},\mathcal{F}_{n},\mathcal{G}_{n}),n\in[1\dots N]$ , with the same architecture as the global model.

As shown in Figure 2, PmcmFL constructs or updates the prototype library on the server (operation ①) at the beginning of each federated communication round. Subsequently, the prototype library and the global model are broadcast to all participants. Then, each participating client performs local training and then updates local prototypes. Using prototypes as masks of missing modalities, a task-calibrated training loss will supervise the training (operation ②). Additionally, to tackle the non-IID issue and the resulting client drift, a prototype-based contrastive loss is proposed (operation ④). Finally, all participating clients transmit their local models and local prototypes back to the server, after which the global model is updated by aggregating those local models. During inference, the prototype library also assists inference with missing modalities by transmitting prototypes to the global model (operation ③). The complete algorithm is in Appendix B.

For local training, We use Multiway Transformer as encoder $\mathcal{E}$ . This encoder (Bao et al., 2022; Peng et al., 2022; Wang et al., 2023b) handles different missing patterns by switching among various encoding modes. When only image or text input exists, the multiway encoder functions as a modality-specific encoder, like ViT (Dosovitskiy et al., 2021) or BERT (Devlin et al., 2019). When inputting an image-text pair, the multiway encoder functions as an interactive encoder, enabling the representation of one modality to fuse information from the other modality. More details on the Multiway Transformer are in Appendix C.

3.2 Constructing Prototype Library

Prototypes compact the semantics of data within the same class. Considering communication overhead in FL, low-dimension representations are used to construct the prototype library. As for the compacting strategy, we naively use the class centroids as the prototypes.

Selecting Hidden Representations. For $(i_{nk},t_{nk})\in\mathcal{M}_{n}$ , we denote $\mathbb{H}_{nk}=\mathcal{E}_{n}(i_{nk},t_{nk})\in\mathbb{R}^{m\times d}$ as the output of the multiway transformer encoder, where $m$ is the number of output tokens, and $d$ is the dimension of each token. The image CLS token $h_{nk}^{(i)}$ and the text CLS token $h_{nk}^{(t)}$ are extracted from $\mathbb{H}_{nk}$ . We select the two CLS tokens along with the fused representation, i.e., $h_{nk}^{(f)}=\mathcal{F}_{n}(h_{nk}^{(i)},h_{nk}^{(t)})$ , for constructing prototypes.

Construction and Update In the prototype library, we construct and maintain three types of prototypes: image prototypes $\mathcal{P_{I}}$ , text prototypes $\mathcal{P_{T}}$ , and fusion prototypes $\mathcal{P_{F}}$ . All prototypes are initialized with zero tensors or random tensors. Subsequently, after local training in each federated communication round, all participating clients construct local prototypes by computing class centroids. Finally, the global prototypes are aggregated from the local prototypes with numbers of client samples as weights. The construction process is formalized as follows:

\mathcal{P_{I}}_{n}^{j}=\sum_{k\in C_{n}^{j}}\frac{h_{nk}^{(i)}}{|C_{n}^{j}|},\quad\mathcal{P_{I}}^{j}=\sum_{n\in P_{c}}\frac{|\mathcal{I}_{n}|}{|\mathcal{D}|}\mathcal{P_{I}}_{n}^{j},

(1)

\mathcal{P_{T}}_{n}^{j}=\sum_{k\in C_{n}^{j}}\frac{h_{nk}^{(t)}}{|C_{n}^{j}|},\quad\mathcal{P_{T}}^{j}=\sum_{n\in P_{c}}\frac{|\mathcal{T}_{n}|}{|\mathcal{D}|}\mathcal{P_{T}}_{n}^{j},

(2)

\mathcal{P_{F}}_{n}^{j}=\sum_{k\in C_{n}^{j}}\frac{h_{nk}^{(f)}}{|C_{n}^{j}|},\quad\mathcal{P_{F}}^{j}=\sum_{n\in P_{c}}\frac{|\mathcal{M}_{n}|}{|\mathcal{D}|}\mathcal{P_{F}}_{n}^{j},

(3)

where $C_{n}^{j}$ denotes the data of class $j$ on the $n$ -th client, $P_{c}$ denotes the set of participating clients, and $\mathcal{D}$ denotes all data of participating clients. For $*\in[\mathcal{I,T,F}]$ , $\mathcal{P_{*}}_{n}^{j}$ denotes the local prototypes and ${P_{*}}^{j}$ denotes the global prototypes.

3.3 Prototypes as Masks of Missing Modalities

When missing modalities occur, previous work (Feng et al., 2023) masks missing data with zero tensors. In PmcmFL, prototypes are employed as masks for missing modality representations, thereby incorporating global prior knowledge. Based on this, a task-calibrated training loss and a model-agnostic unimodal inference strategy are proposed.

Task-Calibrated Training Loss. As shown in Figure 2, during local training with complete image-text pairs, the training loss is defined:

L_{task}=\mathcal{L}(\mathcal{G}_{n}(\mathcal{F}_{n}(h_{nk}^{(i)},h_{nk}^{(t)})),y_{nk}),

(4)

where $\mathcal{L}$ denotes the task loss function. For classification tasks, cross-entropy is commonly used. In the case of local training with text missing, a task-calibrated training loss is constructed using the text prototypes as masks:

L_{task}=\mathcal{L}(\mathcal{G}_{n}(\mathcal{F}_{n}(h_{nk}^{(i)},\mathcal{P_{T}}^{j})),y_{nk}),

(5)

where $\mathcal{P_{T}}^{j}$ has the same class as $y_{nk}$ . Similarly, the task-calibrated training loss is defined when images are missing:

L_{task}=\mathcal{L}(\mathcal{G}_{n}(\mathcal{F}_{n}(\mathcal{P_{I}}^{j},h_{nk}^{(t)})),y_{nk}).

(6)

With supervision from the task-calibrated loss, the entire model is trained on the original multimodal task, even when there are missing modalities on clients. On the contrary, using zero masks will result in a task-drifted loss. For example, in the case where text is missing, the training loss with zero masks is denoted as $\mathcal{L}(\mathcal{G}_{n}(\mathcal{F}_{n}(h_{nk}^{(i)},\mathbf{0})),y_{nk})$ . Essentially, this represents an unimodal training task that depicts the relationship between the unimodal representation $h_{nk}^{(i)}$ and the ground truth $y_{nk}$ , which leads to inconsistent optimization directions for the model training.

Model-Agnostic Unimodal Inference. During inference with missing modalities, prototypes are employed as masks to assist the global model. Unlike training, where labels are available, we must find corresponding cross-modal prototypes according to embedded representations. To this end, a matching function $m(\cdot)$ is established from hidden representations to prototypes. We use an example to describe the matching process of cross-modal prototypes. Without loss of generality, assume that the model only utilizes images for inference. First, the image $i_{nk}$ is input for computing the image representation $h_{nk}^{(i)}$ . Sequentially, the image prototype with the same class $\mathcal{P_{I}}^{j}$ is determined using this matching function $\mathcal{P_{I}}^{j}=m(h_{nk}^{(i)})$ . Following this, the corresponding text prototype $\mathcal{P_{T}}^{j}$ is determined according to class $j$ . Finally, the image representation $h_{nk}^{(i)}$ and the text prototype $\mathcal{P_{T}}^{j}$ are input into the subsequent modules.

As for matching function $m(\cdot)$ , we propose both model-free and model-based matching. Model-free matching involves matching the closest prototype for each representation using a distance metric, such as L1, L2 distance, or cosine similarity. And model-based matching involves training tiny models, treating the search of cross-modal prototypes as either a classification or a retrieval task. These tiny models are trained only in the final round of federated communication, incurring negligible overhead to the FL framework. For the classification task, we train a shallow MLP-Classifier using cross-entropy loss. For the retrieval task, inspired by DALLE $\cdot$ 2 (Ramesh et al., 2022), we sequentially use CLIP loss (Radford et al., 2021) and MSE loss to train an MLP-Prior to achieve transfer and retrieval. We attempt three aggregation strategies for tiny models from different clients: selecting the model with maximum client samples, model aggregation, and model ensemble.

Inspired by MixUp (Zhang et al., 2018), we propose ProtoMix to enrich the semantic information within prototypes used for inference. ProtoMix linearly combines the top- $k$ related prototypes with weights derived from applying the Softmax function to distance metrics, classification probabilities, or retrieval match scores.

3.4 Prototypes as Learned Representation Targets

Due to the non-IID issue, the learned representation distributions in latent space significantly differ among clients (i.e., client drift), leading to an uncoordinated model aggregation. To achieve model aggregation without conflicts, all clients must theoretically have similar local representation distributions. In PmcmFL, prototypes are used as learned representation targets to guide clients in learning fused representation distributions clustered by class. For this reason, we introduce a proximal term based on the unidirectional CLIP contrastive loss to regularize client training.

Sampling a batch $\mathcal{B}$ from the local dataset, the proximal term can be denoted as:

L_{contr}=-\frac{1}{|\mathcal{B}|}\sum_{s=1}^{|\mathcal{B}|}\log\frac{\exp(h_{s}^{\top}\cdot\mathcal{P}_{s}/{\tau})}{\sum_{i=1}^{|\mathcal{B}|}\exp({h_{s}^{\top}\cdot\mathcal{P}_{i}}/{\tau})},

(7)

where $h_{s}$ is the $s$ -th representation in batch, $\mathcal{P}_{s}$ is its corresponding prototype, and $\tau$ is a temperature factor. As the global model gradually converges, the prototypes converge step by step. This proximal term works by pulling representations closer to the prototypes of the same class and pushing representations away from prototypes of different classes so that the learned representations are around the corresponding global prototypes.

Since prototypes are the centroids of class representations, maintaining a certain distance (i.e., fine-grained semantic difference) from each representation, we use CLIP loss to constrain their similarities rather than relying on MSE to constrain their distances. Although the prototype-based proximal term can theoretically be constructed on image representations, text representations, and fused representations separately, we only utilize the last one due to the difficulty in coordinating multiple proximal terms. In this case, the total training loss for each client is the sum of task loss and contrastive loss of fused representations:

L=L_{task}+\gamma L_{contr}^{f},

(8)

where $\gamma$ is the weighting hyper parameter for $L_{contr}^{f}$ .

4 Experiment

Table 1: Comparison of PmcmFL with baselines under various missing rates. Modality missing occurs only in the training phase, and the data in the inference phase is modality-complete. * denotes the PmcmFL’s performance is significantly better than existing baselines (paired t-test,

p<0.05

). RM denotes Random Mask.

Methods	Accuracy (%) on 5K Testing Samples ( $\rho_{test}=0$ )					Acc@sum
Methods	$\rho_{train}=10\%$	$\rho_{train}=20\%$	$\rho_{train}=30\%$	$\rho_{train}=40\%$ g	$\rho_{train}=50\%$	Acc@sum
FedAvg + ignore	56.444	53.936	49.990	45.024	38.898	244.292
FedMultimodal	55.360	52.804	49.246	46.378	41.034	244.822
FedMultimodal + RM	55.624	52.542	48.164	47.288	41.346	244.964
FedProx + Mask	52.506	49.546	46.502	46.862	41.140	236.556
FedIoT + Mask	55.010	52.180	47.558	47.582	41.838	244.168
FedPAC + Mask	55.028	50.742	46.916	45.930	40.242	238.858
FedHKD + Mask	56.286	53.546	50.544	47.832	43.376	251.584
PmmFL(Ours) + FA	56.144	54.366	50.556	48.796	46.168	256.030
PmmFL(Ours) + HKD	56.458	53.926	51.568	49.472	46.394	257.818
PmcmFL (Ours)	56.636*	55.478*	52.256*	49.548*	47.042*	260.960*

4.1 Experiment Setup

Dataset. Following CreamFL (Yu et al., 2023), we utilize the challenging VQAv2 dataset (Goyal et al., 2017) to evaluate our PmcmFL. For the efficiency of the experiments, we construct a tiny VQA task, one-tenth the scale of the original VQA task. The tiny VQA is a classification task of 310 classes, with 64,000 training samples and 5,000 testing samples. The training data is distributed to 30 clients, employing the Dirichlet distribution ( $\alpha=0.1$ ) for non-IID data partition (Hsu et al., 2019). Following suggestions from previous work (Feng et al., 2023), we set an equal training missing rate of $\rho_{train}=[0.1,0.2,0.3,0.4,0.5]$ for each modality. More details on the dataset are in Appendix D.

Model. For the encoder $\mathcal{E}$ , we utilize the same architecture as BEIT3-base (Wang et al., 2023b) along with its pre-train parameters. For the deep fusion layer $\mathcal{F}$ , the image CLS token and the text CLS token are concatenated and projected through an MLP. For the task head $\mathcal{G}$ , the fusion representation is projected to logits using another MLP. More details on the model architecture are in Appendix E.

Baseline. We compare our PmcmFL with the most relevant existing FL frameworks. Among them, 1) FedAvg+ignore, where training is conducted only on modality-complete data, ignoring data with modality missing; 2) FedMultimodal (Feng et al., 2023), Essentially an extension of FedAvg, using Zero Mask for missing modalities; 3) Fedmultimodal+random mask, where replaces Zero Mask with Gaussian Random Mask; 4) FedIoT+mask, a framework the same as FedAvg, where Mask is applied to missing modalities, using aggregation strategy is designed for multimodal FL in FedIoT (Zhao et al., 2022). 5) FedPAC+mask, a framework based on FedAvg and Mask for missing modalities, using prototype-based feature alignment (FA) in FedPAC (Xu et al., 2023). 6) FedHKD+mask, a framework based on FedAvg and Mask for missing modalities, using prototype-based hyper knowledge distillation (HKD) in FedHKD (Chen et al., 2023a). To demonstrate the effectiveness of our Prototype Contrast, we ablate it to obtain the variant PmmFL(PmcmFL without Contrast), and introduce the following two baselines: 7) PmmFL+FA and 8) PmmFL+HKD, which replace Prototype Contrast with Feature Alignment (FA) and Hyper Knowledge Distillation (HKD), respectively. We future compare with 9) CreamFL in simple modality missing scenarios. More details on baselines are in Appendix F.

Implement. To enable fair comparisons, our experiments remain consistent on method-agnostic hyperparameters, including federated communication rounds, local training epochs, client selection rate, learning rate, optimizer parameters, and batch size. For hyperparameters introduced by PmcmFL, to prevent them from overfitting to specific tasks, we only conducted a simple parameter search. Only the weight $\gamma$ is searched within $[5.0,1.0,0.5,0.1,0.01]$ to find the optimal value. More Implementation details for all experiments can be found in Appendix G.

Evaluation Metric. We fix the random seed and utilize optimal accuracy in the last ten communication rounds as the final performance. Note that all evaluations are conducted with modality-complete testing samples, except in experiments on inference with missing modalities (Sec. 4.4). Appendix H provides accuracy under varying degrees of modality missing during the inference phase.

Table 2: Results of ablation studies under various missing rates. PC denotes Prototype Contrast and PM denotes Prototype Mask.

PC	PM	Accuracy (%) on 5K Testing Samples ( $\rho_{test}=0$ )					Acc@sum
PC	PM	$\rho_{train}=10\%$	$\rho_{train}=20\%$	$\rho_{train}=30\%$	$\rho_{train}=40\%$	$\rho_{train}=50\%$	Acc@sum
		55.624	52.542	48.164	47.288	41.346	244.964
✓		55.948	53.588	50.982	47.422	44.454	252.394
	✓	55.326	54.486	50.858	49.290	45.848	255.808
✓	✓	56.636	55.478	52.256	49.548	47.042	260.960

Table 3: Inference with 100% text modality missing. Zero Mask and Random Mask are baseline methods for handling inference with missing modalities. Others are our Prototype Mask with various matching algorithms.

Methods	Accuracy (%) on Unimodal Data
Methods	Mix-1	Mix-10	Mix-20	Mix-best
Zero Mask	0.996	0.996	0.996	0.996
Random Mask	3.662	3.662	3.662	3.662
L1 Metric	6.358	6.170	6.784	6.784
L2 Metric	5.454	5.238	5.782	6.106
COS Metric	5.524	6.062	6.676	6.938
Classifier(max.)	26.956	27.022	27.022	27.022
Classifier(ens.)	27.090	27.090	27.090	27.090
Classifier(avg.)	27.502	27.092	27.092	27.502
MLP-Prior(max.)	27.052	19.906	8.742	27.052
MLP-Prior(ens.)	27.260	27.280	27.280	27.280
MLP-Prior(avg.)	5.858	5.702	5.996	6.210

4.2 Main Results

Table 1 displays the accuracy of the global model in all baselines and our PmcmFL. Our PmcmFL framework generally achieves noticeable performance improvement over all baselines in all modality missing rates.

Compared to ignoring incomplete data, the Mask method significantly improves the performance of federated training with 50% modality missing (a 2.448% boost), indicating that Mask is effective in handling severe modality missing.

Compared to FedAvg, FedIoT leads to a performance decline (a 0.796% drop on Acc@sum), which indicates that assigning more aggregation weight to multimodal clients no longer works in our task scenarios. Similarly, FedPAC leads to a performance decline (a 6.106% drop on Acc@sum), which is because the feature alignment (FA) employed by FedPAC, a strong regularization that narrows the distance between representations and prototypes through MSE, disrupts the semantic distinctiveness of representations.

Compared to all the baselines, our PmcmFL achieves superior performance (9.376-24.404% boost on Acc@sum), demonstrating the effectiveness of our Prototype Mask and Contrast. Furthermore, as the modality missing rate increases, the performance improvement brought about by our PmcmFL tends to be more significant (a 0.192% boost in 10% missing but a 3.666% boost in 50% missing).

Compared to PmcmFL’s variants, slightly superior performance (3.14-4.93% boost on Acc@sum) indicates that our Prototype Contrast is more generalizable than Feature Alignment and Hyper Knowledge Distillation. We further validate the better performance of PmcmFL compared with CreamFL in scenarios with a 100% missing rate of either modality. The experimental results are in Appendix I.

4.3 Ablation Studies

We study the effect of each PmcmFL component during training. Table 2 displays the results of ablation studies under various training modality missing rates.

The results show that Prototype Contrast and Mask contribute to performance improvement over the baselines, with gains of 7.430% and 11.168% on Acc@sum, respectively. Moreover, combining Prototype Contrast and Prototype Mask assist local training yields further performance boost.

4.4 Inference with Missing Modalities

Table 3 shows the testing accuracy of the global model in 100% text missing, demonstrating intuitively the performance of PmcmFL in handling inference with missing modalities. We use the global model trained with 30% modality missing, and its accuracy is 52.256% in modality-complete testing samples. For ProtoMix, top-1, top-10, top-20, and best prototype mix are exhibited.

We can obviously see that the inference accuracy of Zero Mask and Random Mask, serving as baselines, is only 0.996% and 3.662%, respectively, which indicates that the modality missing issues severely impair the model’s inference accuracy. In contrast, Prototype Mask achieves the highest accuracy of 27.502% (a 23.840% boost), demonstrating that our method helps restore modal inference accuracy. Moreover, the experiments also demonstrate the effectiveness of our ProtoMix strategy. More results and discussions can be found in Appendix J.

4.5 Qualitative Studies

We utilize T-SNE (Van der Maaten & Hinton, 2008) for visualization to qualitatively analyze the role of Prototypes Contrast in our FL framework.

Figure 3(a) visualizes fused representation distribution learned by different clients. We can observe that in FedAvg, some dots cluster by clients, indicating that each client forms its individual latent space. When local models are aggregated into a global model, individual latent spaces hinder the formation of a unified latent space. In contrast, in our PmcmFL, the heterogeneity of latent spaces among various clients is reduced due to prototypes acting as representation targets to guide local training.

Figure 3(b) visualizes the fused representation distribution learned by the global model. We randomly selected samples of three classes in the testing set. It can be observed that in FedAvg, dots from different classes are mixed, which is not conducive to subsequent classification tasks. In contrast, in our PmcmFL, dots from different classes cluster around their respective prototypes, which validates our starting point of using prototypes for constructing contrastive loss.

4.6 Overhead and Efficiency of Communication

In our PmcmFL, the parameters transmitted during each federated communication round include model weights and the prototype library.

Figure 4 illustrates different FL frameworks’ communication overhead and communication efficiency. As shown, in our tiny VQA task, PmcmFL introduces only negligible additional communication overhead (+0.32%). In the original VQA task, PmcmFL exhibits even lower communication overhead than FedHKD (+3.21% v.s. +5.43%). Furthermore, we can observe that when transmitting the same amount of parameters, PmcmFL achieves optimal performance, demonstrating that our framework has the highest communication efficiency.

4.7 Robustness to federated settings

Figure 5 illustrates the performance of various FL Frameworks across different communication rounds, client selection rates, local training epochs, and learning rates. We conduct experiments with 50% modality missing while keeping all federated settings constant except the one being investigated. Four representative FL frameworks are compared: 1) FedAvg with ignoring modality-incomplete data; 2) FedAvg with Gaussian Random Mask; 3) FedHKD, the existing one with optimal performance; 4) our PmcmFL. The experimental results demonstrate that, under different FL settings, our PmcmFL consistently achieved optimal performance, showcasing robustness to FL settings.

5 Conclusion

In this paper, we investigate multimodal federated learning with uncertain modality missing in both training and inference. Our work introduces a prototype library to empower the FL framework with the capability to alleviate performance degradation resulting from modality missing. Based on the prototype library, we construct a task-calibrated training loss, a model-agnostic unimodal inference strategy, and a proximal term. The effectiveness of our framework has been validated through comparisons with state-of-the-art methods under various missingness settings.

Our forthcoming efforts will center on riching semantics of prototypes, improving the prototype match accuracy, and extending Prototype Mask to a data augmentation method. Furthermore, we recognize the potential of using prototypes to tackle issues with missing or inaccurate labels, and we also intend to explore this avenue.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning, thereby facilitating the practical application of multimodal learning in real-world scenarios. The dataset we use is publicly available and does not involve privacy issues.

References

Bao et al. (2022) Bao, H., Dong, L., Piao, S., and Wei, F. BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
Chen et al. (2023a) Chen, H., Wang, C., and Vikalo, H. The best of both worlds: Accurate global and personalized models through federated learning with data-free hyper-knowledge distillation. In International Conference on Learning Representations, 2023a.
Chen & Zhang (2022) Chen, J. and Zhang, A. Fedmsplit: Correlation-adaptive federated multi-task learning across multimodal split networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 87–96, 2022.
Chen et al. (2023b) Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A. V., Bradbury, J., and Kuo, W. Pali: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations, 2023b.
Dai et al. (2023) Dai, Y., Chen, Z., Li, J., Heinecke, S., Sun, L., and Xu, R. Tackling data heterogeneity in federated learning with class prototypes. In Thirty-Seventh AAAI Conference on Artificial Intelligence, pp. 7314–7322, 2023.
Devlin et al. (2019) Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019.
Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
Feng et al. (2023) Feng, T., Bose, D., Zhang, T., Hebbar, R., Ramakrishna, A., Gupta, R., Zhang, M., Avestimehr, S., and Narayanan, S. Fedmultimodal: A benchmark for multimodal federated learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4035–4045, 2023.
Goyal et al. (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6325–6334, 2017.
Hsu et al. (2019) Hsu, T. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. CoRR, abs/1909.06335, 2019.
Huang et al. (2022) Huang, W., Ye, M., and Du, B. Learn from others and be yourself in heterogeneous federated learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10133–10143, 2022.
Huang et al. (2023) Huang, W., Ye, M., Shi, Z., Li, H., and Du, B. Rethinking federated learning with domain shift: A prototype view. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16312–16322, 2023.
Kim et al. (2021) Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.
Li et al. (2023) Li, J., Li, F., Zhu, L., Cui, H., and Li, J. Prototype-guided knowledge transfer for federated unsupervised cross-modal hashing. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 1013–1022, 2023.
Li et al. (2021) Li, Q., He, B., and Song, D. Model-contrastive federated learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10713–10722, 2021.
Li et al. (2020) Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020.
Liao et al. (2023) Liao, X., Liu, W., Chen, C., Zhou, P., Zhu, H., Tan, Y., Wang, J., and Qi, Y. Hyperfed: Hyperbolic prototypes exploration with consistent aggregation for non-iid data in federated learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp. 3957–3965, 2023.
Liu et al. (2023) Liu, J., Zhan, Y., Luo, X., Chen, Z., Wang, Y., and Xu, X. Prototype-based layered federated cross-modal hashing. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–2, 2023.
Lyu et al. (2023) Lyu, F., Tang, C., Deng, Y., Liu, T., Zhang, Y., and Zhang, Y. A prototype-based knowledge distillation framework for heterogeneous federated learning. In 43rd IEEE International Conference on Distributed Computing Systems, pp. 1–11, 2023.
Ma et al. (2021) Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., and Peng, X. SMIL: multimodal learning with severely missing modality. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pp. 2302–2310, 2021.
Ma et al. (2022) Ma, M., Ren, J., Zhao, L., Testuggine, D., and Peng, X. Are multimodal transformers robust to missing modality? In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18156–18165, 2022.
McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273–1282, 2017.
Mendieta et al. (2022) Mendieta, M., Yang, T., Wang, P., Lee, M., Ding, Z., and Chen, C. Local learning matters: Rethinking data heterogeneity in federated learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8387–8396, 2022.
Peng et al. (2022) Peng, Z., Dong, L., Bao, H., Ye, Q., and Wei, F. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. 2022.
Qu et al. (2022) Qu, L., Zhou, Y., Liang, P. P., Xia, Y., Wang, F., Adeli, E., Fei-Fei, L., and Rubin, D. L. Rethinking architecture design for tackling data heterogeneity in federated learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10051–10061, 2022.
Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
Ramesh et al. (2022) Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
Tan et al. (2022) Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., and Zhang, C. Fedproto: Federated prototype learning across heterogeneous clients. In Thirty-Sixth AAAI Conference on Artificial Intelligence, pp. 8432–8440, 2022.
Van der Maaten & Hinton (2008) Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Wang et al. (2023a) Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L., and Carneiro, G. Multi-modal learning with missing modality via shared-specific feature modelling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15878–15887, 2023a.
Wang et al. (2023b) Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., and Wei, F. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186, 2023b.
Xu et al. (2023) Xu, J., Tong, X., and Huang, S. Personalized federated learning with feature alignment and classifier collaboration. In International Conference on Learning Representations, 2023.
Yu et al. (2022) Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
Yu et al. (2023) Yu, Q., Liu, Y., Wang, Y., Xu, K., and Liu, J. Multimodal federated learning via contrastive representation ensemble. In International Conference on Learning Representations, 2023.
Zhang et al. (2018) Zhang, H., Cissé, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
Zhao et al. (2021) Zhao, J., Li, R., and Jin, Q. Missing modality imagination network for emotion recognition with uncertain missing modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2608–2618, 2021.
Zhao et al. (2022) Zhao, Y., Barnaghi, P. M., and Haddadi, H. Multimodal federated learning on iot data. In Seventh IEEE/ACM International Conference on Internet-of-Things Design and Implementation, pp. 43–54, 2022.
Zhu et al. (2021) Zhu, Z., Hong, J., and Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In Proceedings of the 38th International Conference on Machine Learning, pp. 12878–12889, 2021.

Appendix A Extend to Other Multimodal Tasks

In the main body, we illustrate the framework of PmcmFL using a text-image multimodal classification task (VQA) as an example and apply VQA to evaluate the framework’s performance. However, this does not imply that our PmcmFL is limited to specific modalities or tasks.

On the contrary, as the PmcmFL framework does not depend on specific model structures and only requires the model to be abstractly divided into modal encoders, fusion layers, and task heads, PmcmFL can be directly extended to almost all multimodal classification tasks.

In Table LABEL:other_task, we list some datasets to which PmcmFL can be extended, including the modalities and tasks associated with these datasets. The table represents just the tip of the iceberg for multimodal tasks that require modal fusion.

Table 4: Datasets that PmcmFL can be extended to.

Datasets	Modalities	Tasks
CrisisMMD	Image, Text	Crisis Information Classification
Hateful-Memes	Image, Text	Hateful Content Detection
MOSI	Video, Audio, Text	Multimodal Sentiment Analysis
MELD	Audio, Text	Multimodal Sentiment Analysis
UCF101	Video, Audio	Multimodal Action Recognition

We can see that the datasets to which PmcmFL can be extended encompass inputs from various modalities and involve a variety of task types.

As for the extension strategy, taking the multimodal sentiment classification task on the MOSI dataset as an example, one simply needs to replace the image-text encoder described in PmcmFL with a video encoder, speech encoder, and text encoder. The fusion module can be specially designed for the task, along with a corresponding classification task head.

Appendix B Algorithm of PmcmFL

Algorithm 1 and Algorithm 2 illustrate the flowcharts of our PmcmFL during training and inference, respectively.

Algorithm 1 Training of PmcmFL.

Input: Number of federated communication rounds $T$ , number of clients $C$ , number of local epochs $E$ , server model, local model $(\mathcal{E}_{n},\mathcal{F}_{n},\mathcal{G}_{n})$ , local learning rate $\eta$ , dataset $\mathcal{D}_{n}=(\mathcal{M}_{n},\mathcal{I}_{n},\mathcal{T}_{n})$ of the $n$ -th client and fraction of clients $s$ that are selected to perform computation in each round.

Output: The final server model’s parameters $\omega^{T}$

ServerExecutes:

Initialize ${\omega}^{0}$ randomly;

Initialize global prototypes $\mathcal{P_{I}}_{-local},\mathcal{P_{T}}_{-local},\mathcal{P_{F}}_{-local}$ with zero tensors;

for $t=1,2,\cdots,T$ do

Compute global prototypes

\mathcal{P_{I}},\mathcal{P_{T}},\mathcal{P_{F}}

according to Equation 1 2 3;

\mathcal{P_{I}}_{-local},\mathcal{P_{T}}_{-local},\mathcal{P_{F}}_{-local}\leftarrow[~{}],[~{}],[~{}]

;

S_{t}\leftarrow

random set of

\max(s\cdot C,1)

clients; for each client $n$ in $S_{t}$ do

send the global model’s parameters

\omega^{t-1}

and global prototypes

\mathcal{P_{I}},\mathcal{P_{T}},\mathcal{P_{F}}

to client

n

;

\omega^{t}_{n},\mathcal{P_{I}}_{n},\mathcal{P_{T}}_{n},\mathcal{P_{F}}_{n}\leftarrow

ClientLocalTraining

(n,t,\omega^{t-1},\mathcal{P_{I}},\mathcal{P_{T}},\mathcal{P_{F}})

;

\mathcal{P_{I}}_{-local},\mathcal{P_{T}}_{-local},\mathcal{P_{F}}_{-local}\leftarrow\mathcal{P_{I}}_{-local}+\mathcal{P_{I}}_{n},\mathcal{P_{T}}_{-local}+\mathcal{P_{T}}_{n},\mathcal{P_{F}}_{-local}+\mathcal{P_{F}}_{n}

;

Local models are aggregated into the global model:

{\omega}^{t}={\sum}_{n\in S_{t}}\frac{|\mathcal{D}_{n}|}{|\mathcal{D}|}{{\omega}^{t}_{n}}

ClientLocalTraining

(n,t,{\omega}^{t-1},\mathcal{P_{I}},\mathcal{P_{T}},\mathcal{P_{F}})

: for epoch $i=1,2,\cdots E$ do

Local update

\omega^{(t,i)}_{n}\leftarrow\omega^{(t,i-1)}_{n}-\eta\nabla L(\mathcal{D}_{n},\mathcal{P_{I}},\mathcal{P_{T}},\mathcal{P_{F}};\omega^{(t-1,i-1)}_{n})

according to Equation 8;

Compute local prototypes

\mathcal{P_{I}}_{n},\mathcal{P_{T}}_{n},\mathcal{P_{F}}_{n}

according to Equation 1 2 3; Return

\omega^{t}_{n},\mathcal{P_{I}}_{n},\mathcal{P_{T}}_{n},\mathcal{P_{F}}_{n}

;

Algorithm 2 Inference of PmcmFL.

Input: Server model $(\mathcal{E},\mathcal{F},\mathcal{G})$ , Testing set.

Output: Labels for classification tasks

if input is an image-text pair $(i,t)$ then

Compute

logits=\mathcal{G(F(E}(i,t)))

else if input is an image $i$ then

Compute the image representation

h^{(i)}=\mathcal{E}(i)

;
Search for the corresponding image prototype

\mathcal{P_{I}}^{j}

using matching function

m(h^{(i)})

;
Find the corresponding text prototype

\mathcal{P_{T}}^{j}

through the association in the prototype library;
Compute

logits=\mathcal{G(F(}h^{(i)},\mathcal{P_{T}}^{j}))

;

else if input is text $t$ then

Compute the text representation

h^{(t)}=\mathcal{E}(t)

;
Search for the corresponding text prototype

\mathcal{P_{T}}^{j}

using matching function

m(h^{(t)})

;
Find the corresponding text prototype

\mathcal{P_{I}}^{j}

through the association in the prototype library;
Compute

logits=\mathcal{G(F(}\mathcal{P_{I}}^{j},h^{(t)}))

;

Calculate probability distribution and predict the label:

\hat{y}=\mathop{\textnormal{argmax}}\textnormal{Softmax}(logits)

Appendix C Details on Multiway Transformer

We will first introduce the differences between Multiway Transformers and regular Transformers here. We will also emphasize that although we use multi-head Transformers as encoders, any encoder architecture is applicable within the PmcmFL framework.

C.1 Multiway Transformer

In summary, a typical Transformer consists of Multi-Head Self-Attention layers (Attention), Feedforward Networks (FFN), Layer Normalizations, and Residual Connections. The difference between Multiway Transformers and regular Transformers lies in the Attention layers and the FFNs. Specifically, the Attention layers in Multiway Transformer are modality-shared self-attention, and the FFNs are Modality-Specific Feedforward Networks (also known as Modality-Specific Experts Networks), meaning each modality has its own FFN.

We use images and text as examples to illustrate how Multiway Transformer switch encoding modes. The input images and text are initially processed into sequences of tokens involving patchify or tokenization, together with positional encoding. The initial token sequences (the image token sequence, the text token sequence, and the image-text token sequence) can be represented as:

z^{0}_{i}=[z_{i0},z_{i1},\cdots,z_{iN_{i}}],

(9)

z^{0}_{t}=[z_{t0},z_{t1},\cdots,z_{tN_{t}}],

(10)

z^{0}=cat(z^{0}_{i},z^{0}_{t})=[z_{i0},\cdots,z_{iN_{i}},z_{t0},\cdots,z_{tN_{t}}],

(11)

where $z_{i*}$ denotes the image token, $z_{t*}$ denotes the text token, $z_{i0}$ denotes the image CLS token, $z_{t0}$ denotes the text CLS token, $N_{i}$ and $N_{t}$ denote the number of image tokens and text tokens respectively.

For the Shared Self-Attention layer at $l$ -th encoding layer, we represent it as ${Attention}^{l}(\cdot)$ . For the Modality-Specific Expert Networks at $l$ -th encoding layer, we use ${FFN}^{l}_{i}(\cdot)$ and ${FFN}^{l}_{t}(\cdot)$ to respectively represent the vision expert network and the language expert network. For the sake of clarity, we have omitted Residual Connections and Layer Normalizations.

When dealing with only image inputs, Multiway Transformer utilizes the attention module and the visual expert network at each encoding layer:

z^{l}_{i}={{FFN}^{l}_{i}(Attention}^{l}(z^{l-1}_{i})).

(12)

When dealing with only text inputs, Multiway Transformer utilizes the attention module and the language expert network at each encoding layer:

z^{l}_{t}={{FFN}^{l}_{t}(Attention}^{l}(z^{l-1}_{t})).

(13)

When dealing with image-text pairs, Multiway Transformer first employs the shared self-attention module for cross-modal interaction:

a^{l}=cat(a^{l}_{i},a^{l}_{t})={Attention}^{l}(z^{l-1}).

(14)

Subsequently, individual expert networks are applied to tokens from the corresponding modality:

z^{l}=cat({FFN}^{l}_{i}(a^{l}_{i}),{FFN}^{l}_{t}(a^{l}_{t})).

(15)

By utilizing Modality-Specific Expert Networks, Multiway Transformers can switch to different encoders.

C.2 Decoupling Framework and Model

It is important to highlight that PmcmFL is a universal framework and the encoders used in our PmcmFL are not limited to the Multiway Transformer In fact, our framework is flexible to incorporate with encoders of other architectures.

Figure 6 illustrates the general architecture of multimodal models and the architecture we use. The same as ours, general multimodal models consist of three main parts: modality-specific encoders, a fusion module, and a task head.

Due to the similar architecture, general multimodal models can be easily developed in our PmcmFL. To elaborate, we can construct a prototype library by utilizing the outputs of modality-specific encoders and the fusion module. Based on the prototype library, PmcmFL can execute as usual.

It is worth noting that if the fusion module employs the Transformer architecture, it could be a wise move to integrate a module of bottleneck architecture after the modality-specific encoders to obtain low-dimension representation for computing prototypes, which will significantly reduce federated communication overhead.

The reason we adopt Multiway Transformer is due to its superior ability in handling multimodal interaction. In our PmcmFL, Multiway Transformer not only functions as modality-specific encoders but also functions as dual-stream interactive encoders to achieve better modality fusion.

Appendix D Details on the Dataset

In this section, we provide a detailed description of the construction process for ting VQA, along with visualizations for non-IID data and modality missing.

D.1 Tiny VQA

Due to the typically slower convergence of models in federated learning, the training cost is significantly increased. The full scale of VQAv2 is too extensive for evaluating federated learning frameworks. Therefore, we evaluate on a randomly selected subset of VQAv2. This approach has been demonstrated in prior works, such as CreamFL.

Our tiny VQA is a scaled-down version, reducing both label quantity and training data volume to one-tenth of the original VQA task. The original VQA is commonly seen as a classification task with over 3,000 classes, supported by a vast training dataset exceeding 640,000 samples. We refine the labels by excluding those with fewer than 180 occurrences, reshaping VQA into a classification challenge with 310 classes. For training, we randomly chose 64,000 image-text pairs, equivalent to one-tenth of the combined original VQA task training and validation sets. Additionally, our test set comprises 5,000 random image-text pairs randomly selected, ensuring none overlap with the training set. All experiments are conducted on ting VQA, except when compared with CreamFL.

D.2 Visualization for Data Distribution

To simulate real-world scenarios, we distributed the training data of tiny VQA across 30 clients, using a Dirichlet distribution with a hyperparameter of 0.1 as the basis for non-IID partitioning. Previous works involving non-IID have almost adopted this hyperparameter setting, which describes a severe non-IID scenario.

Additionally, to simulate uncertain modality missing, we assigned a fixed missing rate for each modality on every client, thereby conforming to the Bernoulli distribution. In our experiments, we considered five missing ratios: 10%, 20%, 30%, 40%, and 50%.

We visualized the data/modality distribution to illustrate the corresponding non-IIDness and modalities missingness as follows.

Visualization for Non-IIDness. We selected 40 classes from a pool of 310, visualizing their distribution across various clients. Figure 7 displays the diversified results of the visualization, indicating the heterogeneous class distribution across clients.

Visualization for Modality Missingness. We visualized the modality distribution on various clients under different modality missing rates. Figure 8 displays the visualization results, showing the diversified modality distribution.

Appendix E Details on the Model

In this section, we explain the model architecture used in our experiments.

For the Multiway Transformer encoder, we employ the architecture of BEIT3-base, which consists of 12 encoding layers. Detailed model architecture can be found in BEIT3’s GitHub repository¹¹1BEIT3’s GitHub repository: https://github.com/microsoft/unilm/tree/master/beit3. We make no modifications to the encoder architecture and utilize its pre-trained weight (pre-training data does not include VQAv2). It is worth noting that when the input is an image-text pair, the original model only uses the image CLS token for subsequent tasks, whereas we utilize both the image CLS and the text CLS token. We simply add an additional output to accommodate our PmcmFL.

For the fusion module, we first apply layer normalization to the image CLS token and the text CLS token separately, then concatenate them. Subsequently, a linear layer and activation layer are applied to obtain the corresponding fusion representation. The dimensions of both the CLS tokens and the fusion representations are 1×768.

For the task head, we use a linear layer to double the dimensions, followed by a layer normalization. After an activation layer, we then use another linear layer to reduce the dimensions to 310, which corresponds to the number of classification categories.

Appendix F Details on Baselines

In this section, we elaborate on the details of the baselines.

F.1 Principle of Selection

For training with modality missing, we propose two innovations: Prototype Mask and Prototype Contrast. Prototype Mask is a strategy to address modality missing, while Prototype Contrast is a strategy to alleviate client heterogeneity and enhance the global model performance. Based on these, we aim to select the most suitable baseline.

Handling modality missing. When modality is missing, the simplest approach is to ignore the incomplete data and only use the data with complete modalities for training. We refer to this approach as Ignore Missing. Previous works used zero tensors to replace the representation of the missing modality. We refer to this approach as zero Mask. A straightforward extension is to use a random tensor following a Gaussian distribution for masking, which we refer to as Random Mask. We consider these three strategies as baselines for handling the modality missing issue and compare them with our proposed Prototype Mask.

Improving the global model. Both non-independent and identically distributed (Non-IID) data and modality missing can lead to client heterogeneity, severely compromising the performance of the global model. FedAvg, as the fundamental federated learning algorithm, does not provide a solution to address non-IID issues. FedProx introduces a proximal term to constrain the updates of local models, demonstrating its effectiveness in mitigating the impact of non-IID scenarios. FedIoT tackles the problem from the perspective of model aggregation strategies, granting more aggregation weight to multimodal clients to alleviate the influence of modality-incomplete clients. Prototype-based Feature Alignment (FA) strategies in FedPAC and Prototype-based Hyper Knowledge Distillation (HKD) in FedHKD are both optimal approaches aimed at mitigating non-IID challenges. We consider these five federated learning frameworks as baselines for improving the global model and compare them with our proposed Prototype Contrast.

F.2 Implementation Details of Baselines

We will elaborate on the implementation details of each baseline.

FedAvg + Ignore. Within the FedAvg framework, the training process involves disregarding all training data on clients with incomplete modalities and using only the data with complete modalities for training.

FedMultimodal (FedAvg + Zero Mask). Based on FedAvg, we adopt the Random Mask as the strategy for handling missing modalities. This approach is consistent with FedMultimodal.

FedMultimodal + RM (FedAvg + Random Mask). Based on FedAvg, we adopt the Random Mask as the strategy for handling missing modalities. Random tensors are obtained from a Gaussian distribution.

We compare three strategies for handling missing data under FedAvg and find that using the Random Mask yields the best results. Therefore, subsequent baselines all adopt Random Mask.

FedProx + Random Mask. Based on FedAvg, we adopt the Random Mask as the strategy for handling missing modalities. According to the relevant paper, the weight of the proximal term is set to 1.0.

FedIoT + Random Mask. Based on FedAvg, we adopt the Random Mask as the strategy for handling missing modalities. According to the paper of FedIoT, we set 100 times the aggregation weight for image-text pairs.

FedPAC + Random Mask. Based on FedAvg, we adopt the Random Mask as the strategy for handling missing modalities. We implement a feature alignment strategy on the fused representations. According to the paper of FedPAC, the hyperparameter of feature alignment is set to 1.0.

FedHKD + Random Mask. Based on FedHKD, we adopt the Random Mask as the strategy for handling missing modalities. Hyper knowledge distillation is applied in the fused representations According to the paper of FedHKD, the hyperparameter of hyper knowledge distillation is set to 0.05.

PmmFL + FA. Based on FedAvg, we adopt the Prototype Mask as the strategy for handling missing modalities. Feature alignment is applied to the fused representations, with the hyperparameter set to 1.0.

PmmFL + HKD. Based on FedAvg, we adopt the Prototype Mask as the strategy for handling missing modalities. Hyper knowledge distillation is applied in the fused representations, with the hyperparameter set to 0.05.

CreamFL. We use the results reported by CreamFL.

Appendix G Implementation Details

We will go through the implementation details of each experiment one by one.

Main Experiments.

The main experiment establishes consistent missing rates for each modality, specifically 0.1, 0.2, 0.3, 0.4, and 0.5. We conducted testing using the modal complete test set and presented the results in Table 1 (We conducted testing with modal partially missing data in Appendix H.). All experiment results are averaged under five runs with fixed random seeds. We considered a conventional federated setting for the main experiment. We set the communication rounds to 30. In each round, 5 clients are selected to participate in training (client selection rate is 0.167). We referred to the BEIT3 GitHub repository for the learning rate and optimizer parameters. The learning rate is fixed at 3e-5. We use the AdamW optimizer with a weight decay set to 1e-4, a momentum of 0.9, and $\beta$ of [0.9, 0.98]. For the training loss, cross-entropy is used as the task loss, and the temperature for the prototype contrastive loss is set to 0.07, following the CLIP GitHub repository. We only perform simple hyperparameter optimization for the weight of the prototype contrastive loss, selecting the optimal value from the set [5.0, 1.0, 0.5, 0.1, 0.01]. All experiments are conducted on 4 NVIDIA RTX 4090 GPUs using distributed data parallelism, with a total batch size of 48 (12 per GPU).

Ablation Studies. We conducted ablation studies on the Prototype Mask and Prototype Contrast in our PmcmFL. The hyperparameter settings for the ablation experiments are the same as those for the main experiment. Note that Sec. 4.7 (Robustness to Federated Settings) can also be considered as an ablation study on federated settings.

Inference with Missing Modalities. To intuitively demonstrate the role of Prototype Masks during the inference phase, we conduct inference with the complete missing of one modality. The model used for this experiment is the best-performing model trained in the primary experiment with 30% modality missing. Without loss of generality, we omit 100% of the text modality during this experiment. In this case, this experiment can also be regarded as an image unimodal inference experiment.

As described in the method section, we need to train a tiny model in the final round of federated communication to achieve the matching from representation to prototype. For model-based matching, we trained tiny models on each client in the final round of federated communication. The classification-based matching utilizes a 2.2M MLP architecture, accounting for less than 1% of the communication round’s overhead. The model takes image representations as input for a 310-class classification task and undergoes 100 epochs of training on each client, with a single GPU training time of under 1 minute. Similarly, the retrieval-based matching employs a 2.9M MLP architecture, constituting approximately 1% of the communication round’s overhead. This model takes image representations as input and generates corresponding image prototypes. Its training involves an initial 200 epochs with contrastive loss, followed by 500 epochs of MSE loss. The training time on a single GPU is approximately 3-5 minutes.

The server needs to select or integrate these tiny models from the clients. We experimented with three strategies to explore the utilization of tiny models from various clients: selecting the model with the maximum client samples, model parameter aggregation through FedAvg, and model ensemble. The first strategy entails choosing the model from the client with the most data samples among the 30 clients. The model aggregation strategy combines the 30 small models through FedAvg. The model ensemble method merges classification probabilities or retrieval scores from individual models through weighted aggregation, with weights determined by the sample quantities of the respective clients.

Qualitative Studies. We choose the optimal model trained in the main experiment with 50% modality missing for quantitative experiments. Associated client models are also utilized.

To compare the heterogeneity among client models under FedAvg and PmcmFL, we randomly selected three models from five client models and visualized their fused representation distributions on the testing set. To compare the performance of the global models under FedAvg and PmcmFL, we selected three categories (with corresponding labels ’red’, ’blue’, and ’green’) from a total of 310 classes and visualized their fused representation distributions.

Communication Efficiency. The experiments on communication efficiency utilize the results derived from the main experiment with 50% modality missing.

Robustness for federated setting. This experiment can also be considered as an ablation study on federated settings. We test PmcmFL’s robustness for varying communication rounds, client selection rates, local training epochs, and learning rates. To evaluate the robustness for communication rounds, we set communication rounds to [5, 10, 15, 20, 25, 30], with a client selection rate fixed at 0.1, local training epochs set to 3, and a fixed learning rate of 3e-5. To evaluate the robustness for client selection rates, we set client selection rates to [0.1, 0.167, 0.233, 0.333], with communication rounds of 20, local training epochs set to 3, and a fixed learning rate of 3e-5. To evaluate the robustness for local training epochs, we set local training epochs to [1, 3, 5, 7], with communication rounds of 20, a client selection rate fixed at 0.1, and a fixed learning rate of 3e-5. To evaluate the robustness for learning rates, we set learning rates to [1e-5, 3e-5, 5e-5, 7e-5], with communication rounds of 20, a client selection rate fixed at 0.1, and local training epochs set to 3.

Appendix H Inference with Partial Modality Missing

We train the model with modalities missing at rates of [10%, 20%, 30%, 40%, 50%], and performed inference on test sets with modalities missing at the same rates.

In baselines, Random Mask serves as the strategy for handling missing modalities, except for FedMultimodal (FedAvg + Zero Mask), which employs Zero Mask. In our PmcmFL, the Prototype Mask is used for handling missing modalities.

The experimental results are shown in Table 5. We compare our PmcmFL with existing approaches, so the two variants of our framework (PmmFL+FA and PmmFL+HKD) are not presented in the table.

As can be seen, the experimental results demonstrate the superiority of our PmcmFL.

Table 5: Comparison of PmcmFL with baselines under various missing rates. Modality missing occurs in both the training and testing phases. In our approach, the Prototype Mask is also used during the inference phase.

\rho_{train}

and

\rho_{test}

denote the missing rates in the training phase and testing phase, respectively. We bold the best results under the same training and test missing rate.

Methods	$\rho_{test}$	$\rho_{train}$					Sum
Methods	$\rho_{test}$	10%	20%	30%	40%	50%	Sum
FedAvg + Ignore	10%	52.016	50.756	46.812	42.254	36.720	228.558
	20%	47.904	47.318	43.198	39.386	34.278	212.084
	30%	42.490	43.138	39.250	36.340	31.802	193.020
	40%	38.938	40.462	36.964	34.218	30.040	180.622
	50%	33.760	36.180	33.406	30.492	26.926	160.764
FedMultimodal	10%	52.822	50.288	47.234	44.478	39.766	234.588
	20%	50.310	47.624	45.056	42.280	38.366	223.636
	30%	47.252	44.832	43.174	40.288	37.242	212.788
	40%	45.496	43.034	41.904	38.824	36.228	205.486
	50%	42.588	39.828	40.034	36.872	34.658	193.980
FedMultimodal + RM	10%	52.268	50.242	46.230	45.352	39.962	234.054
	20%	49.176	47.602	43.954	43.368	38.460	222.560
	30%	45.290	44.862	41.614	41.120	36.966	209.852
	40%	42.554	42.958	39.844	39.862	35.870	201.088
	50%	38.568	39.964	37.348	37.536	34.082	187.498
FedProx + Mask	10%	50.382	47.392	44.790	45.234	39.940	227.738
	20%	48.094	44.978	42.848	43.170	38.462	217.552
	30%	45.690	42.592	40.808	41.514	36.936	207.540
	40%	44.300	41.002	39.414	40.240	35.806	200.762
	50%	42.132	38.350	37.372	38.136	34.138	190.128
FedIoT + Mask	10%	52.714	49.910	45.698	45.664	40.604	234.590
	20%	50.278	47.112	43.544	43.486	39.130	223.550
	30%	47.572	44.694	41.266	41.456	37.430	212.418
	40%	45.606	43.016	39.726	39.968	36.226	204.542
	50%	43.164	40.000	37.420	37.494	34.430	192.508
FedPAC + Mask	10%	52.044	48.640	45.266	43.950	39.142	229.042
	20%	49.738	46.094	43.266	41.934	37.618	218.650
	30%	47.082	43.558	41.084	39.692	36.282	207.698
	40%	45.402	42.024	39.656	38.258	35.376	200.716
	50%	42.430	39.236	37.460	35.840	33.872	188.838
FedHKD + Mask	10%	53.774	51.708	48.298	46.060	42.256	242.096
	20%	51.222	49.654	45.826	44.004	40.946	231.652
	30%	48.232	47.498	43.186	42.064	39.654	220.634
	40%	46.524	45.306	41.590	40.962	37.908	212.290
	50%	43.632	43.358	38.790	38.522	37.394	201.696
PmcmFL(Ours)	10%	53.810	52.846	49.710	47.206	45.106	248.678
	20%	51.552	50.606	47.584	45.428	43.300	238.470
	30%	48.148	47.452	44.916	42.932	41.054	224.502
	40%	45.792	45.812	42.874	41.014	39.404	214.896
	50%	43.490	43.018	39.912	39.132	38.264	203.816

Appendix I Compare with CreamFL

We compare PmcmFL with CreamFL, and due to the different task scenarios from the main experiment, we present the results here.

Similar to CreamFL, we conducted experiments in simple scenarios where there are only unimodal clients and modality-complete multimodal clients. We train models on over 3000 of the most frequent answers (the original VQA task) and report the inference accuracy on 5K testing samples. Table 6 shows the results.

Table 6: Comparison of PmcmFL with baselines in specific task scenarios.

Methods	Accuracy
FedAvg	52.54
FedIoT	53.06
FedMD	57.43
FedET	59.90
FedGEMS	60.23
reamFL+Avg	58.64
reamFL+IoT	59.64
CreamFL	62.12
PmcmFL(Ours)	65.53

The results show our framework also achieves SOTA performance (a 3.38% boost compared with CreamFL). This suggests that PmcmFL is equally applicable to specific task scenarios considered by previous works and outperforms methods specifically designed for such scenarios.

We believe that this performance improvement mainly stems from the utilization of the Multiway Transformer encoder, which excels in modality fusion and thus benefits the task. This reflects the advantage of our federated framework compared with CreamFL, which cannot train the modality fusion module on the client.

Appendix J Discussions on ProtoMix

We first present the testing set matching accuracy for each matching method, followed by an explanation of the results generated by ProtoMix.

J.1 Accuracy of Various Matching Strategies

Table 7 displays the matching accuracy of each matching strategy. The data presented corresponds directly to Table 3.

Table 7: The matching accuracy of various matching strategies under top-1, top-5, top-10, and top-20 matches.

Methods	Matching Accuracy (%)
Methods	Top1	Top5	Top10	Top20
L1 Metric	2.46	5.94	10.78	28.46
L2 Metric	1.04	3.14	4.94	11.86
COS Metric	1.04	3.32	6.84	27.04
Classifier(max)	19.10	46.28	50.80	55.16
Classifier(ens)	22.34	54.38	62.72	69.14
Classifier(avg)	23.20	51.16	57.18	62.74
MLP-Piror(max)	19.56	41.92	42.62	43.46
MLP-Piror(ens)	21.70	47.92	52.38	56.16
MLP-Piror(avg)	0.36	4.32	5.36	9.72

We can see that there is a significant difference in the matching accuracy of prototype matching strategies with different approaches. The top-1 matching accuracy of the model-free matching strategy is only around 1.04-2.46%, while the top-1 prototype matching accuracy of the model-based matching strategy is approximately 19.10-23.20% (except for MLP-Prior with parameter aggregation).

In addition, compared to the top-1 matching accuracy, the matching accuracy for top-k (k=5, 10, 20) shows significant improvement. This suggests that considering more prototypes may lead to obtaining more closely matched prototypes, which forms the basis for our proposal of Protomix.

Compared to Table 3 in the main text, we observe a positive correlation between top-1 prototype matching accuracy and accuracy of inference with missing modalities. The positive correlation is reasonable and the effect we desire, which also indicates that, for higher accuracy of inference with missing modalities, we should strive to improve the accuracy of prototype matching. In fact, when we perform unimodal inference using the corresponding prototypes (prototype matching accuracy is 100% ), its accuracy can reach 43.314%, which is only 8.942% lower than the modality-complete inference.

J.2 Analysis of ProtoMix

Figure 9 displays all possible ProtoMix results for nine matching strategies.

We observe that almost all matching strategies benefit from ProtoMix, indicating that our proposed strategy of augmenting prototype semantics is effective. However, the performance improvement brought by Top-k ProtoMix is very limited and does not correspond to the accuracy of Top-k prototype matching, which suggests that there is still significant space for improvement in our ProtoMix.

Additionally, we observed significant differences in the performance of ProtoMix under different matching strategies, and we will analyze them one by one. There are two key factors influencing the performance of ProtoMix: prototype matching accuracy and matching confidence.

For the three model-free matching strategies, both matching accuracy and confidence are relatively low. The results indicate that combining top-k prototypes with confidence-weighted fusion results in prototypes with more accurate semantics.

For the three classifier-based matching strategies, their matching accuracy is generally moderate, but the confidence in the top-1 match is exceptionally high. We observed that both a single classifier and a parameter-aggregated classifier have 19.10% matching accuracy but almost boast over 99% confidence in the top-1 class. The results in very little semantic information from other prototypes were obtained when combining top-k prototypes with confidence-weighted fusion. Consequently, the inference accuracy remains the same regardless of how many prototypes are fused. This limitation stems from the poor generalization ability of the classifier, as it tends to overfit the uneven distribution of client data during training on the client side.

For the three retrieval-based matching strategies, we observed that a single prior has 19.56% prototype matching accuracy and possesses a top-1 retrieval score with less prominent confidence, creating a difference between the maximum prior and the ensemble prior. However, since parameter aggregation does not apply to non-classification tasks, the aggregated model exhibits low matching accuracy and confidence, leading to performance comparable to model-free matching strategies.

Appendix K Additional Visualization

K.1 Representation Distribution among Clients

We visualized more representation distributions learned by various clients. Figure 10 displays the visualized results. We observe that, compared to the baseline method FedAvg, our PmcmFL reduces the heterogeneity in the feature space among clients (i.e., mitigates client drift).

K.2 Representation Distribution among Classes

We visualized more representation distributions learned by the global model among various classes. Figure 11 displays the visualized results. We can observe that, compared to the baseline method, our PmcmFL can cluster representations by class.

Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast