Contrastive Domain Adaptation for Early Misinformation Detection: A Case Study on COVID-19

Zhenrui Yue University of Illinois Urbana-ChampaignUSA [email protected] , Huimin Zeng University of Illinois Urbana-ChampaignUSA [email protected] , Ziyi Kou University of Illinois Urbana-ChampaignUSA [email protected] , Lanyu Shang University of Illinois Urbana-ChampaignUSA [email protected] and Dong Wang University of Illinois Urbana-ChampaignUSA [email protected]

(2022)

Abstract.

Despite recent progress in improving the performance of misinformation detection systems, classifying misinformation in an unseen domain remains an elusive challenge. To address this issue, a common approach is to introduce a domain critic and encourage domain-invariant input features. However, early misinformation often demonstrates both conditional and label shifts against existing misinformation data (e.g., class imbalance in COVID-19 datasets), rendering such methods less effective for detecting early misinformation. In this paper, we propose contrastive adaptation network for early misinformation detection (CANMD). Specifically, we leverage pseudo labeling to generate high-confidence target examples for joint training with source data. We additionally design a label correction component to estimate and correct the label shifts (i.e., class priors) between the source and target domains. Moreover, a contrastive adaptation loss is integrated in the objective function to reduce the intra-class discrepancy and enlarge the inter-class discrepancy. As such, the adapted model learns corrected class priors and an invariant conditional distribution across both domains for improved estimation of the target data distribution. To demonstrate the effectiveness of the proposed CANMD, we study the case of COVID-19 early misinformation detection and perform extensive experiments using multiple real-world datasets. The results suggest that CANMD can effectively adapt misinformation detection systems to the unseen COVID-19 target domain with significant improvements compared to the state-of-the-art baselines.

misinformation detection, domain adaptation

^†^†journalyear: 2022^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Information and Knowledge Management; October 17–21, 2022; Atlanta, GA, USA^†^†booktitle: Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), October 17–21, 2022, Atlanta, GA, USA^†^†price: 15.00^†^†doi: 10.1145/3511808.3557263^†^†isbn: 978-1-4503-9236-5/22/10^†^†ccs: Information systems Data mining^†^†ccs: Computing methodologies Natural language processing

1. Introduction

Recent progress in developing natural language processing (NLP) models leads to significant improvements in text classification performance (Devlin et al., 2019; Liu et al., 2019). Nevertheless, misinformation detection remains one of the most elusive challenges in text classification, especially when the model is trained on a source domain but deployed to classify misinformation in a different target domain (Du et al., 2020; Zou et al., 2021; Zeng et al., 2022). A recent example can be found in the early spreading stage of the COVID-19 pandemic, where misinformation detection systems often fail to distinguish valuable information from large amounts of rumors and misleading posts on social media platforms, resulting in potential threats to public health and interest (Li et al., 2021).

Refer to caption — Figure 1. Labeled source data and unlabeled target data are accessible for domain adaptation. Upon deployment, the model predicts misinformation in the target domain.

In this paper, we study the domain adaptation problem in early misinformation detection using COVID-19 data. Widespread COVID misinformation poses a major threat to the online ecosystem and potentially dangers the public health. For example, Roozenbeek et al. (2020) demonstrate strong correlation between COVID-19 misinformation and noncompliance of health guidance as well as reduced likelihood in receiving vaccines. To identify potential misinformation on social media platforms, one possible solution is to train supervised models via external knowledge or crowdsourcing (Pan et al., 2018; Shu et al., 2019a; Müller et al., 2020; Roitero et al., 2020; Hossain et al., 2020; Medina Serrano et al., 2020; Rashid and Wang, 2021; Kou et al., 2021; Shang et al., 2021; Kou et al., 2022a; Shang et al., 2022a, b; Kou et al., 2022b). Yet such learning approaches require large amounts of annotations or extensive computational resources, making existing methods ineffective for early misinformation detection in COVID-19, where label-rich misinformation data and domain knowledge are inaccessible for training (Li et al., 2021).

As a solution, domain adaptation methods can be used to adapt an existing misinformation detection system trained on the label-rich source domain to a label-scarce target domain (Du et al., 2020; Zou et al., 2021; Li et al., 2021). Figure 1 gives an overview of the domain adaptation in misinformation detection. One common approach is to estimate the domain discrepancy (e.g., via an additional discriminator) and impose penalty when source and target features are distinguishable. As such, the model learns domain-invariant features, and thereby improving the classification performance in the target domain (Ganin et al., 2016; Tzeng et al., 2017; Kang et al., 2019). However, early misinformation of COVID-19 often demonstrates both conditional shift (i.e., $p(\bm{x}|y)\neq q(\bm{x}|y)$ ) and label shift (i.e., $p(y)\neq q(y)$ ) compared to existing misinformation datasets. Consequently, such large discrepancies often limit the adaptation performance or even lead to negative transfer using the aforementioned methods (see Section 4), rendering the existing methods less effective for early misinformation detection in COVID-19.

To improve the adaptation performance under large domain discrepancies, we propose contrastive adaptation network for early misinformation detection (CANMD). Specifically, we leverage a pretrained misinformation detection system to generate labeled target examples via pseudo labeling. To correct the label shift between pseudo labels and the target output distributions, we design a learnable rescaling component to estimate and correct the output probabilities of the pseudo labels. Then, joint training is performed by sampling source and target examples using a class-aware sampling strategy to resemble the target data distribution. Here, we propose a contrastive adaptation loss to reduce the discrepancy for intra-class data examples and enlarge the discrepancy for inter-class data examples. As such, the adapted misinformation detection system learns an invariant conditional distribution. Combined with the corrected class priors, we improve the adaptation performance by better approximating the joint distribution in the target domain. To demonstrate the effectiveness of the proposed CANMD, we study the case of COVID-19 early misinformation detection and perform extensive experiments with real-world datasets. In particular, we adopt five source misinformation datasets published before the COVID outbreak. We additionally adopt three COVID misinformation datasets from 2020 to 2022 as target datasets (i.e., CoAID, Constraint and ANTiVax (Cui and Lee, 2020; Patwa et al., 2021; Hayawi et al., 2022)) to evaluate the proposed CANMD. Our results suggest that CANMD effectively adapts misinformation detection systems to the COVID-19 target domain and consistently outperforms state-of-the-art baselines with an average improvement of 11.5% in balanced accuracy (BA).

We summarize the main contributions of our works¹¹1Our implementation is publicly available at https://github.com/Yueeeeeeee/CANMD.:

(1)

We propose a learnable rescaling component in CANMD to correct the pseudo labels for improved estimation of the class priors. Additionally, we design a class-aware sampling strategy to resemble the target domain data distribution.
(2)

CANMD learns a domain-invariant conditional distribution by reducing the intra-class discrepancy and enlarging the inter-class discrepancy via contrastive learning. Combined with the corrected priors, we improve the adaptation by approximating the joint distribution of the target data.
(3)

To the best of our knowledge, CANMD is the first work that adopts pseudo labeling with label correction and contrastive domain adaptation for adapting misinformation systems to the unseen COVID-19 domain.
(4)

We demonstrate the effectiveness of CANMD on multiple source and COVID-19 target datasets, where the proposed CANMD consistently outperforms the state-of-the-art baselines by a significant margin.

2. Related Work

2.1. Misinformation Detection

Existing methods in misinformation detection can be mainly divided into three coarse categories: (1) content-based misinformation detection: content-based models perform classification upon statements (i.e., claim) or multimodal input. For example, language models are used for misinformation detection based on linguistic features or structure-related properties (Wang, 2017b; Karimi and Tang, 2019; Das et al., 2021). Multimodal architectures are introduced to learn features from both textual and visual information for improved performance (Jin et al., 2017; Wang et al., 2018; Khattar et al., 2019); (2) social context-aware misinformation detection: Jin et al. (2016) leverage online interactions to evaluate the credibility of post contents. The propagation paths in the spreading stage can be used to detect fake news on social media platforms (Liu and Wu, 2018). Other context like user dynamics and publishers are introduced to enhance social context-aware misinformation detection (Guo et al., 2018; Shu et al., 2019b); (3) knowledge guided misinformation detection: Vo and Lee (2020) and Popat et al. (2018) propose to search for external evidence to derive informative features for content verification. Multiple evidence pieces are used to build graphs for fine-grained fact verification (Liu et al., 2020). Knowledge graphs introduce additional information to provide useful explanations for the detection results (Cui et al., 2020; Kou et al., 2022a, b). Nevertheless, the generalization and adaptability of such misinformation detection systems are not well studied. Therefore, we focus on domain adaptation of content-based language models for misinformation detection.

2.2. Domain Adaptation in Computer Vision

Domain adaptation methods are primarily studied on the image classification task (Long et al., 2013; Tzeng et al., 2014; Long et al., 2015, 2017; Kang et al., 2019). Existing approaches minimize the representation discrepancy between the source and target domains to encourage domain-invariant features, and thereby improving the model performance on shifted input distributions (i.e., covariate shifts) (Long et al., 2013; Tzeng et al., 2014). Early domain adaptation methods utilize an additional distance term in the training objective to minimize the distance between the input marginal distributions (Tzeng et al., 2014; Long et al., 2015). Long et al. (2013) and Long et al. (2017) adapt source-trained models by optimizing the distance between the joint distributions across both domains. Similarly, a feature discriminator can be introduced to perform minimax optimization, where the regularization on domain generalization can be implicitly imposed when source and target features are distinguishable by the discriminator (Ganin et al., 2016; Tzeng et al., 2017). Class-aware and contrastive domain adaptation are proposed for fine-grained domain alignment, such methods adapt the conditional distributions by regularizing the inter-class and intra-class distances via an additional domain critic (Pei et al., 2018; Kang et al., 2019). Recently, calibration and label shifts are studied for domain adaptation problems when the label distribution changes between the domains (Guo et al., 2017; Lipton et al., 2018; Azizzadenesheli et al., 2019; Alexandari et al., 2020). To the best of our knowledge, domain adaptation that considers both label correction and conditional distribution adaptation is not studied in current literature. Yet label shift is often observed in real-world scenarios (e.g., misinformation detection). As such, we propose an adaptation method that corrects the label shift and optimizes the conditional distribution to improve the out-of-domain performance.

2.3. Domain Adaptation in Text Classification

Currently, few methods are tailored for domain adaptation in misinformation detection, therefore, we review the related work in both text classification and misinformation detection. A domain-distinguishing task is proposed to post-train models for domain awareness and improved adversarial adaptation performance (Du et al., 2020). Energy-based adaptation is proposed via an additional antoencoder that generates source-like target features (Zou et al., 2021). Xu et al. (2019), Zhang et al. (2020) and Zhou et al. (2022) utilize domain adversarial training and multi-task learning to promote domain-invariant features in multimodal misinformation detection. Pseudo labeling is used to integrate domain knowledge via a weak labeling function for multi-source domain adaptation (Li et al., 2021). However, domain adaptation is not well studied for content-based misinformation detection, let alone the adaptation for COVID-related early misinformation. We develop a domain adaptation method tailored for misinformation detection CANMD. Our method significantly improves the adaptation performance in COVID misinformation detection by: (1) correcting the label shift between the source and target data distributions; and (2) exploiting knowledge from the source domain by minimizing the distance between the source and target conditional distributions.

3. Methodology

3.1. Setup

Data: Given labeled source data and unlabeled target data, our setting aims at improving the misinformation detection performance in an unseen target domain. We define misinformation detection as a binary text classification task. For this purpose, labeled source data and unlabeled target data are available, we denote the source data with $\bm{\mathcal{X}}_{s}$ and the target data with $\bm{\mathcal{X}}_{t}$ . Formally, each input example is defined by a tuple consisting of an input text $\bm{x}$ and a label $y$ with $y\in\{0,1\}$ . Input $\bm{x}$ is considered as misinformation (i.e., $y=0$ ) if it contains primarily or entirely false information. Otherwise, the label is considered as non-misleading (i.e., $y=1$ ):

•

Source data: Labeled source data from $\bm{\mathcal{X}}_{s}$ . Each example $(\bm{x}_{s},y_{s})\in\bm{\mathcal{X}}_{s}$ is defined by an input text $\bm{x}_{s}$ and a label $y_{s}$ . We denote the joint distribution $(\bm{x}_{s},y_{s})$ in $\bm{\mathcal{X}}_{s}$ with $\mathcal{P}$ .
•

Target data: Unlabeled target data from $\bm{\mathcal{X}}_{t}$ . For target example $(\bm{x}_{t},y_{t})\in\bm{\mathcal{X}}_{t}$ , we only have access to the input text $\bm{x}_{t}$ . Ground truth label $y_{t}$ is not given for training. Similarly, we denote the joint distribution in $\bm{\mathcal{X}}_{t}$ with $\mathcal{Q}$ .

Model: The classification model can be represented with function $\bm{f}$ . $\bm{f}$ takes text $\bm{x}$ as input and yields output logits. The output probability distribution is obtained by using the softmax function $\sigma$ . Ideally, the model $\bm{f}$ can correctly predict the ground truth $y$ with the maximum logit value, namely $y=\arg\max\bm{f}(\bm{x})$ .

Objective: The objective is to adapt a classifier $\bm{f}$ trained on source distribution $\mathcal{P}$ to the target distribution $\mathcal{Q}$ , such that the performance can be maximized in the target data $\bm{\mathcal{X}}_{t}$ . In other words, we minimize the negative log likelihood (NLL) loss between model output distribution $\sigma(\bm{f}(\bm{x}_{t}))$ and ground truth $y_{t}$ for $\bm{\mathcal{X}}_{t}$ :

(1)

\min_{\begin{subarray}{c}\bm{f}\end{subarray}}\mathbb{E}_{(\bm{x}_{t},y_{t})\sim\bm{\mathcal{X}}_{t}}\mathcal{L}_{\mathrm{nll}}(\sigma(\bm{f}(\bm{x}_{t})),y_{t}).

3.2. Proposed Approach

To adapt the trained classifier to the target domain, we propose contrastive adaptation network for early misinformation detection (CANMD) by approximating the target joint distribution. Unlike previous works with the assumption of covariate shift (Long et al., 2013; Tzeng et al., 2014; Long et al., 2015; Ganin et al., 2016), we consider label shift (i.e., $p(y)\neq q(y)$ ) and conditional shift (i.e., $p(\bm{x}|y)\neq q(\bm{x}|y)$ ) between the source domain and target domain. Since $p(\bm{x},y)=p(\bm{x}|y)p(y)$ (Similarly, $q(\bm{x},y)=q(\bm{x}|y)q(y)$ ), our solution to adapt the misinformation detection model $\bm{f}$ to the target domain is two-fold: (1) label shift estimation and correction: we generate pseudo labels for target examples in $\bm{\mathcal{X}}_{t}$ and rescale the labels to correct the shift in the joint distribution caused by a shift in label proportion (Lipton et al., 2018; Azizzadenesheli et al., 2019; Alexandari et al., 2020). In other words, we estimate and correct $q(y)$ . (2) Conditional distribution adaptation: using the pseudo labels generated from the previous step, we minimize the distance between the conditional distributions $p(\bm{x}|y)$ and $q(\bm{x}|y)$ with contrastive domain adaptation. As we reduce the distance between both conditional distributions (i.e, $p(\bm{x}|y)\approx q(\bm{x}|y)$ ), we improved the estimation of $q(\bm{x},y)=p(\bm{x}|y)q(y)$ using the corrected $q(y)$ .

The reason for label shift correction lies in the rapid changing dynamics of COVID-19 misinformation. For instance, the proportion of misinformation on social media platforms changes rapidly in different stages (i.e., label shift) (Cui and Lee, 2020; Patwa et al., 2021; Hayawi et al., 2022). Additionally, conventional misinformation detection systems often fail to distinguish COVID misinformation due to large domain discrepancies, as we demonstrate in Section 4. Therefore, the proposed CANMD comprises of two stages: (1) pseudo labeling and label correction, where we perform pseudo labeling and correct the label distribution $q(y)$ in the target domain; and (2) contrastive adaptation, in which we reduce intra-class discrepancy and enlarge inter-class discrepancy among examples from both domains. By leveraging pseudo labels from the first stage, we estimate and minimize the discrepancy between conditional distributions $p(\bm{x}|y)$ and $q(\bm{x}|y)$ . The illustration of CANMD is provided in Figure 2, unlike existing methods, we propose to correct the label shift and adapt the conditional distributions instead of solely reducing the feature discrepancy.

3.3. Pseudo Labeling and Label Correction

Given access to labeled source data $\bm{\mathcal{X}}_{s}$ , we first pretrain the misinformation classification system. In practice, we can skip this step if the model is already trained on the source data. Once the source-trained model is ready, we generate the pseudo label $\hat{y}_{t}$ of the unlabeled target example $\bm{x}_{t}$ with $\arg\max\bm{f}(\bm{x}_{t})$ .

However, the generated pseudo labels are noisy, which often cause deterioration of domain adaptation performance, especially when there exists a large domain gap (Yue et al., 2021). Additionally, different label proportions between both domains (i.e., label shift) can result in biased pseudo labels and potential negative transfer results (Alexandari et al., 2020). Inspired by calibration methods (Guo et al., 2017), we propose a label correction component to correct the distribution shift between the pseudo labels and the target output distribution. Specifically for input $\bm{x}$ , we design vector rescaling to introduce learnable parameter $\bm{w}$ and $\bm{b}$ and compute the corrected output as follows:

(2)

\sigma(\bm{w}\odot\bm{f}(\bm{x})+\bm{b}),

where $\odot$ represents element-wise product and $\sigma$ represents the softmax function. In rare cases when the the source and target output distributions result in large bias values in $\bm{b}$ , we discard the bias to avoid generating constant output probability (i.e., $\sigma(\bm{w}\odot\bm{f}(\bm{x})+\bm{b})\approx\sigma(\bm{b})$ when $\bm{b}$ consists of values much greater than $\bm{w}\odot\bm{f}(\bm{x})$ ). Thus the correction method reduces to:

(3)

\sigma(\bm{w}\odot\bm{f}(\bm{x})).

To obtain the optimal $\bm{w}$ and $\bm{b}$ , we optimize the parameter $\bm{w}$ and $\bm{b}$ in training to rescale the output probabilities and fit the distribution $q(y)$ . Given pretrained misinformation detection function $\bm{f}$ and input pair $(\bm{x}_{t},y_{t})$ , we follow (Guo et al., 2017) and minimize the NLL loss w.r.t. the parameters using the gradient descent algorithm, i.e.:

(4)

\min_{\begin{subarray}{c}\bm{w},\bm{b}\end{subarray}}\mathbb{E}_{(\bm{x}_{t},y_{t})\sim\bm{\mathcal{X}}_{t}}\mathcal{L}_{\mathrm{nll}}(\sigma(\bm{w}\odot\bm{f}(\bm{x}_{t})+\bm{b}),y_{t}),

where $\bm{b}$ is discarded in the case of constant output probability or when the optimization fails. Similar to (Guo et al., 2017; Lipton et al., 2018; Alexandari et al., 2020), the validation set is used to optimize the vector rescaling parameters.

In sum, the pseudo labeling and label correction stage via the label correction component can be formulated as:

(5)

\hat{y}_{t}=\arg\max\sigma(\bm{w}\odot\bm{f}(\bm{x})+\bm{b}).

After label correction, we filter the pseudo labels to select a subset of the target examples for contrastive adaptation. The pseudo labels are filtered according to the label confidence (i.e., $\max\sigma(\bm{w}\odot\bm{f}(\bm{x})+\bm{b})$ ), we preserve the target sample if the confidence value is above confidence threshold $\tau$ . In the following, we denote the filtered target data with pseudo labels using $\bm{\mathcal{X}}^{\prime}_{t}$ . By introducing the label correction component, we reduce the bias from $\bm{f}$ when the source and target label distributions shift. For example, COVID misinformation in the CoAID dataset consists of less than $10\%$ false claims (Cui and Lee, 2020). If $\bm{f}$ is pretrained on a balanced dataset, the pseudo labels would demonstrate a similar label distribution regardless of the label distribution in CoAID and result in noisy labels. CANMD rescales the model output using learnable parameters to adjust the pseudo labels, followed by confidence thresholding that selects high-confidence examples from a target-similar distribution.

3.4. Contrastive Adaptation

We now describe the contrastive adaptation stage of CANMD. We perform mini-batch training and sample identical amounts of input pair $(\bm{x},y)$ from both the source data $\bm{\mathcal{X}}_{s}$ and labeled target data $\bm{\mathcal{X}}^{\prime}_{t}$ . To approximate the target data distribution and efficiently estimate the discrepancy, we adopt a class-aware sampling strategy to guarantee the same amount of examples within each class. Specifically, a target batch is first sampled from the target data $\bm{\mathcal{X}}^{\prime}_{t}$ . By counting the examples in the target batch, we sample from the source data $\bm{\mathcal{X}}_{s}$ with the same number of examples from each class respectively. Consequently, we build mini-batches that comprise of the same classes and identical number of examples in both domains. We also resemble the target data distribution and incorporate the source domain knowledge during adaptation (Kang et al., 2019).

To estimate the domain discrepancy, we revisit the maximum mean discrepancy (MMD) distance. MMD estimates the distance between two distributions using samples drawn from them (Gretton et al., 2012). Given $\mathcal{P}$ and $\mathcal{Q}$ , MMD distance $\mathcal{D}$ is defined as:

(6)

\mathcal{D}=\sup_{f\in\mathcal{H}}\big{(}\mathbb{E}_{x\sim\mathcal{P}}[f(x)]-\mathbb{E}_{y\sim\mathcal{Q}}[f(y)]\big{)},

where $f$ is a function (kernel) in reproducing the kernel Hilbert space $\mathcal{H}$ . By introducing the kernel trick and empirical kernel mean embeddings (Long et al., 2015; Yue et al., 2021), we further simplify the estimation of the squared MMD distance between $\bm{\mathcal{X}}_{s}$ and $\bm{\mathcal{X}}^{\prime}_{t}$ as follows:

(7)	$\displaystyle\mathcal{D}^{\mathrm{MMD}}$	$\displaystyle=\frac{1}{\|\bm{\mathcal{X}}_{s}\|\|\bm{\mathcal{X}}_{s}\|}\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}_{s}\|}k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{s}^{(j)}))$
		$\displaystyle+\frac{1}{\|\bm{\mathcal{X}}^{\prime}_{t}\|\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{i=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}k(\phi(\bm{x}_{t}^{\prime(i)}),\phi(\bm{x}_{t}^{\prime(j)}))$
		$\displaystyle-\frac{2}{\|\bm{\mathcal{X}}_{s}\|\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{t}^{\prime(j)})),$

where we adopt the output from the [CLS] position in the transformer model as $\phi$ . Here, the expectation is simplified by using mean embeddings of the drawn samples, $k$ refers to the Gaussian kernel, i.e., $k(\bm{x}_{i},\bm{x}_{j})=\mathrm{exp}(-\frac{\|\bm{x}_{i}-\bm{x}_{j}\|^{2}}{\gamma})$ . The estimated MMD distance is used to compute a contrastive adaptation loss. In particular, we leverage the Equation 7 to regularize the distances among examples within the same class and examples from different classes. For this purpose, we define the class-aware MMD as:

(8)	$\displaystyle\mathcal{D}_{\mathrm{c_{1}c_{2}}}^{\mathrm{MMD}}$	$\displaystyle=\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}_{s}\|}\frac{\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(i)},y_{s}^{(j)})k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{s}^{(j)}))}{\sum_{l=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{m=1}^{\|\bm{\mathcal{X}}_{s}\|}\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(l)},y_{s}^{(m)})}$
		$\displaystyle+\sum_{i=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\frac{\mathbbm{1}_{c_{1}c_{2}}(y_{t}^{\prime(i)},y_{t}^{\prime(j)})k(\phi(\bm{x}_{t}^{\prime(i)}),\phi(\bm{x}_{t}^{\prime(j)}))}{\sum_{l=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{m=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\mathbbm{1}_{c_{1}c_{2}}(y_{t}^{\prime(l)},y_{t}^{\prime(m)})}$
		$\displaystyle-2\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\frac{\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(i)},y_{t}^{\prime(j)})k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{t}^{\prime(j)}))}{\sum_{l=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{m=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(l)},y_{t}^{\prime(m)})},$

with $\mathbbm{1}_{c_{1}c_{2}}(y_{1},y_{2})=\left\{\begin{aligned} 1,&\,\mathrm{if}\,y_{1}=c_{1},y_{2}=c_{2},\\ 0,&\,\mathrm{else}.\end{aligned}\right.$ and $c_{1}$ and $c_{2}$ representing two classes. If $c_{1}$ and $c_{2}$ are the same class, then $\mathcal{D}_{\mathrm{c_{1}c_{2}}}^{\mathrm{MMD}}$ estimates the intra-class discrepancy between the source and target domain. If $c_{1}$ and $c_{2}$ represent two different classes, then $\mathcal{D}_{\mathrm{c_{1}c_{2}}}^{\mathrm{MMD}}$ evaluates the inter-class discrepancy between both domains.

Using the class-aware MMD distance, we define the contrastive loss $\mathcal{L}_{\mathrm{contrasive}}$ as a part of the optimization objective:

(9)

\mathcal{L}_{\mathrm{contrasive}}=\mathcal{D}_{00}^{\mathrm{MMD}}+\mathcal{D}_{11}^{\mathrm{MMD}}-\frac{1}{2}(\mathcal{D}_{01}^{\mathrm{MMD}}+\mathcal{D}_{10}^{\mathrm{MMD}}).

As we consider binary classification in misinformation detection, $\mathcal{D}_{00}^{\mathrm{MMD}}$ and $\mathcal{D}_{11}^{\mathrm{MMD}}$ refers to the intra-class discrepancy of misinformation examples and credible examples in the representation space. $\mathcal{D}_{01}^{\mathrm{MMD}}+\mathcal{D}_{10}^{\mathrm{MMD}}$ represents inter-class discrepancy, the distance between different classes is enlarged to avoid misclassification (by taking the negative value). Based on the combination of intra-class and inter-class discrepancy, the misinformation detection system $\bm{f}$ learns a feature representation that separates the input from different classes and confuses input from the same class. In other words, the loss pulls together the same-class samples and pushes apart samples from different classes. As such, we minimize the distance between $p(\bm{x}|y)$ and $q(\bm{x}|y)$ based on the pseudo labels generated in the previous stage, we illustrate the adaptation process in Figure 3. Existing methods (left) that adapts marginal distributions can cause performance drops when there exist label shifts between the source and target domains (see green triangles). As opposed to such methods, CANMD (right) corrects the label shift and samples from the target distribution to adapt the conditional distributions, which yields an improved decision boundary for the target domain.

By combining both the NLL loss for classification and the above contrastive loss, we formulate the overall optimization objective in our contrastive adaptation stage as follows:

(10)

\mathcal{L}=\mathcal{L}_{\mathrm{nll}}+\lambda\mathcal{L}_{\mathrm{contrasive}},

where $\mathcal{L}_{\mathrm{nll}}$ incorporates knowledge from both the source and target domains (via pseudo labels), while $\mathcal{L}_{\mathrm{contrasive}}$ adapts the conditional distribution to the target domain. $\lambda$ can be chosen empirically.

1 Input Source examples

\bm{\mathcal{X}}_{s}

, unlabeled target examples

\bm{\mathcal{X}}_{t}

, pretrained model

\bm{f}

, correction parameters

\bm{w}

and

\bm{b}

;

2 Input Number of iteration

N

, batch size

B

, confidence threshold

\tau

, scaling factor

\lambda

;

// 1. Pseudo Labeling and Label Correction

4 Compute the target examples with pseudo labels

\bm{\mathcal{X}}^{\prime}_{t}

;

5 Optimize the parameters

\bm{w}

and

\bm{b}

with Equation 4;

6 Correct the pseudo labels with Equation 5;

7 Filter

\bm{\mathcal{X}}^{\prime}_{t}

with minimum confidence threshold

\tau

;

// 2. Contrastive Adaptation

9 for $i\in\{1,2,...,N\}$ do

10 Sample a target batch

\{(\bm{x}_{t}^{\prime(1)},y_{t}^{\prime(1)}),...,(\bm{x}_{t}^{\prime(B)},y_{t}^{\prime(B)})\}

;

11 Perform class-aware sampling to build a source batch

\{(\bm{x}_{s}^{(1)},y_{s}^{(1)}),...,(\bm{x}_{s}^{(B)},y_{s}^{(B)})\}

;

12 Compute the loss with Equation 10;

13 Perform backpropagation and update

\bm{f}

;

15 end for

Algorithm 1 Optimization loop of the proposed CANMD

3.5. Overall Framework

The overall framework of CANMD can be found in Figure 2. In the pseudo labeling and label correction stage, pseudo labels are first generated for all target examples using the pretrained misinformation detection system $\bm{f}$ . Next, we optimize the learnable correction parameters and correct the generated pseudo labels, we also filter the target data with a minimum confidence threshold $\tau$ (chosen empirically). In the contrastive adaptation stage, we perform class-aware sampling to resemble the target data distribution. Then, we compute a contrastive adaptation loss to reduce the intra-class discrepancy and enlarge the inter-class discrepancy. For each epoch, the optimization process is described in Algorithm 1.

Different from previous domain adaptation work in text classification (Du et al., 2020; Li et al., 2021; Zou et al., 2021), we discard domain-adversarial training for efficient training and approximate the joint distribution of the target domain. In particular, we propose a label correction component to correct the estimation of $q(y)$ during pseudo labeling. This corresponds the correction of pseudo labels in COVID misinformation, where we remove the potential bias in the pretrained model to reduce noisy labels in adaptation. Next, the model is adapted by minimizing the distance between the conditional distributions via MMD distance, in which the model learns domain-confusing yet class-separating features. Put simply, we transfer the knowledge from the source misinformation data to the COVID domain via fine-grained alignment in the feature space. By estimating $q(y)$ and adapting $p(\bm{x}|y)\approx q(\bm{x}|y)$ , we improve the estimation of $q(\bm{x},y)$ with $q(\bm{x},y)=p(\bm{x}|y)q(y)$ and thereby improving the adaptation performance for misinformation detection in COVID-19.

4. Experiments

Source Datasets
Datasets	Negative	Positive	Avg. Len	Type
FEVER	29.59%	70.41%	9.4	Claim
GettingReal	8.79%	91.21%	738.9	News
GossipCop	24.19%	75.81%	712.9	News
LIAR	44.23%	55.77%	20.2	Claim
PHEME	33.99%	66.01%	21.5	Social Media
Target Datasets
CoAID	9.72%	90.28%	54.0	News / Claim
Constraint	47.66%	52.34%	32.7	Social Media
ANTiVax	38.15	61.85%	26.2	Social Media

Table 1. Dataset details.

4.1. Experimental Design

Datasets: In our experiments, we adopt five source datasets released before the COVID outbreak and three COVID misinformation datasets to validate the proposed CANMD. For source datasets, we use FEVER (Thorne et al., 2018), GettingReal (Risdal, 2016), GossipCop (Shu et al., 2020), LIAR (Wang, 2017a) and PHEME (Buntain and Golbeck, 2017). For target datasets, we adopt CoAID (Cui and Lee, 2020), Constraint (Patwa et al., 2021) and ANTiVax (Hayawi et al., 2022), which are released in 2020, 2021 and 2022 respectively. Dataset details are provided in Table 1, where Negative and Positive are the proportion of misinformation and valid information in the dataset. Avg. Len represents the average token length of text and Type denotes the source type of the text (e.g., news, claim or social media).

Model: Following (Li et al., 2021; Zou et al., 2021; Liu et al., 2019), we select the commonly used RoBERTa as the misinformation detection model to perform the proposed CANMD. RoBERTa is a transformer-based language model with modified pretraining method that improves the performance on various NLP downstream tasks.

Baselines: The pretrained misinformation detection model is first directly evaluated on the target test data as naïve baseline. We additionally select two state-of-the-art baselines: DAAT and EADA (Du et al., 2020; Zou et al., 2021). DAAT leverages post training on BERT to improve the domain-adversarial adaptation. EADA performs energy-based adversarial domain adaptation with an additional autoencoder.

Evaluation: Similar to (Kou et al., 2022a; Li et al., 2021; Zou et al., 2021), we split the data into training, validation, and test sets with the ratio of 7:1:2. We adopt accuracy and F1 score to perform evaluation. We additionally introduce balanced accuracy (BA) to equivalently evaluate the adaptation performance in both classes, balanced accuracy is defined as the average value of sensitivity and specificity:

(11)

\mathrm{BA}=\frac{1}{2}(\mathrm{TPR}+\mathrm{TNR})=\frac{1}{2}(\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}+\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}),

where TPR represents sensitivity and TNR represents specificity. TP, TN represent true positive and true negative, FP and FN represent false positive and false negative. In other words, balanced accuracy equally considers the accuracy of both classes, and thus yields a better estimation of adaptation performance under label shift.

Source Datasets
Dataset	BA $\uparrow$	Acc. $\uparrow$	F1 $\uparrow$
FEVER	0.7957	0.7957	0.8173
GettingReal	0.8459	0.9587	0.9776
GossipCop	0.7760	0.8693	0.9173
LIAR	0.6070	0.6322	0.7116
PHEME	0.8630	0.8674	0.8975
Target Datasets
CoAID	0.8892	0.9720	0.9846
Constraint	0.9350	0.9327	0.9323
ANTiVax	0.9303	0.9191	0.9291

Table 2. Supervised training results.

Source Dataset	Target Dataset	CoAID (2020)			Constraint (2021)			ANTiVax (2022)
Source Dataset	Metric	BA $\uparrow$	Acc. $\uparrow$	F1 $\uparrow$	BA $\uparrow$	Acc. $\uparrow$	F1 $\uparrow$	BA $\uparrow$	Acc. $\uparrow$	F1 $\uparrow$
FEVER (2018)	None	0.5528	0.8980	0.9456	0.5633	0.5818	0.7057	0.5871	0.6219	0.7066
	DAAT	0.5113	0.8920	0.9427	0.5485	0.5640	0.6791	0.5641	0.6255	0.7345
	EADA	0.5185	0.9050	0.9499	0.5602	0.5780	0.7001	0.5223	0.6159	0.7531
	CANMD	0.5236	0.8970	0.9454	0.6837	0.6827	0.6864	0.6501	0.6793	0.7485
GettingReal (2016)	None	0.5000	0.9060	0.9507	0.5132	0.5360	0.6929	0.5022	0.6066	0.7546
	DAAT	0.5314	0.9110	0.9531	0.5532	0.5734	0.7075	0.5057	0.6093	0.7557
	EADA	0.5000	0.9060	0.9507	0.5466	0.5678	0.7077	0.5025	0.6070	0.7548
	CANMD	0.5622	0.9150	0.9551	0.7440	0.7416	0.7370	0.5629	0.6289	0.7409
GossipCop (2018)	None	0.5121	0.9020	0.9483	0.5559	0.5762	0.7096	0.5283	0.6262	0.7630
	DAAT	0.5095	0.9060	0.9506	0.7178	0.7276	0.7806	0.6692	0.7161	0.7918
	EADA	0.4994	0.9050	0.9501	0.5210	0.5430	0.6944	0.5509	0.6434	0.7709
	CANMD	0.6487	0.9250	0.9598	0.7914	0.7921	0.8020	0.7410	0.7771	0.8321
LIAR (2017)	None	0.6900	0.7320	0.8337	0.7255	0.7318	0.7706	0.5147	0.4846	0.4656
	DAAT	0.5000	0.9060	0.9507	0.7607	0.7626	0.7795	0.6228	0.5778	0.5393
	EADA	0.7019	0.9350	0.9650	0.7776	0.7794	0.7950	0.6184	0.6642	0.7490
	CANMD	0.7164	0.9440	0.9699	0.8147	0.8140	0.8183	0.7554	0.7844	0.8338
PHEME (2017)	None	0.5015	0.9000	0.9473	0.4787	0.4972	0.6456	0.5083	0.6103	0.7553
	DAAT	0.5149	0.9070	0.9511	0.5227	0.5411	0.6763	0.5895	0.6498	0.7518
	EADA	0.5095	0.9060	0.9506	0.4944	0.4969	0.6391	0.5411	0.6328	0.7632
	CANMD	0.5152	0.8990	0.9466	0.5594	0.5654	0.6235	0.6166	0.6823	0.7797
Average	None	0.5513	0.8676	0.9251	0.5673	0.5846	0.7049	0.5281	0.5899	0.6890
	DAAT	0.5134	0.9044	0.9496	0.6206	0.6337	0.7246	0.5903	0.6357	0.7146
	EADA	0.5458	0.9114	0.9533	0.5800	0.5930	0.7073	0.5470	0.6327	0.7582
	CANMD	0.5932	0.9160	0.9554	0.7186	0.7192	0.7334	0.6652	0.7104	0.7870

Table 3. Main results of domain adaptation.

Implementation: The input text is preprocessed as follows: (1) special symbols like emojis are back translated into English words; (2) hashtags, mentions and URLs are tokenized for posts on social media; (3) special characters are removed from the input text. For training, we first pretrain the misinformation detection model on the source dataset using Adam optimizer without linear warm-up. We adopt the learning rate of 1e-5, batch size of 24 and the maximum epoch of 5. We validate the model after each epoch with the validation set and report metric scores on the test set. In the adaptation step of CANMD, we adopt the same training and evaluation methods. For batching, we sample 24 target examples and perform class-aware sampling to sample another 24 source examples. We select confidence threshold $\tau$ from $0.6$ to $0.8$ and scaling factor $\lambda$ from $0.001$ to $0.05$ . For baseline methods, we adopt the identical training conditions as CANMD and follow the default hyperparameter selections in the original papers (Du et al., 2020; Zou et al., 2021).

4.2. Quantitative Analysis

We first present the supervised training results of all datasets in Table 2, the evaluation results on the target datasets provide an upper bound for the adaptation performance. We observe the following: (1) the performance varies for different data types. For example, RoBERTa does not perform well when we train on claims without additional knowledge (e.g., LIAR provides short claims). Predicting misinformation on social media posts achieves better performance with average BA of 0.9094 (i.e., PHEME, Constraint and ANTiVax), potentially due to the syntactic patterns of misinformation (e.g., second-person pronouns, swear words and adverbs (Li et al., 2021)). (2) For disproportionate label distributions, balanced accuracy (BA) better reflects the classification performance by weighting both classes equally. For example, the classification accuracy is 0.9587 on GettingReal, while the BA score drops drastically to 0.8459. This is because the dataset contains overwhelmingly true examples and by only predicting one class, the model still achieves over 0.9 accuracy. Overall, the supervised performance depends heavily on the input data type and label distributions, suggesting the potential benefits of domain similarity in adapting misinformation detection systems.

Now we study the adaptation performance and report the results in Table 3. Each row represents a combination of source datasets and domain adaptation methods, and each column includes metric scores on one target dataset. We report BA, accuracy and F1 metrics for all source-target combinations, the best results are marked in bold and the second best results are underlined. We report our observations: (1) due to the large domain discrepancies, domain adaptation for COVID misinformation detection is a non-trivial task. For instance, the naïve baseline (i.e., None) trained on FEVER achieves 0.5633 BA and 0.5818 accuracy in Constraint; (2) in rare cases, we observe negative transfer. For example, consistent performance drops can be found on the BA metric in FEVER $\rightarrow$ CoAID experiments due to the discrepancy and the label shift across both domains. (3) Baseline adaptation methods achieve superior results than the naïve baseline (i.e., None) in most cases. For example, the DAAT baseline achieves 4.7% and 6.4% relative improvements on BA and accuracy, similar improvements can be found using EADA. (4) On average, CANMD performs the best by achieving the highest performance in most scenarios and outperforming the best baseline method by 11.5% and 7.9% on BA and accuracy. (5) CANMD performs well even when the source and target datasets demonstrate significant label shifts. For instance, GettingReal consists of over 90% true examples while Constraint is a balanced dataset in the COVID domain. CANMD achieves 0.7440 BA and 0.7416 accuracy in GettingReal $\rightarrow$ Constraint, with 34.5% and 29.3% improvements compared to the strongest baseline (i.e., DAAT). Altogether, the results suggest that the proposed CANMD is effective in improving misinformation detection performance on out-of-domain data. By comparing the performance in source-target combinations with dissimilar label distributions (e.g., GossipCop $\rightarrow$ CoAID), CANMD is particularly effective in adapting models to target domains under significant label shifts due to label correction.

4.3. Qualitative Analysis

We also present qualitative examples of the learnt domain-invariant features generated by the baseline methods and CANMD. In particular, we first train the misinformation detection models with different domain adaptation methods. The trained models are evaluated on the test set and we visualize the [CLS] position output from the transformer encoder. We select the correct predictions and perform T-SNE to generate the plots. The plots can be found in Figure 4. Constraint is a widely used COVID misinformation dataset, we thus perform the experiments with DAAT, EADA and CANMD using Constraint as the target dataset (Const.), valid examples are in orange while misinformation examples are in green. For source datasets, we denote the FEVER with F, GettingReal with GR, GossipCop with GC, LIAR with L and PHEME with P.

From the qualitative examples we observe the following: (1) all domain adaptation methods generate domain-invariant features in the representation space, as we do not observe clustering of examples in the plots. Nevertheless, different classes of examples are not clearly separated in most cases. (2) The proposed CANMD reduces intra-class distances while separating the true samples from the misinformation samples. For example, we observe a clear decision boundary between both classes in Figure 4(k) and Figure 4(l). (3) The label correction component in CANMD effectively corrects the feature distribution in the representation space. For instance, the target dataset Constraint has a balanced label distribution, while FEVER, GettingReal and GossipCop comprise of over 70% credible examples (orange). Yet we observe comparable area consisting of credible and fake examples in Figure 4(k), Figure 4(l) and Figure 4(m), whereas the baseline methods demonstrate overwhelming orange area (as of the source data distribution). In sum, CANMD exploits the source domain knowledge by enlarging inter-class margins and learning compact intra-class features, CANMD also corrects the label shifts to remove potential bias from the source data.

4.4. Performance Variations

We now discuss the potential reasons for the performance variations in the adaptation results. For this purpose, we refer to the adaptation results in Table 3 and qualitative examples in Figure 4. In particular, we discuss how does the adaptation performance vary across different source-target combinations?

First, the different adaptation performance can be traced back to the similarity between the source domain and the target domain. When the source and target datasets are alike, knowledge can be transferred to the target domain with less efforts. An example can be found in the LIAR $\rightarrow$ Constraint dataset in Table 3, where both datasets have similar input types and text lengths. In this case, the naïve baseline achieves 0.7255 BA while CANMD can reach up to 0.8147 in BA. In contrast, GettingReal $\rightarrow$ Constraint demonstrates a larger domain gap with different input types (News $\rightarrow$ Social Media) and length (738.9 $\rightarrow$ 32.7). Thus, we observe less improvements using CANMD, with 0.7440 and 0.7416 in BA and accuracy respectively. The variations in the improvements can be further attributed the to the label shift between the source and target domain. When the source domain has a significantly different label distribution from the target domain, the bias may be carried over to the target domain during adaptation. An example can be found in GossipCop $\rightarrow$ CoAID, where the target label distribution is 10:90. As such, the baseline methods perform poorly in the balanced accuracy metric. Compared to the baseline methods, CANMD achieves 0.6487 in BA in contrast to only circa 0.5 of baseline methods, suggesting the effectiveness of the proposed label correction component. Therefore, the domain discrepancy and the label shift between the source and target domains are observed to be crucial for the expected adaptation performance in COVID-19 misinformation detection.

Dataset	BA $\uparrow$	Acc. $\uparrow$	F1 $\uparrow$
CoAID
CANMD	0.5932	0.9160	0.9554
-Label Correction	0.5749	0.9156	0.9553
-Contrastive Adaptation	0.5513	0.8676	0.9251
Constraint
CANMD	0.7186	0.7192	0.7334
-Label Correction	0.6745	0.6819	0.7366
-Contrastive Adaptation	0.5673	0.5846	0.7049
ANTiVax
CANMD	0.6652	0.7104	0.7870
-Label Correction	0.6077	0.6627	0.7572
-Contrastive Adaptation	0.5281	0.5899	0.6890

Table 4. Ablation study of CANMD.

4.5. Ablation Studies

Label Correction and Contrastive Adaptation: We evaluate the effectiveness of the label correction component and contrastive adaptation by comparing our results from CANMD to the results trained without label correction and contrastive adaptation. We report the performance on target datasets (averaged over all source datasets) in Table 4. Note that removing the contrastive adaptation component reduces CANMD to the naïve baseline, as the pseudo labeling and label correction are designed for the following contrastive adaptation stage. As expected, we observe performance drops by removing the label correction component in CANMD. On average, the performance reduces 6.1% and 3.6% in BA and accuracy when we remove the label correction component in CANMD. Surprisingly, removing label correction slightly improves the F1 score on Constraint, this is because the model tends to predict true more frequently without label correction. As F1 does not considers true negatives, the value can be lower even if the model predicts with an increased fairness between both classes. Additionally, we observe that the majority of the improvements in CoAID are from the label correction component rather than the contrastive adaptation stage. A potential reason is the disproportionate label distribution in CoAID, where over 90% data examples are credible information. Thus, the performance improves significantly by correcting the label shift between CoAID and the source datasets.

Sensitivity Analysis of Hyperparameters: We study the sensitivity of the hyperparameters $\lambda$ and confidence threshold $\tau$ . Similarly, we vary the input hyperparameter in adaptation and report the results on all target datasets. The reported numbers are BA values and are averaged across source datasets, see Figure 5. The results on hyperparameter $\lambda$ are comparatively stable and the BA scores are less sensitive to the changes of $\lambda$ . For confidence threshold $\tau$ in pseudo labeling, the results vary across datasets. For example, the performance peaks for Constraint and CoAID at $\tau=0.6$ , while the results improve steadily with increasing $\tau$ for AnTiVax, suggesting further room for improvements on ANTiVax. Overall, the proposed CANMD is robust to the changes of $\lambda$ and $\tau$ and consistently outperforms baselines even with sub-optimal hyperparameters.

5. Conclusion

In this paper, we propose a novel framework for domain adaptation in COVID-19 misinformation detection. We design CANMD and propose to perform label shift correction and contrastive learning for early misinformation detection in the COVID domain. Unlike existing methods, we design a label correction component and adapt the conditional distributions to improve the target domain performance. Extensive experiments demonstrate that CANMD is effective in domain adaptation for misinformation detection by achieving 11.5% average improvements and up to 34.5% improvements under label shifts between the source and target domains.

Acknowledgments

This research is supported in part by the National Science Foundation under Grant No. IIS-2202481, CHE-2105005, IIS-2008228, CNS-1845639, CNS-1831669. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References

(1)
Alexandari et al. (2020) Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. 2020. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning. PMLR, 222–232.
Azizzadenesheli et al. (2019) Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. 2019. Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734 (2019).
Buntain and Golbeck (2017) Cody Buntain and Jennifer Golbeck. 2017. Automatically identifying fake news in popular twitter threads. In 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 208–215.
Cui and Lee (2020) Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885 (2020).
Cui et al. (2020) Limeng Cui, Haeseung Seo, Maryam Tabar, Fenglong Ma, Suhang Wang, and Dongwon Lee. 2020. Deterrent: Knowledge guided graph attention network for detecting healthcare misinformation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 492–502.
Das et al. (2021) Sourya Dipta Das, Ayan Basak, and Saikat Dutta. 2021. A heuristic-driven ensemble framework for COVID-19 fake news detection. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation. Springer, 164–176.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Du et al. (2020) Chunning Du, Haifeng Sun, Jingyu Wang, Qi Qi, and Jianxin Liao. 2020. Adversarial and Domain-Aware BERT for Cross-Domain Sentiment Analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4019–4028. https://doi.org/10.18653/v1/2020.acl-main.370
Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research 17, 1 (2016), 2096–2030.
Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A Kernel Two-Sample Test. The Journal of Machine Learning Research 13, 1 (2012), 723–773.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International Conference on Machine Learning. PMLR, 1321–1330.
Guo et al. (2018) Han Guo, Juan Cao, Yazi Zhang, Junbo Guo, and Jintao Li. 2018. Rumor detection with hierarchical social attention network. In Proceedings of the 27th ACM international conference on information and knowledge management. 943–951.
Hayawi et al. (2022) Kadhim Hayawi, Sakib Shahriar, Mohamed Adel Serhani, Ikbal Taleb, and Sujith Samuel Mathew. 2022. ANTi-Vax: a novel Twitter dataset for COVID-19 vaccine misinformation detection. Public health 203 (2022), 23–30.
Hossain et al. (2020) Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 Misinformation on Social Media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.nlpcovid19-2.11
Jin et al. (2017) Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo Luo. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on Multimedia. 795–816.
Jin et al. (2016) Zhiwei Jin, Juan Cao, Yongdong Zhang, and Jiebo Luo. 2016. News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
Kang et al. (2019) Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. 2019. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4893–4902.
Karimi and Tang (2019) Hamid Karimi and Jiliang Tang. 2019. Learning Hierarchical Discourse-level Structure for Fake News Detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3432–3442. https://doi.org/10.18653/v1/N19-1347
Khattar et al. (2019) Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, and Vasudeva Varma. 2019. Mvae: Multimodal variational autoencoder for fake news detection. In The world wide web conference. 2915–2921.
Kou et al. (2022a) Ziyi Kou, Lanyu Shang, Yang Zhang, and Dong Wang. 2022a. HC-COVID: A Hierarchical Crowdsource Knowledge Graph Approach to Explainable COVID-19 Misinformation Detection. Proceedings of the ACM on Human-Computer Interaction 6, GROUP (2022), 1–25.
Kou et al. (2021) Ziyi Kou, Lanyu Shang, Yang Zhang, Christina Youn, and Dong Wang. 2021. Fakesens: A social sensing approach to covid-19 misinformation detection on social media. In 2021 17th International Conference on Distributed Computing in Sensor Systems (DCOSS). IEEE, 140–147.
Kou et al. (2022b) Ziyi Kou, Lanyu Shang, Yang Zhang, Zhenrui Yue, Huimin Zeng, and Dong Wang. 2022b. Crowd, Expert & AI: A Human-AI Interactive Approach Towards Natural Language Explanation Based COVID-19 Misinformation Detection. In IJCAI.
Li et al. (2021) Yichuan Li, Kyumin Lee, Nima Kordzadeh, Brenton Faber, Cameron Fiddes, Elaine Chen, and Kai Shu. 2021. Multi-Source Domain Adaptation with Weak Supervision for Early Fake News Detection. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 668–676.
Lipton et al. (2018) Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In International conference on machine learning. PMLR, 3122–3130.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Liu and Wu (2018) Yang Liu and Yi-Fang Wu. 2018. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
Liu et al. (2020) Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2020. Fine-grained Fact Verification with Kernel Graph Attention Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7342–7351. https://doi.org/10.18653/v1/2020.acl-main.655
Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. PMLR, 97–105.
Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. 2013. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision. 2200–2207.
Long et al. (2017) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In International conference on machine learning. PMLR, 2208–2217.
Medina Serrano et al. (2020) Juan Carlos Medina Serrano, Orestis Papakyriakopoulos, and Simon Hegelich. 2020. NLP-based Feature Extraction for the Detection of COVID-19 Misinformation Videos on YouTube. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Association for Computational Linguistics, Online. https://aclanthology.org/2020.nlpcovid19-acl.17
Müller et al. (2020) Martin Müller, Marcel Salathé, and Per E Kummervold. 2020. Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv preprint arXiv:2005.07503 (2020).
Pan et al. (2018) Jeff Z Pan, Siyana Pavlova, Chenxi Li, Ningxi Li, Yangmei Li, and Jinshuo Liu. 2018. Content based fake news detection using knowledge graphs. In International semantic web conference. Springer, 669–683.
Patwa et al. (2021) Parth Patwa, Shivam Sharma, Srinivas Pykl, Vineeth Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2021. Fighting an infodemic: Covid-19 fake news dataset. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation. Springer, 21–29.
Pei et al. (2018) Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. 2018. Multi-adversarial domain adaptation. In Thirty-second AAAI conference on artificial intelligence.
Popat et al. (2018) Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2018. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 22–32. https://doi.org/10.18653/v1/D18-1003
Rashid and Wang (2021) Md Tahmid Rashid and Dong Wang. 2021. CovidSens: a vision on reliable social sensing for COVID-19. Artificial intelligence review 54, 1 (2021), 1–25.
Risdal (2016) Megan Risdal. 2016. Getting Real about Fake News. https://doi.org/10.34740/KAGGLE/DSV/911
Roitero et al. (2020) Kevin Roitero, Michael Soprano, Beatrice Portelli, Damiano Spina, Vincenzo Della Mea, Giuseppe Serra, Stefano Mizzaro, and Gianluca Demartini. 2020. The covid-19 infodemic: Can the crowd judge recent misinformation objectively?. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1305–1314.
Roozenbeek et al. (2020) Jon Roozenbeek, Claudia R Schneider, Sarah Dryhurst, John Kerr, Alexandra LJ Freeman, Gabriel Recchia, Anne Marthe Van Der Bles, and Sander Van Der Linden. 2020. Susceptibility to misinformation about COVID-19 around the world. Royal Society open science 7, 10 (2020), 201199.
Shang et al. (2022b) Lanyu Shang, Ziyi Kou, Yang Zhang, Jin Chen, and Dong Wang. 2022b. A Privacy-aware Distributed Knowledge Graph Approach to QoIS-driven COVID-19 Misinformation Detection. In 2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS). IEEE, 1–10.
Shang et al. (2021) Lanyu Shang, Ziyi Kou, Yang Zhang, and Dong Wang. 2021. A Multimodal Misinformation Detector for COVID-19 Short Videos on TikTok. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 899–908.
Shang et al. (2022a) Lanyu Shang, Ziyi Kou, Yang Zhang, and Dong Wang. 2022a. A Duo-generative Approach to Explainable Multimodal COVID-19 Misinformation Detection. In Proceedings of the ACM Web Conference 2022. 3623–3631.
Shu et al. (2019a) Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019a. defend: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 395–405.
Shu et al. (2020) Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2020. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data 8, 3 (2020), 171–188.
Shu et al. (2019b) Kai Shu, Suhang Wang, and Huan Liu. 2019b. Beyond news contents: The role of social context for fake news detection. In Proceedings of the twelfth ACM international conference on web search and data mining. 312–320.
Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018).
Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7167–7176.
Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
Vo and Lee (2020) Nguyen Vo and Kyumin Lee. 2020. Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7717–7731. https://doi.org/10.18653/v1/2020.emnlp-main.621
Wang (2017a) William Yang Wang. 2017a. ” liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648 (2017).
Wang (2017b) William Yang Wang. 2017b. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Vancouver, Canada, 422–426. https://doi.org/10.18653/v1/P17-2067
Wang et al. (2018) Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018. Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining. 849–857.
Xu et al. (2019) Brian Xu, Mitra Mohtarami, and James Glass. 2019. Adversarial domain adaptation for stance detection. arXiv preprint arXiv:1902.02401 (2019).
Yue et al. (2021) Zhenrui Yue, Bernhard Kratzwald, and Stefan Feuerriegel. 2021. Contrastive Domain Adaptation for Question Answering using Limited Text Corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9575–9593. https://doi.org/10.18653/v1/2021.emnlp-main.754
Zeng et al. (2022) Huimin Zeng, Zhenrui Yue, Yang Zhang, Ziyi Kou, Lanyu Shang, and Dong Wang. 2022. On Attacking Out-Domain Uncertainty Estimation in Deep Neural Networks. In IJCAI.
Zhang et al. (2020) Tong Zhang, Di Wang, Huanhuan Chen, Zhiwei Zeng, Wei Guo, Chunyan Miao, and Lizhen Cui. 2020. BDANN: BERT-based domain adaptation neural network for multi-modal fake news detection. In 2020 international joint conference on neural networks (IJCNN). IEEE, 1–8.
Zhou et al. (2022) Honghao Zhou, Tinghuai Ma, Huan Rong, Yurong Qian, Yuan Tian, and Najla Al-Nabhan. 2022. MDMN: Multi-task and Domain Adaptation based Multi-modal Network for early rumor detection. Expert Systems with Applications 195 (2022), 116517.
Zou et al. (2021) Han Zou, Jianfei Yang, and Xiaojian Wu. 2021. Unsupervised Energy-based Adversarial Domain Adaptation for Cross-domain Text Classification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1208–1218. https://doi.org/10.18653/v1/2021.findings-acl.103

(7)	$\displaystyle\mathcal{D}^{\mathrm{MMD}}$	$\displaystyle=\frac{1}{\|\bm{\mathcal{X}}_{s}\|\|\bm{\mathcal{X}}_{s}\|}\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}_{s}\|}k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{s}^{(j)}))$
		$\displaystyle+\frac{1}{\|\bm{\mathcal{X}}^{\prime}_{t}\|\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{i=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}k(\phi(\bm{x}_{t}^{\prime(i)}),\phi(\bm{x}_{t}^{\prime(j)}))$
		$\displaystyle-\frac{2}{\|\bm{\mathcal{X}}_{s}\|\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{t}^{\prime(j)})),$

(8)	$\displaystyle\mathcal{D}_{\mathrm{c_{1}c_{2}}}^{\mathrm{MMD}}$	$\displaystyle=\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}_{s}\|}\frac{\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(i)},y_{s}^{(j)})k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{s}^{(j)}))}{\sum_{l=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{m=1}^{\|\bm{\mathcal{X}}_{s}\|}\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(l)},y_{s}^{(m)})}$
		$\displaystyle+\sum_{i=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\frac{\mathbbm{1}_{c_{1}c_{2}}(y_{t}^{\prime(i)},y_{t}^{\prime(j)})k(\phi(\bm{x}_{t}^{\prime(i)}),\phi(\bm{x}_{t}^{\prime(j)}))}{\sum_{l=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\sum_{m=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\mathbbm{1}_{c_{1}c_{2}}(y_{t}^{\prime(l)},y_{t}^{\prime(m)})}$
		$\displaystyle-2\sum_{i=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{j=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\frac{\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(i)},y_{t}^{\prime(j)})k(\phi(\bm{x}_{s}^{(i)}),\phi(\bm{x}_{t}^{\prime(j)}))}{\sum_{l=1}^{\|\bm{\mathcal{X}}_{s}\|}\sum_{m=1}^{\|\bm{\mathcal{X}}^{\prime}_{t}\|}\mathbbm{1}_{c_{1}c_{2}}(y_{s}^{(l)},y_{t}^{\prime(m)})},$