^*^*footnotetext: These authors contributed equally to this work.¹¹institutetext: Department of Electrical and Computer Engineering,
The University of British Columbia, Vancouver, BC, Canada ¹¹email: {dwenlong,xiaoxiao.li}@ece.ubc.ca ²²institutetext: Department of Computer Science and Engineering,
The Chinese University of Hong Kong, Hong Kong, China ²²email: {yzhong22,qdou}@cse.cuhk.edu.hk

On Fairness of Medical Image Classification
with Multiple Sensitive Attributes via
Learning Orthogonal Representations

Wenlong Deng^∗ 11 0009-0002-0545-6384 Yuan Zhong^∗ 22 0009-0004-4909-8763 Qi Dou 22 0000-0002-3416-9950 Xiaoxiao Li 11 0000-0002-8833-0244

Abstract

Mitigating the discrimination of machine learning models has gained increasing attention in medical image analysis. However, rare works focus on fair treatments for patients with multiple sensitive demographic attributes, which is a crucial yet challenging problem for real-world clinical applications. In this paper, we propose a novel method for fair representation learning with respect to multi-sensitive attributes. We pursue the independence between target and multi-sensitive representations by achieving orthogonality in the representation space. Concretely, we enforce the column space orthogonality by keeping target information on the complement of a low-rank sensitive space. Furthermore, in the row space, we encourage feature dimensions between target and sensitive representations to be orthogonal. The effectiveness of the proposed method is demonstrated with extensive experiments on the CheXpert dataset. To our best knowledge, this is the first work to mitigate unfairness with respect to multiple sensitive attributes in the field of medical imaging. The code is available at https://github.com/ubc-tea/FCRO-Fair-Classification-Orthogonal-Representation.

1 Introduction

With the increasing application of artificial intelligence systems for medical image diagnosis, it is notably important to ensure fairness of image classification models and investigate concealed model biases that are to-be-encountered in complex real-world situations. Unfortunately, sensitive attributes (e.g., race and gender) accompanied by medical images are prone to be inherently encoded by machine learning models [5], and affect the model’s discrimination property [20]. Recently, fair representation learning has shown great potential as it acts as a group parity bottleneck that mitigates discrimination when generalized to downstream tasks. Existing methods [1, 4, 15, 16] have studied the parity between privileged and unprivileged groups upon just a single sensitive attribute, but neglecting the flexibility with respect to multiple sensitive attributes, in which the conjunctions of unprivileged attributes might deteriorate discrimination. This is a crucial yet challenging problem hindering the applicability of machine learning models, especially for medical image classification where patients always have many demographic attributes.

Refer to caption — Figure 1: A t-SNE [10] visualization of (a) sensitive attribute and (b) target representations learned from our proposed methods FCRO on the CheXpert dataset [7]. Sensitive embeddings capture subgroups’ variance. We claim FCRO enforces fair classification on the target task by learning orthogonal target representations that are invariant over different demographic attribute combinations.

To date, it is still challenging to effectively learn target-related representations which are both fair and flexible to multiple sensitive attributes, regardless of some promising investigations recently. For instance, adversarial methods [1, 11] produce robust representations by formulating a min-max game between an encoder that learns class-related representation and an adversary that removes sensitive information from it. Disentanglement-based methods [4] achieve separation by minimizing the mutual information between target and sensitive attribute representations. These methods typically gain efficacy by means of carefully designed objectives. To extend them to the multi-attribute setting, additional loss functions have to be explored, which should handle gradient conflict or interference. Methods using variational autoencoder [3] decompose the latent distributions of target and sensitive and penalize their correlation for disentanglement. However, aligning the distribution of the sensitive attributes is difficult or even intractable given the complex combination of multiple factors. Besides, there are some fairness methods based on causal inference [12] or bi-level optimization [16], which also learn debiased while multi-attributes inflexible representations. Recently, disentanglement is vigorously interpreted as the orthogonality of a decomposed target-sensitive latent representation pair by [15], where they predefine a pair of orthogonal subspaces for target and sensitive attribute representations. However, in the multi-sensitive attributes setting, the dimension of the target space would be continuously compressed and how to solve it is still an open problem.

In this paper, we propose a new method to achieve Fairness via Column-Row space Orthogonality (dubbed FCRO) by learning fair representations with respect to multiple sensitive attributes for medical image classification. FCRO considers multi-sensitive attributes by encoding them into a unified attribute representation. It achieves a best trade-off for fairness and data utility (see illustrations in Fig. 1) via orthogonality in both column and row spaces. Our contributions are summarized as follows: (1) We tackle the practical and challenging problem of fairness given multiple sensitive attributes for medical image classification. To the best of our knowledge, this is the first work to study fairness with respect to multi-sensitive attributes in the field of medical imaging. (2) We relax the independence of target and sensitive attribute representations by orthogonality which can be achieved by our proposed novel column and row losses. (3) We conduct extensive experiments on the CheXpert [7] dataset with over 80,000 chest X-rays. FCRO achieves a superior fairness-utility trade-off over state-of-the-art methods regarding multiple sensitive attributes race, sex, and age.

2 Methodology

2.1 Problem Formulation

Notations. We consider group fairness in this work. While keeping the model utility for the privileged group, group fairness articulates the equality of some statistics like predictive rate between certain groups. Considering a binary classification problem with column vector inputs $x\in\mathcal{X}$ , labels $y\in\mathcal{Y}=\{0,1\}$ . Multi-sensitive attributes $a\in\mathcal{A}$ is vector of $m$ attributes sampled from the conjunction of binary attributes, i.e., Cartesian product of sensitive attributes $\mathcal{A}=\prod_{i\in[m]}A_{i}$ ^*^** $[m]=\{0,1,..,m\}$ , where $A_{i}\in\{0,1\}$ is the $i$ -th sensitive attribute. Our training data consist of tuples $\mathcal{D}=\{(x,y,a)\}$ . We denote the classification model $f(x)=h_{T}(\phi_{T}(x))$ that predicts a class label $\widehat{y}$ given $x$ , where $\phi_{T}:\mathcal{X}\mapsto\operatorname{{\mathbb{R}}}^{d}$ is a feature encoder for target embeddings, and $h_{T}:\operatorname{{\mathbb{R}}}^{d}\mapsto\operatorname{{\mathbb{R}}}$ is a scoring function. Similarly, we consider a sensitive attribute model $g(x)=\{h_{A_{1}}(\phi_{A}(x)),...,h_{A_{m}}(\phi_{A}(x))\}$ that predicts sensitive attributes associated with input $x$ . Given the number of samples $n$ , the input data representation is $X=[x_{1},\dots,x_{n}]$ and we denote the feature representation $Z_{T}=\phi_{T}(X)$ , $Z_{A}=\phi_{A}(X)$ $\in\operatorname{{\mathbb{R}}}^{d\times n}$ .

Fairness with multiple sensitive attributes. A classifier predicts $y$ given an input $x$ by estimating the posterior probability $p(y|x)$ . When inputs that are affected by their associated attributes (i.e., $\{A_{1},\dots,A_{m}\}\rightarrow X$ ) are fed into the network, the posterior probability is written as $p(y|x,a)$ . Since biased information from $\operatorname{{\mathcal{A}}}$ is encoded, this can lead to an unfair prediction by the classifier. For example, in the diagnosis of a disease with sensitive attributes age, sex, and race, a biased classifier will result in $p(\widehat{y}|A={\rm male},{\rm old},{\rm black})\neq p(\widehat{y}|A=\rm{female},\rm{young},\rm{white})$ . In this work, we focus on equalized odds (ED), which is a commonly used and crucial criterion of fair classification in the medical domain [19]. In our case, ED regarding multiple sensitive attributes can be formulated as follows:

\displaystyle P(\widehat{Y}=y|A=\pi_{1},Y=y)=P(\widehat{Y}=y|A=\pi_{2},Y=y),\ \forall\pi_{1},\pi_{2}\in\mathcal{A},y\in\mathcal{Y}.

(1)

A classifier satisfies ED with respect to group $A$ and the target $Y$ if the model prediction $\widehat{Y}$ and $A$ are conditionally independent given $Y$ , i.e., $\widehat{Y}\perp A|Y$ .

Fair representation. To enforce our aforementioned conditions, we follow [15] and introduce target embedding $z_{T}$ and multi-attribute embedding $z_{A_{i}}$ that is generated from $x$ . As in the causal structure graph for the classifier depicted in Fig. 2 (a), our objective is to find a fair model that maximizes the log-likelihood function of the distribution $p(y,a|x)$ , where:

	$\displaystyle p(y,a\|x)=\frac{p(y\|x,a)p(x\|a)p(a)}{p(x)}$	$\displaystyle=p(y\|x)p(a\|x)$		(2)
		$\displaystyle=p(y\|z_{T})p(z_{T}\|x)\prod_{i\in[m]}p(a_{i}\|z_{A_{i}})p(z_{A_{i}}\|x),$		(3)

and we call $z_{T}$ fair representation for the target task (e.g., disease diagnosis). To this end, we aim to maximize Eq. (3) with the conditional independence constraint to train a fair classifier. It is noteworthy that in the multi-sensitive attributes setting, forcing $z_{T}$ to be independent on all $z_{A_{i}},\forall i\in[m]$ is challenging and even intractable when $m$ is large. Therefore, we propose to encode multi-sensitive attributes into a single compact encoding $z_{A}$ that is still predictive for classifying attributes (i.e., $z_{A}\rightarrow\{a_{1},\dots,a_{m}\}$ ). Then we can rewrite Eq. (3) as likelihood with the independence constraint on $z_{T}$ and $z_{A}$ :

\displaystyle p(y,a|x)=p(y|z_{T})p(z_{T}|x)p(a|z_{A})p(z_{A}|x).

(4)

However, maximizing Eq. (4) brings two technical questions:
Q1: How to satisfy the independence constraint for $z_{T}$ and $z_{A}$ ?
A1: We relax independence by enforcing orthogonality in a finite vector space. Different from predefined orthogonal space in [15], we enforce orthogonality in both column spaces (Sec. 2.2) and row spaces (Sec. 2.3) of $Z_{T}$ and $Z_{A}$ .
Q2: How to estimate $p(y|z_{T})$ , $p(z_{T}|x)$ , $p(a|z_{A})$ , $p(z_{A}|x)$ ?
A2: We train two convolutional encoders $z_{T}=\phi_{T}(x)$ and $z_{A}=\phi_{A}(x)$ to approximate $p(z_{T}|x)$ and $p(z_{A}|x)$ respectively; And we train two multi-layer perception classifiers $y=h_{T}(z_{T})$ and $a=h_{A}(z_{A})$ to approximate $p(y|z_{T})$ and $p(a|z_{A})$ respectively (Sec. 2.4).

2.2 Column Space Orthogonality

First, we focus on the column space of the target and the sensitive attribute representations. Column space orthogonality aims to learn target representations $Z_{T}$ that fulfill the following two aims: 1) have the least projection onto the sensitive space $\mathcal{S}_{A}$ and 2) preserve the representation power to predict $Y$ .

Denote the target representation $Z_{T}=[\widetilde{z}_{T}^{1},\widetilde{z}_{T}^{2},\dots,\widetilde{z}_{T}^{n}]$ and the sensitive attribute representation $Z_{A}=[\widetilde{z}_{A}^{1},\widetilde{z}_{A}^{2},\dots,\widetilde{z}_{A}^{n}]$ , where $\widehat{z}^{i}\in\operatorname{{\mathbb{R}}}^{d\times 1}$ is a column vector for $i\in[n]$ , we represent the column space for $Z_{T}$ and $Z_{A}$ as $\operatorname{{\mathcal{S}}}_{T}={\rm span}(Z_{T})$ and $\operatorname{{\mathcal{S}}}_{A}={\rm span}(Z_{A})$ respectively. Aim 1 can be achieved by forcing $\operatorname{{\mathcal{S}}}_{T}=\operatorname{{\mathcal{S}}}_{A}^{\perp}$ . Although both $\widetilde{z}_{T},\widetilde{z}_{A}\in\operatorname{{\mathbb{R}}}^{d}$ , their coordinates may not be aligned as they are generated from two separate encoders. As a result, if $d\ll\infty$ , then there is no straightforward way to achieve $\operatorname{{\mathcal{S}}}_{T}\perp\operatorname{{\mathcal{S}}}_{A}$ by directly constraining $\widetilde{z}^{i}_{T},\widetilde{z}^{j}_{A}$ (e.g., forcing $(\widetilde{z}^{i}_{T})^{\top}\widetilde{z}^{j}_{A}=0$ ). Aim 2 can be achieved by seeking a low-rank representation $\widetilde{\operatorname{{\mathcal{S}}}}_{A}$ for $\operatorname{{\mathcal{S}}}_{A}$ , whose rank is $k$ such that $k\ll d$ , because we have ${\rm rank}(\operatorname{{\mathcal{S}}}_{T})+{\rm rank}(\operatorname{{\mathcal{S}}}_{A})=d$ if $\operatorname{{\mathcal{S}}}_{T}=\operatorname{{\mathcal{S}}}_{A}^{\perp}$ holds. Then $\operatorname{{\mathcal{S}}}_{A}^{\perp}$ would be a high-dimensional space with sufficient representation power for target embeddings. This is especially important when we face multiple sensitive attributes, as the total size of the space is $d$ , and increasing the number of sensitive attributes would limit the capacity of $\operatorname{{\mathcal{S}}}_{T}$ to learn predictive $\widetilde{z}_{T}$ . To this end, we first propose to find the low rank sensitive attribute representation space $\widetilde{\operatorname{{\mathcal{S}}}}_{A}$ , and then encourage $Z_{T}$ to be in $\widetilde{\operatorname{{\mathcal{S}}}}_{A}$ ’s complement $\widetilde{\operatorname{{\mathcal{S}}}}_{A}^{\perp}$ .

Construct low-rank multi-sensitive space. We apply Singular Value Decomposition (SVD) on $Z_{A}=U_{A}\Sigma_{A}V_{A}$ to construct the low-rank space $\widetilde{\mathcal{S}}_{A}$ , where $U_{A},V_{A}\in\operatorname{{\mathbb{R}}}^{d\times d}$ are orthogonal matrices with left and right singular vectors $u_{A}^{i}\in\operatorname{{\mathbb{R}}}^{d}$ and $v^{i}_{A}\in\operatorname{{\mathbb{R}}}^{n}$ respectively. And $\Sigma_{A}\in\operatorname{{\mathbb{R}}}^{d\times n}$ is a diagonal matrix with descending non-negative singular values $\{\delta^{i}_{A}\}_{i=1}^{min\{n,d\}}$ . Then we extract the most important $k$ left singular vectors to construct $\widetilde{\mathcal{S}}_{A}=[u^{1}_{A},...,u^{k}_{A}]$ , where $k$ controls how much sensitive information to be captured in $\widetilde{\operatorname{{\mathcal{S}}}}_{A}$ . It is notable that $\widetilde{\mathcal{S}}_{A}$ is agnostic to the number of sensitive attributes because they share the same $Z_{A}$ . For situations that can not get the whole dataset at once, we follow [8] to select the most important bases from both bases of old iterations and newly constructed ones, thus providing an accumulative low-rank space construction variant to update $\widetilde{\operatorname{{\mathcal{S}}}}_{A}$ iteratively. As we do not observe significant performance differences between these two variants (Fig. 4 (a)), we use and refer to the first one in this paper if there is no special clarification.

Column orthogonal loss. With the low-rank space $\widetilde{\mathcal{S}}_{A}$ for multiple sensitive attributes, we encourage $\phi_{T}$ to learn representations in its complement $\widetilde{\mathcal{S}}_{A}^{\perp}$ . Notice that $\widetilde{\mathcal{S}}_{A}^{\perp}$ can also be interpreted as the kernel of the projection onto $\widetilde{\mathcal{S}}_{A}$ , i.e., $\widetilde{\mathcal{S}}_{A}^{\perp}=\text{Ker}(\text{proj}_{\widetilde{\mathcal{S}}_{A}}\widetilde{z}_{T})$ . Therefore, we achieve column orthogonal loss by minimizing the projection of $Z_{T}$ to $\widetilde{\mathcal{S}}_{A}$ , which can be defined as:

\displaystyle L_{corth}

\displaystyle=c(Z_{T},\widetilde{\mathcal{S}}_{A})=\sum^{n}_{i=1}\frac{\left\|\widetilde{\mathcal{S}}_{A}^{\top}\widetilde{z}_{T}^{i}\right\|_{2}^{2}}{\left\|\widetilde{z}_{T}^{i}\right\|_{2}^{2}}.

(5)

As $\widetilde{\mathcal{S}}_{A}$ is a low-rank space, $\widetilde{\mathcal{S}}_{A}^{\perp}$ will have abundant freedom for $\phi_{T}$ to extract target information, thus reserving predictive ability.

2.3 Row Space Orthogonality

Then, we study the row space of target and sensitive attribute representations. Row space orthogonality aims to learn target representations $Z_{T}$ that have the least projection onto the sensitive row space $\widehat{\operatorname{{\mathcal{S}}}}_{A}$ . In other words, we want to ensure orthogonality on each feature dimension between $Z_{T}$ and $Z_{A}$ . Denote target representation $Z_{T}=[\widehat{z}_{T}^{1};\widehat{z}_{T}^{2};\dots;\widehat{z}_{T}^{d}]$ and sensitive attribute representation $Z_{A}=[\widehat{z}_{A}^{1};\widehat{z}_{A}^{2};\dots;\widehat{z}_{A}^{d}]$ , where $\widehat{z}^{i}\in\operatorname{{\mathbb{R}}}^{1\times n}$ is a row vector for $i\in[d]$ . We represent row space for target representations and sensitive attribute representations as $\widehat{\operatorname{{\mathcal{S}}}}_{T}={\rm span}(Z_{T}^{\top})$ and $\widehat{\operatorname{{\mathcal{S}}}}_{A}={\rm span}(Z_{A}^{\top})$ correspondingly. As the coordinates (i.e., the index of samples) of $\widehat{z}_{A}$ and $\widehat{z}_{T}$ are aligned, forcing $\widehat{\operatorname{{\mathcal{S}}}}_{T}\perp\widehat{\operatorname{{\mathcal{S}}}}_{A}$ can be directly applied by pushing $\widehat{z}_{T}^{i}$ and $\widehat{z}_{A}^{j}$ to be orthogonal for arbitrary $i,j\in[d]$ .

Unlike column space, the orthogonality here won’t affect the utility, since the row vector $\widehat{z}_{T}$ is not directly correlated to the target $y$ . To be specific, we let pair-wise row vectors $Z_{T}=[\widehat{z}_{T}^{1},\widehat{z}_{T}^{2},\dots,\widehat{z}_{T}^{d}]$ and $Z_{A}=[\widehat{z}_{A}^{1},\widehat{z}_{A}^{2},\dots,\widehat{z}_{A}^{d}]$ have a zero inner dot-product. Then for any $i,j\in[d]$ , we try to minimize $<\widehat{z}_{T}^{i},\widehat{z}_{A}^{j}>$ . Here we slightly modify the orthogonality by extra subtracting the mean vector $\mu_{A}$ and $\mu_{T}$ from $Z_{A}$ and $Z_{T}$ respectively, where $\mu=\mathbb{E}_{i\in[d]}\widehat{z}^{i}\in\mathbb{R}^{1\times n}$ . Then orthogonality loss will naturally be integrated into a covariance loss:

\displaystyle L_{rorth}=r(Z_{T},Z_{A})=\frac{1}{d^{2}}\sum^{d}_{i=1}\sum^{d}_{j=1}\left[(\widehat{z}^{i}_{T}-\mu_{T})(\widehat{z}^{j}_{A}-\mu_{A})^{\top}\right]^{2}.

(6)

In this way, the resulting loss encourages each feature of $Z_{T}$ to be independent of features in $Z_{A}$ thus suppressing the sensitive-encoded covariances that cause the unfairness.

2.4 Overall Training

In this section, we introduce the overall training schema as shown in Fig. 2 (c). For the sensitive branch, we noticed that training a shared encoder can pose a risk of sensitive-related features being used for the target classification [4], and we pre-train separate $\left\{\phi_{A},h_{A_{1}},...,h_{A_{m}}\right\}$ for multiple sensitive attributes using the sensitive loss as $L_{sens}=\frac{1}{m}\sum_{i\in[m]}L_{A_{i}}$ . Here we use cross-entropy loss as $L_{A_{i}}$ for the $i$ -th sensitive attribute. Hence $p(z_{A}|x)$ and $p(a|z_{A})$ in Eq. (4) can be obtained. Then, the multi-sensitive space $\operatorname{{\mathcal{S}}}_{A}$ is constructed as in Section 2.2 over the training data. For the target branch, we use cross-entropy loss as our classification objective $L_{T}$ to supervise the training of $\phi_{T}$ and $h_{T}$ and estimate $p(z_{T}|x)$ and $p(y|z_{T})$ in Eq. (4) respectively. Here we do not make additional constraints to $L_{T}$ , which means it can be replaced by any other task-specific losses. At last, we apply our column and row orthogonality losses $L_{corth}$ and $L_{rorth}$ to representations as introduced in Section 2.2 and Section 2.3 along with detached $\operatorname{{\mathcal{S}}}_{A}$ and $Z_{A}$ to approximate independence between $p(z_{A}|x)$ and $p(z_{T}|x)$ . The overall target objective is given as:

\displaystyle L_{targ}

\displaystyle=L_{T}+\lambda_{c}L_{corth}+\lambda_{r}L_{rorth},

(7)

where $\lambda_{c}$ and $\lambda_{r}$ are hyper-parameters to weigh orthogonality and balance fairness and utility.

3 Experiments

Table 1: CheXpert dataset statistics and group positive rate

p(y=1|a)

regarding pleural effusion with three sensitive attributes race, sex, and age.

Dataset	#Sample	Group Positive Rate
		Race	Sex	Age
		(White/Non-white/gap)	(Male/Female/gap)	(>60/ $\leq$ 60/gap)
Original	127130	.410/.393/.017	.405/.408/.003	.440/.359/.081
Augmented	88215	.264/.386/.122	.254/.379/.125	.264/.386/.122

3.1 Setup

Dataset. We adopt CheXpert dataset [7] to predict Pleural Effusion in chest X-rays, as it’s crucial for chronic obstructive pulmonary disease diagnosis with high incidence. Subgroups are defined based on the following binarized sensitive attributes: self-reported race and ethnicity, sex, and age. Note that data bias (positive rate gap) is insignificant in the original dataset (see Table 1, row ’original’). To demonstrate the effectiveness of bias mitigation methods, we amplify the data bias by (1) firstly dividing the data into different groups according to the conjunction of multi-sensitive labels; (2) secondly calculating the positive rate of each subgroup; (3) sampling out patients and increase each subgroup’s positive rate gap to 0.12 (see Table 1, row ‘augmented’). We resize all images to $224\times 224$ and split the dataset into a 15% test set, and an 85% 5-fold cross-validation set.

Evaluation metrics. We use the area under the ROC curve (AUC) to evaluate the utility of classifiers. To measure fairness, we follow [13] and compute subgroup disparity with respect to ED (denoted as $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ , which is based on true positive rate (TPR) and true negative rate (TNR)) in Eq. (1) as:

\mathrm{\Delta}_{\mathrm{ED}}=\max_{y\in\mathcal{Y},\pi_{1},\pi_{2}\in\mathcal{A}}\left|P(\widehat{Y}=y|A=\pi_{1},Y=y)-P(\widehat{Y}=y|A=\pi_{2},Y=y)\right|.

(8)

We also follow [20] and compare subgroup disparity regarding AUC (denoted as $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ ), which gives a threshold-free fairness metric. Note that we evaluate disparities both jointly and individually. The joint disparities assess multi-sensitive fairness by computing disparity across subgroups defined by the combination of multiple sensitive attributes $\mathcal{A}$ . On the other hand, individual disparities are calculated based on a specific binary sensitive attribute $A_{i}$ .

Table 2: Comparasion of predicting Pleural Effusion on CheXpert dataset. We report the mean and standard deviation of 5-fold models trained with multi-sensitive attributes. AUC is used as the utility metric, and fairness is evaluated using disparities among subgroups defined on multi-sensitive attributes {Race, Sex, and Age} jointly and each of the attribute individually.

Methods AUC ( $\uparrow$ ) Subgroup Disparity ( $\downarrow$ ) Joint Race Sex Age $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ ERM [17] 0.863 0.119 0.224 0.018 0.055 0.046 0.142 0.023 0.038 (.005) (.017) (.013) (.009) (.017) (.008) (.014) (.004) (.010) G-DRO [14] 0.854 0.101 0.187 0.015 0.048 0.034 0.105 0.035 0.051 (.004) (.012) (.034) (.003) (.014) (.010) (.025) (.002) (.010) JTT [9] 0.834 0.103 0.166 0.019 0.056 0.026 0.079 0.017 0.030 (.020) (.017) (.023) (.008) (.016) (.002) (.004) (.006) (.007) Adv [18] 0.854 0.089 0.130 0.017 0.027 0.022 0.039 0.016 0.023 (.002) (.009) (.018) (.004) (.009) (.003) (.008) (.004) (.004) BR-Net [1] 0.849 0.113 0.200 0.018 0.051 0.037 0.109 0.027 0.039 (.001) (.025) (.023) (.008) (.013) (.012) (.025) (.006) (.006) PARADE [4] 0.857 0.103 0.193 0.017 0.052 0.042 0.104 0.026 0.031 (.002) (.022) (.032) (.002) (.010) (.006) (.023) (.006) (.011) Orth [15] 0.856 0.084 0.177 0.012 0.045 0.022 0.083 0.025 0.032 (.007) (.022) (.016) (.005) (.012) (.009) (.012) (.006) (.005) FCRO (ours) 0.858 0.057 0.107 0.012 0.033 0.015 0.024 0.013 0.019 (.001) (.022) (.013) (.003) (.008) (.004) (.008) (.004) (.006)

Implementation details. In our implementation, all methods use the same training protocol. We choose DenseNet-121 [6] as the backbone, but replace the final layer with a linear layer to extract 128-dimensional representations. The optimizer is Adam with learning rate of $1e^{-4}$ , and weight decay of $4e^{-4}$ . We train for 40 epochs with a batch size of 128. We sweep a range of hyper-parameters for each method and empirically set $\lambda_{c}=80$ , $\lambda_{r}=500$ , and $k=3$ for FCRO. We train models in 5-fold with different random seeds. In each fold, we sort all the validations according to utility and select the best model with the lowest average $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ regarding each sensitive attribute among the top 5 utilities.

Baselines. We compare our method with (i) G-DRO [14] and (ii) JTT [9] – methods that seek low worst-group error by minimax optimization on group fairness and target task error, which can be naturally regarded as multi-sensitive fairness methods by defining subgroups with multi-sensitive attributes conjunctions. We also extend recent state-of-the-art fair representation learning methods on single sensitive attributes to multiple ones and compare our method with them, including (iii) Adv [18] and (iv) BR-Net [1] – methods that achieve fair representation via disentanglement using adversarial training, (v) PARADE [4] – a state-of-the-art method that adversarially eliminates mutual information between target and sensitive attribute representations and (vi) Orth [15] – a recent work closest to our method. Orth hard codes the means of both sensitive and target prior distributions to orthogonal means and re-parameterize the encoder output on the orthogonal priors. Besides, we give the result of (vii) ERM [17] – the vanilla classifier trained without any bias mitigation techniques.

3.2 Comparsion with Baselines

Quantitative results. We summarize quantitative comparisons in Table 2. It can be observed that all the bias mitigation methods can improve fairness compared to ERM [17] at the cost of utility. While ensuring considerable classification accuracy, FCRO achieves significant fairness improvement both jointly and individually, demonstrating the effectiveness of our representation orthogonality motivation. To summarize, compared with the best performance in each metric, FCRO reduced classification disparity on subgroups with joint $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ by 2.7% and joint $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ by 2.3% respectively, and experienced 0.5% $\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}$ and 0.4% $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ boosts regarding the average of three sensitive attributes. As medical applications are sensitive to classification thresholds, we further show calibration curves with the mean and standard deviation of subgroups defined on the conjunction of multiple sensitive attributes in Fig. 3 (a). The vanilla ERM [17] suffers from biased calibration among subgroups. Fairness algorithms can help mitigate this, while FCRO shows the most harmonious deviation and the most trustworthy classification. Qualitative results. We present the class activation map [2] in Fig. 3 (b). We observe that the vanilla ERM [17] model tends to look for sensitive evidence outside the lung regions, e.g., breast, which threatens unfairness. FCRO focuses on the pathology-related part only for fair Pleural Effusion classification, which visually confirms the validity of our method.

3.3 Ablation Studies

Loss modules and hyperparameters. To evaluate the impact of loss weights and critical components of FCRO, we delve into different choices of hyperparameters and alternative designs. As depicted in Fig. 4 (a), we showcase the significance of the key components and the Pareto fronts (i.e., the set of optimal points) curve that results from a range of hyperparameters $\lambda_{c}$ and $\lambda_{r}$ . Increasing $\lambda_{c}$ and $\lambda_{r}$ resulted in a small decrease in AUC (0.5% and 0.12%, respectively). However, this was offset by significant gains in fairness (4% and 1.5%, respectively). The observation confirms the goal of managing multiple sensitive attributes effectively without sacrificing accuracy. Besides, We observe that removing either column or row space orthogonality shows a decrease in joint $\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}$ of 2.4% and 1.8% respectively, but still being competitive. We also observe accumulative space introduced in Section 2.2 achieves a comparable performance.

Training with different sensitive attributes. We present an in-depth ablation study on multiple sensitive attributes in Fig. 4 (b), where models are trained with various numbers and permutations of attributes. We show all methods perform reasonably better than ERM when trained with a single sensitive attribute but FCRO brought significantly more benefits when trained with the union of discriminated attributes (e.g., Sex $\times$ Age), which consolidate FCRO’s ability for multi-sensitive attributes fairness. FCRO stand out among all methods.

Different rank $k$ for $\widetilde{\operatorname{{\mathcal{S}}}}_{A}$ . We show the effect of choosing different $k$ for column space orthogonality. As shown in Fig. 4 (c), a lower rank $k$ benefits convergence of the model thus improving accuracy, which validates our insights in Section. 2.2 that lower sensitive space rank will improve the utility of target representations. In Fig. 4 (d), we show that $k=3$ is enough to capture over $95\%$ sensitive information and keep increasing it does not bring too much benefit for fairness, thus we choose $k=3$ to achieve the best utility-fairness trade off.

4 Conclusion and Future Work

This work studies an essential yet under-explored fairness problem in medical image classification where samples are with sets of sensitive attributes. We formulate this problem mathematically and propose a novel fair representation learning algorithm named FCRO, which pursues orthogonality between sensitive and target representations. Extensive experiments on a large public chest X-rays demonstrate that FCRO significantly boosts the fairness-utility trade-off both jointly and individually. Moreover, we show that FCRO performs stably under different situations with in-depth ablation studies. For future work, we plan to test the scalability of FCRO on an extremely large number of sensitive attributes.

Acknowledgments

This work is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), Public Safety Canada (NS-5001-22170), in part by NVIDIA Hardware Award, and in part by the Hong Kong Innovation and Technology Commission under Project No. ITS/238/21.

References

[1] Adeli, E., Zhao, Q., Pfefferbaum, A., Sullivan, E.V., Fei-Fei, L., Niebles, J.C., Pohl, K.M.: Representation learning with statistical independence to mitigate bias. In: IEEE/CVF Winter Conference on Applications of Computer Vision (2021)
[2] Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In: IEEE winter conference on applications of computer vision (2018)
[3] Creager, E., Madras, D., Jacobsen, J.H., Weis, M., Swersky, K., Pitassi, T., Zemel, R.: Flexibly fair representation learning by disentanglement. In: International conference on machine learning. pp. 1436–1445. PMLR (2019)
[4] Dullerud, N., Roth, K., Hamidieh, K., Papernot, N., Ghassemi, M.: Is fairness only metric deep? evaluating and addressing subgroup gaps in deep metric learning. The International Conference on Learning Representations (2022)
[5] Glocker, B., Jones, C., Bernhardt, M., Winzeck, S.: Algorithmic encoding of protected characteristics in image-based models for disease detection. arXiv preprint arXiv:2110.14755 (2021)
[6] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
[7] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
[8] Lin, S., Yang, L., Fan, D., Zhang, J.: Trgp: Trust region gradient projection for continual learning. International Conference on Learning Representations (2022)
[9] Liu, E.Z., Haghgoo, B., Chen, A.S., Raghunathan, A., Koh, P.W., Sagawa, S., Liang, P., Finn, C.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning (2021)
[10] Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(11) (2008)
[11] Madras, D., Creager, E., Pitassi, T., Zemel, R.: Learning adversarially fair and transferable representations. In: International Conference on Machine Learning. pp. 3384–3393. PMLR (2018)
[12] Madras, D., Creager, E., Pitassi, T., Zemel, R.: Fairness through causal awareness: Learning causal latent-variable models for biased data. In: Proceedings of the conference on fairness, accountability, and transparency. pp. 349–358 (2019)
[13] Roh, Y., Lee, K., Whang, S.E., Suh, C.: Fairbatch: Batch selection for model fairness. International Conference on Learning Representations (2020)
[14] Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. International Conference on Learning Representations (2019)
[15] Sarhan, M.H., Navab, N., Eslami, A., Albarqouni, S.: Fairness by learning orthogonal disentangled representations. In: European Conference on Computer Vision. pp. 746–761. Springer (2020)
[16] Shui, C., Xu, G., Chen, Q., Li, J., Ling, C., Arbel, T., Wang, B., Gagné, C.: On learning fairness and accuracy on multiple subgroups. Advances in Neural Information Processing Systems (2022)
[17] Vapnik, V.: Principles of risk minimization for learning theory. In: Advances in Neural Information Processing Systems. vol. 4 (1991)
[18] Wadsworth, C., Vera, F., Piech, C.: Achieving fairness through adversarial learning: an application to recidivism prediction. arXiv preprint arXiv:1807.00199 (2018)
[19] Xu, Z., Li, J., Yao, Q., Li, H., Shi, X., Zhou, S.K.: A survey of fairness in medical image analysis: Concepts, algorithms, evaluations, and challenges. arXiv preprint arXiv:2209.13177 (2022)
[20] Zhang, H., Dullerud, N., Roth, K., Oakden-Rayner, L., Pfohl, S., Ghassemi, M.: Improving the fairness of chest x-ray classifiers. In: Conference on Health, Inference, and Learning. pp. 204–233. PMLR (2022)

	$\displaystyle p(y,a\|x)=\frac{p(y\|x,a)p(x\|a)p(a)}{p(x)}$	$\displaystyle=p(y\|x)p(a\|x)$		(2)
		$\displaystyle=p(y\|z_{T})p(z_{T}\|x)\prod_{i\in[m]}p(a_{i}\|z_{A_{i}})p(z_{A_{i}}\|x),$		(3)

On Fairness of Medical Image Classification with Multiple Sensitive Attributes via Learning Orthogonal Representations