This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

**footnotetext: These authors contributed equally to this work.11institutetext: Department of Electrical and Computer Engineering,
The University of British Columbia, Vancouver, BC, Canada 11email: {dwenlong,xiaoxiao.li}@ece.ubc.ca
22institutetext: Department of Computer Science and Engineering,
The Chinese University of Hong Kong, Hong Kong, China 22email: {yzhong22,qdou}@cse.cuhk.edu.hk

On Fairness of Medical Image Classification
with Multiple Sensitive Attributes via
Learning Orthogonal Representations

Wenlong Deng 11 0009-0002-0545-6384    Yuan Zhong 22 0009-0004-4909-8763    Qi Dou 22 0000-0002-3416-9950    Xiaoxiao Li 11 0000-0002-8833-0244
Abstract

Mitigating the discrimination of machine learning models has gained increasing attention in medical image analysis. However, rare works focus on fair treatments for patients with multiple sensitive demographic attributes, which is a crucial yet challenging problem for real-world clinical applications. In this paper, we propose a novel method for fair representation learning with respect to multi-sensitive attributes. We pursue the independence between target and multi-sensitive representations by achieving orthogonality in the representation space. Concretely, we enforce the column space orthogonality by keeping target information on the complement of a low-rank sensitive space. Furthermore, in the row space, we encourage feature dimensions between target and sensitive representations to be orthogonal. The effectiveness of the proposed method is demonstrated with extensive experiments on the CheXpert dataset. To our best knowledge, this is the first work to mitigate unfairness with respect to multiple sensitive attributes in the field of medical imaging. The code is available at https://github.com/ubc-tea/FCRO-Fair-Classification-Orthogonal-Representation.

1 Introduction

With the increasing application of artificial intelligence systems for medical image diagnosis, it is notably important to ensure fairness of image classification models and investigate concealed model biases that are to-be-encountered in complex real-world situations. Unfortunately, sensitive attributes (e.g., race and gender) accompanied by medical images are prone to be inherently encoded by machine learning models [5], and affect the model’s discrimination property [20]. Recently, fair representation learning has shown great potential as it acts as a group parity bottleneck that mitigates discrimination when generalized to downstream tasks. Existing methods [1, 4, 15, 16] have studied the parity between privileged and unprivileged groups upon just a single sensitive attribute, but neglecting the flexibility with respect to multiple sensitive attributes, in which the conjunctions of unprivileged attributes might deteriorate discrimination. This is a crucial yet challenging problem hindering the applicability of machine learning models, especially for medical image classification where patients always have many demographic attributes.

Refer to caption
Figure 1: A t-SNE [10] visualization of (a) sensitive attribute and (b) target representations learned from our proposed methods FCRO on the CheXpert dataset [7]. Sensitive embeddings capture subgroups’ variance. We claim FCRO enforces fair classification on the target task by learning orthogonal target representations that are invariant over different demographic attribute combinations.

To date, it is still challenging to effectively learn target-related representations which are both fair and flexible to multiple sensitive attributes, regardless of some promising investigations recently. For instance, adversarial methods [1, 11] produce robust representations by formulating a min-max game between an encoder that learns class-related representation and an adversary that removes sensitive information from it. Disentanglement-based methods [4] achieve separation by minimizing the mutual information between target and sensitive attribute representations. These methods typically gain efficacy by means of carefully designed objectives. To extend them to the multi-attribute setting, additional loss functions have to be explored, which should handle gradient conflict or interference. Methods using variational autoencoder [3] decompose the latent distributions of target and sensitive and penalize their correlation for disentanglement. However, aligning the distribution of the sensitive attributes is difficult or even intractable given the complex combination of multiple factors. Besides, there are some fairness methods based on causal inference [12] or bi-level optimization [16], which also learn debiased while multi-attributes inflexible representations. Recently, disentanglement is vigorously interpreted as the orthogonality of a decomposed target-sensitive latent representation pair by [15], where they predefine a pair of orthogonal subspaces for target and sensitive attribute representations. However, in the multi-sensitive attributes setting, the dimension of the target space would be continuously compressed and how to solve it is still an open problem.

In this paper, we propose a new method to achieve Fairness via Column-Row space Orthogonality (dubbed FCRO) by learning fair representations with respect to multiple sensitive attributes for medical image classification. FCRO considers multi-sensitive attributes by encoding them into a unified attribute representation. It achieves a best trade-off for fairness and data utility (see illustrations in Fig. 1) via orthogonality in both column and row spaces. Our contributions are summarized as follows: (1) We tackle the practical and challenging problem of fairness given multiple sensitive attributes for medical image classification. To the best of our knowledge, this is the first work to study fairness with respect to multi-sensitive attributes in the field of medical imaging. (2) We relax the independence of target and sensitive attribute representations by orthogonality which can be achieved by our proposed novel column and row losses. (3) We conduct extensive experiments on the CheXpert [7] dataset with over 80,000 chest X-rays. FCRO achieves a superior fairness-utility trade-off over state-of-the-art methods regarding multiple sensitive attributes race, sex, and age.

2 Methodology

2.1 Problem Formulation

Notations. We consider group fairness in this work. While keeping the model utility for the privileged group, group fairness articulates the equality of some statistics like predictive rate between certain groups. Considering a binary classification problem with column vector inputs x𝒳x\in\mathcal{X}, labels y𝒴={0,1}y\in\mathcal{Y}=\{0,1\}. Multi-sensitive attributes a𝒜a\in\mathcal{A} is vector of mm attributes sampled from the conjunction of binary attributes, i.e., Cartesian product of sensitive attributes 𝒜=i[m]Ai\mathcal{A}=\prod_{i\in[m]}A_{i}***[m]={0,1,..,m}[m]=\{0,1,..,m\}, where Ai{0,1}A_{i}\in\{0,1\} is the ii-th sensitive attribute. Our training data consist of tuples 𝒟={(x,y,a)}\mathcal{D}=\{(x,y,a)\}. We denote the classification model f(x)=hT(ϕT(x))f(x)=h_{T}(\phi_{T}(x)) that predicts a class label y^\widehat{y} given xx, where ϕT:𝒳d\phi_{T}:\mathcal{X}\mapsto\operatorname{{\mathbb{R}}}^{d} is a feature encoder for target embeddings, and hT:dh_{T}:\operatorname{{\mathbb{R}}}^{d}\mapsto\operatorname{{\mathbb{R}}} is a scoring function. Similarly, we consider a sensitive attribute model g(x)={hA1(ϕA(x)),,hAm(ϕA(x))}g(x)=\{h_{A_{1}}(\phi_{A}(x)),...,h_{A_{m}}(\phi_{A}(x))\} that predicts sensitive attributes associated with input xx. Given the number of samples nn, the input data representation is X=[x1,,xn]X=[x_{1},\dots,x_{n}] and we denote the feature representation ZT=ϕT(X)Z_{T}=\phi_{T}(X), ZA=ϕA(X)Z_{A}=\phi_{A}(X) d×n\in\operatorname{{\mathbb{R}}}^{d\times n}.

Refer to caption
Figure 2: Overview of our proposed method FCRO. (a) The graphical model of orthogonal representation learning for fair medical image classification with multiple sensitive attributes. (b) The novel column-row space orthogonality. In the column space, we encourage the target model to learn representations in the complement of a low-rank sensitive space. In the row space, we enforce each row vector (feature dimension) of the target and sensitive attribute representations to be orthogonal to each other (c) The overall training pipeline. We use a pre-trained multi-sensitive branch, and propagate orthogonal gradients to target encoder ϕT\phi_{T}.

Fairness with multiple sensitive attributes. A classifier predicts yy given an input xx by estimating the posterior probability p(y|x)p(y|x). When inputs that are affected by their associated attributes (i.e., {A1,,Am}X\{A_{1},\dots,A_{m}\}\rightarrow X) are fed into the network, the posterior probability is written as p(y|x,a)p(y|x,a). Since biased information from 𝒜\operatorname{{\mathcal{A}}} is encoded, this can lead to an unfair prediction by the classifier. For example, in the diagnosis of a disease with sensitive attributes age, sex, and race, a biased classifier will result in p(y^|A=male,old,black)p(y^|A=female,young,white)p(\widehat{y}|A={\rm male},{\rm old},{\rm black})\neq p(\widehat{y}|A=\rm{female},\rm{young},\rm{white}). In this work, we focus on equalized odds (ED), which is a commonly used and crucial criterion of fair classification in the medical domain [19]. In our case, ED regarding multiple sensitive attributes can be formulated as follows:

P(Y^=y|A=π1,Y=y)=P(Y^=y|A=π2,Y=y),π1,π2𝒜,y𝒴.\displaystyle P(\widehat{Y}=y|A=\pi_{1},Y=y)=P(\widehat{Y}=y|A=\pi_{2},Y=y),\ \forall\pi_{1},\pi_{2}\in\mathcal{A},y\in\mathcal{Y}. (1)

A classifier satisfies ED with respect to group AA and the target YY if the model prediction Y^\widehat{Y} and AA are conditionally independent given YY, i.e., Y^A|Y\widehat{Y}\perp A|Y.

Fair representation. To enforce our aforementioned conditions, we follow [15] and introduce target embedding zTz_{T} and multi-attribute embedding zAiz_{A_{i}} that is generated from xx. As in the causal structure graph for the classifier depicted in Fig. 2 (a), our objective is to find a fair model that maximizes the log-likelihood function of the distribution p(y,a|x)p(y,a|x), where:

p(y,a|x)=p(y|x,a)p(x|a)p(a)p(x)\displaystyle p(y,a|x)=\frac{p(y|x,a)p(x|a)p(a)}{p(x)} =p(y|x)p(a|x)\displaystyle=p(y|x)p(a|x) (2)
=p(y|zT)p(zT|x)i[m]p(ai|zAi)p(zAi|x),\displaystyle=p(y|z_{T})p(z_{T}|x)\prod_{i\in[m]}p(a_{i}|z_{A_{i}})p(z_{A_{i}}|x), (3)

and we call zTz_{T} fair representation for the target task (e.g., disease diagnosis). To this end, we aim to maximize Eq. (3) with the conditional independence constraint to train a fair classifier. It is noteworthy that in the multi-sensitive attributes setting, forcing zTz_{T} to be independent on all zAi,i[m]z_{A_{i}},\forall i\in[m] is challenging and even intractable when mm is large. Therefore, we propose to encode multi-sensitive attributes into a single compact encoding zAz_{A} that is still predictive for classifying attributes (i.e., zA{a1,,am}z_{A}\rightarrow\{a_{1},\dots,a_{m}\}). Then we can rewrite Eq. (3) as likelihood with the independence constraint on zTz_{T} and zAz_{A}:

p(y,a|x)=p(y|zT)p(zT|x)p(a|zA)p(zA|x).\displaystyle p(y,a|x)=p(y|z_{T})p(z_{T}|x)p(a|z_{A})p(z_{A}|x). (4)

However, maximizing Eq. (4) brings two technical questions:
Q1: How to satisfy the independence constraint for zTz_{T} and zAz_{A}?
A1: We relax independence by enforcing orthogonality in a finite vector space. Different from predefined orthogonal space in [15], we enforce orthogonality in both column spaces (Sec. 2.2) and row spaces (Sec. 2.3) of ZTZ_{T} and ZAZ_{A}.
Q2: How to estimate p(y|zT)p(y|z_{T}), p(zT|x)p(z_{T}|x), p(a|zA)p(a|z_{A}), p(zA|x)p(z_{A}|x)?
A2: We train two convolutional encoders zT=ϕT(x)z_{T}=\phi_{T}(x) and zA=ϕA(x)z_{A}=\phi_{A}(x) to approximate p(zT|x)p(z_{T}|x) and p(zA|x)p(z_{A}|x) respectively; And we train two multi-layer perception classifiers y=hT(zT)y=h_{T}(z_{T}) and a=hA(zA)a=h_{A}(z_{A}) to approximate p(y|zT)p(y|z_{T}) and p(a|zA)p(a|z_{A}) respectively (Sec. 2.4).

2.2 Column Space Orthogonality

First, we focus on the column space of the target and the sensitive attribute representations. Column space orthogonality aims to learn target representations ZTZ_{T} that fulfill the following two aims: 1) have the least projection onto the sensitive space 𝒮A\mathcal{S}_{A} and 2) preserve the representation power to predict YY.

Denote the target representation ZT=[z~T1,z~T2,,z~Tn]Z_{T}=[\widetilde{z}_{T}^{1},\widetilde{z}_{T}^{2},\dots,\widetilde{z}_{T}^{n}] and the sensitive attribute representation ZA=[z~A1,z~A2,,z~An]Z_{A}=[\widetilde{z}_{A}^{1},\widetilde{z}_{A}^{2},\dots,\widetilde{z}_{A}^{n}], where z^id×1\widehat{z}^{i}\in\operatorname{{\mathbb{R}}}^{d\times 1} is a column vector for i[n]i\in[n], we represent the column space for ZTZ_{T} and ZAZ_{A} as 𝒮T=span(ZT)\operatorname{{\mathcal{S}}}_{T}={\rm span}(Z_{T}) and 𝒮A=span(ZA)\operatorname{{\mathcal{S}}}_{A}={\rm span}(Z_{A}) respectively. Aim 1 can be achieved by forcing 𝒮T=𝒮A\operatorname{{\mathcal{S}}}_{T}=\operatorname{{\mathcal{S}}}_{A}^{\perp}. Although both z~T,z~Ad\widetilde{z}_{T},\widetilde{z}_{A}\in\operatorname{{\mathbb{R}}}^{d}, their coordinates may not be aligned as they are generated from two separate encoders. As a result, if dd\ll\infty, then there is no straightforward way to achieve 𝒮T𝒮A\operatorname{{\mathcal{S}}}_{T}\perp\operatorname{{\mathcal{S}}}_{A} by directly constraining z~Ti,z~Aj\widetilde{z}^{i}_{T},\widetilde{z}^{j}_{A} (e.g., forcing (z~Ti)z~Aj=0(\widetilde{z}^{i}_{T})^{\top}\widetilde{z}^{j}_{A}=0). Aim 2 can be achieved by seeking a low-rank representation 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A} for 𝒮A\operatorname{{\mathcal{S}}}_{A}, whose rank is kk such that kdk\ll d, because we have rank(𝒮T)+rank(𝒮A)=d{\rm rank}(\operatorname{{\mathcal{S}}}_{T})+{\rm rank}(\operatorname{{\mathcal{S}}}_{A})=d if 𝒮T=𝒮A\operatorname{{\mathcal{S}}}_{T}=\operatorname{{\mathcal{S}}}_{A}^{\perp} holds. Then 𝒮A\operatorname{{\mathcal{S}}}_{A}^{\perp} would be a high-dimensional space with sufficient representation power for target embeddings. This is especially important when we face multiple sensitive attributes, as the total size of the space is dd, and increasing the number of sensitive attributes would limit the capacity of 𝒮T\operatorname{{\mathcal{S}}}_{T} to learn predictive z~T\widetilde{z}_{T}. To this end, we first propose to find the low rank sensitive attribute representation space 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A}, and then encourage ZTZ_{T} to be in 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A}’s complement 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A}^{\perp}.

Construct low-rank multi-sensitive space. We apply Singular Value Decomposition (SVD) on ZA=UAΣAVAZ_{A}=U_{A}\Sigma_{A}V_{A} to construct the low-rank space 𝒮~A\widetilde{\mathcal{S}}_{A}, where UA,VAd×dU_{A},V_{A}\in\operatorname{{\mathbb{R}}}^{d\times d} are orthogonal matrices with left and right singular vectors uAidu_{A}^{i}\in\operatorname{{\mathbb{R}}}^{d} and vAinv^{i}_{A}\in\operatorname{{\mathbb{R}}}^{n} respectively. And ΣAd×n\Sigma_{A}\in\operatorname{{\mathbb{R}}}^{d\times n} is a diagonal matrix with descending non-negative singular values {δAi}i=1min{n,d}\{\delta^{i}_{A}\}_{i=1}^{min\{n,d\}}. Then we extract the most important kk left singular vectors to construct 𝒮~A=[uA1,,uAk]\widetilde{\mathcal{S}}_{A}=[u^{1}_{A},...,u^{k}_{A}], where kk controls how much sensitive information to be captured in 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A}. It is notable that 𝒮~A\widetilde{\mathcal{S}}_{A} is agnostic to the number of sensitive attributes because they share the same ZAZ_{A}. For situations that can not get the whole dataset at once, we follow [8] to select the most important bases from both bases of old iterations and newly constructed ones, thus providing an accumulative low-rank space construction variant to update 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A} iteratively. As we do not observe significant performance differences between these two variants (Fig. 4 (a)), we use and refer to the first one in this paper if there is no special clarification.

Column orthogonal loss. With the low-rank space 𝒮~A\widetilde{\mathcal{S}}_{A} for multiple sensitive attributes, we encourage ϕT\phi_{T} to learn representations in its complement 𝒮~A\widetilde{\mathcal{S}}_{A}^{\perp}. Notice that 𝒮~A\widetilde{\mathcal{S}}_{A}^{\perp} can also be interpreted as the kernel of the projection onto 𝒮~A\widetilde{\mathcal{S}}_{A}, i.e., 𝒮~A=Ker(proj𝒮~Az~T)\widetilde{\mathcal{S}}_{A}^{\perp}=\text{Ker}(\text{proj}_{\widetilde{\mathcal{S}}_{A}}\widetilde{z}_{T}). Therefore, we achieve column orthogonal loss by minimizing the projection of ZTZ_{T} to 𝒮~A\widetilde{\mathcal{S}}_{A}, which can be defined as:

Lcorth\displaystyle L_{corth} =c(ZT,𝒮~A)=i=1n𝒮~Az~Ti22z~Ti22.\displaystyle=c(Z_{T},\widetilde{\mathcal{S}}_{A})=\sum^{n}_{i=1}\frac{\left\|\widetilde{\mathcal{S}}_{A}^{\top}\widetilde{z}_{T}^{i}\right\|_{2}^{2}}{\left\|\widetilde{z}_{T}^{i}\right\|_{2}^{2}}. (5)

As 𝒮~A\widetilde{\mathcal{S}}_{A} is a low-rank space, 𝒮~A\widetilde{\mathcal{S}}_{A}^{\perp} will have abundant freedom for ϕT\phi_{T} to extract target information, thus reserving predictive ability.

2.3 Row Space Orthogonality

Then, we study the row space of target and sensitive attribute representations. Row space orthogonality aims to learn target representations ZTZ_{T} that have the least projection onto the sensitive row space 𝒮^A\widehat{\operatorname{{\mathcal{S}}}}_{A}. In other words, we want to ensure orthogonality on each feature dimension between ZTZ_{T} and ZAZ_{A}. Denote target representation ZT=[z^T1;z^T2;;z^Td]Z_{T}=[\widehat{z}_{T}^{1};\widehat{z}_{T}^{2};\dots;\widehat{z}_{T}^{d}] and sensitive attribute representation ZA=[z^A1;z^A2;;z^Ad]Z_{A}=[\widehat{z}_{A}^{1};\widehat{z}_{A}^{2};\dots;\widehat{z}_{A}^{d}], where z^i1×n\widehat{z}^{i}\in\operatorname{{\mathbb{R}}}^{1\times n} is a row vector for i[d]i\in[d]. We represent row space for target representations and sensitive attribute representations as 𝒮^T=span(ZT)\widehat{\operatorname{{\mathcal{S}}}}_{T}={\rm span}(Z_{T}^{\top}) and 𝒮^A=span(ZA)\widehat{\operatorname{{\mathcal{S}}}}_{A}={\rm span}(Z_{A}^{\top}) correspondingly. As the coordinates (i.e., the index of samples) of z^A\widehat{z}_{A} and z^T\widehat{z}_{T} are aligned, forcing 𝒮^T𝒮^A\widehat{\operatorname{{\mathcal{S}}}}_{T}\perp\widehat{\operatorname{{\mathcal{S}}}}_{A} can be directly applied by pushing z^Ti\widehat{z}_{T}^{i} and z^Aj\widehat{z}_{A}^{j} to be orthogonal for arbitrary i,j[d]i,j\in[d].

Unlike column space, the orthogonality here won’t affect the utility, since the row vector z^T\widehat{z}_{T} is not directly correlated to the target yy. To be specific, we let pair-wise row vectors ZT=[z^T1,z^T2,,z^Td]Z_{T}=[\widehat{z}_{T}^{1},\widehat{z}_{T}^{2},\dots,\widehat{z}_{T}^{d}] and ZA=[z^A1,z^A2,,z^Ad]Z_{A}=[\widehat{z}_{A}^{1},\widehat{z}_{A}^{2},\dots,\widehat{z}_{A}^{d}] have a zero inner dot-product. Then for any i,j[d]i,j\in[d], we try to minimize <z^Ti,z^Aj><\widehat{z}_{T}^{i},\widehat{z}_{A}^{j}>. Here we slightly modify the orthogonality by extra subtracting the mean vector μA\mu_{A} and μT\mu_{T} from ZAZ_{A} and ZTZ_{T} respectively, where μ=𝔼i[d]z^i1×n\mu=\mathbb{E}_{i\in[d]}\widehat{z}^{i}\in\mathbb{R}^{1\times n}. Then orthogonality loss will naturally be integrated into a covariance loss:

Lrorth=r(ZT,ZA)=1d2i=1dj=1d[(z^TiμT)(z^AjμA)]2.\displaystyle L_{rorth}=r(Z_{T},Z_{A})=\frac{1}{d^{2}}\sum^{d}_{i=1}\sum^{d}_{j=1}\left[(\widehat{z}^{i}_{T}-\mu_{T})(\widehat{z}^{j}_{A}-\mu_{A})^{\top}\right]^{2}. (6)

In this way, the resulting loss encourages each feature of ZTZ_{T} to be independent of features in ZAZ_{A} thus suppressing the sensitive-encoded covariances that cause the unfairness.

2.4 Overall Training

In this section, we introduce the overall training schema as shown in Fig. 2 (c). For the sensitive branch, we noticed that training a shared encoder can pose a risk of sensitive-related features being used for the target classification [4], and we pre-train separate {ϕA,hA1,,hAm}\left\{\phi_{A},h_{A_{1}},...,h_{A_{m}}\right\} for multiple sensitive attributes using the sensitive loss as Lsens=1mi[m]LAiL_{sens}=\frac{1}{m}\sum_{i\in[m]}L_{A_{i}}. Here we use cross-entropy loss as LAiL_{A_{i}} for the ii-th sensitive attribute. Hence p(zA|x)p(z_{A}|x) and p(a|zA)p(a|z_{A}) in Eq. (4) can be obtained. Then, the multi-sensitive space 𝒮A\operatorname{{\mathcal{S}}}_{A} is constructed as in Section 2.2 over the training data. For the target branch, we use cross-entropy loss as our classification objective LTL_{T} to supervise the training of ϕT\phi_{T} and hTh_{T} and estimate p(zT|x)p(z_{T}|x) and p(y|zT)p(y|z_{T}) in Eq. (4) respectively. Here we do not make additional constraints to LTL_{T}, which means it can be replaced by any other task-specific losses. At last, we apply our column and row orthogonality losses LcorthL_{corth} and LrorthL_{rorth} to representations as introduced in Section 2.2 and Section 2.3 along with detached 𝒮A\operatorname{{\mathcal{S}}}_{A} and ZAZ_{A} to approximate independence between p(zA|x)p(z_{A}|x) and p(zT|x)p(z_{T}|x). The overall target objective is given as:

Ltarg\displaystyle L_{targ} =LT+λcLcorth+λrLrorth,\displaystyle=L_{T}+\lambda_{c}L_{corth}+\lambda_{r}L_{rorth}, (7)

where λc\lambda_{c} and λr\lambda_{r} are hyper-parameters to weigh orthogonality and balance fairness and utility.

3 Experiments

Table 1: CheXpert dataset statistics and group positive rate p(y=1|a)p(y=1|a) regarding pleural effusion with three sensitive attributes race, sex, and age.
Dataset #Sample Group Positive Rate   
Race Sex  Age
(White/Non-white/gap) (Male/Female/gap) (>60/\leq 60/gap)
Original 127130 .410/.393/.017 .405/.408/.003 .440/.359/.081
Augmented 88215 .264/.386/.122 .254/.379/.125  .264/.386/.122

3.1 Setup

Dataset. We adopt CheXpert dataset [7] to predict Pleural Effusion in chest X-rays, as it’s crucial for chronic obstructive pulmonary disease diagnosis with high incidence. Subgroups are defined based on the following binarized sensitive attributes: self-reported race and ethnicity, sex, and age. Note that data bias (positive rate gap) is insignificant in the original dataset (see Table 1, row ’original’). To demonstrate the effectiveness of bias mitigation methods, we amplify the data bias by (1) firstly dividing the data into different groups according to the conjunction of multi-sensitive labels; (2) secondly calculating the positive rate of each subgroup; (3) sampling out patients and increase each subgroup’s positive rate gap to 0.12 (see Table 1, row ‘augmented’). We resize all images to 224×224224\times 224 and split the dataset into a 15% test set, and an 85% 5-fold cross-validation set.

Evaluation metrics. We use the area under the ROC curve (AUC) to evaluate the utility of classifiers. To measure fairness, we follow [13] and compute subgroup disparity with respect to ED (denoted as ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}}, which is based on true positive rate (TPR) and true negative rate (TNR)) in Eq. (1) as:

ΔED=maxy𝒴,π1,π2𝒜|P(Y^=y|A=π1,Y=y)P(Y^=y|A=π2,Y=y)|.\mathrm{\Delta}_{\mathrm{ED}}=\max_{y\in\mathcal{Y},\pi_{1},\pi_{2}\in\mathcal{A}}\left|P(\widehat{Y}=y|A=\pi_{1},Y=y)-P(\widehat{Y}=y|A=\pi_{2},Y=y)\right|. (8)

We also follow [20] and compare subgroup disparity regarding AUC (denoted as ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}), which gives a threshold-free fairness metric. Note that we evaluate disparities both jointly and individually. The joint disparities assess multi-sensitive fairness by computing disparity across subgroups defined by the combination of multiple sensitive attributes 𝒜\mathcal{A}. On the other hand, individual disparities are calculated based on a specific binary sensitive attribute AiA_{i}.

Table 2: Comparasion of predicting Pleural Effusion on CheXpert dataset. We report the mean and standard deviation of 5-fold models trained with multi-sensitive attributes. AUC is used as the utility metric, and fairness is evaluated using disparities among subgroups defined on multi-sensitive attributes {Race, Sex, and Age} jointly and each of the attribute individually.

Methods AUC (\uparrow) Subgroup Disparity (\downarrow)    Joint Race Sex Age    ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}} ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}} ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}} ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}}  ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} ERM [17] 0.863 0.119 0.224 0.018 0.055 0.046 0.142 0.023  0.038 (.005) (.017) (.013) (.009) (.017) (.008) (.014) (.004) (.010) G-DRO [14] 0.854 0.101 0.187 0.015 0.048 0.034 0.105 0.035  0.051 (.004) (.012) (.034) (.003) (.014) (.010) (.025) (.002) (.010) JTT [9] 0.834 0.103 0.166 0.019 0.056 0.026 0.079 0.017 0.030 (.020) (.017) (.023) (.008) (.016) (.002) (.004) (.006) (.007) Adv [18] 0.854 0.089 0.130 0.017 0.027 0.022 0.039 0.016 0.023 (.002) (.009) (.018) (.004) (.009) (.003) (.008) (.004) (.004) BR-Net [1] 0.849 0.113 0.200 0.018 0.051 0.037 0.109 0.027 0.039 (.001) (.025) (.023) (.008) (.013) (.012) (.025) (.006) (.006) PARADE [4] 0.857 0.103 0.193 0.017 0.052 0.042 0.104 0.026 0.031 (.002) (.022) (.032) (.002) (.010) (.006) (.023) (.006) (.011) Orth [15] 0.856 0.084 0.177 0.012 0.045 0.022 0.083 0.025 0.032 (.007) (.022) (.016) (.005) (.012) (.009) (.012) (.006) (.005) FCRO (ours)  0.858 0.057 0.107 0.012 0.033 0.015 0.024 0.013 0.019 (.001) (.022) (.013) (.003) (.008) (.004) (.008) (.004) (.006)

Implementation details. In our implementation, all methods use the same training protocol. We choose DenseNet-121 [6] as the backbone, but replace the final layer with a linear layer to extract 128-dimensional representations. The optimizer is Adam with learning rate of 1e41e^{-4}, and weight decay of 4e44e^{-4}. We train for 40 epochs with a batch size of 128. We sweep a range of hyper-parameters for each method and empirically set λc=80\lambda_{c}=80, λr=500\lambda_{r}=500, and k=3k=3 for FCRO. We train models in 5-fold with different random seeds. In each fold, we sort all the validations according to utility and select the best model with the lowest average ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} regarding each sensitive attribute among the top 5 utilities.

Baselines. We compare our method with (i) G-DRO [14] and (ii) JTT [9] – methods that seek low worst-group error by minimax optimization on group fairness and target task error, which can be naturally regarded as multi-sensitive fairness methods by defining subgroups with multi-sensitive attributes conjunctions. We also extend recent state-of-the-art fair representation learning methods on single sensitive attributes to multiple ones and compare our method with them, including (iii) Adv [18] and (iv) BR-Net [1] – methods that achieve fair representation via disentanglement using adversarial training, (v) PARADE [4] – a state-of-the-art method that adversarially eliminates mutual information between target and sensitive attribute representations and (vi) Orth [15] – a recent work closest to our method. Orth hard codes the means of both sensitive and target prior distributions to orthogonal means and re-parameterize the encoder output on the orthogonal priors. Besides, we give the result of (vii) ERM [17] – the vanilla classifier trained without any bias mitigation techniques.

Refer to caption
Figure 3: (a) Subgroup calibration curves. We report the mean (the line) and standard deviation (the shadow around it) of different subgroups defined by the conjunction of race, sex, and age. Larger shadow areas represent more severe unfairness. (b) Class activation map [2] generated from vanilla ERM [17] and FCRO (ours).

3.2 Comparsion with Baselines

Quantitative results. We summarize quantitative comparisons in Table 2. It can be observed that all the bias mitigation methods can improve fairness compared to ERM [17] at the cost of utility. While ensuring considerable classification accuracy, FCRO achieves significant fairness improvement both jointly and individually, demonstrating the effectiveness of our representation orthogonality motivation. To summarize, compared with the best performance in each metric, FCRO reduced classification disparity on subgroups with joint ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}} by 2.7% and joint ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} by 2.3% respectively, and experienced 0.5% ΔAUC\operatorname*{\mathrm{\Delta}}_{\mathrm{AUC}} and 0.4% ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} boosts regarding the average of three sensitive attributes. As medical applications are sensitive to classification thresholds, we further show calibration curves with the mean and standard deviation of subgroups defined on the conjunction of multiple sensitive attributes in Fig. 3 (a). The vanilla ERM [17] suffers from biased calibration among subgroups. Fairness algorithms can help mitigate this, while FCRO shows the most harmonious deviation and the most trustworthy classification. Qualitative results. We present the class activation map [2] in Fig. 3 (b). We observe that the vanilla ERM [17] model tends to look for sensitive evidence outside the lung regions, e.g., breast, which threatens unfairness. FCRO focuses on the pathology-related part only for fair Pleural Effusion classification, which visually confirms the validity of our method.

Refer to caption
Figure 4: (a) Fairness-utility trade-off. The perfect point lies in the top left corner. We report ablations and Pareto fronts of the sweep of hyperparameters. (b) Fairness when trained with various combinations of three sensitive attributes: race (R), sex (S), and age (A). (c) AUC convergence with different rank kk of 𝒮A\operatorname{{\mathcal{S}}}_{A}. (d) Fairness and total variance (the percentage of sensitive information captured by 𝒮A\operatorname{{\mathcal{S}}}_{A}) under different kk.

3.3 Ablation Studies

Loss modules and hyperparameters. To evaluate the impact of loss weights and critical components of FCRO, we delve into different choices of hyperparameters and alternative designs. As depicted in Fig. 4 (a), we showcase the significance of the key components and the Pareto fronts (i.e., the set of optimal points) curve that results from a range of hyperparameters λc\lambda_{c} and λr\lambda_{r}. Increasing λc\lambda_{c} and λr\lambda_{r} resulted in a small decrease in AUC (0.5% and 0.12%, respectively). However, this was offset by significant gains in fairness (4% and 1.5%, respectively). The observation confirms the goal of managing multiple sensitive attributes effectively without sacrificing accuracy. Besides, We observe that removing either column or row space orthogonality shows a decrease in joint ΔED\operatorname*{\mathrm{\Delta}}_{\mathrm{ED}} of 2.4% and 1.8% respectively, but still being competitive. We also observe accumulative space introduced in Section 2.2 achieves a comparable performance.

Training with different sensitive attributes. We present an in-depth ablation study on multiple sensitive attributes in Fig. 4 (b), where models are trained with various numbers and permutations of attributes. We show all methods perform reasonably better than ERM when trained with a single sensitive attribute but FCRO brought significantly more benefits when trained with the union of discriminated attributes (e.g., Sex ×\times Age), which consolidate FCRO’s ability for multi-sensitive attributes fairness. FCRO stand out among all methods.

Different rank kk for 𝒮~A\widetilde{\operatorname{{\mathcal{S}}}}_{A}. We show the effect of choosing different kk for column space orthogonality. As shown in Fig. 4 (c), a lower rank kk benefits convergence of the model thus improving accuracy, which validates our insights in Section. 2.2 that lower sensitive space rank will improve the utility of target representations. In Fig. 4 (d), we show that k=3k=3 is enough to capture over 95%95\% sensitive information and keep increasing it does not bring too much benefit for fairness, thus we choose k=3k=3 to achieve the best utility-fairness trade off.

4 Conclusion and Future Work

This work studies an essential yet under-explored fairness problem in medical image classification where samples are with sets of sensitive attributes. We formulate this problem mathematically and propose a novel fair representation learning algorithm named FCRO, which pursues orthogonality between sensitive and target representations. Extensive experiments on a large public chest X-rays demonstrate that FCRO significantly boosts the fairness-utility trade-off both jointly and individually. Moreover, we show that FCRO performs stably under different situations with in-depth ablation studies. For future work, we plan to test the scalability of FCRO on an extremely large number of sensitive attributes.

Acknowledgments

This work is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), Public Safety Canada (NS-5001-22170), in part by NVIDIA Hardware Award, and in part by the Hong Kong Innovation and Technology Commission under Project No. ITS/238/21.

References

  • [1] Adeli, E., Zhao, Q., Pfefferbaum, A., Sullivan, E.V., Fei-Fei, L., Niebles, J.C., Pohl, K.M.: Representation learning with statistical independence to mitigate bias. In: IEEE/CVF Winter Conference on Applications of Computer Vision (2021)
  • [2] Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In: IEEE winter conference on applications of computer vision (2018)
  • [3] Creager, E., Madras, D., Jacobsen, J.H., Weis, M., Swersky, K., Pitassi, T., Zemel, R.: Flexibly fair representation learning by disentanglement. In: International conference on machine learning. pp. 1436–1445. PMLR (2019)
  • [4] Dullerud, N., Roth, K., Hamidieh, K., Papernot, N., Ghassemi, M.: Is fairness only metric deep? evaluating and addressing subgroup gaps in deep metric learning. The International Conference on Learning Representations (2022)
  • [5] Glocker, B., Jones, C., Bernhardt, M., Winzeck, S.: Algorithmic encoding of protected characteristics in image-based models for disease detection. arXiv preprint arXiv:2110.14755 (2021)
  • [6] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
  • [7] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
  • [8] Lin, S., Yang, L., Fan, D., Zhang, J.: Trgp: Trust region gradient projection for continual learning. International Conference on Learning Representations (2022)
  • [9] Liu, E.Z., Haghgoo, B., Chen, A.S., Raghunathan, A., Koh, P.W., Sagawa, S., Liang, P., Finn, C.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning (2021)
  • [10] Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(11) (2008)
  • [11] Madras, D., Creager, E., Pitassi, T., Zemel, R.: Learning adversarially fair and transferable representations. In: International Conference on Machine Learning. pp. 3384–3393. PMLR (2018)
  • [12] Madras, D., Creager, E., Pitassi, T., Zemel, R.: Fairness through causal awareness: Learning causal latent-variable models for biased data. In: Proceedings of the conference on fairness, accountability, and transparency. pp. 349–358 (2019)
  • [13] Roh, Y., Lee, K., Whang, S.E., Suh, C.: Fairbatch: Batch selection for model fairness. International Conference on Learning Representations (2020)
  • [14] Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. International Conference on Learning Representations (2019)
  • [15] Sarhan, M.H., Navab, N., Eslami, A., Albarqouni, S.: Fairness by learning orthogonal disentangled representations. In: European Conference on Computer Vision. pp. 746–761. Springer (2020)
  • [16] Shui, C., Xu, G., Chen, Q., Li, J., Ling, C., Arbel, T., Wang, B., Gagné, C.: On learning fairness and accuracy on multiple subgroups. Advances in Neural Information Processing Systems (2022)
  • [17] Vapnik, V.: Principles of risk minimization for learning theory. In: Advances in Neural Information Processing Systems. vol. 4 (1991)
  • [18] Wadsworth, C., Vera, F., Piech, C.: Achieving fairness through adversarial learning: an application to recidivism prediction. arXiv preprint arXiv:1807.00199 (2018)
  • [19] Xu, Z., Li, J., Yao, Q., Li, H., Shi, X., Zhou, S.K.: A survey of fairness in medical image analysis: Concepts, algorithms, evaluations, and challenges. arXiv preprint arXiv:2209.13177 (2022)
  • [20] Zhang, H., Dullerud, N., Roth, K., Oakden-Rayner, L., Pfohl, S., Ghassemi, M.: Improving the fairness of chest x-ray classifiers. In: Conference on Health, Inference, and Learning. pp. 204–233. PMLR (2022)