\name

Marzieh Ajirak^⋄, Cassandra Heiselman^†, Petar M. Djurić^⋄ \address^⋄Department of Electrical and Computer Engineering, Stony Brook University
^†Department of Obstetrics, Gynecology and Reproductive Medicine, Stony Brook University Hospital \ninept \copyrightnotice© 2021 IEEE

Title

Abstract

Novel coronavirus disease 2019 (COVID-19) is rapidly spreading throughout the world and while pregnant women present the same adverse outcome rates, they are underrepresented in clinical research. In this paper, we model categorical variables of 89 test-positive COVID-19 pregnant women within the unsupervised Bayesian framework. We model the data using latent Gaussian processes for density estimation of multivariate categorical data. The results show that the model can find latent patterns in the data, which in turn could provide additional insights into the study of pregnant women that are COVID-19 positive.

{keywords}

Coronavirus disease, Categorical data, Gaussian process latent variable model, Distribution estimation

1 Introduction

The coronavirus disease 2019 (COVID-19), caused by a severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has become an unprecedented public health crisis. Around the world, many governments issued a call to researchers in machine learning (ML) and artificial intelligence (AI) to address high-priority questions related to COVID-19. This call was not unusual because ML methods are finding many uses in medical diagnosis applications. The ML field is rich with examples where based on predictive models one can, for example, estimate disease severity and consequently, the state of a patient’s health. These models employ data-driven algorithms that can extract features and discover complicated patterns that could have not been recognized or interpreted by humans [alimadadi2020artificial].

Much of the medical data are of multivariate categorical type, and typically they represent examinations and test results of patients. Besides, they include random errors and systematic biases, and sometimes they are missing. The development of accurate models and efficient inference algorithms for overcoming the challenges that clinical data bring in and getting the best out of them is of great importance.

Pregnant women are a particularly important patient population to study due to their vulnerability to disease and the often underrepresentation of the population in clinical research [Favre2020]. There has been a relative sparcity of data in regards to SARS-CoV-2 and its effect in pregnancy. Utilizing ML techniques to study this population during the pandemic can help build pregnancy-specific evidence to guide clinical recommendations.

The key to have successful predictive methods largely depends on feature selection and data representation. A common approach is to have a clinical doctor specify the variables and label the clinical data to be used as training sets. Then the ML method will find the mapping from data and will be tested on new data sets. Although appropriate in many situations, a supervised definition of the features contributes to losing an opportunity to learn latent patterns and features [miotto2016deep]. In combating the subjectivity of defining the features, an unsupervised learning approach can be used to extract useful information from data. One might argue that a limitation of this method is that patients are often represented in a low-dimensional space that summarizes the information available in the data but with losses in information [miotto2016deep].

In the ML literature, Gaussian processes (GPs) provide data efficient and powerful Bayesian framework for learning latent functions or patterns [rasmussen2003gaussian]. It can be shown that a GP is equivalent to a single layer of fully connected neural network with an i.i.d. prior over its parameters and with infinite network width [lee2017deep]. GPs have been successfully applied in supervised and unsupervised task of learning from medical data [feng2018supervised]. In particular, GP latent variable models (GP-LVMs) are used for Bayesian unsupervised learning. They are endowed with the ability to automatically learn latent structures and dependencies in the data [titsias2010bayesian].

The application of GPLVMs for categorical data is more involved. When the data consist of categorical variables, such as medical examinations, patient’s co-morbidities and symptoms, we will have long vectors of categorical variables which lead to huge numbers of possible realizations. This in turn creates very sparse spaces when we deal with a limited number of data.

The problem of sparsity usually occurs when the dataset diversity is poor and the number of patients is relatively small. The recent models [khan2012stick]-[gal2015latent], has achieved significant improvements in performance. For example, the authors in [gal2015latent], has non-linearly embedded the observations in the continuous space using Gaussian processes.

In this paper, we explore modeling vectors of categorical variables within the Bayesian nonparametric framework, and in particular, we work with Gaussian process latent variable models (GP-LVMs). We focus on categorical latent GPs and use them in the context of GP-LVM models. We study data related to test-positive COVID-19 pregnant women who tested positive between March 13, 2020, and August 08, 2020 at Stony Brook University Hospital (SBUH).

Our contribution in the paper is in formulating the problem of density estimation of categorical data in test-positive COVID-19 pregnant women. The purpose of this effort was to discover hidden patterns among the women that could be used for prediction of outcomes based on a number of categorical variables. For example, we could use the found latent space to find similarities among patients and use them to predict the severity of the disease for a particular patient given the “location” of the patient in the latent space. Further, such studies could allow for easy visualization of the cohort of patients under consideration.

The remainder of this paper is organized as follows. In the next section, we provide a brief overview of GP-LVMs. In Section 3, we present the details of the model used for our multivariate categorical data analysis and its inference process. In Section 4, we show our results from both simulations and real COVID-19 data. Finally, with Section 5, we conclude the paper with final remarks.

2 Background

3 Model Description

In this section we present the method based on GP-LVM for distribution estimation of multivariate categorical data, first proposed in [gal2015latent]. We first present the underlying generative model, and then we show how to infer the hidden model variables given the observed data.

3.1 Generative Model

We consider a generative model for a dataset $Y$ with $N$ observations and $D$ categorical variables. We show the $d$ -th categorical variable in the $n$ -th observation by $y_{nd}.$ Each categorical variable $y_{nd}$ can take values, for example, from 1 to $K$ . Each of them are samples from their corresponding probability mass function with weights $\mathbf{f}_{nd}=(f_{nd1},\ldots,f_{ndK})$ . We assume each random variable $f_{ndk}$ is a nonlinear function of the input variable $\mathbf{x}_{n}\in\mathbb{R}^{Q}$ , where $Q$ is the dimension of $\mathbf{x}_{n}.$ Therefore, $f_{ndk}=\mathcal{F}_{dk}\left(\mathbf{x}_{n}\right).$

Since $\mathbf{x}$ and $\mathcal{F}$ are latent we need to assign prior distributions for them. We assume a Gaussian distribution prior with standard deviation $\sigma_{x}^{2}$ for the latent variables $\mathbf{x}$ , and a Gaussian process (GP) prior for each of the functions $\mathcal{F}$ . We also consider a set of $M$ variational inducing points. For each vector of latent function values $\mathbf{f}_{dk}$ , we introduce a separate set of $M$ inducing variables $\mathbf{u}_{dk}$ , evaluated at a set of inducing input locations given by $Z$ . It is assumed that all $\mathbf{u}_{dk}$ s are evaluated at the same inducing locations. The inducing variables are function points drawn from the GP prior and lie in same latent space as the $F$ variables (Fig. LABEL:fig:_FU). The generative model can be summarized as

$\displaystyle{\mathbf{x}}_{nq}$	$\displaystyle\stackrel{{\scriptstyle\text{ iid }}}{{\sim}}\mathcal{N}\left(0,\sigma^{2}_{x}\right),$	(1)
$\displaystyle\mathcal{F}_{dk}$	$\displaystyle\stackrel{{\scriptstyle\text{ iid }}}{{\sim}}\mathcal{GP}\left(0,\mathbf{K}_{d}\right),$	(2)
$\displaystyle f_{ndk}$	$\displaystyle=\mathcal{F}_{dk}\left(\mathbf{x}_{n}\right),$	(3)
$\displaystyle u_{mdk}$	$\displaystyle=\mathcal{F}_{dk}\left(\mathbf{z}_{m}\right),$	(4)
$\displaystyle p(y_{nd}=k)$	$\displaystyle=\frac{\exp\left(f_{ndk}\right)}{\sum_{k^{\prime}=1}^{K}\exp\left(f_{ndk^{\prime}}\right)}.$	(5)

3.2 Inference

The marginal log-likelihood is intractable because of the covariance function of the GP and the Softmax likelihood. We will consider a variational approximation to the posterior distribution of $\mathbf{X},\mathbf{F}$ and $\mathbf{U}$ factorized as,

q(\mathbf{X},\mathbf{F},\mathbf{U})=q(\mathbf{X})q(\mathbf{U})p(\mathbf{F}|\mathbf{X},\mathbf{U}).

(6)

By applying Jensen’s inequality, we can write a lower bound of the log-evidence (ELBO) as

\log p(\mathbf{Y})=\log\int p(\mathbf{X})p(\mathbf{U})p(\mathbf{F}|\mathbf{X},\mathbf{U})p(\mathbf{Y}|\mathbf{F})\mathrm{d}\mathbf{X}\mathrm{d}\mathbf{F}\mathrm{d}\mathbf{U}\\ \geq-\mathrm{KL}(q(\mathbf{X})\|p(\mathbf{X}))-\mathrm{KL}(q(\mathbf{U})\|p(\mathbf{U}))\\ \quad\quad+\sum_{n=1}^{N}\sum_{d=1}^{D}\int q\left(\mathbf{x}_{n}\right)q\left(\mathbf{U}_{d}\right)p\left(\mathbf{f}_{nd}|\mathbf{x}_{n},\mathbf{U}_{d}\right)\\ \cdot\log p\left(\mathbf{y}_{nd}|\mathbf{f}_{nd}\right)\mathrm{d}\mathbf{x}_{n}\mathrm{d}\mathbf{f}_{nd}\mathbf{U}_{d}:=\mathcal{L},

(7)

where,

p\left(\mathbf{f}_{nd}|\mathbf{x}_{n},\mathbf{U}_{d}\right)=\prod_{k=1}^{K}\mathcal{N}(\mathbf{K}_{d,nM}\mathbf{K}_{d,MM}^{-1}\mathbf{u}_{dk},\\ K_{d,nn}-\mathbf{K}_{d,nM}\mathbf{K}_{d,MM}^{-1}\mathbf{K}_{d,Mn}).

(8)

The lower bound is still intractable because of the softmax likelihood, $\log p\left(\mathbf{y}_{nd}\mid\mathbf{f}_{nd}\right)$ . Therefore, we will compute the lower bound $\mathcal{L}$ and its derivatives with the Monte Carlo method. We draw samples of $\mathbf{x}_{n},\mathbf{U}_{d}$ and $\mathbf{f}_{nd}$ from $q\left(\mathbf{x}_{n}\right),q\left(\mathbf{U}_{d}\right),$ and $p\left(\mathbf{f}_{nd}\mid\mathbf{x}_{n},\mathbf{U}_{d}\right)$ respectively and estimate $\mathcal{L}$ with the sample average. We consider mean field variational approximation for the latent points $q(\mathbf{X})$ and a joint Gaussian distribution for $q(\mathbf{U})$ as,

q(\mathbf{U})=\prod_{d=1}^{D}\prod_{k=1}^{K}\mathcal{N}\left(\mathbf{u}_{dk}|\boldsymbol{\mu}_{dk},\mathbf{\Sigma}_{d}\right),

(9)

q(\mathbf{X})=\prod_{n=1}^{N}\prod_{q=1}^{Q}\mathcal{N}\left(x_{nq}|m_{nq},s_{nq}^{2}\right),

(10)

where the covariance matrix $\Sigma_{d}$ is shared for the same categorical variable $d$ . The KL divergence in $\mathcal{L}$ can be computed analytically with the given variational distributions. We need to optimize the hyperparameters of each GP (parameters of $\mathbf{K}_{d}$ ), parameters of the variational random variables $\mathbf{u}_{dk}$ , $\boldsymbol{\mu}_{dk}$ , $\mathbf{\Sigma}_{d}$ , mean $m_{nq}$ and variance $\sigma^{2}_{nq}$ of the latent inputs. The graphical model of the variational distributions for the inference part is shown in Fig. LABEL:graph2. For further details about the inference and learning hyperparameters refer to [gal2015latent].

4 Experiments and results

4.1 Synthetic data

In this section, we present our results with synthetic data. We first generated a dataset $\mathbf{Y}$ for $N=100$ patients and 10 categorical variables as shown in Fig. LABEL:graph1. These patients belonged to two clusters. We assumed each categorical variable can take values 0 and 1. In this way, we only needed to estimate the probability of one category, since $\operatorname{Pr}\{\mathrm{k}=0\}=1-\operatorname{Pr}\{\mathrm{k}=1\}$ . The probability of the $d$ -th category to be 1 for patient $x_{n}$ was proportional to the output of the function $f_{d}(x_{n})$ , Fig. 1 (a). We generated samples $x_{n}$ from a mixture of two Gaussian distributions in order to see if the model could detect the underlying cluster of patients. These two clusters are illustrated in Fig. 1 (a). Note that the dimension of the latent space was one. However, we initialized the dimension at a higher value and let the model learn the lower dimension of the latent space. Figure 1 (b) shows the inferred latent space for the first two dimensions. Figure 1 (c) plotted the train error. We can infer the dimensionality of the latent space by comparing the maximized ELBO for different initialization of $Q$ .

Refer to caption — (a) Ground truth: latent functions $\mathcal{F}_{dk}(x)$ for $d=1,\dots,10$ are shown in different colors.

4.2 COVID-19 Data

We used data collected at SBUH of test-positive COVID-19 pregnant women. The dataset was composed of categorical data of 89 patients. All the variables are listed in Table 1. The COVID-19 patients may carry severe or mild symptoms, and some of them end up in an ICU. There have been studies that utilized ML algorithms to build a detection model for the severeness of COVID-19. The authors in [yao2020severity] have investigated the binary classification problem between severe and mild cases. The problem with these methods is that they do not provide uncertainty about the prediction. Moreover, they need subjective definition of features and outcomes. Most of the time, disease severity levels are more than just described by two classes. Thus, it is important to study a range of classes of disease severeness, from mild, such as common cough, various in-between levels of disease, to harsh that entail ending up in ICUs. This is why it is important to employ methods like categorical latent GPs to decide on their own on the number of categories of patients and based on the patterns they discover in the patients’ data.

In this analysis, we mapped the patient population into a two-dimensional space using only the symptoms at the time of diagnosis. Then we labeled the points by severity of the outcomes. The latent space representing outcomes is shown in Fig. 2. By looking at the mapped data, we can easily see that there is a population of patients at $Q_{1}=0.1$ most of which did not need hospitalization. Instead, all the patients that have been admitted to the hospital are concentrated at the upper left corner of the space. Among them, three patients were admitted to ICU after developing severe outcomes. Here we have also observed one patient who is an outlier because that patient has been in ICU, and in the latent space she was lying in the far right of the space. Note that the pregnant women under investigation could have been in the hospital for reasons other than COVID. Many of them tested positive when they were completely asymptomatic.

Finally we mapped patients using all the categorical variables from Table 1 into a two-dimensional latent space. The latent points and the clustering of the patients are shown in Fig. 3. The intensity of green color in the figure reflects the density of the population in the latent space. The highest density is around the rightmost group of patients. One might argue that there are three clusters of patients, of which the one in the middle is the least concentrated. These clusters allow us to study the relationships among the patients and in particular, understand why patients grouped in a cluster are similar to other patients in the same cluster.

Table 1: List of COVID-19 patients variables

5 Conclusion

In this paper, we analyzed multivariate categorical data with models based on categorical latent Gaussian processes. With these models, we can discover much lower dimensional latent spaces that can facilitate classification, prediction and visualization. More specifically, we used a data-efficient Bayesian framework for clustering of high-dimensional categorical data. Our tests with synthetic data showed that the method is capable of finding latent structures of the data. Further, we applied the method to data obtained from test-positive COVID-19 pregnant women. There, too, the method discovered latent structures. These structures can be useful for many purposes including gaining important insights of high interest to physicians.