This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Effect of Multiple Imputation of Routine Pathology Variables on Laboratory Diagnosis of Hepatitis C Infection

Nidhi Menon111Corresponding author at: National Centre for Epidemiology & Population Health, Australian National University, Canberra, Australia
email address: [email protected]
National Centre for Epidemiology & Population Health, Australian National University, Canberra, Australia
Brett A. Lidbury & Alice Richardson National Centre for Epidemiology & Population Health, Australian National University, Canberra, Australia
Abstract

Background: Pathology tests are central to modern healthcare in terms of diagnosis and patient management. Aggregated pathology results provide opportunities for research into fundamental and applied questions in health and medicine, but data analytic challenges appear since test profiles vary between medical practitioners, resulting in missing data. In this study we provide an analytical investigation of the laboratory diagnosis of Hepatitis C (HCV) infection and focus on how to maximize the predictive value of routine pathology data. We recommend using the Influx - Outflux measures to help construct the imputation model when using multiple imputation.

Methods: Data from 14,320 community-patients aged 15 - 100 years were accessed via ACT Pathology (The Canberra Hospital, Australia). Influx and Outflux were calculated to identify which variables were potentially powerful predictors of missing values. Available Case analysis and Multiple Imputation were used to accommodate missing values in the dataset. Logistic regression model and stepwise selection method were used for analysing the imputed datasets. The predictive power of all methods was compared.

Results: The predictive power of the models on multiply imputed data was similar to the power of the models based on complete data. The advantage of multiply imputed data was that it allowed for the inclusion of all the completed variables in the logistic models, thus identifying a broader selection of test results that could lead to the enhanced laboratory prediction of HCV.

Conclusions: Multiple imputation is an important statistical resource allowing all individuals in a study to contribute whatever data they have supplied to the analysis. MI in combination with the values of Influx and Outflux identifies potential predictors of HepC infection. Variables age, gender and alanine aminotransferase have been shown to be strong laboratory predictors of HCV infection.

Keywords: Hepatitis; logistic regression; multiple imputation.

Highlights:

  • Multiple imputation of missing data is a well established technique in statistics and contribute to improving laboratory diagnosis.

  • Laboratory diagnosis of Hepatitis C infection can be achieved using partially observed routine pathology data through multiple imputation.

  • Areas under ROC curve achieved in our analysis were around 70%

  • Multiple imputation in combination with the Influx - Outflux allows novel variables to contribute to the prediction models.

  • Age, gender and alanine aminotransferase are strong laboratory predictors of HCV infection.

1 Introduction

Pathology laboratories generate large quantities of data representing human function, including analyses of blood chemistry and cells, infections, genetics and more. These data represent an under-utilized research resource, and while there are drawbacks (Richardson et al., 2016), they introduce a rich informatics and statistical substrate to assist clinical decision making. They also support the interrogation of databases to assist answering research problems; for example, such databases have been successfully mined for enhanced laboratory prediction of infectious diseases (Richardson and Lidbury, 2013, 2017). Given the observational nature of the data collected, one drawback is that not every test is conducted for every patient. Therefore, pathology databases contain many missing values that potentially dilute the value of the data base.

Missingness can be one of three types: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Data are MCAR if the probability of missingness is not related to the value of the observation or any other variables in the data. For most datasets, the assumption of MCAR is unlikely to be satisfied unless the data is missing due to the design employed. If the missingness depends on the value of another variable in the dataset, then the data is said to be MAR. MCAR is a special case of MAR, thus if the data is MCAR, they are also MAR. If the MAR assumption is violated, then the data is said to be MNAR (Rubright et al., 2014).

Missing data mechanisms are important since older methods of handling missing data (e.g. available case analysis) assume MCAR missingness while modern techniques to handle missing data such as maximum likelihood estimation (MLE) and multiple imputation assume only MAR.

1.1 Hepatitis C Virus

Hepatitis virus is a world-wide health concern, with an estimated 1.45 million deaths attributable to the virus globally in 2013 (Stanaway et al., 2016). More recent modelling (Blach et al., 2017) has also shown that the global prevalence of Hepatitis C virus (HCV) has reached an estimated 71.1 million infections. HCV affects the liver and can cause permanent and ultimately fatal liver cancer (El-Serag, 2012). HCV causes the inflammation of the liver and over time this can lead to liver diseases like cirrhosis and fibrosis. HCV is a blood borne virus and in developed countries it is often transmitted through risky behaviour such as sharing needles. Indeed, in Australia in 2007 (Razali et al., 2007), 90% of the new and 80% of the existing infections are transmitted in this way.

While there are advances in the treatment of people with HCV, the human costs of HCV, in terms of reduction of quality of life and well being and through occupational and social discrimination and isolation, remain significant (Vietri et al., 2013). Guidelines for the care and treatment of HCV have recently been updated (Organization et al., 2018). The financial costs of the virus for medical and hospital care, lost productivity, and the need for social support are also an increasing proportion of healthcare expenditure in Australia. While the diagnosis of HCV is most confidently made through the immunoassay (for Study of Liver et al., 2014), having a powerful predictive model will help focus on early detection of HCV infection particularly for cases where the specific immunoassay has not been ordered.

1.2 Multiple imputation

Analysis of multivariate data is often negatively affected by incompleteness. Reduced statistical power, increase in standard errors, complications in data handling and analysis, and introduction of bias are some of the common problems associated with missing data (Horton and Lipsitz, 2001).

Several methods have been proposed over the years to handle the problem of missing data, for example single imputation using means or medians (Little and Rubin, 2002) and the EM algorithm (Dempster et al., 1977). While single imputation techniques are easy to implement, they treat the missing values as if they were known, thus eliminating any uncertainty that missing values bring to the dataset. They do not preserve the inherent variability of the imputed data and can be severely biased (Dempster et al., 1977). For these reasons we will not be pursuing Single Imputation in this paper but instead show the usefulness of Multiple Imputation (MI) (Rubin, 1987), where the multiple imputed datasets are analysed using standard statistical procedures and the estimates from these analyses are then combined to produce overall estimates with more appropriate standard errors. These estimates and standard errors reflect the true variation and uncertainty in the data better than the estimates obtained by deleting any observations with missing values, also known as available case analysis. MI is less biased than single imputation methods, increases the efficiency of estimation and also preserves the variability in the dataset. Hence, it is the preferred method to fill in missing values in datasets such as the one used in this study. It should be noted that the goal of imputation is not to replace or recover these missing values from the dataset, as they are unknown, but to produce valid and analytical results in the presence of the missing values.

Despite its conception in the 1980s, MI has only recently come into general use across medical research due to advances in statistical computation. Even with the evolution of MI and its statistical advantages, the technique has had limited application in laboratory diagnostics. The use of MI in clinical chemistry has been supported (Janssen et al., 2009; Waljee et al., 2013) when values of a variable in an existing prediction model are missing.The study presented herein extends this previous approach by imputing variables with an unknown relationship to the outcome of interest.

The primary goal of this study is to show the importance of multiple imputation when dealing with missing values in routinely collected pathology data. A secondary goal of this paper is to contribute to knowledge of laboratory prediction of HCV by identifying novel combinations of routinely measured biomarkers that are highly predictive of HCV. This focus on an existing resource which provides a low-cost approach to knowledge discovery is important in the context of low-middle income countries, where laboratory resources may be limited (Shang et al., 2013).

2 Materials and Methods

2.1 Data

The dataset employed in this study was made available by ACT Pathology at The Canberra Hospital, Canberra, Australia. Patient identifiers were removed and only laboratory ID numbers provided. The data set contained the 18,625 pathology requests between 1997 and 2007 that included a re-quest for either the assay for Hepatitis B (HepB) or Hepatitis C (HepC) or both (Figure 1). There were 4,296 patients for whom 60% or more of the other assays (Table 1) were missing, and these were removed before analysis commenced due to the large proportion of missingness. Another 3,546 individuals were missing a HCV immunoassay (HepC) outcome despite a request having been made and so excluded from the analysis. The result of the HCV immunoassay is recorded as positive if the test is positive to the presence of HCV anti-bodies. The same dataset has been analysed previously (Richardson and Lidbury, 2013, 2017) to ascertain the interaction between virus, outcome, pre-processing and method on the performance of decision tree ensembles and support vector machines.

For access to de-identified patient data, this study had human ethics approval granted by The Australian National University Human Ethics Committee (2012/349) and the ACT Health Human Research Ethics Committee (ETHLR.11.016).

Refer to caption
Figure 1: Flow of patients into the HepC study (To be printed in Colour)
Table 1: Diagnostic pathology variables included in the regression analysis, with summary statistics (n = 10,774).
Variable Description Mean (SD) (HepC = 0) (n = 10102) Mean (SD) (HepC = 1) (n = 672) p-value
Age Patient’s age in years 44.72(19.36)44.72\ (19.36) 40.09(14.41)40.09\ (14.41) <0.001<0.001
Sex Patient’s gender (Males) Frequency (%) 5188(51.36%)5188\ (51.36\%) 251(37.35%)251\ (37.35\%) 0.00013
ALB Albumin 42.84(5.80)42.84\ (5.80) 42.09(5.88)42.09\ (5.88) 0.0004
Sodium Sodium 139.63(3.22)139.63\ (3.22) 139.66(3.06)139.66\ (3.06) 0.99270.9927
K Potassium 4.00(0.45)4.00\ (0.45) 4.07(0.47)4.07\ (0.47) <0.001<0.001
RCC Red cell count 4.53(0.64)4.53\ (0.64) 4.68(0.64)4.68\ (0.64) <0.001<0.001
RDW Red cell distribution width 13.88(1.78)13.88\ (1.78) 14.07(1.63)14.07\ (1.63) <0.001<0.001
ALT Alanine aminotransferase 1.46(0.38)1.46\ (0.38) 1.64(0.42).64\ (0.42) <0.001<0.001
ALKP Alkaline Phosphate 1.92(0.21)1.92\ (0.21) 1.94(0.18).94\ (0.18) <0.001<0.001
Crea Creatinine 1.93(0.16)1.93\ (0.16) 1.90(0.13)1.90\ (0.13) <0.001<0.001
GGT Gamma-glutamyl transferase 1.61(0.44)1.61\ (0.44) 1.69(0.43)1.69\ (0.43) <0.001<0.001
Urea Blood urea 0.72(0.20)0.72\ (0.20) 0.65(0.19)0.65\ (0.19) <0.001<0.001
Plt Platelets 2.39(0.20)2.39\ (0.20) 2.39(0.18)2.39\ (0.18) 0.21790.2179
WCC White cell count 0.87(0.17)0.87\ (0.17) 0.90(0.15)0.90\ (0.15) <0.001<0.001
TBil Total bilirubin 1.03(0.30)1.03\ (0.30) 0.98(0.31)0.98\ (0.31) <0.001<0.001
Mono Monocytes 0.74(0.20)0.74\ (0.20) 0.76(0.15)0.76\ (0.15) <0.001<0.001
Eos Eosinophils 0.38(0.18)0.38\ (0.18) 0.39(0.18)0.39\ (0.18) 0.0140.014
Bas Basophil 0.18(0.09)0.18\ (0.09) 0.19(0.08)0.19\ (0.08) <0.001<0.001
Lymph Lymphocytes 1.38(0.34)1.38\ (0.34) 1.48(0.53)1.48\ (0.53) <0.001<0.001

2.2 Statistical Analysis

This is a cross-sectional dataset where each individual only appears once with a non-monotone pattern of missingness. Percentage missingness is calculated for each variable and we examine the pattern of missingness visually to get an impression of the extent of incompleteness. These patterns of missingness influences the amount of information that can be transferred between variables (Van Buuren, 2012). For example, imputations can be more precise if complete information is available for the other variables for the observations that are to be imputed.

In his book (Van Buuren, 2012), van Buuren proposed using Influx and Outflux statistics to streamline this process of constructing imputation models that are impartial to the subjectivity of the imputer and the analyst Van Buuren (2012). These statistics quantity how each variable connects to others. As described by Van Buuren (2012), for a pair of variables (Xa,Xb)(X_{a},X_{b}) in a sample of nn observations with pp variables, the influx coefficient for the variable XaX_{a} is defined as

Ia=a=1pb=1pi=1n(1ri,a)ri,bb=1pi=1nri,bI_{a}=\frac{\sum_{a=1}^{p}\sum_{b=1}^{p}\sum_{i=1}^{n}(1-r_{i,a})r_{i,b}}{\sum_{b=1}^{p}\sum_{i=1}^{n}r_{i,b}}

The value of IaI_{a} depends on the proportion of missing data in the variable XaX_{a}. This means that if XaX_{a} has no missing values, IaI_{a} = 1. The Influx coefficient (IaI_{a}) is the number of variable pairs (Xa,Xb)(X_{a},X_{b}), where XaX_{a} is missing and XbX_{b} is observed. Thus, a high influx implies how well the variable is connected with the observed data.

The second measure proposed by van Buuren, is the outflux coefficient. For the same pair of variables (Xa,Xb)(X_{a},X_{b}), the outflux coefficient indicated how useful is XaX_{a} to impute missing values in XbX_{b}. Unlike the former, outflux depends on the extent of missing values in the variable XaX_{a}. The variable with higher outflux is better connected to the missing data, and can be more useful in imputing other variables (Van Buuren, 2012). The outflux coefficient is given by

Oa=a=1pb=1pi=1nri,a(1ri,b)b=1pi=1n(1ri,a)O_{a}=\frac{\sum_{a=1}^{p}\sum_{b=1}^{p}\sum_{i=1}^{n}r_{i,a}(1-r_{i,b})}{\sum_{b=1}^{p}\sum_{i=1}^{n}(1-r_{i,a})}

Fluxplots can be used to plot both these measures to make interpretation easier for the imputer and the analyst. It can be used to identify variables that clutter the imputation model, thus making the process of constructing the imputation model less subjective. Variables in the lower areas in the fluxplot that are not used for analysis can be removed from the data prior to imputation. While there is a significant proportion of literature that focuses on developing proper imputation models, there is an evident lack of illustrative examples describing the imputation process in practice to create imputation models are as impartial to the imputer and the analysts and relies on quantitative measures to develop these models.
Multiple imputation is used to address missing data. In this method, each missing value is replaced by two or more imputed values to represent the statistical uncertainty about which value to impute. It involves three stages; Imputation, Analysis and Pooling. Final inference on the regression coefficient estimates is made on the pooled result using Rubin’s combination rule (Little and Rubin, 2002). Variables with an Outflux coefficient above 0.95 i.e. variables that occurred in the top left corner of the flux plot, are used as predictors in the imputation model. We employ a stepwise procedure by calculating the number of times a variable occurred a predictor in the imputation model. This method is also referred to as the impute then select method (Yang et al., 2005). A supermodel is constructed using the variables that occurred across all imputed datasets. For all other variables, we use the multivariate Wald test and the likelihood ratio test to determine whether the variable should be included in the final model.

Multiple Imputation using Chained Equations (MICE) (Van Buuren, 2012) has emerged in the literature as a routine method for handling missing data in both continuous and binary variables. Descriptive statistics (mean and standard deviation) for all iterations are used to obtain a plot to check imputation convergence. If the mean and standard deviation of the imputed variables appear to have settled at particular values, the imputation process is deemed to have converged. There is a lack of formal tests of convergence (Su et al., 2011), so the plots of statistics such as the mean and standard deviation are used to provide visual information about convergence. A backwards step wise model selection procedure is employed in the analysis phase after imputations.

All analysis was performed using the statistical software R (version 3.4.0) (R Core Team, 2021). The package ‘mice’ (Van Buuren and Groothuis-Oudshoorn, 2011) were used for multiple imputation. The package ‘ROCR’ (Sing et al., 2005) is used to produce Receiver Operating Characteristic (ROC) curves to test the predictive ability of the imputed models and to calculate the area under the curve (AUC). The packages ‘nortest’ (Gross and Ligges, 2015) and ‘psych’ (Revelle, 2022) are used to perform the Anderson-Darling test of normality and to obtain summary statistics respectively. The package ‘MASS’ (Venables and Ripley, 2002) is used to perform χ2\chi^{2} and Mann-Whitney U tests.

Table 2: Diagnostic pathology variables included in the regression analysis, with summary statistics (n = 10,774).
Variable Percentage Missing (HepC = 0) (n = 10102) Percentage Missing (HepC = 1) (n = 672)
Age 0 0
Sex 0 0
Urea 18.71 13.39
TBil 18.71 4.46
ALT 18.67 4.46
Crea 18.65 13.39
GGT 18.63 4.02
Potassium 17.52 12.35
ALKP 17.43 3.57
Sodium 17.19 11.61
ALB 16.41 3.57
Bas 2.52 2.23
Eos 1.97 1.04
Plt 0.86 0.15
Mono 0.53 0.30
Neut 0.53 0.30
Lymph 0.53 0.30
WCC 0.50 0
MCHC 0.23 0
Hct 0.21 0
RCC 0.21 0
RDW 0.17 0.45
MCV 0.16 0
Mch 0.16 0
Hb 0.12 0
Refer to caption
Figure 2: Patterns of missingness in predictor variables by combination. Each row of the chart represents a different combination of missing values.
Blue = a variable has no missing values, red = a variable has missing values. See Table 2 for abbreviation of biomarker names.(To be printed in Colour)
Refer to caption
Figure 3: Fluxplot for HepC data: Outflux v/s Influx
Table 3: Influx and outflux of multivariate missing data patterns.
Variable Proportion Observed Influx Outflux FICO
Age 1.000 0.000 1.000 0.454
Sex 1.000 0.000 1.000 0.454
ALT 0.865 0.101 0.346 0.369
ALB 0.882 0.086 0.387 0.381
ALKP 0.875 0.093 0.365 0.376
Crea 0.840 0.126 0.314 0.350
TBil 0.865 0.102 0.345 0.369
GGT 0.866 0.101 0.348 0.369
Urea 0.839 0.127 0.310 0.349
Sodium 0.862 0.105 0.348 0.366
K 0.858 0.109 0.343 0.364
Hb 0.999 0.000 0.992 0.453
RCC 0.998 0.001 0.987 0.453
MCV 0.999 0.001 0.992 0.453
Hct 0.998 0.001 0.987 0.453
RDW 0.998 0.001 0.991 0.453
Mch 0.999 0.001 0.992 0.453
MCHC 0.998 0.001 0.987 0.453
Plt 0.993 0.005 0.956 0.450
WCC 0.996 0.002 0.964 0.452
Neut 0.996 0.002 0.961 0.451
Lymph 0.996 0.002 0.961 0.451
Mono 0.996 0.002 0.961 0.451
Eos 0.983 0.014 0.939 0.444
Bas 0.978 0.019 0.928 0.442
hepout 0.752 0.250 0.791 0.274

The variables are imputed on the raw scale. Log or square root transformations are applied to those variables with highly skewed distributions using passive imputation (Von Hippel, 2007). Passive imputation is a method to handle derived variables in imputation by transforming the imputed values of the variable. Even though a Normal distribution is not essential in the predictors, transformations were applied to ensure predictors were of similar order of magnitude across analysis methods. Complex transformations (like Box-Cox power transformations) were not considered for the final model for clarity of interpretation. Finally, logistic regression is used to predict HCV status on the basis of age, sex and the 23 biomarkers.

3 Results

There were over 100 different patterns of missingness in the predictor variables across the 14,320 patients (Figure 2a). There were 7,820 individuals (54.6%) with complete data on age, sex and all 23 biomarkers.
Percentages of missingness (Figure 2b, Table 2) occurs in one of three ways. For Sodium, Potassium, creatinine (Crea) and Urea, there is approximately 15% missingness, whether or not a patient was HCV positive. For serum albumin (ALB), alanine aminotransferase (ALT), alkaline phosphatase (ALP), gamma glutamyl transferase (GGT) and total serum bilirubin (TBil), there is about 15% missingness amongst HepC negative, and less than 1% missingness amongst HepC positive. For the remainder (Age, Sex, platelets (Plt), monocytes (Mono), eosinophils (Eos), basophils (Bas), lymphocytes (Lymph)), there was only occasional missingness of data.

The area under the ROC curve (AUROC) (Figure 5a) for the available cases logistic regression model on the validation dataset was 72%; namely, the model has a 72% chance of correctly classifying individuals as having a positive HepC assay. The model identified the variables Age, Sex, Sodium, RDW, Potassium, log(ALT), log(ALP), log(Crea), log(TBil), log(Urea) and sqrt(Lymph) as significant (p<0.05(p<0.05; see Table 3).

As stated earlier, variables with higher outflux are potentially powerful predictors and can be used to impute variables with missing values. As indicated by the flux plot (Figure 3), variables in the far left corner have more complete data. Variables closer to the diagonal have more balanced values of influx and outflux. The group below the diagonal with higher values of influx depend highly on the imputation model. In this study, we consider variables with outflux values >0.90>0.90 to be included in the imputation model. These comprise: Age, Sex, Hb, RCC, MCV, Hct, RDW, Mch, MCHC, Plt ,WCC, Mono, Eos, Bas, Neut and Lymph.

We fit a stepwise logistic model to predict HepC and count the number of times each variable occurred in the imputation model in each of the imputed datasets. This was done for five, and 20 imputations. We observe that variables that did not occur in any of the imputation models when there were five imputations, occur in lower frequencies when there were 20 imputations.

Refer to caption
Figure 4: Percentage of missingness across variables in the dataset. Variables coloured pink have between 0 and 10% mising values, variables in blue have between 11 and 25% missing values. (To be printed in Colour)
Table 4: Tabulation of the number of occurrences in imputation model across all imputed datasets.
Variable No. of Occurrences (5 imputations) No. of Occurrences (20 imputations)
Age 5 20
ALB 5 20
ALT 5 20
Lymphocytes 5 20
RCC 5 20
RDW 5 20
Sex 5 20
TBil 5 20
Urea 4 20
Crea 5 18
K 3 15
MCV 4 14
Basophils 5 13
Hb 3 13
MCHC 3 13
Sodium 4 13
Hct 2 9
ALKP 0 7
Monocytes 2 6
Eosinophils 0 5
Mch 3 5
WCC 0 3
Neut 0 2
Plt 0 1

Variables that occurred in all imputation models were considered in the supermodel.For other variables, the likelihood ratio test was applied to identify if the variable should be included in the final model. Variables with p-value <0.05<0.05 were selected in the final supermodel. Table 4 shows the results of the tabulation for number of occurrences of each variable when number of imputations were equal to 5 and 20.

Table 5: Regression coefficients for logistic regression models. Log = log transform applied before regression analysis. Sqrt = square root transformation applied before regression analysis
Variable CC: β(SE)\beta(SE) CC: p-value MICE (m = 5): β(SE)\beta(SE) MICE (m = 5): p-value MICE (m = 20): β(SE)\beta(SE) MICE (m = 20): p-value
Intercept 3.408 (2.455) 0.1651 31.577 (8.937) <0.001<0.001 -0.7547 (3.2853) 0.8183
Age -0.016 (0.003) <0.001<0.001 -0.016 (0.003) <0.001<0.001 -0.0169 (0.003) <0.001<0.001
Sex 0.664 (0.109) <0.001<0.001 0.516 (0.112) <0.001<0.001 0.5043 (0.0956) <0.001<0.001
RDW 0.087 (0.026) <0.001<0.001 0.142 (0.023) <0.001<0.001 0.1592 (0.0254) <0.001<0.001
Log(ALT) 1.415 (0.123) <0.001<0.001 0.3504 (0.0481) <0.001<0.001 0.3436 (0.0445) <0.001<0.001
Sqrt(Lymph) 0.312 (0.112) 0.0056 0.3912 (0.099) <0.001<0.001 0.3406 (0.0943) <0.001<0.001
ALB -0.016 (0.009) 0.0644 -0.044 (0.008) <0.001<0.001 -0.0473 (0.0084) <0.001<0.001
Potassium 0.383 (0.112) <0.001<0.001 0.1439 (0.0962) 0.1346
Sodium -0.035 (0.016) 0.0292
Sqrt(Bas) 0.701 (0.361) 0.0523
Log(ALKP) -0.921 (0.269) <0.001<0.001
Log(Crea) -1.320 (0.558) 0.0180 -0.3904 (0.195) 0.0046
Log(TBil) -1.114 (0.170) <0.001<0.001 -0.3945 (0.065) <0.001<0.001 -0.4031 (0.0656) <0.001<0.001
Log(Urea) -1.261 (0.338) <0.001<0.001 -0.251 (0.135) 0.0652 -0.4706 (0.114) <0.001<0.001
Sqrt(Mono) -0.174(0.275) 0.5251
Hb 0.0913(0.022) <0.001<0.001 0.0774(0.0335) 0.0211
MCHC -0.0902(0.024) <0.001<0.001
MCV -0.2328 (0.0889) 0.0089 0.0712 (0.0549) 0.0194
RCC -2.107 (0.698) 0.0025 -2.0800 (0.7416) 0.0051
Mch 0.503 (0.2762) 0.0688 -0.4026 (0.1620) 0.0131
Hct 4.3233(11.205) 0.0699
Figure 5: Receiver-Operator Characteristic (ROC) curves for each method of analysis. Diagonal line represents ROC of 50% (To be printed in Colour)
Refer to caption
(a) Available Case analysis
Refer to caption
(b) MICE (m = 5)
Refer to caption
(c) MICE ( m = 20)

The parameter estimate from the multiple imputation methods are automatically pooled and presented in Table 5. The AUROCs of the five-fold imputed models have a mean of 72% (SD 0.003). We increased the number of imputations from 5 to 20 and observed variable selection was more consistent, reducing the randomness in the variable selection process. The AUROCs of the twenty-fold imputed models have a mean of 71% (SD 0.002).

4 Discussion

Discussion of the results will involve both the meaning of the results for multiple imputation in general, and the results for the diagnosis of hepatitis C in particular. The focus of the study is to improve the diagnostic process by developing a powerful predictive model for HepC without discarding any information collected during laboratory tests due to the presence of missing values, and using the completed dataset for developing this model.

The percentage of missing values displayed three patterns (see Results). Biomarkers with around 15% missingness included sodium, potassium, creatinine and urea which are associated with the Urea-Electrolytes-Creatinine battery of tests. These are routinely requested for most laboratory investigations. Biomarkers such as ALB, ALT, ALP, GGT and TBil and displayed around 15% missingness for those found HCV negative, and less than 1% missingness for those found HCV positive. These are routine liver function tests and would be primary markers sought when HepC is suspected. The remaining biomarkers are from the full blood count and are routinely requested, and so records are complete for almost all individuals. A recent paper (Lidbury, 2018) has been published on this strategy, but with the goal of avoiding painful and expensive liver biopsy.

The multiply imputed analyses are largely consistent with the available case analysis (Richardson and Lidbury, 2013, 2017) (Table 3). However, three variables in the single imputation model were not identified as important disease markers in the analysis of the imputed datasets; hence some unnecessary variables were removed via MI. Although the ROC curves (Figure 5) produced by the MI methods did not yield high absolute accuracy, the ten key predictors common to all the multiply imputed regression models suggest a pattern of routine, easily accessible pathology markers, which predict a HCV infection. In addition to early detection, the advantage of using quality routine pathology data as a predictive model, where immunoassay is not easily available is a key outcome especially when handling population health issues for rural or remote areas in third world countries.

The regression models suggest that increased levels of ALT, RDW and lymphocytes increased the odds of detecting HCV infection. All models presented include the variable ALT which implies that ALT is one of the most useful routine test predictors of HCV infection. The finding for ALT coincides with current medical practice (Holmes et al., 2013), as cases with elevated ALT levels are more likely to be infected with HCV. However, ALT elevation is not specific for viral hepatitis [(Kwo et al., 2017) and so an increased level of ALT in the blood is used as a guide to request specific second-tier tests that may include HepC immunoassay.

RDW has also been previously identified as having a relationship with HCV in a small-scale Chinese in-patient study (Shang et al., 2013); the current analysis supports that finding in a much larger Australian community cohort. Lymphocytes have also been identified as playing a role in HCV infection in a systematic review of epidemiological studies (He et al., 2016).

Age is another strong predictor of HCV status, with increasing age decreasing the chance of being HCV positive. Evidence from Europe (Cacoub et al., 2016) and Australia (Faustini et al., 2010) supports this. The increased prevalence in this study of HCV diagnosis in younger patients suggests that clinicians could use this evidence to be proactive in seeking information on the key variables identified in this study in the younger age groups. Although the laboratory supplying the data is housed on a hospital campus, the organisation provides pathology services to the public as well as inpatients, hence the sample contains many community-living persons with a wide range of clinical indications leading to the request for an HCV immunoassay.

The multiple imputation methods employed here assume MAR. However, the results of this study are limited by the likelihood that MNAR is actually the type of missingness observed in pathology data. Policy initiatives such as those recently employed for Vitamin D (Boyages, 2016) provide external reasons to reduce the number of pathology tests ordered in a variety of situations. Another likely reason that MNAR would arise in the context of this study is that clinicians are likely to use their clinical judgement to order some tests and not others. This clinical judgment is now also being enhanced by data mining (Lidbury et al., 2015). Non-clinical characteristics can also drive choice of tests requested (Smellie et al., 2002). Practice characteristics and other clinical notes are not captured in the available data.

A second and related limitation of this multiple imputation analysis is that were MAR to be assumed, the parameter estimates from the imputed data are very sensitive to the model employed for the probability of response (Little and Rubin, 2002). The probability of response can be modelled by using linear regression models for continuous variables, logistic regression models for binary variables, and multinomial logistic regression models for categorical variables with more than two classes.

A final limitation of this analysis concerns the rarity of the outcome: only 3% of subjects tested positive for HCV. Whilst the rarity of the outcome supports our view that inclusion bias is likely to be small, it does raise issues around the stability of classical estimation algorithms. Logistic regression for rare events may require the use of penalised likelihood instead of maximum likelihood (King and Zeng, 2001).

5 Conclusion

Faced with missing data in a routine pathology laboratory database, logistic regression to develop a prediction model for HCV positive assay results was successfully undertaken following the use of multiple imputation. The coefficient estimates of the regression models built on various imputed datasets were pooled to obtain overall estimates. These logistic regression models, on average, explained about 70% of the variation in the imputed dataset. The regression models built on imputed datasets have similar AUROCs compared to the regression model generated using the complete dataset (73% on complete dataset compared to an average of 71% to 72% on multiply imputed datasets). However, the regression models on the imputed datasets identified more predictors as significant in predicting HCV infection. This means that MI leads to richer prediction models and not using MI may lead to missing predictors that have a useful relationship with the outcome of interest. This study suggests that MI in the diagnostic process of HepC infection using laboratory data has identified combinations of routinely measured biomarkers that are integral to the prediction of HCV infection.

Acknowledgements: The authors wish to thank Dr Gus Koerbin and staff at ACT Pathology for their support of this project. They also thank Mr Sheikh Faisal who undertook the initial analysis of this data as part of his Master’s degree research.

Financial support: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

  • Blach et al. (2017) Blach, S., S. Zeuzem, M. Manns, I. Altraif, A.-S. Duberg, D. H. Muljono, I. Waked, S. M. Alavian, M.-H. Lee, F. Negro, et al. (2017). Global prevalence and genotype distribution of hepatitis c virus infection in 2015: a modelling study. The lancet Gastroenterology & hepatology 2(3), 161–176.
  • Boyages (2016) Boyages, S. C. (2016). Vitamin d testing: new targeted guidelines stem the overtesting tide. The Medical Journal of Australia 204(1), 18.
  • Cacoub et al. (2016) Cacoub, P., C. Comarmond, F. Domont, L. Savey, A. C. Desbois, and D. Saadoun (2016). Extrahepatic manifestations of chronic hepatitis c virus infection. Therapeutic advances in infectious disease 3(1), 3–14.
  • Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.
  • El-Serag (2012) El-Serag, H. B. (2012). Epidemiology of viral hepatitis and hepatocellular carcinoma. Gastroenterology 142(6), 1264–1273.
  • Faustini et al. (2010) Faustini, A., P. Colais, E. Fabrizi, A. M. Bargagli, M. Davoli, D. Di Lallo, A. Di Napoli, P. Pezzotti, C. Sorge, R. Grillo, et al. (2010). Hepatic and extra-hepatic sequelae, and prevalence of viral hepatitis c infection estimated from routine data in at-risk groups. BMC infectious diseases 10(1), 1–13.
  • for Study of Liver et al. (2014) for Study of Liver, E. A. et al. (2014). Easl clinical practice guidelines: management of hepatitis c virus infection. Journal of hepatology 60(2), 392–420.
  • Gross and Ligges (2015) Gross, J. and U. Ligges (2015). Package ‘nortest’: Five omnibus tests for testing the composite hypothesis of normality. Dortmund, Germany: Dortmund University. R package version 1.0.4.
  • He et al. (2016) He, Q., Q. He, X. Qin, S. Li, T. Li, L. Xie, Y. Deng, Y. He, Y. Chen, and Z. Wei (2016). The relationship between inflammatory marker levels and hepatitis c virus severity. Gastroenterology Research and Practice 2016.
  • Holmes et al. (2013) Holmes, J., A. Thompson, and S. Bell (2013). Hepatitis c: an update. Australian Journal of General Practice 42(7), 452.
  • Horton and Lipsitz (2001) Horton, N. J. and S. R. Lipsitz (2001). Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician 55(3), 244–254.
  • Janssen et al. (2009) Janssen, K. J., Y. Vergouwe, A. R. T. Donders, F. E. Harrell Jr, Q. Chen, D. E. Grobbee, and K. G. Moons (2009). Dealing with missing predictor values when applying clinical prediction models. Clinical chemistry 55(5), 994–1001.
  • King and Zeng (2001) King, G. and L. Zeng (2001). Logistic regression in rare events data. Political analysis 9(2), 137–163.
  • Kwo et al. (2017) Kwo, P. Y., S. M. Cohen, and J. K. Lim (2017). Acg clinical guideline: evaluation of abnormal liver chemistries. Official journal of the American College of Gastroenterology— ACG 112(1), 18–35.
  • Lidbury (2018) Lidbury, B. A. (2018). Predicting liver disease post hepatitis virus infection: In silico pathology and pattern recognition. EBioMedicine 35, 10–11.
  • Lidbury et al. (2015) Lidbury, B. A., A. M. Richardson, and T. Badrick (2015). Assessment of machine-learning techniques on large pathology data sets to address assay redundancy in routine liver function test profiles. Diagnosis 2(1), 41–51.
  • Little and Rubin (2002) Little, R. J. and D. B. Rubin (2002). Statistical analysis with missing data, Volume 793. John Wiley & Sons.
  • Organization et al. (2018) Organization, W. H. et al. (2018). Guidelines for the care and treatment of persons diagnosed with chronic hepatitis C virus infection. World Health Organization.
  • R Core Team (2021) R Core Team (2021). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Razali et al. (2007) Razali, K., H. H. Thein, J. Bell, M. Cooper-Stanbury, K. Dolan, G. Dore, J. George, J. Kaldor, M. Karvelas, J. Li, et al. (2007). Modelling the hepatitis c virus epidemic in australia. Drug and alcohol dependence 91(2-3), 228–235.
  • Revelle (2022) Revelle, W. (2022). psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. R package version 2.2.3.
  • Richardson et al. (2016) Richardson, A., B. M. Signor, B. A. Lidbury, and T. Badrick (2016). Clinical chemistry in higher dimensions: machine-learning and enhanced prediction from routine clinical chemistry data. Clinical biochemistry 49(16-17), 1213–1220.
  • Richardson and Lidbury (2013) Richardson, A. M. and B. A. Lidbury (2013). Infection status outcome, machine learning method and virus type interact to affect the optimised prediction of hepatitis virus immunoassay results from routine pathology laboratory assays in unbalanced data. BMC bioinformatics 14(1), 1–8.
  • Richardson and Lidbury (2017) Richardson, A. M. and B. A. Lidbury (2017). Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines. BMC medical informatics and decision making 17(1), 1–11.
  • Rubin (1987) Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys, Volume 81. John Wiley & Sons.
  • Rubright et al. (2014) Rubright, J. D., R. Nandakumar, and J. J. Gluttin (2014). A simulation study of missing data with multiple missing x’s. Practical Assessment, Research, and Evaluation 19(1), 10.
  • Shang et al. (2013) Shang, G., A. Richardson, M. E. Gahan, S. Easteal, S. Ohms, and B. A. Lidbury (2013). Predicting the presence of hepatitis b virus surface antigen in chinese patients by pathology data mining. Journal of medical virology 85(8), 1334–1339.
  • Sing et al. (2005) Sing, T., O. Sander, N. Beerenwinkel, and T. Lengauer (2005). Rocr: visualizing classifier performance in r. Bioinformatics 21(20), 3940–3941.
  • Smellie et al. (2002) Smellie, W., M. Galloway, D. Chinn, and P. Gedling (2002). Is clinical practice variability the major reason for differences in pathology requesting patterns in general practice? Journal of clinical pathology 55(4), 312–314.
  • Stanaway et al. (2016) Stanaway, J. D., A. D. Flaxman, M. Naghavi, C. Fitzmaurice, T. Vos, I. Abubakar, L. J. Abu-Raddad, R. Assadi, N. Bhala, B. Cowie, et al. (2016). The global burden of viral hepatitis from 1990 to 2013: findings from the global burden of disease study 2013. The Lancet 388(10049), 1081–1088.
  • Su et al. (2011) Su, Y.-S., A. Gelman, J. Hill, and M. Yajima (2011). Multiple imputation with diagnostics (mi) in r: Opening windows into the black box. Journal of Statistical Software 45, 1–31.
  • Van Buuren (2012) Van Buuren, S. (2012). Flexible imputation of missing data. CRC press.
  • Van Buuren and Groothuis-Oudshoorn (2011) Van Buuren, S. and K. Groothuis-Oudshoorn (2011). mice: Multivariate imputation by chained equations in r. Journal of statistical software 45, 1–67.
  • Venables and Ripley (2002) Venables, W. N. and B. D. Ripley (2002). Modern applied statistics with S-PLUS. Springer Science & Business Media.
  • Vietri et al. (2013) Vietri, J., G. Prajapati, and A. C. El Khoury (2013). The burden of hepatitis c in europe from the patients’ perspective: a survey in 5 countries. BMC gastroenterology 13(1), 1–8.
  • Von Hippel (2007) Von Hippel, P. T. (2007). 4. regression with missing ys: an improved strategy for analyzing multiply imputed data. Sociological Methodology 37(1), 83–117.
  • Waljee et al. (2013) Waljee, A. K., A. Mukherjee, A. G. Singal, Y. Zhang, J. Warren, U. Balis, J. Marrero, J. Zhu, and P. D. Higgins (2013). Comparison of imputation methods for missing laboratory data in medicine. BMJ open 3(8), e002847.
  • Yang et al. (2005) Yang, X., T. R. Belin, and W. J. Boscardin (2005). Imputation and variable selection in linear regression models with missing covariates. Biometrics 61(2), 498–506.