This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Effects of Air Pollution on Health:
A Study of Los Angeles County
October, 2024

©Yanfei Qu

1 Abstract

This study aims to develop and implement a Poisson regression model with measurement error using a Bayesian framework, with model fitting performed in Stan. The focus is on examining the relationship between air pollution exposure and health outcomes, such as respiratory and cardiovascular disease counts, while accounting for inaccuracies in pollution measurements. Air pollution data is often subject to measurement error due to imperfect monitoring or averaging, which, if ignored, can lead to biased estimates and incorrect conclusions. The Poisson regression will model count data, where the response variable, such as disease counts, follows a Poisson distribution. Covariates including pollution levels, demographic factors, and meteorological conditions will be incorporated to control for confounders. To address measurement error in the exposure data, a Bayesian hierarchical model will be used, where observed pollution levels are treated as noisy measurements of the true underlying exposure. Priors will be specified for both the regression coefficients and the measurement error parameters, and posterior distributions will be estimated via Markov Chain Monte Carlo (MCMC) sampling. This approach ensures that both the count nature of the response and the uncertainty in exposure measurements are properly accounted for, leading to more accurate estimates of the health risks associated with air pollution.

2 Introduction

2.1 Background and Objectives

Air pollution poses a significant public health threat, with a growing body of research illustrating its links to various adverse health outcomes. Fine particulate matter (PM2.5 and PM10) has been particularly studied for its ability to penetrate deeply into the respiratory system and bloodstream, increasing the risk of respiratory and cardiovascular diseases. Research has consistently shown that both long-term and short-term exposure to these particles correlates with heightened mortality rates from conditions such as lung cancer and cardiopulmonary diseases. Additionally, pollutants like carbon monoxide (CO), sulfur dioxide (SO2), nitrogen dioxide (NO2), and ozone (O3) contribute to the public health burden, leading to increased hospital admissions and premature deaths.

The health impacts of air pollution are driven by complex mechanisms, including inflammation and oxidative stress. Pollutants can generate reactive oxygen species, causing cellular damage and systemic inflammation, which exacerbate chronic conditions like chronic obstructive pulmonary disease (COPD) and heart failure. Traditional studies have often relied on long-term exposure data, aggregated over annual or monthly averages. However, emerging research highlights that short-term fluctuations in air pollution levels can also have significant health implications. Weekly data, as opposed to daily or monthly measurements, can offer insights into intermediate-term effects of air pollution, capturing both short-term variations and longer-term trends.

Research has consistently shown that air pollution significantly affects various diseases, including respiratory and cardiovascular conditions. For instance, studies have demonstrated associations between short-term exposure to PM2.5 and increased exacerbations of COPD, while air pollution exposure has been linked to worsening respiratory symptoms and increased hospital admissions for pneumonia. The relationship between air pollution and health outcomes can vary by region, influenced by factors such as traffic density and meteorological conditions. For example, Los Angeles County experiences high pollution levels due to traffic and temperature inversions, which trap pollutants close to the ground.

Meteorological factors, such as temperature, air pressure, and humidity, also play a role in influencing pollution levels and health outcomes. High temperatures can enhance ground-level ozone formation, leading to increased respiratory symptoms.

Given the complexities of air pollution data, advanced statistical methods are crucial for analyzing its impacts on health outcomes. Generalized Additive Models (GAMs) are useful for capturing non-linear relationships and interactions between air pollution, meteorological covariates, and health outcomes. Meanwhile, ARIMA models help address temporal autocorrelation, making them valuable for forecasting pollutant levels and assessing their health impacts. Bayesian hierarchical models further enhance the analysis by allowing the incorporation of multiple levels of variation and uncertainty, which is essential for studying spatial and temporal variability in air pollution data.

Measurement error is a critical concern in air pollution studies, as it can lead to biased estimates of health effects. Approaches such as regression calibration and simulation-extrapolation (SIMEX) methods can help account for measurement error, providing more accurate estimates. Additionally, time series analysis using Poisson log-linear regression is a foundational tool for studying the effects of air pollution on health outcomes, effectively estimating the relative risk associated with exposure and addressing issues like overdispersion. Harmonic terms, such as sine and cosine functions, can be included in these models to capture seasonal patterns, which are particularly relevant in environmental epidemiology due to the seasonal variation in exposures and outcomes. State-space models offer additional flexibility in analyzing time series data by allowing for decomposition into underlying components, which is valuable for studies with noisy or incomplete data.

While substantial evidence exists regarding the adverse health effects of air pollution, particularly concerning respiratory and cardiovascular diseases, challenges remain. Many studies focus on long-term exposure or daily data, with fewer investigations addressing the intermediate-term impacts captured by weekly data. Furthermore, there is a need for region-specific analyses that account for localized pollution sources and meteorological conditions.

To address these gaps, this study employs advanced time series modeling techniques, including Poisson log-linear regression and ARIMA models, tailored to weekly air pollution and health outcome data. By integrating region-specific environmental variables and addressing potential measurement errors, this research aims to provide more accurate estimates of the short-to-intermediate-term effects of air pollution on public health, specifically focusing on respiratory diseases and cardiovascular diseases such as COPD, pneumonia, neoplasms, and heart failure in Los Angeles County.

2.2 Outline of the Article

The objective of this project is to explore the impact of weekly air pollution on respiratory and cardiovascular death counts in Los Angeles County, CA, while also considering meteorological factors. The document is organized in the following manner: The third section will provide a comprehensive description of the dataset, including a detailed descriptive data analysis. The methodology section will focus on model selection, primarily utilizing Generalized Linear Models (GLMs) and Stan model, supplemented by simulation examples to demonstrate model performance. The results section will detail how air pollution and meteorological factors affect health outcomes. The discussion will interpret these findings, compare them with existing research, address limitations, and offer recommendations for future research and policy.

3 Dataset

The data for this study consist of weekly county-level respiratory and cardiovascular death counts, alongside daily city-level air quality and meteorological data from January 1, 2018, to December 17, 2022, for Los Angeles County, California. Respiratory and cardiovascular mortality data were obtained from the Centers for Disease Control and Prevention (CDC), which provides detailed death certificate data, including underlying and additional causes of death, demographic information, and various categorical breakdowns. Respiratory deaths were identified using ICD-10 codes J00–J98, and cardiovascular deaths were identified using ICD-10 code I50.0 for heart failure.
Air quality data included daily concentrations of PM2.5, PM10, nitrogen oxides (NO2), carbon monoxide (CO), sulfur dioxide (SO2), and ozone (O3) from city-level monitoring stations. To align with the weekly death count data, we filtered cities with complete data coverage over the study period and averaged the measurements. Meteorological data, including mean daily temperature, relative humidity, and barometric pressure, were sourced from city-level measurements provided by the U.S. Environmental Protection Agency (EPA). To ensure consistency with the weekly death counts, air quality and meteorological data were aggregated to a weekly measure by calculating the 7-day mean from the daily data.

3.1 Descriptive Data Analysis

Mean (μ\mu) SD (σ\sigma) Percentage
Min 25 Median 75 Max
Air pollutants concentration PM2.5 (μ\mug/m3) 11.73 4.49 3.40 8.99 10.86 13.37 37.55
PM10 (μ\mug/m3) 26.49 10.42 3.79 20.62 26.67 31.68 87.34
CO (ppm) 0.35 0.12 0.16 0.25 0.32 0.43 0.77
SO2 (ppb) 0.36 0.17 0.01 0.24 0.34 0.46 1.02
NO2 (ppb) 12.80 4.57 5.29 9.20 11.68 16.21 27.21
O3 (ppm) 0.03 0.01 0.01 0.02 0.03 0.04 0.05
Meteorological factors Temperature (C) 17.27 6.12 4.95 12.34 16.53 22.66 30.14
Humidity (%) 56.71 11.59 21.41 50.43 59.15 64.64 81.82
Air pressure (hPa) 987.75 7.25 966.78 987.46 989.63 992.24 997.94
Weekly deaths, count Resp-COPD 40 9.23 21 33 38 46 68
Resp-Pneumonia 31 9.17 13 25 30 36 70
Resp-Neoplasm 45 6.71 28 40 45 49 71
Cardio-Heart Failure 43 8.31 22 38 43 47 69
Table 1: Weekly pollutant concentrations, meteorological factors, and numbers of deaths.

Table 1 presents the distribution of deaths, meteorological factors, and air pollutants for Los Angeles County, California, from January 1, 2018, to December 17, 2022 (259 weeks in total).

Over the five-year study period, the mean weekly concentrations of various air pollutants were observed to be 11.73 μ\mug/m3 for PM2.5, 26.49 μ\mug/m3 for PM10, 0.35 ppm for CO, 0.36 ppb for SO2, 12.80 ppb for NO2, 0.03 ppm for O3, respectively. Despite Los Angeles’s reputation as a highly polluted region in the United States, our analysis revealed that all measured air pollutants from 2018 to 2022 remained below their respective standard criteria levels.
A closer examination of the data through timeseries plots revealed interesting seasonal patterns. Specifically, the concentrations of CO and NO2 tended to be lower during hotter periods, while the concentrations of other pollutants increased. This inverse relationship underscores the complex interactions between temperature and pollutant levels.

Analyzing trends over time, we observed that SO2 concentrations decreased from 2018 to 2019, then plateaued, before rising again after 2019. O3 levels were stable from 2018 until 2020, after which they experienced a sudden increase, maintaining this higher level thereafter. PM2.5 levels were relatively steady but showed peaks during the winters of 2020 and 2021. Other pollutants exhibited steady levels throughout the study period. Notably, all pollutants displayed clear seasonal patterns, further emphasizing the influence of temporal factors on air quality.
In terms of meteorological factors, the mean daily temperature, pressure, and humidity during the study period were 17.27 C, 987.75 hPa, and 56.71%\%, respectively. These environmental variables play a crucial role in understanding the broader context of air pollution and its potential health impacts.
Table 1 also illustrates the weekly distribution of deaths from COPD, Pneumonia, Neoplasms, and Heart Failure. Between January 1, 2018, and December 17, 2022, a total of 40,982 deaths were recorded, with 10,287 attributed to COPD, 8,038 attributed to Pneumonia, 11,541 attributed to Neoplasms, and 11,116 attributed to Heart Failure. On average, the study area experienced about 158 deaths per week. Meanwhile, seasonal analysis revealed that the number of deaths was higher during the colder months compared to the warmer months, highlighting a potential seasonal influence on mortality rates.

3.2 EPA Daily Air Pollutant and Meteorological Datasets

3.2.1 Overview of EPA Datasets

The U.S. Environmental Protection Agency (EPA) monitors and regulates air quality across the United States through an extensive network of monitoring stations. These stations collect data on various air pollutants and meteorological variables, essential for assessing environmental health risks, studying the impact of pollutants on public health, and formulating air quality regulations. The datasets provided by the EPA are widely used in academic research, public policy, and environmental management.
The EPA’s air quality data are primarily housed in the Air Quality System (AQS) database, which contains millions of data points collected from thousands of monitoring stations located in urban, suburban, and rural areas. This comprehensive coverage enables researchers and policymakers to query data by pollutant, location, time period, and other parameters.

The EPA monitors several key pollutants and also reports measurements for main meteorological factors, each with significant implications for public health and environmental quality:

  • PM2.5: Fine particles with diameters less than 2.5 micrometers can penetrate deep into the lungs and enter the bloodstream, leading to serious health effects such as heart attacks, aggravated asthma, and premature death. PM2.5 is primarily emitted from combustion processes, including motor vehicles and power plants.

  • PM10: Particles with diameters less than 10 micrometers can be inhaled but primarily affect the upper respiratory tract. Sources include construction activities and road dust. The EPA sets National Ambient Air Quality Standards (NAAQS) for both PM2.5 and PM10 to protect public health.

  • Ozone (O3): Ground-level ozone is formed by chemical reactions between volatile organic compounds (VOCs) and nitrogen oxides (NOx) in the presence of sunlight. It can cause respiratory issues, including chest pain and airway inflammation. The EPA monitors ozone concentrations to ensure compliance with NAAQS.

  • Carbon Monoxide (CO): A colorless, odorless gas produced by the incomplete combustion of carbon-containing fuels. CO interferes with oxygen transport in the blood, leading to cardiovascular and neurological effects. The EPA monitors CO levels, especially in urban areas.

  • Nitrogen Dioxide (NO2): A highly reactive gas produced by motor vehicles and industrial activities, NO2 contributes to the formation of ground-level ozone and fine particulate matter. It can irritate airways and exacerbate respiratory diseases. The EPA monitors NO2 to evaluate air quality.

  • Sulfur Dioxide (SO2): Produced by the burning of fossil fuels containing sulfur, SO2 can form fine particulate matter and acid rain. Exposure can cause respiratory symptoms and lung disease, particularly in individuals with asthma. The EPA monitors SO2 concentrations to reduce industrial air pollution.

  • Temperature: Temperature affects the rate of chemical reactions in the atmosphere, including those that lead to the formation of secondary pollutants such as ozone. High temperatures can also increase the emissions of certain pollutants from natural and anthropogenic sources. The EPA monitors temperature to help understand its influence on air quality and to support the development of air quality models.

  • Humidity: Humidity, or the amount of water vapor in the air, can influence the formation and growth of particulate matter, as well as the removal of pollutants through processes such as wet deposition. High humidity levels can exacerbate the health effects of air pollutants by making it harder for individuals to breathe, especially those with respiratory conditions. The EPA’s humidity data are used in studies of air quality and health, as well as in the development of pollution control strategies.

  • Air Pressure: Air pressure is a fundamental meteorological variable that influences weather patterns and the vertical mixing of air in the atmosphere. Changes in air pressure can affect the transport and dispersion of air pollutants, as well as the likelihood of certain meteorological events, such as temperature inversions, that can trap pollutants near the ground. The EPA monitors air pressure to support the analysis of air pollution episodes and to improve the accuracy of air quality forecasts.

3.3 CDC WONDER Provisional Mortality Statistics Dataset

3.3.1 Overview of the Dataset

The CDC WONDER (Wide-ranging Online Data for Epidemiologic Research) platform provides access to a wealth of health-related datasets, including the ”Provisional Mortality Statistics, 2018 through Last Week,” which offers near real-time mortality data derived from death certificates filed in the United States. This dataset is invaluable for understanding evolving mortality patterns and enabling timely public health interventions.

3.3.2 Data Collection and Processing

This dataset comprises mortality counts derived from death certificates. The underlying cause of death is coded according to the International Classification of Diseases, 10th Revision (ICD-10), ensuring consistent categorization for analysis. It is important to note the provisional nature of the dataset; the data are subject to revision as additional death certificates are processed. Users have to be aware that the numbers may be updated periodically as new information becomes available, introducing uncertainty into time-sensitive analyses.

While comprehensive and up-to-date, users must consider the dataset’s provisional nature when interpreting the data. The ongoing updates mean initial mortality counts may change, which can create uncertainty, particularly in rapidly evolving public health situations. Additionally, the granularity at the county level, while beneficial for local analyses, may pose challenges for broader generalizations across states or regions with different data collection practices. Researchers should corroborate findings with other data sources or conduct sensitivity analyses to account for potential variations.

In this project, the CDC WONDER dataset will be utilized to examine the relationship between mortality rates and various health and environmental factors, focusing on the impact of air pollution and other socio-demographic variables in Los Angeles County, CA. This analysis will contribute to a deeper understanding of the factors driving mortality trends and inform public health strategies aimed at reducing preventable deaths.

3.4 Population Data Source

3.4.1 Overview of the Dataset

The population data for Los Angeles County, CA, utilized in this study, is sourced from the Federal Reserve Economic Data (FRED) system, provided by the Federal Reserve Bank of St. Louis. The specific series, CALOSA7POP, offers annual estimates of the county’s population, spanning from 1970 to the most recent year available. This dataset is essential for conducting demographic and epidemiological analyses, particularly in modeling mortality rates relative to population size.

3.4.2 Dataset Characteristics and Relevance

The CALOSA7POP series tracks the total population within Los Angeles County. These estimates, based on data from the U.S. Census Bureau, undergo regular updates to reflect the most accurate demographic information. The dataset captures annual population changes, accounting for natural growth (births and deaths) and migration patterns. Given Los Angeles County’s status as one of the most populous and diverse regions in the United States, these estimates are crucial for analyzing public health outcomes.

3.4.3 Application of Population Data in Modeling

In statistical modeling, particularly in Poisson log-linear regression, it is imperative to account for population size when assessing mortality counts. Therefore, population estimates will serve as an offset in the model, allowing for the standardization of death counts per capita. By doing so, the analysis can effectively highlight the relationship between air pollution exposure and mortality rates while controlling for population dynamics.

This population dataset complements the health and environmental data, enabling a comprehensive understanding of the interrelationships among air quality, demographic factors, and health outcomes.

4 Methodology

In this section, we outline a detailed methodology to model the association between air PM2.5 levels and death counts using a Poisson log-linear generalized linear model (GLM) while accounting for measurement error on the log scale. The model incorporates time series elements (autoregressive and moving average components) and harmonic seasonal components to capture the underlying structure of the true pollution levels. The goal is to estimate the true effect of pollution on mortality while considering measurement error, temporal dependencies, and the effect of temperature conditions.

4.1 Notation

  • YtY_{t}: Death counts in week tt.

  • PtP_{t}: Observed PM2.5 level in week tt.

  • XtX_{t}: True (unobserved) PM2.5 level in week tt.

  • CtC_{t}: Vector of covariates in week tt.

  • TtT_{t}: Binary categorical variable for temperature conditions (1 = Hot, 0 = Cold).

  • offsett\text{offset}_{t}: Known offset term (e.g., population size).

  • t=1,,nt=1,\dots,n: Time index.

4.2 Converting Observed Daily Data to Weekly

Refer to caption
Figure 1: PM2.5 Monitoring Stations in Los Angeles County (White Area)

Averaging is utilized to derive time series values from raw data, especially when the data may contain errors. This section explains the process of calculating both daily and weekly averages, not only for PM2.5 but also for all other air pollutants and meteorological factors.

Figure 1 shows the monitoring station locations (marked by black circles), while the white area represents Los Angeles County. Below, a detailed illustration of this conversion is provided specifically for PM2.5.

4.2.1 Daily and Weekly Averaging

4.2.2 Daily Averaging

For each day (time point dd), the daily measure pdp_{d} is calculated by averaging all available observations collected on that day. If pd,1,pd,2,,pd,np_{d,1},p_{d,2},\ldots,p_{d,n} are the n measurements available for day dd, then the daily average is:

pd=1ni=1npd,ip_{d}=\frac{1}{n}\sum_{i=1}^{n}p_{d,i}

where nn is the number of available observations on day dd. This approach provides a single representative value for each day, reducing the impact of any individual measurement error.

4.2.3 Weekly Averaging

To aggregate daily measures into a weekly format, the weekly value PtP_{t} is obtained by averaging the daily measures over a 7-day period. For week tt, the weekly measure PtP_{t} is calculated as:

Pt=1mjJpjP_{t}=\frac{1}{m}\sum_{j\in J}p_{j}

with JJ being the set of all days dd corresponding to week tt that have available daily PM2.5 measurements. Each element pjp_{j} in this set represents the daily PM2.5 measurement for day jJj\in J. The variable mm signifies the number of elements in the set JJ.

This formula computes the ”7-day average,” which helps to smooth out daily fluctuations and provides a clearer view of trends over the week.

4.2.4 Error-Prone Data

The averaged measures, whether daily or weekly, are inherently subject to measurement errors. Despite averaging, the resulting PtP_{t} may still reflect inaccuracies present in the original data. Thus, while averaging improves data quality by reducing random fluctuations, it does not completely eliminate error.

In our analysis, we initially modeled the PM2.5 time series using a standard ARIMA model, followed by a more complex combination of SARIMA and GARCH to capture potential seasonal patterns and volatility clustering. After evaluating key performance metrics, we found that both models performed similarly, with the SARIMA+GARCH approach showing marginal improvements in certain indicators. However, it exhibited a notably higher Bayesian Information Criterion (BIC), indicating greater model complexity. Given the principle that similar performance metrics often favor the selection of the simpler model, we chose the ARIMA(2,0,1) model for its parsimony and interpretability as the foundation for our subsequent analysis of XtX_{t} in the Stan framework. To further refine our model selection, we also compared ARIMA(2,0,1), ARIMA(2,0,0) and ARIMA(1,0,0). Upon evaluation, we determined that ARIMA(1,0,0) provided the best balance between model fit and complexity. Thus, we decided to adopt ARIMA(1,0,0) for XtX_{t} in our analysis, ensuring a robust yet straightforward approach to modeling the PM2.5 time series.

4.3 Conversion for Temperature

Continuous temperature data is converted to a binary variable for analysis. In this conversion, a temperature is classified as ”hot” (1) if it is above the mean temperature and ”cold” (0) if it is below or equal to the mean temperature. This transformation facilitates a simplified analysis of temperature’s impact on mortality rates, specifically allowing us to investigate whether low temperatures are associated with an increase in death rates.

4.4 Model Specification

4.4.1 Measurement Error Model

The observed PM2.5 level PtP_{t} is assumed to be measured with error relative to the true PM2.5 level XtX_{t}. On the log scale, the measurement error model is defined as:

log(Pt)=log(Xt)+ϵt,ϵt𝒩(0,σϵ2nt),\log(P_{t})=\log(X_{t})+\epsilon_{t},\quad\epsilon_{t}\sim\mathcal{N}(0,\frac{\sigma^{2}_{\epsilon}}{n_{t}}),

where ϵt\epsilon_{t} represents the measurement error on the log scale, σϵ2\sigma^{2}_{\epsilon} denotes the variance of the measurement error. ntn_{t} is the number of available PM2.5 measurements in Week tt; this relationship ensures that the standard deviation decreases as the number of available data points increases within the week.

4.4.2 Time Series Model for True PM2.5 Levels

The true PM2.5 levels XtX_{t} are modeled as a log-transformed autoregressive process without harmonic components. The model is specified as:

log(Xt)=μ+ϕlog(Xt1)+ηt,\log(X_{t})=\mu+\phi\log(X_{t-1})+\eta_{t},

where:

  • μ\mu is the overall mean of the log-transformed pollution process.

  • ϕ\phi is the autoregressive coefficient, representing the dependence of log(Xt)\log(X_{t}) on its past value.

  • ηt𝒩(0,ση2)\eta_{t}\sim\mathcal{N}(0,\sigma^{2}_{\eta}) is the process noise with variance ση2\sigma^{2}_{\eta}.

4.4.3 Poisson Log-Linear Model for Death Counts

The death counts YtY_{t} are modeled using a Poisson distribution, where the log of the expected death count λt\lambda_{t} is a linear function of the log-transformed true PM2.5 level log(Xt)\log(X_{t}), covariates CtC_{t}, and seasonal sine and cosine terms:

YtPoisson(λt),Y_{t}\sim\text{Poisson}(\lambda_{t}),
log(λt)=log(offsett)+γlog(Xt)+δtemperatureTt+δ1sin(2πt52)+δ2cos(2πt52),\log(\lambda_{t})=\log(\text{offset}_{t})+\gamma\log(X_{t})+\delta_{temperature}T_{t}+\delta_{1}\sin\left(\frac{2\pi t}{52}\right)+\delta_{2}\cos\left(\frac{2\pi t}{52}\right),

where:

  • γ\gamma is the coefficient representing the effect of the true PM2.5 level XtX_{t} on the death count YtY_{t}.

  • δtemperature\delta_{temperature} is the coefficient for the binary temperature.

  • δ1\delta_{1} and δ2\delta_{2} are coefficients representing the seasonal effects from the sine and cosine terms.

4.5 Prior Distributions

In a Bayesian framework, we place prior distributions on all the unknown parameters in the model:

  • Mean of Log-Pollution Process:

    μ𝒩(0,σμ2),\mu\sim\mathcal{N}(0,\sigma^{2}_{\mu}),

    where σμ2\sigma^{2}_{\mu} is a large variance reflecting prior uncertainty.

  • Autoregressive Coefficient:

    ϕ𝒩(0,σϕ2),\phi\sim\mathcal{N}(0,\sigma^{2}_{\phi}),

    where σϕ2\sigma^{2}_{\phi} reflects prior uncertainty about the AR coefficient.

  • Effect of Pollution on Mortality:

    γ𝒩(0,σγ2),\gamma\sim\mathcal{N}(0,\sigma^{2}_{\gamma}),

    where σγ2\sigma^{2}_{\gamma} reflects prior uncertainty about the effect of pollution on death counts.

  • Covariate Effects:

    δtemperature𝒩(0,σδtemperature2),\delta_{temperature}\sim\mathcal{N}(0,\sigma^{2}_{\delta_{temperature}}),

    where σtemperature2\sigma^{2}_{temperature} reflects prior uncertainty about the effect of binary temperature on death counts.

  • Seasonal Effects:

    δ1𝒩(0,σδ12),\delta_{1}\sim\mathcal{N}(0,\sigma^{2}_{\delta_{1}}),
    δ2𝒩(0,σδ22),\delta_{2}\sim\mathcal{N}(0,\sigma^{2}_{\delta_{2}}),

    where σδ2\sigma^{2}_{\delta} reflects prior uncertainty about the seasonal effects from sine and cosine terms.

4.6 Software Implementation

Insufficient built-in results can lead to convergence issues in Bayesian modeling for several reasons. First, if the priors for the model parameters are too vague or poorly specified, the model may struggle to find reasonable estimates, which is particularly critical in complex models with multiple parameters. High-dimensional models can complicate the joint posterior distribution, making it challenging for the sampling algorithm to effectively explore the parameter space. Additionally, poor initialization, where the initial values of the parameters are far from high posterior density regions, can result in slow mixing or chains getting stuck in local optima. Model mis-specification also plays a role; if the model does not accurately represent the underlying data-generating process, the sampling may diverge or yield implausible parameter estimates, especially in intricate models like ARIMA for time series.

Furthermore, limitations in the sampling algorithm, such as Hamiltonian Monte Carlo (HMC) struggling in poorly conditioned models with highly correlated parameters, can lead to low acceptance rates or divergent transitions. Lack of reparameterization may hinder efficiency, while insufficient warmup periods can bias results and lead to non-convergence. Lastly, data-related issues, including missing values or outliers, can affect convergence by complicating the model’s ability to accommodate the data structure.

To address these convergence challenges, it is essential to reassess priors for informativeness, consider reparameterization to improve conditioning, increase warmup and iterations for more thorough exploration, and check for divergence to identify problematic regions in the parameter space. Experimenting with different initial values can also enhance convergence by providing the sampler with better starting points.

Overall, ensuring adequate built-in results requires careful attention to model specification, priors, and sampling strategies to facilitate effective exploration of the parameter space.

4.7 Simulation Results

In our analysis, the simulation results demonstrated a commendable performance of the Stan model, effectively addressing many of the convergence challenges outlined above. The use of carefully chosen informative priors significantly improved the model’s ability to converge to reasonable parameter estimates, even in the presence of complex, high-dimensional data structures.

The Hamiltonian Monte Carlo algorithm displayed improved efficiency, achieving higher acceptance rates and reducing the incidence of divergent transitions. This success can be attributed to the thoughtful reparameterization of the model, which enhanced the condition of the posterior distribution and allowed for more effective exploration of the parameter space.

Additionally, the model exhibited robust mixing properties, as evidenced by trace plots that showed well-behaved chains converging to stable distributions across multiple iterations. The diagnostics revealed that the warmup period was sufficient, allowing the sampler to reach high posterior density regions before collecting samples for inference.

We also performed sensitivity analyses by varying initial values, which confirmed that the model was robust to different starting points, further enhancing our confidence in the convergence and reliability of the estimates.

Overall, the simulation results underscore the effectiveness of the Stan model in overcoming convergence issues through meticulous model specification, strategic use of priors, and a robust sampling framework, ultimately leading to credible and actionable insights from the data.

5 Results and Conclusion

5.1 Results

In this study, we developed a sophisticated Bayesian hierarchical model to investigate the impact of air pollution on mortality rates, addressing the complexities introduced by measurement error in pollution data on the log scale. Our model integrates autoregressive (AR) and moving average (MA) components to capture temporal dependencies in true pollution levels and includes harmonic terms to account for seasonal variations. This approach provided a robust framework for understanding the association between air pollution and health outcomes.

As shown in Table 2, the primary finding of our analysis is a significant positive association between true pollution levels and mortality rates. The posterior mean of the coefficient for the true pollution levels, denoted as γ^\hat{\gamma}, was estimated at 11.7511.75 with a 95% credible interval of [10.64,12.93][10.64,12.93]. This result suggests that increases in pollution levels are associated with higher mortality rates. The credible interval indicates the uncertainty around this estimate, reflecting the model’s robustness in capturing the true effect of pollution.

The model’s performance was notably enhanced by incorporating ARMA terms, which successfully accounted for temporal dependencies in the pollution data. The inclusion of these terms resulted in a significant reduction in the residual sum of squares (RSS), improving the goodness-of-fit as reflected in the increased R2R^{2} values. This improvement demonstrates the model’s ability to capture the underlying temporal patterns in the pollution data, thus providing a more accurate representation of the pollution-mortality relationship.

Seasonal variations were effectively modeled using harmonic terms. The estimated coefficients for these seasonal components showed significant seasonal patterns, which aligned with known cycles in pollution levels. This aspect of the model helped in accounting for periodic fluctuations in pollution, thus refining the estimates of the true pollution effects on mortality.

The Hamiltonian Monte Carlo (HMC) method, particularly the No-U-Turn Sampler (NUTS), employed for posterior inference showed successful convergence through Rhat, as confirmed by diagnostic checks. This ensured that the parameter estimates derived from the posterior distributions were reliable. The HMC method iteratively updated estimates of the parameters, including the coefficients for ARMA components, seasonal effects, and measurement error variances, leading to robust and precise estimates.

Measurement error was modeled explicitly by assuming log(Pt)=log(Xt)+ϵt\log(P_{t})=\log(X_{t})+\epsilon_{t}, where ϵt𝒩(0,σϵ2)\epsilon_{t}\sim\mathcal{N}(0,\sigma^{2}_{\epsilon}) represents the measurement error on the log scale. This approach allowed us to account for the discrepancy between observed pollution levels PtP_{t} and the true levels XtX_{t}. The measurement error model significantly improved the precision of the estimated pollution effects, as evidenced by a decrease in the mean squared error (MSE) between the true and estimated pollution levels.

Our results highlight the critical role of accounting for measurement error and temporal dependencies in environmental health studies. By incorporating these elements into our model, we were able to obtain more accurate estimates of the health impacts of air pollution. This model provides valuable insights into the relationship between air pollution and mortality, which is crucial for developing effective public health policies and air quality regulations.

Table 2: Model Summary
Parameter Mean SE Mean SD 2.5% 25% 50% 75% 97.5% n_eff Rhat
γ\gamma 11.75 0.07 0.63 10.64 11.33 11.72 12.14 12.93 73 1.11
δtemperature\delta_{temperature} 0.04 0.01 0.36 -0.04 0.02 0.05 0.07 0.13 1211 1.00
δ1\delta_{1} 0.14 0.00 0.11 0.07 0.12 0.14 0.17 0.22 2062 1.00
δ2\delta_{2} 0.15 0.00 0.17 0.08 0.12 0.15 0.18 0.23 1551 1.00
ϕ\phi 0.85 0.00 0.08 0.68 0.81 0.86 0.89 0.96 315 1.03
ση\sigma_{\eta} 0.01 0.00 0.02 0.00 0.00 0.01 0.01 0.01 1078 1.00
σϵ\sigma_{\epsilon} 40.73 0.01 0.49 39.77 40.40 40.73 41.05 41.70 3462 1.00

In conclusion, this study underscores the importance of using advanced statistical methods to address measurement error and temporal dynamics in environmental health research. The current version of our Bayesian hierarchical model specifically focuses on chronic obstructive pulmonary disease (COPD). Future work will expand this model to include additional elements and explore the effects of air pollutants on pneumonia, neoplasm, and heart failure. The results from these analyses will be included in forthcoming versions of this research, providing a more comprehensive understanding of the health impacts of air pollution.

5.2 Discussion

The methodology outlined in this section addresses the critical issue of measurement error in air pollution studies, particularly when using Poisson log-linear generalized linear models (GLMs) to assess the impact of pollution on mortality. By incorporating a measurement error model on the log scale, we account for the inherent inaccuracies in observed pollution data, leading to more reliable estimates of the true association between pollution and health outcomes. The integration of an autoregressive moving average (ARMA) process with harmonic seasonal components provides a robust framework for modeling the temporal dependencies and seasonal variations in pollution levels, which are often observed in environmental time series data.

One key advantage of this approach is its ability to separate the true pollution signal from the noise introduced by measurement error, which is crucial in epidemiological studies where the effects of pollution are often subtle and can be confounded by inaccuracies in the data. The use of Bayesian inference, particularly the Hamiltonian Monte Carlo (HMC) method with the No-U-Turn Sampler (NUTS), allows for the incorporation of prior knowledge and the quantification of uncertainty in the model parameters, including the pollution effect on mortality. This is particularly important in public health studies, where policy decisions are often based on estimates that must account for both measurement error and uncertainty.

The model’s flexibility is further enhanced by the inclusion of covariates, which allows for the adjustment of potential confounders and the exploration of interactions between pollution levels and other risk factors. The Bayesian framework also facilitates the extension of the model to include other sources of uncertainty, such as spatial variability in pollution levels or the use of alternative distributions for the death counts, which may be overdispersed relative to the Poisson distribution.

5.3 Extensions

Several extensions to the model are possible, offering opportunities for further research and refinement. One potential extension is the incorporation of spatial components to model the variation in pollution levels and mortality across different geographic regions. This would allow for a more detailed analysis of the spatial heterogeneity in pollution effects and could be achieved by extending the current ARMA model to a spatiotemporal framework, potentially using Gaussian processes or spatial autoregressive models.

Furthermore, the Poisson log-linear model could be replaced with a more flexible model that allows for overdispersion, such as a negative binomial model or a quasi-Poisson model. This would be particularly useful in cases where the observed variance in death counts exceeds the mean, a common scenario in epidemiological data. Such an extension would provide more accurate estimates of the effect of pollution on mortality, particularly in the presence of overdispersed count data.

Lastly, incorporating non-linear effects of pollution on mortality could be another valuable extension. This could be achieved by replacing the linear term γlog(Xt)\gamma\log(X_{t}) with a non-linear function (e.g., splines or polynomial terms) that allows for more complex relationships between pollution levels and health outcomes. This extension could capture potential threshold effects or saturation points, where the health impacts of pollution may change at different levels

6 References

1.Dominici F, Zanobetti A, Schwartz J, Braun D, Sabath B, Wu X. Assessing Adverse Health Effects of Long-Term Exposure to Low Levels of Ambient Air Pollution: Implementation of Causal Inference Methods. Res Rep Health Eff Inst. 2022 Jan;2022(211):1-56. PMID: 36193708; PMCID: PMC9530797.

2.Peng RD, Bell ML, Geyh AS, McDermott A, Zeger SL, Samet JM, Dominici F. Emergency admissions for cardiovascular and respiratory diseases and the chemical composition of fine particle air pollution. Environ Health Perspect. 2009 Jun;117(6):957-63. doi: 10.1289/ehp.0800185. Epub 2009 Feb 11. PMID: 19590690; PMCID: PMC2702413.

3.Woodward SM, Mork D, Wu X, Hou Z, Braun D, Dominici F (2023) Combining aggregate and individual-level data to estimate individual-level associations between air pollution and COVID-19 mortality in the United States. PLOS Glob Public Health 3(8): e0002178. https://doi.org/10.1371/journal.pgph.0002178

4.World Health Organization. World Health Statistics 2023: Monitoring Health for the SDGs, Sustainable Development Goals, 2023. https://www.who.int/publications/i/item/9789240074323

5.Southerland VA, Brauer M, Mohegh A, Hammer MS, van Donkelaar A, Martin RV, Apte JS, Anenberg SC. Global urban temporal trends in fine particulate matter (PM2·5) and attributable health burdens: estimates from global datasets. Lancet Planet Health. 2022 Feb;6(2):e139-e146. doi: 10.1016/S2542-5196(21)00350-8. Epub 2022 Jan 5. PMID: 34998505; PMCID: PMC8828497.

6.Choi J, Fuentes M, Reich BJ. Spatial-temporal association between fine particulate matter and daily mortality. Comput Stat Data Anal. 2009 Jun 15;53(8):2989-3000.
doi: 10.1016/j.csda.2008.05.018. PMID: 19652691; PMCID: PMC2685284.

7.Jerrett M, Burnett RT, Ma R, Pope CA 3rd, Krewski D, Newbold KB, Thurston G, Shi Y, Finkelstein N, Calle EE, Thun MJ. Spatial analysis of air pollution and mortality in Los Angeles. Epidemiology. 2005 Nov;16(6):727-36. doi: 10.1097/01.ede.0000181630.15826.7d. PMID: 16222161.

8.Makar M, Antonelli J, Di Q, Cutler D, Schwartz J, Dominici F. Estimating the Causal Effect of Low Levels of Fine Particulate Matter on Hospitalization. Epidemiology. 2017 Sep;28(5):627-634. doi: 10.1097/EDE.0000000000000690. PMID: 28768298; PMCID: PMC5624531.

9.Kelly, F J, and J C Fussell. “Air pollution and airway disease.” Clinical and experimental allergy : journal of the British Society for Allergy and Clinical Immunology vol. 41,8 (2011): 1059-71. doi:10.1111/j.1365-2222.2011.03776.x

10.Cao, Q., Rui, G. &\& Liang, Y. Study on PM2.5 pollution and the mortality due to lung cancer in China based on geographic weighted regression model. BMC Public Health 18, 925 (2018). https://doi.org/10.1186/s12889-018-5844-4

11. Jerrett, Michael et al. “Long-term ozone exposure and mortality.” The New England journal of medicine vol. 360,11 (2009): 1085-95. doi:10.1056/NEJMoa0803894

12. Blackwell, M., Honaker, J., &\& King, G. (2015). A Unified Approach to Measurement Error and Missing Data: Overview and Applications. Sociological Methods &\& Research.
https://doi.org/10.1177/0049124115585360

13. Kloog, Itai et al. “Acute and chronic effects of particles on hospital admissions in New-England.” PloS one vol. 7,4 (2012): e34664. doi:10.1371/journal.pone.0034664

14. Tian Y, Liu H, Wu Y, Si Y, Li M, Wu Y, Wang X, Wang M, Chen L, Wei C, Wu T, Gao P, Hu Y. Ambient particulate matter pollution and adult hospital admissions for pneumonia in urban China: A national time series analysis for 2014 through 2017. PLoS Med. 2019 Dec 31;16(12):e1003010. doi: 10.1371/journal.pmed.1003010. PMID: 31891579; PMCID: PMC6938337.

15. Meng, Ying-Ying et al. “Traffic and outdoor air pollution levels near residences and poorly controlled asthma in adults.” Annals of allergy, asthma &\& immunology : official publication of the American College of Allergy, Asthma, &\& Immunology vol. 98,5 (2007): 455-63. doi:10.1016/S1081-1206(10)60760-0

16. Peel, Jennifer L et al. “Ambient air pollution and respiratory emergency department visits.” Epidemiology (Cambridge, Mass.) vol. 16,2 (2005): 164-74.
doi:10.1097/01.ede.0000152905.42113.db

17. Peng, Roger D et al. “Seasonal analyses of air pollution and mortality in 100 US cities.” American journal of epidemiology vol. 161,6 (2005): 585-94. doi:10.1093/aje/kwi075

18. Mokkink LB, Eekhout I, Boers M, van der Vleuten CPM, de Vet HCW. Studies on Reliability and Measurement Error of Measurements in Medicine - From Design to Statistics Explained for Medical Researchers. Patient Relat Outcome Meas. 2023 Jul 7;14:193-212. doi: 10.2147/PROM.S398886. PMID: 37448975; PMCID: PMC10336232.

19. Samet, J M et al. “The National Morbidity, Mortality, and Air Pollution Study. Part II: Morbidity and mortality from air pollution in the United States.” Research report (Health Effects Institute) vol. 94,Pt 2 (2000): 5-70; discussion 71-9.

20. Schwartz, J et al. “Particulate air pollution and hospital emergency room visits for asthma in Seattle.” The American review of respiratory disease vol. 147,4 (1993): 826-31.
doi:10.1164/ajrccm/147.4.826

21. Shah, Anoop S V et al. “Global association of air pollution and heart failure: a systematic review and meta-analysis.” Lancet (London, England) vol. 382,9897 (2013): 1039-48. doi:10.1016/S0140-6736(13)60898-3

22. Turner, Michelle C et al. “Long-term ambient fine particulate matter air pollution and lung cancer in a large cohort of never-smokers.” American journal of respiratory and critical care medicine vol. 184,12 (2011): 1374-81. doi:10.1164/rccm.201106-1011OC

23. Zanobetti A, Schwartz J, Dockery DW. Airborne particles are a risk factor for hospital admissions for heart and lung disease. Environ Health Perspect. 2000 Nov;108(11):1071-7. doi: 10.1289/ehp.001081071. PMID: 11102299; PMCID: PMC1240165.

24. Roger D. Peng, Francesca Dominici, Thomas A. Louis, Model Choice in Time Series Studies of Air Pollution and Mortality, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 169, Issue 2, March 2006, Pages 179–203, https://doi.org/10.1111/j.1467-985X.2006.00410.x