A simulation study of confounding in generalized linear models for air pollution epidemiology.

Confounding between the model covariates and causal variables (which may or may not be included as model covariates) is a well-known problem in regression models used in air pollution epidemiology. This problem is usually acknowledged but hardly ever investigated, especially in the context of generalized linear models. Using synthetic data sets, the present study shows how model overfit, underfit, and misfit in the presence of correlated causal variables in a Poisson regression model affect the estimated coefficients of the covariates and their confidence levels. The study also shows how this effect changes with the ranges of the covariates and the sample size. There is qualitative agreement between these study results and the corresponding expressions in the large-sample limit for the ordinary linear models. Confounding of covariates in an overfitted model (with covariates encompassing more than just the causal variables) does not bias the estimated coefficients but reduces their significance. The effect of model underfit (with some causal variables excluded as covariates) or misfit (with covariates encompassing only noncausal variables), on the other hand, leads to not only erroneous estimated coefficients, but a misguided confidence, represented by large t-values, that the estimated coefficients are significant. The results of this study indicate that models which use only one or two air quality variables, such as particulate matter [less than and equal to] 10 microm and sulfur dioxide, are probably unreliable, and that models containing several correlated and toxic or potentially toxic air quality variables should also be investigated in order to minimize the situation of model underfit or misfit. ImagesFigure 1Figure 2Figure 3Figure 4Figure 5Figure 6Figure 7Figure 8

contet of neralized linear models. Using sythetic dat sets, the present sudy shows how model overfit, udri and misfit in the presence of correted causalvariables in a Poison regression model affet the estmated coefficients of the covarates and their confidence levs.LThe study also shows how tis effect changs with thae rnes tifthe cvriates anxd the sample size. There is qpualitative agrent between the su results and the g expressiom m the large-sample limit o the ordiny in er models. Confonding of oarate n a overfitted mode (with cvari ates encompasing more tbha just the causal variables) does not bias the estimated coefficients but reduces their s iin Te efft of model underfit (with some caual v les l as covariates) or misfit (with coriates encompassing only non l ies), on the oter hand, leads to not only erroneo estimated coeffcients but a misguided confidence, represented by large t-values, thhat the estimt coefient are sgnificnt results of t study indicate that modes which iuse only one or two air quality variables, such as partlate matter .10 pm and sulfur dioxider probayunreliable, andlthat modes containing several correlated and toxic or poten tially toxic air quality viables should a e investigate in order to minimize the situation of model underfit or isi Imorkb air pollution, confunding effecs, lic timeseries studies, epidemiological studies, generalized linear models, model misfit, Poisson regression, simulation.
Environ Hedt/IPenpect107:2. 17-222 (1999). [Online 8PFebruury 1999] hptleh nl.s. h *ni/o/ 7p217-2 benab~t.btml The EPA (1) recently promulgated the National Ambient Air Quality Standards for the mass concentrations of particulate matter <10 pm (PM10) and <2.5 pm (PM2.5). The key rationale for these standards came from the epidemiological studies in the past few years associating particulate air pollution [represented in these studies primarily by the ambient concentrations of either the total suspended particulate (TSP) or PM10] with daily mortality and morbidity. Both the mortality and morbidity studies are almost exclusively ecological time-series studies regressing the daily events of mortality or morbidity against the ambient air quality for given urban areas (2). However, for many urban areas where the same or similar data sets were reanalyzed, different or contradictory conclusions often resulted (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14). This fact highlights the difficulty of establishing a causal relation between ambient PM concentrations at their present levels and a given health endpoint through regression models alone.
In most existing Poisson regression models for countable data such as mortality and morbidity, in addition to the inclusion of some weather parameters, ambient concentrations of some air pollutants, often only one (PM1O or TSP) or two (PM1O or TSP, and SO2) so far, were included as covariates.
Models were then selected based on some goodness-of-fit criteria. However, different choices of covariates and different formulations of the model led to different selected models with different accompanying conclusions, and it is often impossible to ascertain which model is more correct or reliable. One major issue among many is the issue of confounding or collinearity, which we take to mean the presence of significant correlation between a covariate in a regression model and another covariate that may or may not be causal, or a causal variable that may or may not be a covariate in the model. This issue is always acknowledged but almost never investigated. In fact, after acknowledgment of a potential confounding problem, most researchers went on to draw conclusions based on the significance of the estimated coefficients associated with given covariates, often oblivious of or ignoring the fact that confounding can invalidate the conclusions altogether. It is also not often appreciated that uncovering similar regression results for many areas does not necessarily reduce or remove the problem of confounding because the same confounding problem may occur in many areas. However, in ascertaining if PM among many pollutants is indeed a causal agent of daily mortality or morbidity, the problem of confounding cannot be ignored, especially when correlation between ambient concentrations of different air pollutants can be large. Table 1 shows an example of typical correlation coefficients among the air quality variables and meteorological variables that have been considered as covariates in regression studies. The data are from Pittsburgh (Allegheny County), Pennsylvania, for the summer and winter seasons from 1989 to 1991. The pollutants are all daily maximum 1-hr concentrations, with the exception of PM1O, which is a daily-averaged concentration. The meteorological variables are for the hour of the day when the dry-bulb temperature reaches the maximum. Table 1 shows that relatively high correlation can exist between different air quality and meteorological variables, and that the correlation is seasonally dependent. For example, the correlation between 03 and PM1O is high in the summer but very low in the winter. On the other hand, the correlation between CO and PM0O is moderately high in the summer and quite high in the winter. Also, both 03 and PM1O are well correlated with dry-bulb temperature in the summer but less well correlated in the winter. Depending on the choice of covariates in a model, different types of model biases can occur. They include underfitting (with some causal variables being excluded as covariates), overfitting (with covariates including more than just the causal variables) and misfitting (with all covariates being noncausal variables). The effect of such a model bias on the estimated coefficients of the chosen covariates in an ordinary least squares (OLS) linear model was briefly described by Seber (15). Chen and Zhang (16) have provided the means and variances of the estimated coefficients in the large-sample limit (number of observations being much greater than one) for a biased OLS model. However, no closed form is available to describe the impact of model bias due to underfitting, overfitting, and misfitting of covariates on the estimated coefficients of a generalized linear model (GLM) (17). This knowledge has great theoretical and practical importance in view of the role of Poisson regression in the PM epidemiological studies. It is difficult to study the confounding effects because no one knows what a correct model should be. However, synthetic data sets based on a known causal model can serve as a correct model for the study. The purpose of this paper is to investigate how model bias impacts the estimated coefficients and, more important, how covariate confounding, range of fluctuation of the covariates, and sample size affect this impact in a Poisson regression, assuming that we know a priori the exact causal relationship. Our approach is to construct synthetic data sets of a Poisson variate whose mean is determined by a known, exact linear model containing no more than two covariates that have a range of correlation coefficients between them. We then estimate the coefficients of the covariates for a Poisson regression model, biased or otherwise, applied to the synthetic data sets. In addition, the range of fluctuation of one of the covariates and the sample size will also be varied to see how they influence the t-values of the estimated coefficients. Serial correlation of the dependent and independent variables is not considered, as its inclusion would complicate the synthetic data set generation and would not add any new insight or alter the conclusion significantly. Seasonal cycles, long-term trends, and measurement error are not explicitly considered, as they are not a necessary component of confounding. However, they are relevant because they influence the correlation between the different air quality and meteorological variables, as well as the ranges of the variables. Therefore, they influence the extent of confounding. We also present the closed form results for the OLS models and describe the protocol for the construction of the synthetic data sets and the results of different Poisson regressions applied to the data sets.

Models
Before the simulation study of GLM, we present a brief review of the results extracted from Chen and Zhang (16). P=[X TX X Ty, (2) where Y = (y1 ., y)T. X = (xi,),xp, n is the number of observations or the sample size.
Unbiased. If the data are actually generated from a model identical to Equation 1, we call the equation unbiased for the data. With these assumptions, we have in the large-sample limit (n> >1) the expectation and variance of the estimated coefficients, where E(x) = g and var(x) = 11 .
For the case of two covariates without an intercept, we have  (11) where E(x1) = p,, E(x2) = 12X var(xl) = var(x2) = 112, and cov(xl, x2) = P11'121 with p being the correlation coefficient between x1 and x2. Biasedt If the data are generated from the model y = P3x1 +e j=1 (12) with t. p, we call the model (Eq. 1) biased for the data. Consider the following two biased cases: For t >p, or in the case of a model underfit, we have in the large-sample limit, x(vz -VzxVx VxZ yT )]VX 1 (14) where Vx whr X= E(XXT), VZ = E(ZZT), VZX = E(ZXT), and V -z= E(XZT) with Z= (xp+ I, -,-xt) and y=(Pp~1 '... ) For t = 2 and p = 1, X and Z are both one-dimensional random variables. We have (p) 2 2 'lX + tx (15) (5) var(~) 41 + (6) [8~~~2 _F -(P4X4 + g.z) +g2,R)-1] 2 r2 + g2 (16) where E(x) = g, E(z) = p_, var(x) = For the variances, we have the same results as in Equations 10 and 1 1: Misfit. If the data are generated by X, alone but we fit the model using X2, then we have in the large-sample limit var(x2) = rj22, and cov(x, x2) = PTl1Tl2 The above results show a complicated relation between the estimated coefficients, together with their variances, of a biased linear model and the true coefficients and covariance matrices of the explanatory variables. It would be rather hopeless to find a set of analytical expressions for the comparable situations in a GLM.

Protocol for the Construction of Synthetic Data Sets
To keep the scope of work manageable in the simulations, we considered no more than two covariates. For example, the two covariates, xl and x2, could correspond to PMIO and CO, respectively. The dependent variable, y, could be considered as the daily mortality. In generating the synthetic data, we assumed an exact, causal, log-linear relationship between y and xi, with the coefficients for xi being J3, = 0.0005 and P2 = 0.005. These are hypothetical values used for illustrative purposes only. If we assumed the units of micrograms per cubic meter and parts per million by volume (ppmV), respectively, for xl and x2, then PI = 0.0005 corresponds to a relative risk of 1.05 or a mortality increase of 5% per 100 pg/m3 increase in PM1O, whereas fl2= 0.005 corresponds to a relative risk of 1.005 or a mortality increase of 0.5% per ppmV increase in CO. We further assumed the intercept, a, of the exact log-linear model to be 3.132. This value corresponds to an average daily mortality of about 23.
Both xl and x2 were assumed to follow a bivariate lognormal distribution with means and standard deviations (SDs) extracted from the logarithmically transformed PM1O and CO data for Pittsburgh during 1989-1991. In the logarithmic space, the corresponding means were 3.5 and 0.87 and the corresponding SDs were 0.619 and 0.475 for x1 and x2, respectively. In the concentration space, the above information essentially recovers the observed means of 40.22 pg/m3 and 2.68 ppmV, and the observed SDs of 26.25 pg/m3 and 1.41 ppmV, for PM1O and CO, respectively. In the generation of the synthetic data sets, the SD of xl in the logarithmic space, denoted was held fixed at 0.6 while that of x2, denoted rn2, was allowed to vary from 0.2 to 1.0. With both n, and 2 being <1, they are roughly proportional to the corresponding SDs in the concentration space. Note that because of the logarithmic transformation, the 1js here are not identical to the ils described for OLMs. As a measure of confounding between the two variates, the correlation coefficient, p, between the variates in the logarithmic space was also allowed to vary. Again, because of the small (but realistic) values chosen for both il and Ti2, the correlation coefficients between the two variates in the concentration space are typically no more than 10% less than p for p = 0.5 and 0.9 and are essentially 0 for p = 0.
Because the correlation coefficients between any two explanatory variables in Table 1 are generally positive, only positive ps were considered in our simulations. For each realization, the values of x1 and x2 in the logarithmic space were generated using an S-Plus random number generator (MathSoft, Seattle, WA) for a bivariate normal distribution on an IBM RS6000 mainframe. The antilogarithms of these values were used to determine the value, m, of an exact model, log(m) = a + Plxl + P2x2. In fact, m serves as the mean of the daily mortality. With this mean, the Poisson variate, y, was generated using the S-Plus random number generator for the Poisson distribution. A collection ofy values with a sample size, n, constitutes the synthetic data set to be used for Poisson regression: log[E(y)] = a + Plxl + 2x2 (26) The sample size was also allowed to vary from 365 to 7 x 365, corresponding to a period of 1-7 years. To assure that the results of the Poisson regressions were stable, the procedure for each synthetic data set generation and the subsequent regression was performed for a total of 100 times. The means of the 100 repetitions are reported in "Results." No significant differences were found between the means with 100 repetitions and those with 1,000 repetitions.
In the Poisson regression study, the unbiased regression model contained both x1 and x2, as in the exact model. Several biased regression models were considered. For the case of model underfit, the synthetic data sets were constructed using the exact model containing both x1 and x2; the regression model assumed only x1 as the covariate. For model overfit, the synthetic data sets were constructed using only xl, whereas the regression model assumed both xI and x2 to be the covariates. For model misfit, two cases were considered. First, the synthetic data sets were based on only xl; the regression model contained only x2 as the covariate. Second, only x2 was used in the synthetic data sets; only x1 was the covariate in the regression. The latter is not equivalent to the former because we always allowed only the range of x2 to vary.

Results
The impact of 1) confounding or correlation, p, between xl and x2; 2) the data range or SD, Ti2, of x2; and 3) the sample size, n, on the outcome of the Poisson regression will be presented in the same order as Biased. In the underfit case, the synthetic data sets contain the effect of both xl and Nrof observations), j2 (standard deviation of x2, whereas the regression model contains rrelation coefficient between log xl and log onl x as the covariate. In the ordina lin d Poisson regression using the same covari-1 ear regression, Equation 15 indicates that E(Pi) increases with p and 112, or more precisely, E(P1) is asymptotically PI plus a term 0 . ; that is linearly related to pq2 This is quali- also increases with 112, and this increase is P112 enhanced by an increasing p (Fig. 2). As the sample size, n, increases, t(,81) increases as n for an underfitting Poisson regression con-well (Figs. 2, 3)i. If x were used in the .xact model containing xl and x2. modl, tn2 2n regression model, then based on Equations 15 and 16 and the parameters of the exact rent from the coefficients of the exact model, one would expect E(2) to be essenel. This is consistent with Equation 9 tially proportional to P/n2 and t(J2) to he OLMs. However, both t(P1) and again increase with p. decrease with increasing p. This is The above result has a profound impli- when the number of covariates used is small (e.g., one or two). If a causal variable such as CO is missing in the regression model and the variable is highly correlated with a covariate (e.g., PM) in the regression model, then the regression model will indicate a strong but erroneous association of the dependent variable or effect (the daily mortality, for example) with the covariate. In fact, the estimated coefficient of the covariate will be compromised by the size of the actual coefficient of the missing variable, the range of the missing variable, as well as the magnitude of the correlation coefficient between the covariate and the missing variable. The t-value of the estimated coefficient also increases with the correlation coefficient and the range of the missing variable. In addition, increasing the sample size (to several years of data, for example) also increases the t-value, actually making the erroneous association appear more convincing.
In the overfit case, the synthetic data sets were constructed using only x1; the regression model contains both x1 and x2 as the covariates. In agreement with Equation 19 for the ordinary linear regression, the estimated coefficients of both covariates are not significantly different from their exact values, being zero for E(k2). They are not affected by the correlation coefficient between the two covariates. The t(P1), on the other hand, decreases with increasing p, and does not depend strongly on 2. Figure 4 shows t(P1) as a function of p and n, with TI2 held constant at 0.2. As expected, t(k) is essentially zero. If the exact model contains only x2, one expects t(P3I) to be zero and t(k) to be increasing with TI2 and decreasing with p.
The above result indicates that overfitting should not lead to a serious bias in the estimated coefficients of the covariates, but the correlation between the causal and redundant covariates will reduce the significance of the estimated covariates.
Misfit. In the first misfit case, xi was used in the exact model and x2 was the Figure 5. Behavior of E,,2 as a function of n and P/112 for a misfitting Poisson regression containing x2 as the covariate to describe data created from an exact model containing xl. covariate in the regression model. Even though x2 plays no role in the dependent variable of the synthetic data sets in the regression model x2 influences E(732) and t (A2) through p. For the ordinary linear regression, Equation 24 shows that E(P2) increases with increasing p and decreases with increasing TI2. Figure 5 shows E(P2) as a function of P/TI2. The significance of the estimate, 42(), increases with p and n, but, interestingly, not withnT2 (Fig. 6).
In the second misfit case, x2 was used in the exact model and xl was the covariate in the regression model. In this case, the variation in T12 directly impacts the dependent variable in the synthetic data sets. Figure 7 shows E(,A) and t(,A) as an increasing function of PT12. Again, no causal meaning can be attached to the magnitude of the esti-mated~oefficient. Even so, Figure 8 shows that t (p3) can be large and increasing with p and T12, in contrast with t(P2) above, which has little or no dependence on 112' Model misfit is agan a likely occurrence. The result of the misfit is a set of totally meaningless estimated coefficients, yet with increasing significance as the sample size and the ranges of the missing causal variables increase and as the correlation between the covariates and the true causal variables increases. The potential for misleading inference in model  Figure 8. Behavior of 9N as a function of n, n2 and p for a misfitting Poisson regression containing xl as the covariate to describe data created from an exact model containing x2. estimated coefficients of the covariates and the significance of the estimated coefficients in a generalized linear model. The study also shows how this impact changes with the ranges of the covariates and the sample size. There is qualitative agreement between the study results and the corresponding expressions for the largesample limit in the ordinary linear models. The study results are highly relevant to the present active investigations of an association between ambient air pollutant concentrations (especially PM) and daily mortality and morbidity. Modeling bias is a likely occurrence when regression models are used in an effort to identify the causes of a health outcome in an uncontrolled environment. This occurrence can lead to seriously erroneous conclusions when confounding exists between the relevant covariates or between some covariates and causal variables that may or may not be present in the model. The main effect of confounding for model overfit is a reduction in the significance of the estimated coefficients. The effect of model underfit or misfit (a more common occurrence), on the other hand, leads not only to erroneous estimated coefficients, but a misguided confidence, represented by large t-values, that the estimated coefficients are significant. The results of this study indicate that models that use only one or two air quality variables such as PMIO and SO2 are likely unreliable, and that models containing several correlated and toxic or potentially toxic air quality variables should also be investigated in order to minimize the situation of model underfit or misfit. It is also possible that models containing more pollutants as covariates may have estimated coefficients that are unphysical or counter-intuitive. Such a situation would call for controlled experiments to establish a causal relation between a pollutant or multiple pollutants and a health end point.