### Bootstrap-after-Bootstrap Model Averaging for Reducing Model Uncertainty in Model Selection for Air Pollution Mortality Studies

[do action=”authors”]Steven Roberts, Michael A. Martin[/do][do action=”affiliations”]School of Finance and Applied Statistics, College of Business and Economics, Australian National University, Canberra, Australian Capital Territory, Australia[/do][do action=”citation-string”]*Environ Health Perspect* 118:131-136 (2010). http://dx.doi.org/10.1289/ehp.0901007 [online 17 September 2009] [/do]

### Abstract

[do action=”abstract”]

**Background:** Concerns have been raised about findings of associations between particulate matter (PM) air pollution and mortality that have been based on a single “best” model arising from a model selection procedure, because such a strategy may ignore model uncertainty inherently involved in searching through a set of candidate models to find the best model. Model averaging has been proposed as a method of allowing for model uncertainty in this context.

**Objectives:** To propose an extension (double BOOT) to a previously described bootstrap model-averaging procedure (BOOT) for use in time series studies of the association between PM and mortality. We compared double BOOT and BOOT with Bayesian model averaging (BMA) and a standard method of model selection [standard Akaike’s information criterion (AIC)].

**Method:** Actual time series data from the United States are used to conduct a simulation study to compare and contrast the performance of double BOOT, BOOT, BMA, and standard AIC.

**Results:** Double BOOT produced estimates of the effect of PM on mortality that have had smaller root mean squared error than did those produced by BOOT, BMA, and standard AIC. This performance boost resulted from estimates produced by double BOOT having smaller variance than those produced by BOOTand BMA.

**Conclusions:** Double BOOT is a viable alternative to BOOT and BMA for producing estimates of the mortality effect of PM.

[/do][do action=”abstract”]**Key words:** air pollution, Bayesian, bootstrap, model averaging, mortality, particulate matter[/do]

[do action=”notes-rule-above”]Address correspondence to S. Roberts, School of Finance and Applied Statistics, Australian National University, Canberra, ACT 0200, Australia. Telephone: 61-2-6125-3470. Fax: 61-2-6125-0087. E-mail: steven.roberts@anu.edu.au[/do][do action=”notes”]This work was supported by the Australian Research Council (DP0878988).[/do][do action=”notes”]The authors declare they have no competing -financial interests.[/do][do action=”notes”] Received 20 May 2009; accepted 17 September 2009; online 17 September 2009.[/do]

Over the past decade, time series studies that have investigated the association between daily variations in particulate matter (PM) air pollution and daily variations in mortality have become commonplace (Breitner et al. 2009; Kelsall et al. 1997; Roberts 2004). Studies conducted in Europe and North America have found statistically significant associations between increases in daily PM concentrations and increases in daily mortality (Samoli et al. 2008). One common feature of these time series studies is that myriad modeling choices must be made to arrive at an “optimal” model from which an estimate of the association between PM and mortality can be obtained. This array of choices means there are potentially many candidate models for investigating the association between daily PM and mortality. In some studies, models that are selected because they optimize a particular model selection criterion are used to infer a relationship between PM and mortality (Draper 1995; Goldberg et al. 2006; Kelsall et al. 1997). In this context, concerns have been raised in the literature about statistical issues that may arise from the process of selecting a single model from among a potentially large number of competing candidates (Clyde 2000; Koop and Tole 2004; National Research Council 1998). The procedure of selecting a single “best” model may ignore the model uncertainty, which is inherently involved in searching through the set of candidate models to determine the best one. Ignoring model uncertainty is problematic because it reflects statistical variation not captured within the single chosen model, and failure to account for this variation can increase the chance of erroneously concluding a statistically significant association between PM and mortality (Clyde 2000; National Research Council 1998).

Model averaging in both Bayesian and frequentist forms has been proposed as a means of allowing for model uncertainty in time series studies of PM and mortality (Clyde 2000; Koop and Tole 2004, 2006; Martin and Roberts 2006). Model-averaging procedures assign probabilities or weights to each candidate model that reflect the degree to which the model is supported by the data. These probabilities can be used to produce “weighted” average estimates of the association between PM and mortality that explicitly incorporate information from each candidate model. This process of explicitly incorporating each candidate model into the estimation process produces estimates that incorporate the variation inherent in the model selection process. Clyde (2000) and Koop and Tole (2004, 2006) implemented Bayesian model-averaging (BMA) techniques to estimate the association between air pollution and mortality. Martin and Roberts (2006) implemented model averaging using a bootstrap-based procedure and showed that it is competitive with BMA in that context. Previous investigations have also used the bootstrap in the context of time series studies of air pollution, including investigations of the effect of concurvity in generalized additive models (Figueiras et al. 2005; Ramsay et al. 2003).

In this paper, we discuss a double bootstrap model-averaging (double BOOT) approach that extends and improves the bootstrap model-averaging (BOOT) procedure that was implemented in Martin and Roberts (2006).

### Materials and Methods

** Materials.** The data used in this report were obtained from the publicly available National Morbidity, Mortality, and Air Pollution Study database (Zeger et al. 2006). The data consist of daily time series of mortality, temperature, dew point temperature, and PM air pollution measures for five United States (U.S.) cities for the period 19992000. The mortality data are daily counts of non-accidental deaths of individuals ≥ 65 years of age. The measure of ambient PM used is the ambient 24-hr concentration of PM of < 2.5 µm in aerodynamic diameter (PM

_{2.5}) measured in micrograms per cubic meter.

The five U.S. cities included in this study—Birmingham, Alabama; Orlando, Florida; Seattle, Washington; St. Louis, Missouri; and Tampa, Florida—were selected because they had nearly complete PM_{2.5} data over the period of investigation. For these cities, the number of days missing PM_{2.5} concentrations over the 730-day period of investigation ranged from 2 to 18 days. Missing PM_{2.5} concentrations were imputed using the average of the previous and subsequent days’ concentrations.

** Methods.** We investigated model averaging in the context of additive Poisson log-linear models. Under these models, the daily mortality counts are modeled as independent Poisson random variables with mean µ

*on day*

_{t}*t*

log(µ* _{t}*) = confounders(α)

*+ θPM*

_{t}_{2.5}

*, [1]*

_{t,j}where confounders(α)* _{t}* represent other time-varying variables related to daily mortality, and PM

_{2.5}

_{t}_{,}

*is the PM*

_{j}_{2.5}concentration on day

*t*

*j*, for a specific time lag

*j*; α is a tuning parameter—as α increases, so too does the flexibility of the smooth functions used to adjust for the effects of the confounders. Adjusting for confounders is important to avoid spurious findings of an association between PM

_{2.5}and mortality (Bell et al. 2004). Commonly used confounders include weather variables, such as temperature and dew point temperature, and time (Dominici et al. 2003). Our focus in Model [1] on a PM

_{2.5}exposure measure, which corresponds to a specific lag of PM

_{2.5,}is consistent with recent time series studies (Dominici et al. 2006, 2007; Peng et al. 2009). Models of the same general form as Model [1] are commonly used in time series studies of the adverse health effects of PM (Dominici et al. 2007; Peng et al. 2006; Roberts 2005).

Using Model [1] involves selecting a value of α and a lag of PM_{2.5}. For example, if *p* values of α and *q* lags of PM_{2.5} were thought plausible, then *K* = *p* × *q* candidate models could be fitted and assessed with respect to some model selection criterion. If the *K* candidate models are fitted and a single “best” model chosen, the common practice of reporting the statistical characteristics of the winning model effectively ignores the statistical variation suffered as a result of the model selection procedure itself.

In the paragraphs that follow, we describe Akaike’s information criterion (AIC; Akaike 1973) and outline the bootstrap (BOOT) method used by Martin and Roberts (2006) and our extension that refines this method.

AIC is commonly used for model selection in time series studies of the association between PM and mortality (Goldberg et al. 2006; Samoli et al. 2008). It takes a measure of the lack of fit of a model and adds a penalty for the number of parameters in the model. Specifically, AIC is defined as

AIC = 2(maximum log-likelihood) + 2(number of parameters). [2]

To use AIC for model selection, the model with the smallest value of AIC among the candidate models is selected. Further details on AIC, including a discussion of its derivation, can be found in numerous articles (e.g., Burnham and Anderson 2004). In the context of the models considered in this paper, the number of parameters is an increasing function of α.

The BOOT method used by Martin and Roberts (2006) proceeds through the following steps:

- Fit the
*K*candidate models defined by Model [1]. Select as “best” the model with the smallest value of AIC, which is denoted*M**. We also define*M*to represent candidate model_{i}*i*fitted to the observed mortality time series data, for*i*= 1, 2,…,*K*. - Extract the mean adjusted, standardized Pearson residuals (Davison and Hinkley 1997a) and the estimated mean mortality counts from the best model
*M**, which was obtained in step 1. In our context, the mean adjusted, standardized Pearson residuals ξare defined as_{t}- where
*T*is the length of the mortality time series,*y*is the observed mortality count on day_{t}*t*,*û*is the estimated mean mortality count on day_{t}*t*, and*h*is the leverage for the observation on day_{t}*t*.

- where
- Use the stationary bootstrap to generate
*B*resamples of the residuals ξ_{1},…, ξobtained in step 2. The stationary bootstrap is implemented using the approach of Politis and Romano (1994). Under this approach, the stationary bootstrap resamples blocks of data of random length, where the length of each block has a geometric distribution._{T} - Create
*B*bootstrap replicate mortality time series by adding the estimated mean mortality counts from step 2 to each of the*B*resampled residual series generated in step 3. This process is carried out using the following formula:

*d** = µ_{t}^{ˆ}+_{t}_{√}^{—}µ^{ˆ}ξ_{t}*,_{t}*t*= 1, …,*T*, [4]- where ξ
_{1}*,…, ξ* is a resampled residual series, and_{T}*d*_{1}*,…,*d** is the resultant bootstrap replicate mortality time series. The_{T}*d*_{1}*,…,*d** are rounded to the nearest integer before proceeding to step 5._{T}

- where ξ
- Using each of the
*B*bootstrap replicate mortality time series, repeat step 1 with the observed mortality time series data replaced by the bootstrap replicate mortality time series, each time tabulating which of the*K*models is “best” based on AIC. - Assign a weight
*w*equal to the proportion of the_{i}*B*times that the model was selected as best in step 5, to each of the*K*candidate models. - Use the weights obtained in step 6 to compute a “bootstrap weighted” estimate for the effect of PM
_{2.5}on mortality:*w*_{1}*θ*+…+_{1}*w*, where_{K}θ_{K}*θ*is the estimated effect of PM_{i}_{2.5}on mortality obtained from*M*._{i}

In step 3, the stationary bootstrap is used to allow the resampled residuals to mimic the dependence structure of the original residual process under the notion that, although adjacent data points might suffer dependence, blocks of sufficient length may be close to independent of one another. Based on our earlier work, the stationary bootstrap is implemented using a mean block length of size 10 (Martin and Roberts 2006). Lahiri (2003) provides additional information on the use of resampling methods for dependent data. It is important to note that the replicate mortality time series generated in step 4 are not Poisson distributed, but this issue is not of particular concern because the observed mortality time series will also not be Poisson distributed. Indeed, some studies explicitly allow for the non-Poisson nature of the observed mortality time series via quasi-likelihood estimation (Goldberg et al. 2006; Samoli et al. 2008). In our context, the overdispersion estimated within the framework of a Poisson generalized linear model was mild. Thus, we did not consider a quasi-likelihood approach necessary. Further information on residual-based resampling for generalized linear models can be found in Davison and Hinkley (1997b).

Our extension to BOOT described above (termed “double BOOT”) uses a second bootstrap layer after step 6. The second bootstrap layer involves generating another *B* bootstrap replicate mortality time series that are based on the weights *w _{i}* found for each model in the first bootstrap layer. For each of the

*K*candidate models, this procedure involves generating

*Bw*replicate mortality time series using model

_{i}*M*as the basis for the bootstrap procedure described above, for each

_{i}*i*= 1, 2,…,

*K*. As before, based on this new set of

*B*replicate mortality time series, updated weights are constructed for each model based on the proportion of times it was selected as best based on AIC.

The procedure for implementing double BOOT is as follows:

- Perform steps 16 above of the BOOT method.
- For each of the
*i*= 1 to*K*candidate models, construct*Bw*replicate mortality time series using the procedure described in steps 24 of BOOT with_{i}*M** replaced by*M*. This process will produce_{i}*B*=*Bw*_{1}+…+*Bw*second-layer bootstrap replicate mortality time series._{K} - Fit the
*K*candidate models to each of the*B*replicate mortality time series, each time noting which of the*K*models is “best” based on AIC. - Assign a weight
*w** to each of the_{i}*K*candidate models. For each model, the weights are calculated as the proportion of the*B*times the model was selected as best in the preceding step. - Use the weights
*w** to compute a double-bootstrap weighted estimate for the effect of PM_{i}_{2.5}on mortality:*w*_{1}**θ*+…+_{1}*w**_{K}*θ*, where_{K}*θ*is the estimated effect of PM_{i}_{2.5}on mortality obtained from*M*._{i}

A rationale for this proposed extension to BOOT can be provided through a simple example. Consider a setting where there are only two candidate models, and model 1 is judged as best based on AIC. Now suppose the original BOOT procedure is implemented resulting in weights of *w*_{1} = 0.51 and *w*_{2} = 0.49 being assigned to models 1 and 2, respectively. The original BOOT procedure simply uses these weights to produce an average effect estimate. However, the weights of 0.51 and 0.49 can be interpreted as the data providing essentially equal support for the two candidate models. This outcome poses the question of whether it is desirable for the bootstrap replicate mortality time series to be constructed solely on the basis of model 1 when, in fact, according to the evidence given by the weights, the two models are almost equally supported by the data. Double BOOT offers a solution to this problem by performing a second layer of bootstrapping that uses a bootstrap data-generating process to weight each of the original candidate models according to their prevalence (measured through *w _{i}*) as “best” models among the original

*B*bootstrap replicate series. The logic used here could be extended to the case of many competing models where it seems reasonable to perform a second layer of bootstrapping based on how well each candidate model is supported by the data, rather than a single layer where the bootstrapping is based on a single model that essentially assumes full support from the available data. The difference in the double BOOT weights compared with the original BOOT weights would depend on a number of factors, including the number of candidate models that are “close” in terms of support offered by the data and the similarity of these models in terms of model structure. Irrespective of the change in the double BOOT weights, we believe the reweighting to be important—inherent to the success of the bootstrap is the premise that the data-generating process should mimic the true underlying process as closely as possible. In the case of something as complex as a model-selection process, the weights effectively measure a state of belief about the set of candidate models. Thus, our bootstrap resamples mimic that state of belief by generating data sets arising from a variety of candidate models in proportion to our confidence that such models are the correct ones.

The use of the bootstrap to tune another initial bootstrap algorithm has a long history. For example, Efron (1983) used a second level of bootstrap resampling to reduce the bias of the apparent error rate of a linear discriminant rule. Efron termed his method a “double bootstrap” because it involved a second layer of *B* resamples to bias correct an initial bootstrap bias-corrected estimate. Beran (1987) and Hall (1986) discussed the use of second-level resampling to correct for coverage error in confidence intervals. Hall and Martin (1988) proposed a general framework for bootstrap iteration for which the second-level resamples were used to estimate and correct for the error in the original bootstrap procedure. Loh (1987) also used a second layer of bootstrap resamples to correct confidence interval endpoints. However, the methods of Beran (1987), Hall (1986), and Loh (1987) differ in the way the bootstrap critical points are modified. In our approach, the first-layer bootstrap resamples are used to generate an initial set of weights for the set of candidate models. In one way, these weights can themselves be considered as outputs from the initial bootstrap procedure. But, of course, these weights are not “correct” because of the way the bootstrap resamples are constructed in the generalized linear model context. Because the resampling is based on model residuals, there is a tendency for the initial bootstrap step to favor (i.e., give higher weight to) the model from which the original residuals were obtained. Our second-layer bootstrap -resampling is directed at addressing this problem, by using the information gleaned from the initial bootstrap step as a starting point to constructing second-level resamples based on residuals not from a single model fit, but rather from a weighted set of plausible candidate models. Our method is a fully frequentist analog of the bootstrap-after-Bayesian model averaging approach proposed by Buckland et al. (1997). In their paper, the authors had observed that a single-layer bootstrap model averaging approach tended to favor the initial model on which resamples were based. They suggested that an initial Bayesian model averaging (BMA) step could be used to provide a weighted set of models from which resamples could be based in a second bootstrap model selection step. Our method takes a fully frequentist approach by adopting bootstrap methods at both steps.

The form of BMA that will be used in our paper is based on AIC as described in Clyde (2000). In the context of Model [1], BMA based on AIC proceeds by assigning each candidate model *i* a posterior probability given by the following formula:

where AIC* _{i}* is the AIC for candidate model

*i*and

*K*is the number of candidate models.The estimated mortality effect is obtained by weighting the PM effect estimates obtained from each model by its posterior probability.

In the context of our analyses, it is worth discussing the interpretation of the weighted average effect estimates obtained from BOOT, double BOOT, and BMA. These quantities, which are obtained by weighting estimates of the effect of an increase in PM_{2.5} on a single day’s mortality, may be viewed as weighted or model-averaged estimates of the effect of an increase in PM_{2.5} on a single day’s mortality. However, care should be taken when using model averaged estimates because the interpretation of particular parameters may change when other variables, such as copollutants, are added to the model (Lukacs et al. 2009; Thomas et al. 2007). Indeed, not all researchers would agree with the process of averaging estimates obtained using different lags of PM_{2.5}. Some advocate that model averaging is best suited for making predictions (Thomas et al. 2007). In this regard, we also investigate the predictive performance of the three model-averaging procedures considered in this article.

### Results

We used the statistical package R along with packages “boot,” “splines,” and “tseries” for all the analyses (R Development Core Team 2009). Computational constraints meant that producing estimates of the standard errors (SEs) for values presented in Tables 1 and 2 was not feasible, and the provision of SEs for simulated values is not common practice in studies of this kind.

** Simulation study.** We used the 730 days of data from Seattle, Washington, along with the specification of Model [1] to generate mortality time series where the effect of PM

_{2.5}on mortality was known. Generating mortality time series was achieved by producing mortality counts on day

*t*that were Poisson distributed with mean µ

_{t}We considered three different specifications of confounders(α = 1.2)_{t}

**Specification A**

S_{t,}_{1}(time, df = 8α) + S_{t,}_{2}(temp, df = 6α)
+ S_{t,}_{3}(dew, df = 3α),

**Specification B**

S_{t,}_{1}(time, df = 4α) + S_{t,}_{2}(temp, df = 6α)
+ S_{t,}_{3}(dew, df = 3α),

**Specification C**

S_{t,}_{1}(time, df = 8α) + S_{t,}_{2}(temp, df = 6α)
+ S_{t,}_{3}(dew, df = 3α)
+ S_{t,}_{4}(temp_{13}, df = 6α).

In the above equations, θ is the known PM_{2.5} effect, and temp, temp_{13}, and dew represent the current day’s temperature, temperature of the previous 3 days, and current day’s dew point temperature, respectively. The functions S_{t}_{,}* _{j}*() are smooth natural cubic spline functions with the indicated degrees of freedom. To ensure that the degrees of freedom take integer values, the values of 8α, 6α, 4α, and 3α are rounded to the nearest integer. To find realistic representations of the S

_{t}_{,}

*(), we fitted Model [6], using each specification of confounders(α = 1.2)*

_{j}*to the actual Seattle data using a Poisson log-linear generalized linear model with an offset term allowing the effect of PM*

_{t}_{2.5}to be set equal to θ. The offset term allows a term to be included in a generalized linear model with a known, rather than an estimated, coefficient value. We used the fitted values from these models to generate daily Poisson mortality estimates that incorporate a known PM

_{2.5}effect θ. Three values of θ: 0, 0.0003, and 0.001 were considered.

To implement model averaging, a set of candidate models was required. We considered two sets of candidate models that were defined by Model [1] with α taking 10 equally spaced values ranging from α = 0.3 to α = 3, confounders(α)* _{t}* as defined in specification A, and either three lags of PM

_{2.5}(PM

_{2.5}

_{t}_{,0}, PM

_{2.5}

_{t}_{,1}, PM

_{2.5}

_{t}_{,2}) or one lag of PM

_{2.5}(PM

_{2.5}

_{t}_{,1}). In the case of three lags of PM

_{2.5}, we have a set of 10 × 3 = 30 candidate models, and in the case of one lag of PM

_{2.5}, a set of 10 × 1 = 10 candidate models. Similar methods for defining the tuning parameter α for time and weather variables have been used in previous investigations (Dominici et al. 2004; Roberts 2004). The number of parameters estimated in each candidate model is equal to the total number of degrees of freedom used in the S

_{t}_{,}

*() plus 1 for the intercept and 1 for the estimated PM*

_{j}_{2.5}effect.

For mortality generated using the confounders(α = 1.2)* _{t}* specification A, the “true” model is contained among both sets of candidate models, but for mortality generated using confounders(α = 1.2)

*specifications B and C, this is not the case. In specification B, the degrees of freedom used for time have been halved for each value of α compared with the candidate models; whereas, specification C includes temp*

_{t}_{13}, a variable that is not included in any of the candidate models. These latter two situations are perhaps more realistic because in practice no candidate model would correspond to the true model.

In the simulations, *B* = 1,000 was used in BOOT and for both layers of double BOOT. The simulations were conducted by generating sets of 1,000 mortality time series defined by Model [6] with α = 1.2, one of the confounder specifications A, B, or C, and θ and then by applying BOOT, double BOOT, and BMA using the two sets of candidate models [i.e., with 3 lags of PM_{2.5} (30 candidate models total) or 1 lag of PM_{2.5} (10 candidate models total)]. Table 1 contains the results of these simulations. In the simulations involving 30 candidate models, it is evident from the smaller root-mean-squared error (RMSE) values that double BOOT has superior performance to that of both BOOT and BMA. The breakdown of RMSE into bias and SE components shows that the improvement in performance offered by double BOOT is principally due to the lower SE of the estimates obtained by this method. In the simulations involving 10 candidate models, the methods offer similar performance.

For the simulations with 30 candidate models and confounders(α = 1.2)* _{t}* = specification A, we investigated the use of standard AIC model selection (results not shown) by basing estimates on the single model selected as “best” based on AIC. Performance, as measured by RMSE, was substantially worse than that of double BOOT, BOOT, and BMA, with the average values of RMSE of approximately 1.90 for each of the three scenarios considered.

As a final comparison we compared the predictive performance of the three methods using both simulated and actual mortality data. For each mortality time series, we randomly removed 100 observations and applied BOOT, double BOOT, and BMA to the remaining data to obtain predictions for the removed observations. The predictions were computed as weighted averages of the predictions obtained from each candidate model weighted by the weight or probability assigned to that model. The performance of each method was based on the predictive mean squared error (PMSE) computed as {(*y*_{1} ˆ*y*_{1})^{2} +…+ (*y*_{100} ˆ*y*_{100})^{2}}/100, where *y _{i}* and ˆ

*y*are the actual and predicted mortality estimates, respectively. For a given mortality time series, we repeated the process of randomly removing 100 data points and computing the PMSE 100 times. Table 2 reports the number of times (out of 100) that each method had a better predictive performance than alternative methods based on lower PMSE. It is clear that double BOOT has predictive performance superior to that of BOOT, with double BOOT having a smaller PMSE about 70% of the time. The results also provide support for double BOOT versus BMA, with double BOOT providing the same or better predictive performance in two of the three model-specific simulations and in three of the five city-specific simulations.

_{i}** Application.** Tables 3 and 4 show the results of applying the three model-averaging methods and standard AIC to the five cities described above. We calculated the SE values in Table 3 using equation 4 of Burnham and Anderson (2004). For these five cities, the estimates obtained from the three model-averaging methods were similar and the conclusions drawn about the association between PM

_{2.5}and mortality would be essentially the same. However, the results also illustrate that the estimates obtained from standard AIC can be significantly different to those obtained from model averaging. The SEs assigned to the estimates obtained from standard AIC are smaller because these SEs do not take into account the model selection process that was used to find the single best model.

The reason for the differences in the estimates obtained from the three model-averaging
methods based on 30 candidate models compared with standard AIC is a result of the model-averaging methods assigning nonnegligible weights to a number of candidate models. Within each city, the three model-averaging techniques tended to assign nonnegligible weights to three models corresponding to the three different lags of PM_{2.5} but the same level of confounder adjustment α. Comparing the weights obtained from BOOT and double BOOT illustrates that the second bootstrap layer can result in substantial changes to the weights assigned to each model. For example, for Seattle and Tampa in some situations the weights assigned to candidate models differ by approximately 40%.

### Discussion

We have illustrated that double BOOT model averaging can offer benefits over BMA and BOOT for both estimation and prediction. The benefits were particularly noticeable for double BOOT compared with BOOT. This increased performance was attributable to a reduction in the variance of the estimates obtained from double BOOT compared with BOOT and BMA. An interesting observation was that the bias of the estimates obtained from double BOOT was larger than the estimates obtained from BOOT and BMA when the “true” model was contained among the candidate models. This was not the case, however, when the “true” model was not among the candidate models because the double BOOT procedure tended to give less weight to the true model as a consequence of the second bootstrap layer moving some of the weight from the true model to other plausible models. Of course, this phenomenon could not occur in the simulations where the “true” model was not among the candidate models, and the result was that double BOOT had slight improvements in terms of lower bias.

A report of particular relevance to the present study is that of Buckland et al. (1997) who investigated various forms of bootstrap model averaging, including the BOOT method in the present investigation. Buckland et al. (1997) and Claeskens and Hjort (2008) each provide excellent introductory treatments of the issues surrounding model selection and model averaging. Burnham and Anderson (2002) showed that AIC can be derived as a Bayesian result and that the AIC-based BMA weights used in the present paper correspond to posterior model probabilities. Unlike the implementation in this report, BMA can also be implemented by explicitly assigning prior model probabilities (Hoeting et al. 1999; Koop and Tole 2004). In the present setting, AIC-based BMA has the advantage of using objective prior distributions (Clyde 2000) and ease of implementation, compared with explicitly assigned prior model probabilities. An obvious disadvantage of AIC-based BMA is that is does not allow for the incorporation of prior information about the importance of a variable.

It is important to note that the use of BMA applied to time series studies of air pollution and mortality, and in particular the approach of Koop and Tole (2004), has received some criticism in the literature (Crainiceanu et al. 2008; Thomas et al. 2007). In this study we have attempted to avoid these same criticisms by ensuring that when illustrating our proposed averaging method we did so over a range of plausible candidate models, ensuring that a measure of air pollution exposure is included in each candidate model, focusing on single-pollutant models, and also investigating predictive performance. We are of the view that a carefully applied model-averaging procedure can provide useful insight into understanding air pollution health effects by, for example, providing information on how much the data support various models, helping practitioners to appreciate and allow for the effects of model selection and uncertainty, and in some circumstances providing improved estimators of air pollution health effects. However, we are also of the view that the use of model averaging does not negate the need for careful planning and data-gathering processes along with detailed investigations of models arising from a suitably rich set of initial covariates to find an initial and sufficiently rich plausible set of candidate models. We also believe that future comparisons of results obtained from model averaging with traditional methods such as standard AIC would prove valuable.

### Attached Files

### References

- Akaike H. 1973 Information theory as an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory (Petrov BN, Csaki F, eds); Budapest:Akademiai Kiado, 267–281.
- Bell ML, Samet JM, Dominici F. 2004. Time-series studies of particulate matter. Annu Rev Public Health 25:247–280.
- Beran R. 1987. Prepivoting to reduce level error of confidence sets. Biometrika 74:457–468.
- Breitner S, Stölzel M, Cyrys J, Pitz M, Wölke G, Kreyling W, et al. 2009. Short-term mortality rates during a decade of improved air quality in Erfurt, Germany. Environ Health Perspect 117:448–454.
- Buckland ST, Burnham KP, Augustin NH. 1997. Model selection: an integral part of inference. Biometrics 53:603–618.
- Burnham KP, Anderson DR. 2002. Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach. 2nd ed. New York:Springer.
- Burnham KP, Anderson DR. 2004. Multimodel Inference: understanding AIC and BIC in model selection. Sociol Method Res 33:261–304.
- Claeskens G, Hjort NL. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. New York:Cambridge University Press.
- Clyde M. 2000. Model uncertainty and health effect studies for particulate matter. Environmetrics 11:745–763.
- Crainiceanu CM, Dominici F, Parmigiani G.. 2008. Adjustment uncertainty in effect estimation. Biometrika 95:635–651.
- Davison AC, Hinkley DV. 1997a. Bootstrap Methods and Their Application. Cambridge:Cambridge University Press, 331–333.
- Davison AC, Hinkley DV. 1997b. Bootstrap Methods and Their Application. Cambridge:Cambridge University Press, 326–384.
- Dominici F, McDermott A, Hastie TJ. 2004. Improved semiparametric time series of models air pollution and mortality. J Am Stat Assoc 99:938–948.
- Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, et al. 2006. Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA 295:1127–1134.
- Dominici F, Peng RD, Zeger SL, White RH, Samet JM. 2007. Particulate air pollution in the United States: did the risks change from 1987 to 2000? Am J Epidemiol 166:880–888.
- Dominici F, Sheppard L, Clyde M. 2003. Health effects of air pollution: a statistical review. Int Stat Rev 71:243–276.
- Draper D. 1995. Assessment and propagation of model uncertainty. J R Stat Soc Series B Stat Methodol 57:45–97.
- Efron B. 1983. Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78:316–331.
- Figueiras A, Roca-Pardiñas J, Cadarso-Suárez C. 2005. A bootstrap approach to avoid the effect of concurvity in generalised additive models in time series studies of air pollution. J Epidemiol Community Health 59:881–884.
- Goldberg MS, Burnett RT, Yale J-F, Valois M-F, Brook JR. 2006. Associations between ambient air pollution and daily mortality among persons with diabetes and cardiovascular disease. Environ Res 100:255–267.
- Hall P. 1986. On the bootstrap and confidence intervals. Ann Stat 14:1431–1452.
- Hall P, Martin MA. 1988. On bootstrap resampling and iteration. Biometrika 75:661–671.
- Hoeting JA, Madigan D, Raftery AE, Volinsky CT. 1999. Bayesian model averaging: a tutorial. Stat Sci 14:382–417.
- Kelsall JE, Samet JM, Zeger SL, Xu J. 1997. Air pollution and mortality in Philadelphia, 19741988. Am J Epidemiol 146:750–762.
- Koop G, Tole L. 2006. An investigation of thresholds in air pollution- mortality effects. Environ Modell Softw 21:1662–1673.
- Koop G, Tole L. 2004. Measuring the health effects of air pollution: to what extent can we really say that people are dying from bad air? J Environ Econ Manage 47:30–54.
- Lahiri SN. 2003. Resampling Methods for Dependent Data. New York:Springer-Verlag.
- Loh W-Y. 1987. Calibrating confidence coefficients. J Am Stat Assoc 82:155–162.
- Lukacs PM, Burnham KP, Anderson DR. 2009. Model selection bias and Freedman’s paradox. Ann Inst Stat Math; doi: 10.1007/s10463-009-0234-4.
- Martin MA, Roberts S. 2006. Bootstrap model averaging in time series studies of particulate matter air pollution and mortality. J Expo Sci Environ Epidemiol 16:242–250.
- National Research Council. 1998. Research Priorities for Airborne Particulate Matter. Washington, DC:National Academy Press.
- Peng RD, Bell ML, Geyh AS, McDermott A, Zeger SL, Samet JM, et al. 2009. Emergency admissions for cardiovascular and respiratory diseases and the chemical composition of fine particle air pollution. Environ Health Perspect 117:957–963.
- Peng RD, Dominici F, Louis TA. 2006. . Model choice in multi-city time series studies of air pollution and mortality. J R Stat Soc Ser A 169:179–203.
- Politis DN, Romano JP. 1994. The stationary bootstrap. J Am Stat Assoc 89:1303–1313.
- R Development Core Team. 2009. Vienna:R Foundation for Statistical Computing. R: A Language and Environment for Statistical Computing. Available: http://www.R-project.org [accessed 9 November 2009]
- Ramsay TO, Burnett RT, Krewski D. 2003. The effect of concurvity in generalised additive models linking mortality to ambient particulate matter. Epidemiology 14:18–23.
- Roberts S. 2004. Interactions between particulate air pollution and temperature in air pollution mortality time series studies. Environ Res 96:328–337.
- Roberts S. 2005. An investigation of distributed lag models in the context of air pollution and mortality time series analysis. J Air Waste Manage Assoc 55:273–282.
- Samoli E, Peng R, Ramsay T, Pipikou M, Touloumi G, Dominici F, et al. 2008. Acute effects of ambient particulate matter on mortality in Europe and North America: results from the APHENA study. Environ Health Perspect 116:1480–1486.
- Thomas DC, Jerrett M, Kuenzli N, Louis TA, Dominici F, Zeger S, et al. 2007. Bayesian model averaging in time-series studies of air pollution and mortality. J Toxicol Environ Health A 70:311–315.
- Zeger SL, McDermott A, Dominici F, Peng R, Samet J. 2006. Internet-Based Health and Air Pollution Surveillance System. Communication 12. Boston, MA:Health Effects Institute.

**Tags**: January 2010

## Announcements

Come visit *EHP* Editor-in-Chief Sally Darney at the 29th Annual Scientific Conference of the International Society of Environmental Epidemiology, to be held in Sydney, Australia, 24–28 September 2017. She looks forward to talking with *EHP* authors, reviewers, and editors. Contact Sally directly to make an appointment with her.

*EHP* is pleased to announce that “Maternal and Cord Blood Manganese Concentrations and Early Childhood Neurodevelopment among Residents near a Mining-Impacted Superfund Site,” published in *EHP* on 28 June 2017, has been selected by the Children’s Environmental Health Network (CEHN) as its August 2017 Article of the Month. (more…)