Constituents of Household Air Pollution and Risk of Lung Cancer among Never-Smoking Women in Xuanwei and Fuyuan, China

Background: Lung cancer rates among never-smoking women in Xuanwei and Fuyuan in China are among the highest in the world and have been attributed to the domestic use of smoky (bituminous) coal for heating and cooking. However, the key components of coal that drive lung cancer risk have not been identified. Objectives: We aimed to investigate the relationship between lifelong exposure to the constituents of smoky coal (and other fuel types) and lung cancer. Methods: Using a population-based case–control study of lung cancer among 1,015 never-smoking female cases and 485 controls, we examined the association between exposure to 43 household air pollutants and lung cancer. Pollutant predictions were derived from a comprehensive exposure assessment study, which included methylated polycyclic aromatic hydrocarbons (PAHs), which have never been directly evaluated in an epidemiological study of any cancer. Hierarchical clustering and penalized regression were applied in order to address high colinearity in exposure variables. Results: The strongest association with lung cancer was for a cluster of 25 PAHs [odds ratio (OR): 2.21; 95% confidence interval (CI): 1.67, 2.87 per 1 standard deviation (SD) change], within which 5-methylchrysene (5-MC), a mutagenic and carcinogenic PAH, had the highest individual observed OR (5.42; 95% CI: 0.94, 27.5). A positive association with nitrogen dioxide (NO2) was also observed (OR: 2.06; 95% CI: 1.19, 3.49). By contrast, neither benzo(a)pyrene (BaP) nor fine particulate matter with aerodynamic diameter ≤2.5μm (PM2.5) were associated with lung cancer in the multipollutant models. Conclusions: To our knowledge, this is the first study to comprehensively evaluate the association between lung cancer and household air pollution (HAP) constituents estimated over the entire life course. Given the global ubiquity of coal use domestically for indoor cooking and heating and commercially for electric power generation, our study suggests that more extensive monitoring of coal combustion products, including methylated PAHs, may be warranted to more accurately assess health risks and develop prevention strategies from this exposure. https://doi.org/10.1289/EHP4913


Introduction
The incidence of lung cancer among female never smokers in Asia is among the highest in the world (Couraud et al. 2012;Epplein et al. 2005;Pelosof et al. 2017;Samet et al. 2009). There is a large body of evidence identifying household air pollution (HAP) as a key risk factor for lung cancer among never smokers. Approximately half of the world's population is still exposed to HAP from domestic cooking and/or heating with solid fuels (wood, coal, biomass) ( Barone-Adesi et al. 2012;Lan et al. 2002;Sisti et al. 2012). The lung cancer rates in nonsmoking women in Xuanwei, China, are among the highest in the world (Barone-Adesi et al. 2012;Chapman et al. 1988;Mumford et al. 1987). Residents of Xuanwei live primarily in rural areas, working as subsistence farmers who use solid fuels for domestic heating and cooking. Previous epidemiological and experimental studies have shown that the cause of this excess cancer risk is the use of bituminous coal, locally referred to as smoky coal, for indoor cooking and heating (Barone-Adesi et al. 2012;Chapman et al. 1988;Ho et al. 2016;Large et al. 2009;Lui et al. 2017aLui et al. , 2017bMumford et al. 1987Mumford et al. , 1989Tian 2005;Tian et al. 2008). When compared with alternative fuels typically available in the area [including a carboniferous anthracite (smokeless) coal], smoky coal use was associated with a 100-fold increase in lung cancer mortality among nonsmoking women (Barone-Adesi et al. 2012).
The key drivers of this excess risk remain elusive however, because no study to date has been able to estimate specific exposures for individuals in this population over their lifetime and then directly link those exposure estimates to lung cancer risk.
Given the widespread global use of coal for domestic heating and cooking and for power generation, the striking lung cancer excess in this region of China provides an important scientific and public health opportunity to further our mechanistic understanding of the carcinogenicity of coal, more comprehensively evaluate its health risks, and develop strategies for prevention. We therefore conducted a comprehensive large-scale populationbased lung cancer case-control study and a complementary exposure assessment study in never-smoking women in the counties of Xuanwei and Fuyuan (Downward et al. 2014a(Downward et al. , 2014bDownward 2015;Downward et al. 2016Downward et al. , 2017Hu et al. 2014;Seow et al. 2016) We evaluated an array of HAP constituents, including traditional markers of air pollution, such as particulate matter with aerodynamic diameter less ≤2:5 lm (PM 2:5 ), black carbon (BC), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), and the 16 priority U.S. Environmental Protection Agency (U.S. EPA) polycyclic aromatic hydrocarbons (PAHs), including benzo-(a)pyrene (BaP) (Downward et al. 2014b(Downward et al. , 2017Hu et al. 2014;Seow et al. 2016). To more comprehensively assess risks from specific PAHs, we analyzed an additional 23 PAHs, including multiple methylated subspecies, and also measured radon, 23 trace elements (including arsenic), and crystalline silica (quartz) in air (Downward 2015;Downward et al. 2017).

Data Collection
Study design. We conducted a population-based case-control study of lung cancer among never-smoking women from Xuanwei and Fuyuan, China. The enrollment methods have been described previously (Wong et al. 2019). Cases were enrolled from the six hospitals that diagnosed the majority of lung cancer cases in Xuanwei and Fuyuan. Cases were defined as newly diagnosed lung cancers [International Classification of Disease, Ninth Revision (ICD-9; CDC 2013) code 162] among female never smokers between the years of 2006 and 2013. Cases were required to be aged 18-79 y old (at the time of diagnosis), currently living in Xuanwei or Fuyuan, have lived in these counties for at least 1 y, and have had no previous history of cancer diagnoses. Cases were initially identified on the basis of clinical presentation and radiological imaging and subsequently provided sputum samples for cytological analysis and tuberculosis (TB) testing. A subgroup of cases had more extensive diagnostic testing; however, cases without a confirmed histological or cytological diagnosis of lung cancer, clinical evidence, or radiologic findings (i.e., chest X-ray, computed tomography scan), and lack of evidence for other conditions (e.g., TB) were used as the basis for the diagnosis of lung cancer. Cases that were not confirmed based on diagnostic tests were followed up for vital status and cause of death for at least 3 y after enrollment and classified as highprobability lung cancer cases if they did not have another diagnosis explaining their initial presentation or cause of death. A total of 1,060 eligible lung cancer cases were considered, of which 706 were confirmed and 354 were high-probability cases.
We also enrolled a total of 498 population-based controls, frequency matched to cases by age ( ± 5 y). The case-control ratio was partly determined by financial and practical considerations (control interviews were much more difficult to arrange), while the total sample size was considered sufficient for statistical power based on the very strong associations between smoky coal use and lung cancer risk in this region (Barone-Adesi et al. 2012;Lan et al. 2008) Controls were never smokers who did not have a previous cancer diagnosis, were currently living in Xuanwei or Fuyuan, and had lived in these counties for >1 y.
Control sampling was based on four geographic levels from largest to smallest, as follows: a) commune, b) administrative villages ("dadui"), c) natural villages/settlements, and d) individuals. Information on the population density across age strata was available for all communes in Xuanwei and Fuyuan from the National Bureau of Statistics of China. Along with the predicted age distribution of cases in these provinces, this information was used to randomly select natural villages nested within administrative villages and communes from which controls should be sampled. Subsequently, the field team of interviewers visited each natural village, selected three potential participants fitting the eligibility criteria, and randomly recruited one of them.
Cases and controls were interviewed using a detailed questionnaire that collected information on residential history, fuel use, and established or suspected risk factors for lung cancer.
We excluded 41 cases and 13 population-based controls because they reported using fuels or stove types that were not included in the exposure survey that was used for exposure assessment (i.e., coal deposit or stove type used) and an additional 4 cases because of missing information on one of the questionnaire items that was used to adjust for potential confounding in the models ("having sufficient food before marriage"). Participation rates were 84% for cases and 89% for controls. The present investigation was based on data from 1,015 cases and 485 controls.
Ethics. All participants provided written informed consent prior to participating in the study. The study protocol was approved by the institutional review boards of the National Cancer Institute and China National Environmental Monitoring Center.
Questionnaire. Information on demographics, lifestyle factors (including exposure to secondhand smoke), socioeconomic status (SES) (including food sufficiency before marriage), medical history, and household characteristics (including lifetime history of fuel and stove use, type of solid fuels used, and the mine from which coal was sourced) was collected through an administered questionnaire.
Coal deposit assignment and estimation of household air pollutant exposures. The geographic location of coal deposits used by study subjects and associated lung cancer risk by deposit are shown in Figure S1. Coal deposits were classified as smoky coal (i.e., coking coal, one-third coking, meager lean, gas fat) or smokeless coal based on the Chinese state standard coal classification (Chen 2000). Coal deposits from which the household coal originated were assigned on the basis of self-reported fuel sources. Lung cancer risks based on the aggregation of coal deposits that shared common characteristics were previously analyzed and are shown in Figure S1 (Wong et al. 2019). If participants reported that their coal source was a local coal mine, deposits were assigned through Bayesian predictive modeling, where proximity and preference of inhabitants of the same village were the primary predictors.
The Bayesian predictive modeling used a multinomial choice model for each village, which included all mines within a 30-km radius of the village center (selected because this represented the upper 90% of all observed distances between subjects and their reported coal mine). Probabilities for each mine were derived based on the observed choices of other individuals within the same village and a linear effect of straight-line distance to the village center (selecting for mines closer to the participant's village). To this end, probabilities resulted in 50% lower odds of selection for each additional 10-km increase in distance from the village center. To take into account information on reported coal type while allowing reporting errors to occur (previously reported to be approximately 10%) (Downward et al. 2014a), mines that produced the coal type reported were assigned fivefold higher odds of selection.
Selection probabilities for each deposit layer were calculated by summing up selection probabilities for individual mines that were mining that layer. The current analyses used the most likely predicted deposit layer for each participant who reported using coal as their primary fuel source.
Exposure assessment was performed through determinant based modeling as previously described (Downward et al. 2014b, Environmental Health Perspectives 097001-2 127(9) September 2019 2016; Hu et al. 2014;Seow et al. 2016). Briefly, we enrolled 163 households and their never-smoking female heads, selected from 30 villages throughout Xuanwei and Fuyuan, for an exposure survey in which multiple indoor and personal air samples were collected. Village and subject selection was targeted to represent the major geographical regions, solid fuels, and stove designs in use and to provide a population reflective of that within the case-control study. Personal and indoor measurements were collected over two sequential 24-h periods in 2008 and 2009. Samples of PM 2:5 , BC, quartz, trace elements, and particle-bound PAHs and methylated PAHs (mPAHs) were collected on 37-mm Teflon filters. Indoor measurements of NO 2 and SO 2 were collected using passively diffusing filters (Ogawa). Radon was evaluated through the deployment of 90-d passively collecting detectors. Approximately 50% of subjects were visited again in a second season to allow seasonal adjustment of findings. Because levels of crystalline silica, trace elements, and radon were not associated with either fuel use or the use of different coal deposits, they were not included in any subsequent analyses (Downward 2015).
An overview of the constituents measured, by stove and fuel type, is given in Tables S1 and S2. Predictive linear mixed-effect models were constructed for each pollutant, where villages and individual subjects were assigned as random effects (Downward 2015Hu et al. 2014;Seow et al. 2016). Model construction was performed under a supervised stepwise procedure wherein variables were individually considered for model inclusion on the basis of goodness of fit and Akaike information criteria. Additional information is available in Text S1.
These predictive models were applied to the self-reported histories of stove and fuel use within the case-control data set to assign annual geometric mean predicted exposure levels. For case-control subjects who reported using more than one fuel/ stove in a given year, predictions from each permutation were combined based on the proportion of time spent using each permutation (e.g., a person using smoky coal and wood in equal amounts would have their final assignment calculated from a 50% smoky coal and 50% wood prediction).

Statistical Analysis
We followed two approaches to investigate the relation between exposure to HAP and lung cancer case-control status. We first identified clusters of highly correlated HAP constituents using a hierarchical clustering method that is described in more detail in the next paragraph. Regression models were then fitted that related case-control status to cluster scores. For the second approach, we fitted models using the full set of HAP constituents (n = 43). All models were adjusted for age, self-reported exposure to secondhand smoke, and self-reported food sufficiency before marriage as an indicator of SES. Food sufficiency is the most appropriate metric of SES in this population, which comprises mostly rural subsistence farmers with nonguaranteed food security and relatively little contrast in education.
Cluster analysis. We performed a hierarchical cluster analysis on exposures of the controls after a Box Cox transformation to reduce skewness and improve symmetry of the distributions (Box and Cox 1964). The distance matrix was calculated using standard Euclidean distances, and the complete linkage method was used to determine the cluster sequence (Defays 1977). The number of clusters to extract was decided by maximizing the average silhouette width of the resulting clusters (Rousseeuw 1987). For clusters that consisted of a single exposure, a cluster score was calculated by assigning the (transformed) exposure level. For clusters that consisted of more than one exposure, the cluster score was calculated by assigning the first component score from a principal component analysis on all (transformed) exposures in that cluster. To evaluate whether clusters and their associations with cancer differed notably between the overall study population and those using smoky coal, cluster analysis was performed separately for the full study population and for those using smoky coal.
Effect estimation. We used regularized logistic regression to evaluate the association between estimated exposures and lung cancer case-control status. We used multipollutant models because of concerns about uncontrolled confounding by moderate or highly correlated coexposures in single-exposure models (Agier et al. 2016). Investigating health outcomes associated with correlated exposures using multipollutant models is challenging because of the sometimes low precision of coefficient estimates that results from the large number of potential multicollinear covariates and the potential for introducing rather than removing bias (Schisterman et al. 2017;Weisskopf et al. 2018). To address these challenges and improve interpretability of the results, we used a Bayesian regularized regression method to fit our multipollutant models. We preferred a Bayesian approach over a frequentist regularization method (like the well-known lasso or elastic net) because the full posterior distribution allows more complete inference and assessment of model uncertainty (Casella et al. 2010).
We used the horseshoe estimator as originally suggested by Carvalho et al. (2010) by formulating shrinkage priors on the regression coefficients in a Bayesian regression model. We followed recommendations for further modifications of the horseshoe prior and hyperparameter settings that were suggested by Piironen and Vehtari (2017) and implemented in the R package brms (Bürkner 2017) (R version 2.3.1; R Development Core Team). The hyperparameter for the scale of the global shrinkage parameter was selected assuming that all of the confounders, but no more than 50% of the exposures, were associated with lung cancer, while the degrees of freedom (df) for this parameter were left at the default value of 1. The scale parameter for the slab (i.e., the regularization prior for coefficient values of effective variables) was set to 2.5, the df for the slab to 4, and the df for the local shrinkage parameters to 1. We increased the adapt_delta parameter of the no-u-turn sampler (Betancourt 2018) to 0.99 to avoid divergent transitions. We checked the sensitivity of our results to these assumptions by comparing them with results from models that were more strongly regularized (i.e., under the assumption that only a single exposure was related to the outcome) but found that results were not very sensitive to the exact choice. An additional sensitivity analysis using models with priors that implied very little regularization [i.e., a student t-test prior with 7 df and a scale of 2.5 (Gelman et al. 2008)] resulted in point estimates that were broadly similar to the regularized models but had considerably wider confidence intervals (CIs) ( Table S3).
We used 4 chains and 4,000 iterations per chain, discarding the first 2,000 iterations as burn-in. The model was checked for convergence using the Gelman-Rubin diagnostic (Brooks and Gelman 1998).
Receiver operating characteristic analysis. To assess model fit of the regularized regression models and to investigate to what extent the models based on estimated pollutant levels improved on models based on coal deposit information only ( Figure S1), we calculated the area under the curve (AUC) for the prediction of lung cancer case status using a receiver operating characteristic (ROC) analysis (Fawcett 2006). Case-control probabilities were estimated using 10-fold cross-validation to avoid overfitting of the prediction model. Models were adjusted for the same covariates as in the main analysis above using the approach described by Yu et al. (2012), who suggest including estimated covariate effects as a fixed offset in the fold-specific prediction models.
Sensitivity analyses. Exposure contrast was partly due to differences between subjects who mainly used smokeless coal or wood and subjects who mainly used smoky coal. Although this may improve the precision of our regression estimates, it also complicates interpretation because of the potential for introducing aggregation bias and unmeasured confounding (Navidi et al. 1994). We therefore conducted sensitivity analyses that excluded participants who had used smokeless coal and/or wood for more than 2 y. A total of 859 cases (85%) and 182 controls (38%) had not used smokeless coal and/or wood for more than 2 y (i.e., the smoky coal subpopulation).
We also investigated the potential for a lagged effect between modeled exposures and outcome. However, the rank correlations between 10-y lagged cumulative exposure and unlagged cumulative exposure was >0:98 for all HAP constituents. Given this high correlation, no additional lag analysis was performed.

Population Demographic and Exposure Characteristics
An overview of participant characteristics for cases and controls is provided in Table 1 with commune-specific populations in Table  S4. A notable difference was that 96% of cases vs. 79% of controls had ever used smoky coal. Additionally, controls were more likely to have experienced food sufficiency than cases (62% vs. 57%, respectively). An overview of estimated exposure to all 43 retained HAPs by case-control status is provided in Figure S2. In general, cases experienced higher exposures of individual constituents than controls. A heatmap showing Spearman correlation coefficients between the different exposures is shown in Figure S3.
Hierarchical cluster analysis identified six exposure clusters based on data from all controls. An overview of the clusters, alongside pollutant abbreviations, is presented in Table 2. Following review of the clusters, they were designated as follows: PAH25 [consisting of 25 PAHs, including BaP and 5-methylchrysene (5-MC)], WB7 (consisting of seven wood burningassociated exposures, including PM 2:5 and retene), PAH7, PAH2, and two single-exposure clusters consisting of NO 2 and SO 2 , respectively. Correlations between clusters ranged from −0:51 (PAH2:PAH7) to 0.64 (PAH25:NO 2 ) with a median coefficient of 0.38 (Table S5). Table S6 shows the correlations between the different clusters and between each individual exposure and each cluster. Within each cluster, the individual exposure metrics are naturally positively correlated with each other. Within the PAH25 cluster in particular, correlations were typically greater than 0.9 ( Figure S3).
The cluster analysis for the subset of participants that used only or mainly smoky coal resulted in the identification of five clusters, which were designated PAH29 + (29 PAHs plus NO 2 and SO 2 ), PAH2, PAH5 + , WB2, and a single-exposure cluster consisting of retene. Correlations between clusters ranged from −0:44 to 0.79 with a median coefficient of 0.48 (Table S7). Within each cluster, the individual exposure metrics again positively correlated with each other (Table S8).

Lung Cancer and Household Air Pollution Components
Results from the regularized regression analysis of the relationship between the above clusters and lung cancer are presented in Figure 1 for both the full study population and the smoky coal subpopulation (see Tables S9 and S10 for numerical values). In the full population, the PAH25 and NO 2 clusters were the most positively associated with lung cancer with odds ratios (ORs) of 2.21 (95% CI: 1.67, 2.87) and 2.27 (95% CI: 1.48, 3.55 for a 1 standard deviation increase in exposure, respectively). Exposures to pollutants in clusters WB7 and SO 2 were associated with a lower odds of lung cancer [ORs: 0.24 (95% CI: 0.19, 0.31) and 0.55 (95% CI: 0.43, 0.70), respectively], suggesting that these may reflect use of noncoal fuels or less toxic coals.
When examining the results of the regularized regression analysis upon the clusters identified within the smoky coal subpopulation ( Figure 1B; numerical values available in Table S10), positive (albeit nonsignificant) ORs were observed for PAH29 + A graphical comparison of the coefficients from the (regularized) regression models fitted to the full population data or data from the smoky coal subpopulation is provided in Figure S4. Overall, there was high agreement between both models (Pearson's correlation of 0.8), and the positive associations observed in the full population typically remained in the smoky coal subpopulation.

Comparison of Exposure-Specific Models to Models Based on Coal Deposit Layers
We used ROC curve analysis to determine whether exposurebased models provided better discrimination between cases and controls than models based only on information on most frequently used coal deposit layers (see Figure S1). ROC curves and AUCs are given in Figure 4A, and they show that models based on cumulative exposure to individual pollutants provided better model predictions than models based on coal deposit information only (AUC of 79% vs. 72%). In contrast, predictive performance as measured by AUC improved only slightly when information on coal deposits was added to a model based on cumulative exposure to individual pollutants only. However, a comparison of model predictions for this combined model to that from the model based only on individual pollutant information indicated that the modeled pollutants could not fully predict lung cancer case-control status in subjects using coal from deposit 9 ( Figure 4B), as evidenced by the symbols for this deposit visibly being off of the diagonal. This deposit was associated with the very highest risk of lung cancer in previous analyses (see also Figure S1) (Wong et al. 2019), and the estimated residual OR for subjects using coal from this deposit was estimated to be 2.77 (95% CI: 1.12, 6.85) in a model that also accounted for individual exposure effects. By contrast, the ORs for the other deposits ranged from 0.84 to 1.38, and all had 95% CIs crossing 1 (Table S12).
For the smoky coal subpopulation, ROC analysis ( Figure 4C,D) showed that the exposure-based model (AUC = 72%) provided even larger improvements over a model based on coal deposit information only (AUC = 57%), while again indicating that the effect may be underestimated for subjects using coal from deposit 9 (OR: 1.58; 95% CI: 0.91, 4.11; Table S13). Overall, these analyses show that modeled exposures were able to explain much of the lung cancer likelihood attributable to specific types of coal in this region, but that the very highest-risk coal and its combustion products may have additional carcinogenic characteristics that remain to be identified.

Discussion
We conducted the largest population-based case-control study of lung cancer among never-smoking females in Xuanwei and Fuyuan, China, to date to identify the coal combustion emission constituents that drive the exceptionally high lung cancer rates in this region. A parallel comprehensive exposure assessment study, designed to sample coal deposit types and patterns of use representative of cases and controls, measured coal combustion exposures in this region. This data was then used to estimate lifelong exposure to an extensive number of compounds that are potentially toxic constituents of coal combustion in this region of China as well as standard constituents of air pollution. We identified that lung cancer was most strongly associated with two factors, namely, a cluster of 25 PAHs and NO 2 .
Research to date on the cause of the extraordinarily high lung cancer rate in this region has identified the use of a particular type of bituminous coal, which was formed during the late Permian geologic time period, as the main driver of the observed risk (Large et al. 2009). However, studies to date have not been able to identify which HAP constituents are responsible for the observed risk. Understanding which individual or groups of constituents directly contribute to the development of lung cancer in this area is of great importance to further our understanding of lung cancer etiology.
In a previous analysis, we have shown that the association between the domestic use of smoky coal in Xuanwei and Fuyuan could vary from a twofold to a more than 30-fold increase in risk depending on the coal deposit sourced (Wong et al. 2019). In the current study, we show that this is largely driven by a cluster of parent PAHs and substituted derivatives, which include methylated PAHs. We attempted to isolate the signal of individual HAP components by employing a regularized regression model and identified 5-MC, benz(a)anthracene, and NO 2 as having the highest individual ORs in both the total study population and the smoky coal subpopulation.
Previously, Mumford et al. (1989Mumford et al. ( , 1995 conducted a series of experimental studies of smoky coal combustion constituents from Xuanwei, reported that alkylated PAHs (which included 5-MC) are major mutagens in Xuanwei indoor air, and proposed that these may explain the very high rates of lung cancer in this region. There are several carcinogenic metabolites in 5-MC, and it is associated with malignancies in laboratory studies and is present in tobacco smoke and diesel soot extracts (Hecht et al. 1974(Hecht et al. , 1985Hutzler et al. 2011;Pataki et al. 1983;Richter-Brockmann and Achten 2018). In addition, a recent study of coal from other parts of China showed that the method of preparation prior to combustion can have a drastic impact on the generation of methylated PAHs (Chen et al. 2015). The consistency in findings between the current paper and previous experimental studies suggests that methylated PAHs (including 5-MC) may play an important role in the very high lung cancer rates in this region. However, we should note that the high internal correlation within the PAH25 cluster means that identifying any single variable as the sole responsible variable is extremely challenging and that the high loading on 5-MC may represent a statistical, instead of a biological, effect.
We found that NO 2 was associated with lung cancer independently of other variables, both in the cluster and regularized regression models. The observed association with NO 2 is of interest, as studies of outdoor air pollution have provided consistent evidence of a relationship between NO 2 , which is not thought to be a carcinogen itself, as a proxy for traffic-sourced air pollution exposure and lung cancer (Hamra et al. 2015). Similarly, it is possible that here, NO 2 is acting as a surrogate for a HAP component, which is relevant for lung cancer risk but is not reflected in any other constituents we have measured.
Previous research has proposed alternative potential drivers for the lung cancer risk in this region. Studies of uncombusted coal and controlled burning experiments have identified crystalline silica/quartz, either working alone or paired with volatile compounds, as a potential etiological agent. Silica is a known lung carcinogen, largely identified as such within the occupational setting (Steenland et al. 2001). However, it is unlikely that silica can explain a substantial part of the large lung cancer excess in Xuanwei and Fuyuan, given that lung cancer risk associated with high occupational silica exposure, where levels are two orders of magnitude higher than here, is only elevated by 50% (Vermeulen et al. 2011). Exposure to BaP has also been implicated in the region as an exposure of note (Mumford et al. 1987). BaP is well recognized as a lung carcinogen and has frequently been identified to be present in high amounts in the HAP of smoky coal homes, including in our exposure assessment study (Downward et al. 2014b). However, BaP was also observed to be in comparable amounts in wood-burning homes, indicating that its role as a driver of lung cancer in Xuanwei may be more limited. This is borne out by our analysis, which, in regularized regression, finds null ORs for BaP in both the full and smoky  Table S2 for complete listing of compounds in each cluster). Note: CI, confidence interval. coal populations. Other components of coal combustion, which could potentially contribute to lung cancer, include elemental composition of HAP and the presence of radon in the air. We previously measured multiple elements in coal and HAP samples including arsenic, vanadium, and lead, and found that the elemental composition of HAP was not related to fuel or stove type (Downward et al. 2014a;Downward 2015). Radon measurements were collected from 57 households with an average radon concentration of 3:4 pCi=L and with no evidence of a difference between smoky and smokeless coal. The average detected radon level was below the action level of 4 pCi=L as set by the World Health Organization and corresponds with only an approximate 20% increase in lung cancer [National Research Council (U.S.) Committee on Evaluation of EPA Guidelines for Exposure to Naturally Occurring Radioactive Materials 1999], and, as with silica, this is not enough to explain the large excess risk of lung cancer in this region.
We employed ROC analysis to assess how well the different models discriminate between cases and controls in this study under different modeling circumstances. In general, the use of cumulative HAP constituent exposure information substantially improved model predictions vs. models using information only on coal deposits. However, our models were unable to predict the entirety of the cancer risk in subjects using coal from deposit 9, which is located in central Laibin, where coal deposits are associated with the highest lung cancer rates in Xuanwei ( Figure S1). This suggests that while we have identified many of the major agents linked to lung cancer risk in much of this region, there may be additional components of exposure within the central Laibin area that have not yet been fully characterized. In addition to exposures, gene-environment interactions may also play a role. For example, perturbations in the GTSM1 gene (which has a role in PAH metabolism) has been related to increased lung cancer risk in Xuanwei (Lan et al. 2000).
Our study had notable strengths. First, we had high participation rates. Second, we used empirical models derived from a comprehensive exposure survey of middle-aged and elderly neversmoking women in this region to predict over 40 HAP constituents for participants across their life course. The ability to assign individual exposure levels across the life course is especially unique to this study, as no case-control study to date has been able to confidently link HAP concentrations to individual study participants. Many of the previous attempts to implicate groups or individual compounds to the observed risks were on an ecological level, which are prone to biases. Third, the study population was composed of Chinese women who never smoked, which removes potential confounding by active smoking and sex.
Our study has a number of limitations. Due to the high correlations between the different constituents, especially among PAHs and methylated PAHs, we cannot discount the possibility that other constituents also contribute to the observed ORs but, due to chance or having a relatively larger measurement error, were not retained in the models. Additionally, the high correlations between the PAHs, including those with strong negative effects (e.g., retene), are likely to result in effect estimates and their uncertainties being inflated, which is observed with the very wide uncertainties of the regularized regression among the smoky coal subpopulation.
An additional limitation is related to the life course approach. As with many case-control studies, recall bias is possible, especially when recalling events over the life course. Additionally, residential and household changes raise the potential for exposure misclassifications. Information regarding changes in residence, Figure 3. Odds ratios for individual pollutants and lung cancer in the smoky coal subpopulation. Estimates were rescaled to reflect the estimated effect for astandard deviation increase in exposure in the full population to allow direct comparison with estimated effects in Figure 2. The (rescaled) point estimate for retene was off the scale (indicated by the "<" symbol).  Figure S1. Predictions for deposit information is derived from logistic regression-adjusted forage and food sufficiency before marriage or 20 y of age and reported by Wong et al. (2019). coal sources, and exposure-relevant changes in household characteristics (e.g., chimneys) were collected as part of the current study and were included in the exposure models. Residents of Xuanwei and Fuyuan typically have limited residential mobility; therefore, events such as changes in address and the installation of improved stoves represent major (and therefore memorable to both cases and controls) life events. Additionally, measurements in the exposure study were collected in 2008-2009, while the exposures relevant to cancer risk likely occurred decades prior. To address this issue, we analyzed coal samples collected in the late 1980s from representative coal deposits in Xuanwei and coal samples collected from the same deposits in 2008-2013 and found that coal samples collected from these eras were comparable (Downward 2015;Large et al. 2009). In summary, this is the first analytic epidemiologic study to comprehensively study the role of specific HAP constituents experienced over the entire life course and lung cancer risk. Cluster analysis indicated that the main drivers of lung cancer were NO 2 and a cluster of 25 PAHs. Within the PAH cluster, was some evidence to implicate 5-MC as a potential driver of lung cancer, which is especially relevant given that 5-MC is part of the family of methylated PAHs that have been found experimentally to be the major class of mutagens in combustion products of smoky coal in Xuanwei. Given the ubiquity of coal use for indoor cooking and heating in China and other countries and for electric power generation worldwide, our study suggests that more extensive monitoring of coal combustion products, including methylated PAHs, may be warranted to more accurately assess health risks from this exposure. Finally, a fuller understanding of the presence and determinants of production of these compounds is needed to minimize exposure, which could have important public health and prevention implications in China and elsewhere.