Using Birth Cohort Data to Estimate Prenatal Chemical Exposures for All Births around the New Bedford Harbor Superfund Site in Massachusetts

Background: Children born near New Bedford, Massachusetts, have been prenatally exposed to multiple environmental chemicals, in part due to an older housing stock, maternal diet, and proximity to the New Bedford Harbor (NBH) Superfund site. Chemical exposure measures are not available for all births, limiting epidemiologic investigations and potential interventions. Objective: We linked biomonitoring data from the New Bedford Cohort (NBC) and birth record data to predict prenatal exposures for all contemporaneous area births. Methods: We used prenatal exposure biomarker data from the NBC, a population-based cohort of 788 mother–infant pairs born during 1993–1998 to mothers living near the NBH, linked to their corresponding Massachusetts birth record data, to build predictive models for cord serum polychlorinated biphenyls (expressed as a sum, ΣPCBs), p,p′-dichlorodiphenyl dichloroethylene (DDE), hexachlorobenzene (HCB), cord blood lead (Pb), and maternal hair mercury (Hg). We applied the best fit models (highest pseudo R2), with multivariable smooths of continuous variables, to predict exposure biomarkers for all 10,270 births during 1993–1998 around the NBH. We used 10-fold cross validation to validate the exposure models and the bootstrap method to characterize sampling variability in the exposure predictions. Results: The 10-fold cross-validated R2 for the ΣPCBs, DDE, HCB, Pb, and Hg exposure models were 0.54, 0.40, 0.34, 0.46, and 0.40, respectively. For each exposure model, multivariable smooths of continuous variables improved the fit compared with linear models. Other variables with significant effects on exposure estimates were paternal education, maternal race/ethnicity, and maternal ancestry. The resulting exposure predictions for all births had variability consistent with the NBC measured exposures. Conclusions: Predictive models using multivariable smoothing explained reasonable amounts of variance in prenatal exposure biomarkers. Our analyses suggest that prenatal chemical exposures can be predicted for all contemporaneous births in the same geographic area by modeling available biomarker data for a subset of that population. https://doi.org/10.1289/EHP4849


Introduction
Many potential epidemiologic analyses are constrained by the ability to obtain reliable individual-level chemical exposure estimates. It is cost prohibitive and logistically challenging to measure exposures for large populations, and for retrospective analyses or investigations using administrative data sets, it is often impossible. In multiple contexts, researchers have developed strategies to assign exposures to large populations by leveraging measurements for a subset of locations or participants. For example, in air pollution studies (Hoek et al. 2008), researchers collect measurements at a representative subset of geographic locations and construct land use regression (LUR) models to explain variability in those measurements. LUR models evaluate the relationship between observed air pollution concentrations and predictor variables such as land use and traffic conditions in a multivariable regression model to estimate concentrations at unmonitored locations. A defining feature of LUR models intended for epidemiologic studies is that they use as predictors covariates that are available for all geographic locations under study so that all study participants can have exposure assignment. Although less common, this regression approach can also be used to evaluate the relationship between measured biomarkers and predictor variables such as sociodemographic covariates in a subset of the population to predict exposure levels for the entire population.
Analogous to LUR models, we developed exposure regression models for biomarkers of prenatal exposure to organochlorines and metals using data obtained from pregnant women enrolled in the New Bedford Cohort (NBC) and predicted prenatal exposures for all contemporaneous births in the study area. The NBC is an ongoing birth cohort study of 788 mother-infant pairs born during 1993-1998 to mothers residing in one of the four towns adjacent to the polychlorinated biphenyl (PCB)-contaminated New Bedford Harbor (NBH) Superfund site (Figure 1) in Massachusetts. The NBH was designated a Superfund site by the U.S. Environmental Protection Agency (EPA) in 1982 because of extensive PCB contamination from local electronics manufacturing facilities (U.S. EPA 2015). Heavy metals, including lead (Pb), are also present in NBH sediments (Shine et al. 1995) and the area contains a number of other hazardous waste sites and industrial sources of pollution. The region's old housing stock (and associated Pb paint and Pb-containing water distribution systems) and relatively low socioeconomic status (SES) also contributes to Pb exposure risk (Friedrich 2000). Despite these local sources of chemical contamination, diet remains an important route of chemical exposure for the population; for example, fish intake correlates with PCB and methylmercury (MeHg) biomarker levels in the NBC (Choi et al. 2006;Sagiv et al. 2012b).
We modeled PCBs, p,p 0 -dichlorodiphenyl dichloroethylene (DDE), hexachlorobenzene (HCB), Pb, and mercury (Hg) because they have been associated with several childhood health outcomes of interest in this population. Prenatal exposure to organochlorines, including PCBs, has been associated with impaired neurodevelopment (Grandjean et al. 2001;Sagiv et al. 2008Sagiv et al. , 2010Sagiv et al. , 2012a, altered immune function (Dewailly et al. 2000), acute respiratory infections (Dallaire et al. 2004(Dallaire et al. , 2006, and decreased birthweight (Govarts et al. 2012). Impaired neurological development has been also observed with prenatal exposure to Hg (Orenstein et al. 2014;Sagiv et al. 2012b) and Pb (Bellinger et al. 1987;Hu et al. 2006). Although the exposure routes to contaminants in this community include diet and old housing stock, proximity to the harbor has been associated with increased exposure to airborne PCBs (Martinez et al. 2017) as well as with important nonchemical stressors, including low SES and a poor home environment (Vieira et al. 2017). The significance of predicting prenatal exposures for all contemporaneous births in this community is that it allows for future epidemiologic analyses of health outcomes in a large study population for which it is impractical or infeasible to obtain exposure biomarkers.

Study Population
New Bedford Cohort. During 1993During -1998 New Bedford area newborns and their mothers were recruited as participants in the NBC birth cohort study. During pregnancy, the NBC mothers lived in one of the four towns surrounding the NBH Superfund site, including New Bedford, Acushnet, Fairhaven, and Dartmouth. The NBC was designed to assess relationships of prenatal and early life chemical exposures with subsequent child neurodevelopment (Orenstein et al. 2014;Sagiv et al. 2010Sagiv et al. , 2012aSagiv et al. , 2012b. Study infants' prenatal exposures to organochlorines (PCBs, DDE, and HCB) were measured in cord serum and exposure to Pb was measured in cord whole blood; maternal peripartum hair was analyzed for total Hg as a proxy biomarker for MeHg exposure. The majority of Hg in hair is methylated and it correlates with MeHg levels in blood and integrates exposure over a longer time period than in blood, thereby decreasing measurement error in individual exposure assignments (Bartell et al. 2004;IPCS 1990). For PCBs, as previously, we focused on the sum of four prevalent PCB congeners: 118,138,153,and 180 (RPCB 4 ). NBC cord blood samples were collected at birth with one aliquot centrifuged for removal of the serum fraction and a second aliquot collected for whole blood analyses. Organochlorines were measured in the cord serum at the Harvard T.H. Chan School of Public Health Organic Chemistry Laboratory (Boston, MA) using liquid-liquid extraction with extracts analyzed by gas chromatography with electron capture detection. Details of the method and quality assurance/ quality control performance are reported elsewhere and include excellent reproducibility and sensitivity; for example, for most PCB congeners, detection limits were <0:01 ng=g (Korrick et al. 2000). Because of insufficient volume, lipid content could not be measured in NBC cord serum, so organochlorine concentrations are expressed on a wet weight basis as nanograms per gram serum.
Metal measurements were done at the Harvard T.H. Chan School of Public Health Trace Metals Laboratory (Boston, MA). Cord whole blood Pb levels were measured using isotope dilution inductively coupled plasma-mass spectrometry with excellent sensitivity [limit of detection (LOD) of 0:02 lg=dL] and precision (<5%). As a proxy for MeHg, total Hg was measured in maternal peripartum hair samples using atomic absorption spectroscopy with an average LOD of 50 ng=g and excellent quality control performance (Sagiv et al. 2012b).
For both organochlorine and metal analyses, we used quantifiable values below the detection limit to optimize statistical power and avoid bias associated with censoring at the method detection limit (Kim et al. 1995). Questionnaire and medical record data on sociodemographic and birth outcome characteristics were available for NBC participants and are shown in Table 1. After excluding three sets of twins, the analysis included data for 782 NBC mother-infant pairs.
Massachusetts Birth Record Cohort. The Massachusetts Birth Record Cohort (MBRC) includes data from birth certificate data for all births in Massachusetts during 1993-1998, including the mothers recruited into the NBC. Birth records in Massachusetts are linked to several subsequent health outcomes data sets including maternal and child hospital discharge records, emergency department data, the birth defects registry, newborn hearing screening data and the Massachusetts Healthy Start, Early Intervention, and Women, Infants and Children (WIC) Programs (Barfield et al. 2008;Girguis et al. 2016Girguis et al. , 2018Khalili et al. 2018;Kotelchuck 2010;Manning et al. 2011). All Massachusetts children born in the four NBC towns were identified (n = 10,270). Many of the same sociodemographic variables available in the NBC are also available for the MBRC (Table 1).

NBC Chemical Exposure Models
Exposure models were built to explain variability in measured NBC exposure biomarkers as a function of covariates available in the MBRC. Births in the NBC cohort were linked to Massachusetts birth record data. Previously developed prenatal exposure models for the NBC (Fabian et al. 2013(Fabian et al. , 2016 were used as a reference although these models did not exclusively focus on covariates available in the MBRC and did not include all exposures in the present analysis. Table 1 shows that the continuous variables available in the birth records are highly correlated with the same characteristics from the NBC study questionnaires and medical record reviews (Pearson correlation) and that there was high percentage agreement among categorical variables. Previous lactation is not directly available in the MBRC, but a proxy measure was constructed using breastfeeding initiation at the hospital and parity. Mothers with previous live births who initiated breastfeeding at the hospital were coded as having previous lactation. There was 85% agreement between this categorical proxy variable and the previous lactation variable available in the NBC data ( Table 1). The other birth record variable considered in the exposure models that was not directly available in the NBC study was adequacy of prenatal care. For the purposes of comparing data collected in the NBC to data in the birth record, Table 1 includes NBC data on prenatal vitamin use as a proxy for adequate prenatal care.
In addition to the available birth record data, we calculated proximity measures for all geocoded birth addresses. Residential distance in meters to the nearest major roads (Class = 1-4) was estimated using geographic information systems (GIS) and road segments obtained from the Massachusetts Department of Transportation. Major roads include limited-access highways, multilane highways, numbered routes, and major roads. Residential distance to roadway was included in the analysis for the cord blood Pb model, as was age of home, which was obtained from the Massachusetts tax assessor database for 2016/2017. Distance to the harbor was also calculated using GIS and was included in the RPCB 4 , DDE, HCB, and Hg models based on previous spatial associations with important sociodemographic predictors (Vieira et al. 2017). Geocoded addresses were also used to determine the median block group-level household income for the NBC and the MBRC using 2000 decennial U.S. Census data (U.S. Census Bureau 2000).
Exposure models were built for RPCB 4 , DDE, HCB, Pb, and Hg using generalized additive models (GAMs) of complete data for each chemical exposure biomarker. The following GAM framework was used for each exposure model: where y(exposure) is the predicted log exposures for a given chemical, S(x1, x2, . . ., xN) is the multivariable loess (locally weighted scatter plot smoothing) term(s) for the N continuous predictors, b 0 is the vector of parameters, and z is the vector of parametric covariates. The GAM was selected as the primary modeling approach because of its flexibility and ease of interpretation (Hastie and Tibshirani 1990); specifically, a loess smooth adapts to changes in data density and the model reduces to a linear regression without the smooth.
The exposure models were built using log-transformed chemical biomarkers due to their skewed distribution. We considered variables in Table 1 as predictors, starting with the full model and using backward elimination to select the final variables for inclusion. First, all continuous variables were modeled using univariable and multivariable smooth terms as well as linear terms, and the model structure that maximized the pseudo R 2 was Table 1. Characteristics of mother-infant pairs born during 1993-1998 to mothers residing near the New Bedford Harbor from three sources: the New Bedford Cohort (NBC) from study data (n = 782), the NBC Massachusetts birth record data (n = 779), and the Massachusetts Birth Record Cohort (MBRC) birth record data (n = 10,270). 26:3 ± 5:4 2 5 :9 ± 5:4 2 6 :0 ± 5:9 1.00 Maternal pregnancy weight gain (kg) (mean ± SD) 15:1 ± 6:4 1 3 :6 ± 5:7 1 3 :6 selected. We calculated the pseudo R 2 as 1 -(residual deviance/ null deviance) for variable selection. Based on the higher pseudo R 2 , all continuous variables in the exposure models were modeled with smooth terms. We then eliminated categorical variables with p-values generated from the F-statistic that were greater than 0.2 in order of highest p-value if they did not increase the pseudo R 2 value. Variables that were tested but not retained in the final exposure models are indicated as not significant (NS) in Tables 2 and 3. Continuous predictors in the exposure models included space and time represented by latitude and longitude of residential address, infant year of birth, and maternal age at birth. Results of a previous NBC study showed infant year of birth was strongly associated with PCB exposures (Choi et al. 2006). Furthermore, exposure to lipophilic organochlorines often increases with maternal age due to bioaccumulation (Axelrad et al. 2009;Birch et al. 2014). Birthweight, maternal weight gain (in kg), distance measures, and year the home was built were also tested as continuous predictors. Because the age of the home is often associated with the potential presence of Pb in plumbing or paint, we considered only build year in the Pb exposure model; build year was not   Table 3) as a predictor in the Hg or organochlorine models. Categorical variables considered in all exposure models included maternal race/ethnicity and ancestry, parental education, mother's marital status, block group income, parity, adequate prenatal care, maternal smoking and alcohol consumption during pregnancy, and previous lactation. Breastfeeding was tested in all models except the Hg model because it is not believed to be relevant for excretion of Hg (Table 3). The distributions for all variables are presented in Table 1.
We calculated a predictive 10-fold cross-validation R 2 value by randomly dividing the NBC data into 10 groups. For each fold, 1 group was used as the test data and the remaining groups combined as the training data set. The models were fit on the training set and evaluated against the test set. Each group served as a test set once. The correlation between the observed exposures and predicted exposures for the 10 test sets were then averaged to calculate the 10-fold cross-validation R 2 value. The optimal degree of smoothing in the GAMs was chosen by maximizing the 10-fold cross-validation R 2 for each model.

Exposure Predictions for the MBRC
Each NBC exposure model was used to predict the exposures for all births with complete data in the MBRC from 1993 through 1998 for the study towns of New Bedford, Acushnet, Fairhaven, and Dartmouth; we calculated the mean, median, and the 25th and 75th percentiles of those estimates. To account for the effects of sampling variability on our NBC exposure prediction model, we also applied the bootstrap method to each NBC exposure model to generate a distribution of plausible exposure biomarker values. We drew repeated samples 1,000 times, with replacement, from the NBC. The exposure model was refit each time, arriving at a different set of coefficients for the model terms, and applied to data from the MBRC to obtain 1,000 predicted log-transformed exposure values. After back-transforming those values, we calculated the median, the 25th and 75th percentiles, and the corresponding interquartile range (IQR) for the population and individual bootstrapped exposure estimates. For the individual estimates, the median and the 25th and 75th percentiles were calculated across the 1,000 bootstrapped values predicted for each birth and then averaged. For the population estimates, the median and the 25th and 75th percentiles were calculated across the entire cohort population of 10,270 births for each bootstrap, resulting in a distribution of 1,000 values for each statistic (the median and the 25th and 75th percentiles). We then took the mean and 95% probability intervals (PIs) of each statistic's distribution to assess the population exposures. Because the predictive characteristics of the population varied, we expected that the IQR of population exposures would be wider than the IQR for the exposures predicted for each individual birth using the bootstrap method. All calculations were performed using R (version 3.4.4; R Development Core Team), and the "gam" package was used for the models.

Exposure Model Performance
We also compared the performance of the GAMs to individual model and ensemble results from the SuperLearner package in R. SuperLearner evaluates the performance of multiple machine learning models, creates an optimal weighted average of the best performing models referred to as an ensemble, and provides comparisons of the ensemble to the other models using cross validation (Kennedy 2017). Using the SuperLearner function, we first fit the following models simultaneously: GAM, lasso, randomForest, XGBoost, SVM, and the mean of Y. The output provides a risk estimate for each as a measure of model performance, with lower risk estimates indicating better performance. The function also creates an ensemble with a weighted average of the individual models. We then calculated a predictive 10-fold cross-validation R 2 , using the ensemble to compare with the GAM models we developed for each exposure. To evaluate the performance of the individual models in the SuperLearner to the ensemble, we used the CV.SuperLearner function, which calculates the standard errors of each via a nested 10-fold cross validation. The risk estimates and standard errors were then plotted for comparison.

NBC and MBRC Characteristics
We successfully linked 779 of the 782 NBC births to their Massachusetts birth record data. When the birth record characteristics for the NBC subset were compared with the entire MBRC for the study area, we saw very similar proportions. In general, sociodemographic and behavioral characteristics collected in the NBC were similar to those available for the MBRC from the same time period (Table 1). In both data sets, mothers were on average 26 y of age, and the majority of mothers were non-Hispanic white and married at the time of the child's birth. The MBRC had a higher proportion of maternal ancestry from other than the Azores/Portugal or Cape Verde compared with the NBC variable (66% vs. 45%). Agreement between the two data sources for maternal ancestry was generally good (84%), and the higher proportion of missing data in the NBC may account for differences in the proportion of other ancestry.
Reported smoking during pregnancy was high in both the NBC (26%) and the MBRC (23%), and there was 90% agreement between the two data sources, but there were significant differences in reported alcohol consumption during pregnancy (18% in the NBC vs. 2% in the MBRC; 79% agreement). In addition, differences in maternal education levels were observed, with 16.8% of the NBC reporting less than a high school education versus 28.6% of the MBRC. This difference may be due to the amount of missing data for this variable in the NBC (7.3% vs. 0.5% in the linked Massachusetts birth data) given that the percentage of mothers with at least a high school education was similar between the NBC data and the linked Massachusetts birth data (76.0% and 76.4% respectively). With paternal education, 24.0% of the NBC compared with 33.1% of the MBRC reported less than a high school education. However, when using the linked Massachusetts birth data for paternal education among the NBC subset, the proportion with a high school education was similar to the larger MBRC.

NBC Exposure Models
The 10-fold cross-validated R 2 values for the RPCB 4 , DDE, HCB, Pb, and Hg exposure models were 0.54, 0.40, 0.34, 0.46, and 0.40, respectively (Tables 2 and 3). Tables 2 and 3 present the estimates of the percentage difference in biomarker levels of PCB, DDE, HCB, Pb, and Hg associated with categorical variables modeled parametrically. For all the exposure models except Pb, maternal ancestry was a significant predictor, with higher biomarker levels seen for mothers born in the Azores/Portugal. Compared with non-Hispanic whites, Hispanics and non-Hispanic women in other nonwhite racial groups had higher DDE levels. Conversely, non-Hispanic whites had higher Hg levels compared with other groups. For mothers who smoked during pregnancy, there was a significant increase in cord blood Pb compared with mothers who did not smoke, consistent with smoking's association with blood Pb in other populations (Deutch and Hansen 1999;Rodosthenous et al. 2017;Saoudi et al. 2018). Paternal education was a significant predictor of PCB levels, with a 14% increase [95% confidence interval (CI): 2%, 28%] for infants whose fathers had less than a high school level education. Cord blood Pb levels were 21% higher (95% CI: 4%, 40%) among infants whose mothers had not completed high school compared with those whose mothers had at least a high school education.
To assess the effect of continuous covariates modeled using smooths, exposures were predicted by varying one covariate at a time while holding the other variables constant at the median value (Table 4). Older mothers tended to have higher exposure estimates, whereas nonlinear effects in maternal age were observed for HCB and Hg, with women at the 75th percentile value of maternal age (30 y) having higher exposures than the oldest mothers (41 y of age). Although a change from the 25th to 75th percentile value for maternal age nearly doubled the prediction for DDE (0.232 to 0:422 ng=g serum), an interquartile change in year of infant birth had little effect on the predictions (0.302 to 0:304 ng=g serum). Nonlinear effects were also observed with birth weight. Cord blood Pb level predictions were higher for older homes compared with newer homes, and higher at the 25th than 75th percentile of distance to nearest road (471 m and 1:4 lg=dL compared with 1,620 m and 1:2 lg=dL). Varying location, represented by longitude and latitude, also resulted in different exposure predictions, with the highest exposures generally at lower latitudes (closer to the ocean, Figure 1).

Predicted Exposures for the MBRC
Biomarker levels predicted for the MBRC using the entire NBC (prior to bootstrapping) showed similar distributions as the NBC biomarkers ( Figure 2). Table 5 presents the mean and 95% PIs of the median and the 25th and 75th percentiles of the predicted biomarker concentrations and corresponding IQR across the entire population of the Massachusetts birth cohort using the bootstrap method. Based on these predicted biomarkers, it was possible to characterize highly exposed subpopulations. For example, women with ancestry from the Azores/Portugal had predicted median PCB exposure of 0:22 ng=g, whereas women with other ancestry had predicted median PCB exposure of 0:16 ng=g. In addition, women above the 90th percentile in PCBs were disproportionately older mothers with ancestry from the Azores/ Portugal. In addition, for each birth, we found the median, the 25th and 75th percentiles, and IQR difference across the 1,000 bootstrapped exposure estimates predicted for that birth and then averaged over all the births ( Table 6). The average individual IQRs (Table 6) are much smaller than the population IQRs (Table 5) as expected given that the characteristics for predicting exposures vary across the population but not at the individual level.

Exposure Model Performance Comparison
The 10-fold cross-validated R 2 values for the RPCB 4 , DDE, HCB, Pb, and Hg exposure models using the GAMs we developed (0.54, 0.40, 0.34, 0.46, and 0.40, respectively) were similar to the 10-fold cross-validated R 2 values using the ensemble output from the SuperLearner package in R (cross-validated R 2 values of 0. 54, 0.45, 0.33, 0.44, 0.32, respectively). The results of the nested SuperLearner cross validation showed that the GAM model performance was statistically similar to other modern modeling approaches, including lasso, randomForest, and the ensemble (see Figures S1-S5); the risk estimates and standard errors overlap. In building the GAM exposure models, modeling continuous variables using nonlinear smooths increased the pseudo R 2 compared with linear models (see Tables S1-S5).

Discussion
The exposure models built in this study explained variability across several different chemicals using predictors available from the MBRC linked to biomarkers measured in the NBC, allowing for both exposure predictions for epidemiological studies and identification of susceptible subgroups. Our previous research with the NBC indicated that the range of exposures modeled in the MBRC, as well as the differential exposures associated with key sociodemographic covariates, are sufficient to be associated with adverse child health outcomes in epidemiologic analyses (Sagiv et al. 2010(Sagiv et al. , 2012a(Sagiv et al. , 2012b. In considering susceptible subgroups, a significant predictor across all exposure models, with the exception of Pb, was maternal ancestry from the Azores/ Portugal. This could be due to distinctive dietary patterns, such as fish consumption (FAO 2011), or to differential early life exposures to bioaccumulative chemicals among mothers either born in these countries or exposed transgenerationally via the previous generation's exposures. The fact that maternal ancestry was a more consistent predictor than race/ethnicity emphasizes the potential importance of more granular demographic predictors when available. Location should also be considered in exposure models when available. We observed nonlinear associations with longitude and latitude, with births closer to the coastline, where industrial activities are located in this community, having higher exposures. Individuals living closer to the coastline may also be more likely to have higher seafood consumption, which is a known route of exposure for some of these contaminants. Inverse associations were generally observed in parous compared with nulliparous mothers, consistent with chemical elimination from breastfeeding or pregnancy (ATSDR 2004;Nickerson 2006). Although some of these associations do not reflect modifiable risk factors, they do allow for differential exposure assignment for epidemiology, and potential exploration of underlying exposure pathways for which intervention may be possible.
The strength of paternal (as well as maternal) educational attainment in predicting many biomarker levels suggests that this is also an indicator of household SES within this population, especially in the absence of individual-level or household-level measures of income. Inclusion of residential distance to the nearest major road and build year for home at birth as predictors of Pb reinforces previous findings that residential exposure pathways persist, especially in lower-income urban settings with an older housing stock. One limitation of the analysis was that paternal education and home build year were missing for a large proportion of the birth records. Although our exposure models used complete data only, future epidemiologic analyses that use the exposure predictions may consider multiple imputation methods to address the issue of missing data.
Another limitation of the analysis is that variables in the NBC such as diet that could have increased the variance explained by the exposure models were not available in the MBRC. Diet, including fish and organ meat consumption, are important pre-dictors for many of the exposures we targeted for study (Choi et al. 2006). Although maternal ancestry may have served as an indicator of dietary habits, having additional food intake information may have improved our models and increased their ability to ascertain routes of exposure. That said, administrative data sets rarely have dietary and other behavioral information, so our ability to explain significant variance in the absence of these data is notable.
Last, our exposure regression models may not be generalizable to other geographic areas. Although residential distance to the harbor (a term specific to the New Bedford area) was not associated with umbilical cord serum PCB levels in a previous analysis (Choi et al. 2006), we included it in the exposure models (except for Pb) to capture any residual sociodemographic effects based on results from a spatial study that showed associations with proximity to the harbor (Vieira et al. 2017). Other sociodemographic variables might have different meanings in different settings. That said, the approach is generalizable to any setting where biomarker measurements and sociodemographic information are available for a contemporaneous subset of study area participants.
We used GAMs to build our exposure models but acknowledge that other models with comparable flexible modeling features may have performed similarly. We were able to increase the pseudo R 2 value of each model, compared with linear models, by smoothing continuous variables. Using the SuperLearner package in R, we showed that the GAM performed as well as other modeling techniques, including lasso and randomForest, and that the 10-fold cross-validated R 2 values for the GAM exposure models we developed were similar to the ensemble results from the SuperLearner. Moreover, GAMs have been applied previously to NBC data (Vieira et al. 2017) and can yield valuable insight regarding functional forms that facilitate model interpretability. Our cross-validated R 2 values showed comparable performance to exposure regression models used in other successful epidemiologic studies, including simulation studies that illustrated the robustness of exposure-outcome associations (Avanasi et al. 2016;Baxter et al. 2010;Shin et al. 2011).
Although there are several approaches for building exposure models, we used the model that resulted in the highest R 2 value to decrease the bias in our estimates. Using this method would suggest putting all of the variables into the model, given that increasing variables increases the R 2 , but we allowed our sample size to change due to missingness from each variable to determine which variables in the model would give the largest pseudo R 2 value. Although using a liberal variable selection method decreased the bias in our estimates and is generally recommended for model selection when a subset of the data is available (Rubin 1996), it contributed to uncertainty in our exposure predictions. The bootstrap method accounted for sampling variability without assuming a parametric distribution, as reflected by the 95% PIs shown in Table 5. Those intervals are reasonably narrow, indicating that the central tendency and IQR of predicted exposure are fairly stable. Our approach does not account for other sources of uncertainty. For example, our model was not able to predict all the variation in observed biomarker concentrations; unexplained residual variation contributes to the overall uncertainty in biomarker predictions but is not assessed by ordinary bootstrap estimation methods. Future work might include generating parametric or nonparametric bootstrap prediction distributions for more comprehensive evaluation of the effects of individual exposure uncertainty in epidemiological analyses.
In summary, our study showed how multiple prenatal chemical exposures can be estimated in a general population administrative data set by modeling available measured biomarker data Table 6. Mean of the median, the 25th and 75th percentiles of the predicted biomarker concentrations, and interquartile range (IQR) across the 1,000 bootstrapped results for each individual birth in the MBRC (n = 10,270).

Biomarker
Median 25th  for a subset of the population using GAMs and key sociodemographic and behavioral information. Our models can be used to predict exposure biomarker concentrations in future epidemiological investigations of health outcomes. In addition, they yielded valuable insights about residents in a community near a Superfund site who may have been at increased risk of exposure to a range of chemicals.