Preliminary analysis indicates that measurement data are consistent with a log-normal distribution. If Z
denotes the measured value of an analyte and is log-normally distributed, denoted Z
), then by definition log(Z
) is a normal random variable with mean μ and variance σ2
, denoted log(Z
) ~ N
) (Singh et al. 1997
). Suppose X
, …, XK
is a column vector of covariates, with X0
= 1, and β= (β0
, …, βK
is a column vector of regression parameters, where t
denotes vector transpose. If data are complete, then a linear regression equation has the form log(Z
) = βtX
+ ɛ, where ɛ~ N
). For each X
, the model implies that Z
is log-normally distributed with mean βtX
; that is, Z
Regression analysis in control data.
We evaluate the association between analyte concentration and pesticide use by fitting a linear regression model of the logarithm of the analyte level on subject characteristics. Regression (independent) covariates include indicator variables for season of sample collection, presence of oriental rugs, study center, sex, age (< 45, 45–64, ≥ 65 years), race (African American, Caucasian, other), type of home (single family, townhouse/duplex/apartment, other), year of home construction (< 1940< 1940–1959–1960–1979, ≥ 1980), and educational level (< 12, 12–15, ≥ 16 years). As in Colt et al. (2004)
, covariates vary slightly with analyte. Models also include five variables describing the use of insect treatment products: ever/never used products to treat for crawling insects, flying insects, fleas/ticks, termites, and lawn/garden insects. We use data from current homes only.
Regression analysis is hampered by the presence of measurements known only within bounds. We assume that the probability distributions of measurements below the DL (more precisely, within the LB and UB interval) depend only on observed data; that is, the interval-measured concentrations arise from the same distributions that generate the measured values. Let F
(•) be the cumulative distribution function and f
(•) the probability density function for a log-normal random variable. Suppose Xi
is the covariate vector for the i
th of i
= 1, …, n
are recorded for i
= 1, …, n0
individuals, whereas a specific Zi
measurement is recorded for i
+ 1, …, n0
individuals. LB and UB are subscripted to allow different DLs. Using a Tobit regression approach (Gilbert 1987
; Persson and Rootzen 1977
; Tobin 1958
), the log-likelihood function has the form
The first summand derives from the n0 interval measured values and involves the difference of the cumulative distribution function F evaluated at UB and at LB; that is, the probability the measurement lies between the LB and UB. The second summand derives from the n1 detected values. Maximum likelihood estimates (MLEs) for β and their covariance matrix are obtained by maximizing Equation 1 and computing the inverse information matrix using standard methods.
Imputation of missing concentrations.
If the goal is to evaluate pesticide use and analyte levels in carpet dust, represented by the β parameters, then the Tobit regression of Equation 1 is sufficient and no imputation is required. For further analysis or for graphical display, it is useful to generate values for measurements below DLs. We consider several different approaches, including inserting DL/2, inserting E
< DL], or using a single or multiple imputation (Little and Rubin 1987
A multiple imputation procedure is carried out as follows. Using all data (measured concentrations, missing data types I–III, and covariates), we create the log-likelihood function 1, solve for the MLEs of β and σ2
(denoted β̂ and ς̂2
), and impute a value by randomly sampling from a log-normal distribution with the estimated parameters. However, in selecting fill-in values we cannot ignore that β̂ and ς̂2
are themselves estimates with uncertainties. We therefore do not use β̂ and ς̂2
for the imputation, but rather β̃ and σ̃2
, which are estimated from a bootstrap sample of the data (Efron 1979
). Bootstrap data are generated as described below by sampling with replacement, and represent a sample from the same universe as the original data. We repeat the process to create multiple data sets, which are then independently analyzed and combined in a way that accounts for the imputation. Differences in regression results in the multiple data sets reflect variability due to the imputation process.
This procedure, however, omits a source of variability. We have tacitly assumed that the LB and UB are fixed and known in advance. When there are no interfering compounds (missing type I), the assumption is justified because the DL is determined before the GC/MS dust analysis. When there are interfering compounds (missing types II and III), the assumption cannot be fully justified because the bounds depend on the amount of interference and therefore are random. In the NHL data, we assume this uncertainty is small relative to other uncertainties. The imputation proceeds as follows:
Step 1: Create a bootstrap sample and obtain estimates β̃ and σ̃2 based on Equation 2. Bootstrap data are generated by sampling with replacement n times from the n subjects. Sampling “with replacement” selects one record at random and then “puts it back” and selects a second record. After n repetitions, some subjects are selected multiple times, whereas other subjects are not selected at all. If wi is the number of times the ith subject is sampled, then the log-likelihood function for the bootstrap data is
Step 2: Impute analyte values based on sampling from LN (β̃tX, σ̃2). For the ith subject, assign the value
This quantity consists of various elements. F(LBi; β̃tX, σ̃2) and F(UBi; β̃tX, σ̃2) are the cumulative probabilities at ULi and UBi, respectively, based on parameters β̃, σ̃2. Both values lie between zero and one. Select randomly from a uniform distribution on the interval [a, b], denoted Unif[a, b], in particular the interval [F(LBi; β̃tXi, σ̃2), F(UBi; β̃tXi, σ̃2)]. The inverse cumulative distribution function, F−1(•), is the required imputed value in original units between LBi and UBi. Repeat using the same β̃, σ̃2 for each missing value. Detected values are not altered.
Step 3: Repeat steps 1 and 2 to create M
plausible (or “fill-in”) data sets. Remarkably, M
need not be large, and a recommended value is between 3 and 5, with larger values if greater proportions of data are missing (Little and Rubin 1987
; Rubin 1987
). We select M
= 10 to fully account for the variance from the imputation.
Step 4: Fit a regression model to each of the M
data sets and obtain M
sets of parameter estimates and covariance matrices. Combine the M
sets of estimates to account for the imputation (Little and Rubin 1987
; Schafer 1997
). The imputation procedure results in confidence intervals (CIs) that are wider than the single-imputation, fill-in approach.
We conducted a simulation study, using a simple regression model with zero intercept and no covariates, to evaluate the imputation approaches, the effects of the proportion of data below the DL, and sample size. We generated data sets of size n by sampling from a log-normal distribution with parameters (μ,σ2), and defined the DL such that in expectation p percent of the samples falls below the DL; that is, DL = F−1(p; μ, σ2). The simulation involves 5,000 independent data sets for each set of parameters. We compared five approaches: a) direct estimation (Tobit regression) of MLEs (μ̂ and ς̂2) using Equation 1; b) multiple imputation with allowance for uncertainty in model parameters; c) single imputation based on a random fill-in value for each datum below the DL, using MLEs (μ̂ and ς̂2) from Equation 1; d) insertion of DL/2 for all data below the DL; and e) insertion of E[Z|Z < DL] for data below the DL with the expected value based on the MLEs (μ̂ and ς̂2) from Equation 1. For approaches b) through e), estimators are the mean and variance of the logarithm of the observed and imputed data, with adjustment for multiple imputation in b). We compare results with estimates based on complete data.
For the NHL example, we use SAS (SAS System for Windows, version 8.2; SAS Institute Inc., Cary, NC) to generate bootstrap samples, fit linear regressions (PROC REG), solve log-likelihood Equations 1 and 2 (PROC LIFEREG), and combine results from multiple data sets (PROC MIANALYZE). The simulation was conducted using MATLAB (version 7.0; MathWorks Inc., Natick, MA).