Some useful statistical methods for model validation.

Although formal hypothesis tests provide a convenient framework for displaying the statistical results of empirical comparisons, standard tests should not be used without consideration of underlying measurement error structure. As part of the validation process, predictions of individual blood lead concentrations from models with site-specific input parameters are often compared with blood lead concentrations measured in field studies that also report lead concentrations in environmental media (soil, dust, water, paint) as surrogates for exposure. Measurements of these environmental media are subject to several sources of variability, including temporal and spatial sampling, sample preparation and chemical analysis, and data entry or recording. Adjustments for measurement error must be made before statistical tests can be used to empirically compare environmental data with model predictions. This report illustrates the effect of measurement error correction using a real dataset of child blood lead concentrations for an undisclosed midwestern community. We illustrate both the apparent failure of some standard regression tests and the success of adjustment of such tests for measurement error using the SIMEX (simulation-extrapolation) procedure. This procedure adds simulated measurement error to model predictions and then subtracts the total measurement error, analogous to the method of standard additions used by analytical chemists.

empirical comparisons. As part of the process of validation, this statistical procedure strengthens the confidence in the model. For the integrated exposure uptake biokinetic (IEUBK) model, the environmental exposure media are air, water, diet, residential yard soil, and residential dust including multiple source contributions from paint, school or day care, and secondary residences. Measurements of exposure to any of these media are likely to be inaccurate estimates of actual exposure because of such factors as analytical error, repeat sampling variability, and location variability, and are not likely to completely characterize a child's actual long-term lead intake from that medium at that particular point in time. This measurement error is likely to be large enough to substantially attenuate the estimated relationship between observed blood lead and blood lead that is predicted from the model using the noisy input variables associated with that child's exposure. Similar effects are likely to occur in all modeling efforts, including the linear slope factor models that have been developed for long-term adult lead exposure by Bowers et al. (1) and by the U.S. Environmental Protection Agency (U.S. EPA) (2). In the worst case, measurement error may completely obscure the relationship between observed and predicted blood lead. Off-the-shelf statistical remedies for the problem of measurement error correction are not readily available. For the simple regression comparisons, the simulation and extrapolation (SIMEX) method proposed by Carroll et al. (3) may be adequate to estimate the true parameters relating observed and predicted values.
When empirical data are used to evaluate the model, there are several conventional statistical tests that can be applied to test the null hypothesis that the model output is wrong. These involve showing that some form of the predicted value does not equal the same form of the observed value (e.g., typical predicted . typical observed). The simplest empirical comparison is that of a regression of observed values on predicted values (the usual variables are blood lead concentration, or the logarithm of blood lead, or the exceedance of blood lead over a health-based level of concern). If the usual assumptions of normal residuals and linearity are satisfied, we would test slope= 1, intercept = 0. But even in a well-calibrated model, when evaluated against an independent dataset, the typical result is slope less than 1, intercept greater than 0, even when the observed and predicted means are equal. The most plausible explanation, in our opinion, is that the data that are generally available as input for such models are not concurrent measurements of lead concentrations or loadings in environmental media in the residential or occupational setting in which that individual subject is believed to receive the exposure measured as blood lead.
If the purpose of the regression comparison is a formal test of the hypothesis (slope = 1, mean [observed] = mean [predicted]), then the distributional properties (normal, log-normal, etc.) of the adjusted estimates samples are critical in making accurate inferences. This may be even more critical in the logistic regression version of the test, comparing predicted and observed incidence of elevated blood leads, which also requires additional model assumptions about the intrinsic inter-and intraindividual variability of blood lead.
Measurement errors include sampling and analytical biases, instrument reading and recording errors, and temporal and spatial sampling sample collection discrepancies.
These errors may show diverse forms with diverse consequences, but in general are likely to introduce distortions of the predicted values without regard to the form of the predictive model. On the other side, the outcome measures against which the predictions are to be compared, in this case blood lead concentration, are also subject to measurement errors, including sampling and subject selection biases.

Limitations of Statistical Hypothesis Tests
Because of our concerns about the validity of formal statistical tests in the face of measurement error of unknown attributes, the validation strategy document for the IEUBK model (4) recommends that formal pass-fail statistical tests not be applied in empirical comparisons. The role of hypothesis testing in scientific inference has been hotly debated since its earliest uses, and remains a controversial subject for both statisticians and the subject-matter scientists who use statistical methods. However, if caution is used, formal hypothesis-testing methods for predictive models may be extremely helpful in diagnostic studies that estimate the range of conditions beyond which one might encounter some model inadequacies, or circumstances in which supplementary information needs to be collected. We have elaborated on five major areas of concern.
Observational data may not (11), and Mushak (12) in this monograph provide many examples of the assessment and interpretation of several modifying factors in evaluation of lead models.
Classical statistical tests mistakenly assume that the predicted values, prediction intervals, and classification of elevated blood lead concentrations based on the model are statistically accurate predictors of what they purport to predict. Both systematic and random errors in epidemiologic studies influence the accuracy of the predictors. Random errors may occur when single, individual samples do not take into account temporal variability; when spatial samples of yard soil and house dust do not represent the actual play areas and contact surfaces of the child; when exposure has been modified by environmental factors such as groundcover or dust loading; or by behavioral factors such as housecleaning practices, choice of play area, hand-washing frequency, and mouthing of nonfood objects. Random errors of sample collection and processing may also occur when the sample is contaminated by other environmental media, when the sample is modified during transit and storage, or when the sample data are misrecorded.
Systematic errors might occur if instruments are miscalibrated, measurements are taken at inappropriate locations or seasons, no measurements or estimates are made of nonresidential exposure, or the subjects are not representative of the same sociodemographic or ethnic groups as the model.

Statistical Tests of Hypotheses That Evaluate Predicted Values
We will illustrate a few of the potential problems in applying formal statistical goodness-of-fit tests without considering the possible effects of measurement error. Does the log of the observed value minus the log of the model-predicted value equal zero (in which case the predictions are relatively unbiased)?
In order to establish a general notation for these tests, the algebraic notation is used here: d= Y-M, [1] where d= prediction error, Y= observed value (e.g., blood lead or log blood lead), and M= modeled value analogous to Y (M is a model prediction not necessarily derived from an optimized fit of observed values).
Of the many possible statistical hypotheses of model adequacy that can be tested using a set of paired values of an observation Yand its model prediction M, five are listed below as null hypotheses Ho(1) to H0(5). Figure 1 illustrates the graphical interpretation of these five statistical hypotheses, plus two additional hypotheses described in the next section.
Ho(1): mean d= 0 Hypothesis Ho(I) expresses a common concept: Although some differences between observed values and predicted values are expected, there should be no difference between the mean observation and the mean prediction from a good model. Mean could be replaced by some other measure of typical value, such as the median or geometric mean, if prediction errors (d) have an asymmetric or heavy-tailed distribution.
Ho(2): mean Y= mean M Or, if Y=log blood lead, then Ho(2): geometric mean Y= geometric mean M Hypothesis Ho(2) is more general than Ho(I) and addresses a common situation in epidemiologic studies in which data sets may have some records in which the data are not paired. That is, the environmental measurements for calculating a value of M are available, but not a corresponding blood lead observation Y; or conversely, Y is available, but there is not enough environmental data to calculate a corresponding value of M. If the missing values are missing completely at random and the existing data are representative of the missing data, then the observed and modeled  HOll1) where the difference between observed and predicted blood lead concentrations is expected to equal zero. (B) illustrates Ho(2); the observed mean equals the predicted mean. (C) expands Ho(1) to show that, according to Ho(3), the slope of observed vs. predicted should equal one and pass through the origin. (D,E) represent Ho(4) and Ho (5), which are two different logarithmic transformations of Ho(3), and test the condition that the slope = 1 and intercept= 0. These are the two tests that are used in this paper for the SIMEX procedure for measurement error adjustments. (F,G) illustrate two additional hypotheses that evaluate intervals and ranges, both of which are important for risk assessment. Hypothesis Ho(6) tests the accuracy of predicting a specific upper tail interval (e.g., 95%), and hypothesis 1-1(7) tests the validity of the prediction of the number of children with blood lead concentrations exceeding 10 pg/dl. Correctly predicted blood lead concentrations are in quadrants and Ill, incorrectly predicted are in 11  Step 1: Fit a straight line to paired Yand M values by ordinary least squares, producing an estimated intercept A, a slope estimate B, and an estimated residual standard deviation S.
Step 2: Calculate the mean paired difference d and the variance of the predicted values SM.
Step 3: Reject Ho(3) and conclude that predicted values are not close enough to the observed values if N(d -0) + (N-1)M(B-1) >2F2,N2 [2] where F(2, N-2) is the appropriate upper tail percentile of Fisher's F distribution with 2 degrees of freedom for the numerator and N-2 for the denominator.
However, if the standard deviation of the error is not constant, but is proportional to the predicted value M as sometimes happens, then it would be preferable to mathematically stabilize the variance by fitting the linear model to a logarithmic transformation on both sides, as shown in H0 (5). Note that the logarithmic transformation of the linear relationship between Yand Min hypothesis Ho (5) is not the same as the linear relationship in logarithms shown in Ho(4). In Ho (5), the hypothesis is that the relationship between Y and M is linear at all values of M but will not pass through the origin (Y= 0 when M= 0) unless A= 0. In Ho(4), the hypothesis is that the relationship between Yand M always passes through the origin but is linear only if L = 1. In this respect, all three hypotheses, H0(3) through H0 (5), test different sets of assumptions.
As with any regression analysis, residuals should be carefully examined for outliers, curvilinearity, and trends. Nonlinear parametric models that include linearity as a special case can also be used to diagnose curvilinearity. Analysis of variance tests for curvilinearity require multiple observations with the same or similar predicted values.

Two Hypotheses That Evaluate Intervals or Ranges
Many statistical procedures exist for the interesting situation in which the hypotheses tested involve ranges or intervals of values. Two are presented here as descriptive null hypotheses: Ho (6): The model produces accurate prediction intervals (90th or 95th for example)(percent Q) for blood lead.
Ho (7): The model correctly predicts the number of lead-poisoned children.
These must be translated into testable statistical hypotheses with an assumption about the distribution of blood lead. The model prediction M may be thought of as a point estimate of blood lead. The prediction interval defines a range of blood lead concentrations within which the specified percentage Q of the lead-exposed individuals with predicted blood lead M are expected. For the IEUBK model, the Q percent prediction interval is defined by the lines from M* e(-z log(GSD)) to M * e(z * (log(G,SD)) A formal statistical hypothesis test for Ho(6) might be based on a multinomial contingency table assessment that the intended Q percent prediction interval contains about Q percent (e.g., 90%) of the observed blood leads. The total number of observations (N) is divided into three groups: L is the number of observations below the lower line, I is the number of observations between the prediction interval lines, and H is the number of observations above the upper prediction interval line. Therefore, N=L +I+H. [3] The null hypothesis would then have the form E{L} = E{H} = N ('°Q) [4] The statistical translation of Ho (7) is more difficult. A useful tabular framework is shown in Table 1. The blood lead level of concern (LOC) is defined by criteria described by the Centers for Disease Control and Prevention (13). Elevated blood lead means any blood lead concentration that is at least as large as the LOC. Table 1 uses the following definitions: A= number of children with observed and predicted blood leads is less than the LOC; B= number of children with elevated blood lead and predicted blood lead less than the LOC; C= number of children with blood lead less than the LOC predicted to have blood lead equal to or less than the LOC; D= number of children with both observed and predicted elevated blood lead equal to or less than the LOC.
Many appropriate figures of merit can be calculated from this table. A and D are accurate classifications that we wish to maximize, B and Care inaccurate classifications that we wish to minimize. In classical epidemiology terms (14), sensitivity is the proportion of children with elevated blood lead that will be classified correctly by the prediction model (Equation 5), and specificity is the proportion of children without elevated blood lead who are correctly classified by the prediction model (Equation 6). Many investigators are also concerned about the false positive rate (proportion of children classified as likely to have elevated blood lead who are observed to have nonelevated blood lead), denoted FPR, and the false negative rate (proportion of children classified as likely to have nonelevated blood lead who are observed to have elevated blood lead), denoted FNR. These can be calculated from Table 1  [6] [7] [8] There is clearly a trade-off among these criteria, which can be optimized by combining them into a single index or criterion based on, for example, the costs of incorrect decisions (B or C) versus correct decisions (A or D). It is likely that many public health investigators would prefer to Environmental Health Perspectives * Vol 106, Supplement 6 * December 1998  (15). Formulating these hypotheses suggests useful ways to display data for empirical comparisons. The next section will demonstrate why we recommend that great care should be used when actually performing any of these tests. Our concerns are not merely hypothetical. Neither the blood lead data used for comparisons nor the input data used in model predictions can be assumed to be without blemish. Errors in data used as model input can seriously distort formal statistical tests that may be used in model evaluation. Carroll et al. (3) proposed a much more detailed discussion of these effects. Summarized briefly, noisy input data can distort the empirical comparison in either direction but usually in the direction of attenuating the apparent predictiveness of the model. They describe a statistical methodology for removing some of the distortion. Stepping through the problem systematically, we note that differences between observed and predicted values generally have greater variability than variability in the observed values alone, due to input errors in the predicted values. The model propagates this uncertainty about input values into uncertainty about the model output. The effects may be characterized mathematically (see Equation 1): d= observed value-noisy modeled value, which can be expanded to d= (observed valuetrue modeled value) + (true modeled valuenoisy modeled value).
In general, the second term may be expected to add both random and systematic biases to an empirical comparison structured like Ho(I) or Ho(2). The consequences are more serious for regressionstructured evaluations such as Ho(3) through H0(5). Several authors (3,9,16,) demonstrated that the simple regression of observed on predicted value with noisy model predictions caused by input errors will attenuate the slope (B or L) of the linear regression of the observed values Yon the corresponding predictions M. Both the regression and the correlation coefficients assume values closer to zero than the true values, and there would be a corresponding change in the intercept terms (A or K). The usual case is that the slope estimate B or L decreases to a value less than 1, with a corresponding intercept A or Kgreater than zero. This may imply that model does not adequately predict the observations. Furthermore, regression tests on data with larger variability and larger standard errors for parameters may produce lower significance in hypothesis tests. Tests of a linear versus nonlinear relationship between Yand M may also be distorted, usually toward a more linear relationship than really exists.
Similar effects occur when regression comparisons are made of logistic (binary) and categorical (grouped) data. Such tests are usually performed when numeric differences of observed and predicted values are replaced by indicator variables such as coding 1 for inside and 0 for outside the prediction intervals in testing Ho(6). Likewise, Ho(7) might be evaluated by coding observed blood lead less than the LOC as 0 and elevated blood lead as 1, and regressing these on predicted risk or logits for elevated blood lead for each subject. Predictor measurement error will distort these comparisons.
Finally, even the contingency table formulations of Ho(6) or Ho(7) shown in Table 2 are likely to be biased because the predicted values will be misclassified into the wrong category.

A Numerical Example: IEUBK Model Comparisons
We use the dataset that was evaluated by Hogan et al. (11) to focus the reader's attention on measurement error correction and other theoretical aspects of the methodological issues of comparing model predictions with observed blood lead data, not on a particular model or a particular epidemiology study. The dataset contains a large number of observations from a cross-sectional epidemiology study, with particular emphasis on children less than 6 years of age. We demonstrate several tests of the IEUBK model. The tests would be equally appropriate for any other predictive child blood lead model with similar input data. The IEUBK model is intended to describe the distribution of blood lead concentrations expected when all sources of the child's environmental lead exposure have been identified. The data, however, only contain information about the child's residential lead exposure. Therefore, for the purposes of demonstrating some of the statistical evaluation methods described in the preceding section, we use some ancillary information (i.e., the number of hours per week that the caretaker reported the child as present at home). The majority of cases were reported to spend all of the time (168 hr/week) at home. It is highly unlikely that all these children spent all their time inside or in the immediate vicinity of their residence. On the other hand, it is likely that for most of the time these children spent in other locations, the lead exposure was essentially the same. Therefore, we report only the records for the 282 children who met these criteria, and which had sufficient data (age, soil lead or house dust lead, blood lead) to allow calculation of an IEUBK-predicted blood lead, and empirical comparison with observed blood lead.

Preliminary Evaluation and Data Screening
The observed logarithms of blood lead are shown in Figure 2 against the IEUBK predictions, with 80% prediction intervals derived from the IEUBK model run, assuming the geometric standard deviation (GSD) of 1.6. The line log(observed) = log(predicted) is shown at the center of the Table 2. Observed versus predicted blood lead by level-of-concern category. Observed blood lead < 10 10-14 15 Log-predicted blood level, jig/di interval, corresponding to Ho(5) with K= 0 and L= 1. Figure 2 shows only 275 points. Based on several tests, we deleted seven points that appear to be outliers.

Differences between Observed and PredictedValues
The normal probability plot of the cumulative distribDution iS shown in Figure   3. The central 68 to 70% (from z=-1 to z= 1) is nearly linear. The upper and lower tails, however, are linear with a much flatter slope. This suggests that the differences are not normally distributed, but might be the mixture of at least two roughly normal distributions, one with much greater variability than the other.

Empircal Comparisons
Using Counting Data Tables 2 through 4 show several other comparisons that may be useful alternatives in presenting the results. Table 2 indicates the extent to which the predictions are, on the whole, unbiased: the number of predicted values higher than the observed in any given blood lead category is about the same as the number of observed values higher than the predicted values in the analogous (transposed) category. Table 3 reduces the information in Table 2 into three 2 x 2 tables, again showing the desired symmetry or lack of significant bias. Table 4 shows how the graphical information in Figure 2 can be used in a formal test for the adequacy of a prediction  interval. The visual impression from Figure   2 is that more than 20% of the observations lie outside the prediction interval. In fact, as shown in Table 4, only about 55% of the observations lie inside the 80% prediction intervals. This suggests, again, that there may be a subpopulation of children whose blood lead concentrations do not fit a lognormal distribution with a GSD of 1.6.

Adjusting te REgesion Test for Measurement Error
The initial regression results deviate substantially from the null hypotheses Ho(3) and Ho(4), with a Figure 1  A useful and quite general method for dealing with measurement error in nonlinear regression has recently been proposed and shown to be generally valid (3,17). This is the SIMEX method. The concept is very simple: if measurement error biases the estimate, then adding more measurement error should increase the bias. The relationship between the expected value of the estimated coefficient (B or L) and the true coefficient, with and without measurement error, respectively, can be described well in large samples by the equation ( + am [9] where E{B} =expected value of the estimated coefficient p = true coefficient aM= measurement error standard deviation of the predictor ap= standard deviation of the true predictor. In this equation, as aM approaches zero, the expected value of the estimated coefficient approaches the true coefficient (p). The true predictors and the true predictor standard deviation cannot be observed because they depend on the true values of the input variables such as the true time-weighted and soil ingestion rate-weighted soil lead concentration, the true time-weighted and dust ingestion  rate-weighted dust lead concentration, and so on, which cannot be truly measured. However, if some estimate of the model standard deviation, ap, is available, then the observed slopes such as B or L can be adjusted empirically by fitting the slope attenuation model to a set of simulated measurements that are even more noisy than the real data, and extrapolating the observations backward to the known or inferred value of aM, assigned a negative effect as shown below. We demonstrate this method as a test of the linear empirical comparison regression model, fitted in a logarithmic form, shown above as Ho (5). The OLS fit to predicted values would look like a straight line on nontransformed plot but as a curved line on a log-log scale. The SIMEX procedure was carried out by the following steps: Step 1: Estimate the slope Bp and intercept Bo in a nonlinear least-squares regression model log(observed blood lead) = log(BO + Bp* predicted blood lead) + error.
Step 2: For each predicted value M, generate a standard normal random variate Z, and calculate a randomized predicted value with additional log-normally distributed error corresponding to M, rzaA Mra =M*e( M) [10] slope is large enough to sustain a good nonlinear regression model shown in Step 4. Figure 5 shows This assumes that the measurement errors are log-normally distributed, with median or geometric mean equal to 1 and GSD = exp(aM). In these examples, we used aM in steps of 0.1 from 0.1 to 1.0. Note that aM is a purely hypothetical value that brackets the range of plausible measurement error in log(predicted blood lead) not the log GSD of the population of true measurement errors or the population of predicted values M Step 3: Simulation. Repeat Step 2 many times for each set of N simulated predictors Mman. Figure  Step 4: Extrapolation. The extrapolation data set consists of 25 values of simulated Bo and Bp pairs for each of 10 values of aYM, and the nonlinear LS fit at aM= 0. We fitted the following nonlinear extrapolation models with parameters GO, G1, G2 to the 251 (25x 10+1) values of Bo: B= [1 1] and an analogous model with parameters PO, PI, P2 to the 251 values of Bp, B =PO+ pi~~~ [ 12] (Pp +a ) The values for the G and P parameters, which are outputs from the SAS PROC NLIN using the SIMEX method, are given in Table 5. The fitted models and their 95% confidence intervals are shown in Figures 4 and 5 as the smooth curves, covering the range of observed and simulated data. Several alternative fits were carried out, evaluating different weights and variance-stabilizing transformations. The models shown had weight 25 for the nonlinear LS value and no transformation, which produced the smallest extrapolation confidence bands.
The same models were then extrapolated by subtracting out hypothetical values of the true standard error. The extrapolation part of the analysis is shown by the dashed curves in Figures  (p _lg2) [13] [14] from the smooth-fitted curves. These functions would expand to Bo = -°o and Bp= oo as CM approaches G2 or P2. However, much smaller values of the true measurement error standard deviation aM are appropriate. The next step is to select the boundary conditions for AM and apply these to the Bo and Bp equations using the G and P parameters generated by the simulation.
The IEUBK model GSD value of 1.6 reflects a composite of measurement input errors, reflecting: a) environmental exposure concentration errors, and b) variability in biologic and behavioral factors reflected by different absorption coefficients, compartment volumes and transfer coefficients, intake and ingestion rates, and other idiosyncratic exposures. Variability in environmental exposure, denoted GSDE, is the most appropriate component of measurement error to be evaluated for risk assessment. The environmental media concentrations or loadings are usually the most important health risk determination numbers that are factored into site-specific remediation decisions. Biologic and behavioral variability are unavoidable components, but we assume that these can be characterized together as a log-normal component with a GSD denoted GSDB. Assuming that biologic and behavioral variability are independent of environmental measurement variability, then the following model applies:   (5)] are compared in Figure 6. Hypothesis H( (4) assumes a linear regression for log(observed blood lead) versus log(model blood lead), and hypothesis H0(5) assumes a linear relationship between observed and modeled blood, which is fitted after logarithmic transformation of both sides. On the log-log plot of Figure 6, the hypothesis Ho(4) alternatives are straight lines. Note that when there is no adjustment for measurement error, the unadjusted OLS fit has an intercept of 0.9 and a slope of 0.4, whereas after adjustment for measurement error with log(GSD) = 0.33, the SIMEX adjustment gives an intercept of 0.5 and a slope of 0.7. This is much closer to null hypothesis, although the difference between observed and predicted blood lead is still substantial for some risk assessment applications. A measurement error GSD larger than 1.4 in the model values would be needed to bring the curves closer, and cannot be justified based on empirical evidence discussed earlier.
The hypothesis Ho(5) alternatives are curved lines on Figure 6. Note that when there is no adjustment for measurement error, the OLS fit gives an intercept of 3 and a slope of 0.47, whereas after adjustment for measurement error with log(GSD) = 0.33, the SIMEX adjustment gives am intercept of 0.2 and a slope of 1.0. This is much closer to null hypothesis line (observed = predicted), with an Log-predicted blood level, log/dI Figure 6. The application of the SIMEX measurement error correction procedure to two hypotheses, H0(4) and H0 (5). Line A is the theoretical condition where observed = predicted (intercept = 0, slope = 1 ). Lines B and C are the uncorrected and corrected forms of H0(4), and lines D and E are the uncorrected and corrected forms of Ho(5). expected intercept of 0 and slope of 1.0, and the difference between observed and predicted blood lead using this measurement error correction method is negligible for risk assessment applications. A measurement error GSD of about 1.4 in the model values seems to be appropriate. The IEUBK model is only slightly nonlinear at blood lead less than 25, so that the family of linear alternatives in hypothesis Ho (5) may be more realistic. An important aspect of this procedure is that no individual observed values were changed. The variability due to measurement error was enhanced by a method similar to standard additions, then extrapolated to a preselected value for the GSD using an equation derived from a SIMEX application of SAS PROC NLIN. We may therefore accept the statistical hypothesis and conclude that with corrections for measurement error, the IEUBK model provides a satisfactory prediction of typical blood lead concentrations for children exposed to residential lead in this residential situation. This process also raises the possibility that there is a small subpopulation of children with blood lead concentrations either much higher or much lower than those predicted by the model with a standard GSD of 1.6.

Conclusions
Hypothesis tests can be a useful statistical tool for model validation. Several forms of statistical hypotheses were presented that are structured to show the level of confidence that the hypothesis is not rejected. Although they can never show that a model is right (model verification), these hypothesis tests can be used to show that a specific application of the model is not wrong. In this sense, model validation is a process of adding strength to our belief in the predictiveness of a model by repeatedly showing that it is not blatantly wrong in specific applications.
When a statistical test of observed versus predicted values fails to achieve the desired level of confidence, the problem may be with the observed data (usually the result of measurement error) or the model code (usually the specification of one or more key parameters). Recent developments in the statistical field of measurement error correction (3) have provided a tool for reducing the apparent effects of measurement error in the regression model.
In a single application of this measurement error correction procedure, this report has shown that hypothesis tests performed after measurement error correction can reverse the conclusion from rejection to acceptance of the statistical hypothesis, thus further validating the model and increasing the confidence that the model is not wrong. It is important to note that the measurement error correction procedure does not adjust any specific observation or drop any observation from the dataset. It uses the method of standard additions to adjust the slope and intercept of the regression between observed and predicted values.
Multiple regression models and related multiequation structural equation (pathway) models may require more sophisticated approaches. The study of measurement error effects using latent variable methods (20) is time consuming and labor intensive, requiring computer tests of several hours to days in length, using standard statistical packages such as SAS PROC CALIS (18). Unfortunately, intrinsically nonlinear models cannot be handled with existing packages.
There is also a need to evaluate and rank different model specification tests for empirical models when predictor variables are error-prone. Some recently developed methods for comparing different structural equation model specifications use residual curvilinearity (21). The effects of design matrix measurement errors on specification tests using residuals or studentized residuals from not-so-large samples is unknown. Cross-validation and bootstrap methods ought to be useful but may also need adjustments for measurement error effects.