Biostatistical issues in the design and analysis of multiple or repeated genotoxicity assays.

Tests for genotoxic or mutagenic effects of chemicals have prompted efficient biostatistical methods for the quantification of dose-response data, especially from the Ames Salmonella/microsome assay. A decision about the genotoxicity of a compound is, however, always based on several assays, and results from multiple or repeated genotoxicity assays have to be combined either qualitatively or, even better, quantitatively. The latter problem is considered here, and issues for design and analysis are addressed. General recommendations for designing genotoxicity assays are given. A long-known methodology for combining quantitative parameters from different experiments is updated and other statistical methods suitable for the combined analyses of multiple assays are presented. Some aspects of design and analysis are elucidated on count data from unscheduled DNA synthesis assays.


Introduction
The increasing number of chemicals, their spread into the human environment, and their consumption by humans urges quantitative evaluations oftheir potential adverse effects. For this reason, short-term tests (STT) have become a widespread biological assay for detecting and assessing genotoxic and mutagenic effects. Growing awareness of genetic factors related to human diseases and the identification of proto-oncogenes and tumor-suppressor genes have sparked renewed interest in the mechanisms of genotoxicity of environmental agents.
Biostatistics has contributed to the design and analysis of genotoxicity assays in important fields: Trend tests have been developed to test for the presence or absence of genotoxic effects, and they superseded multiple pairwise testing. Nonparametric methods replaced parametric ones, suspending the assumption of a Gaussian normal distribution. Transformations were used to deal with variance heterogeneity. Weighted regressions were applied for fitting dose-response models that had been established either as empirical statistical models or as structural mathematical models motivated by biological considerations. Methods for coping with overdispersed data and tests for checking the distributions of the data were developed. Outlier detection and use of historical control information have been established for quality control. Methods for the analysis of a single assay have been summarized recently (1). There have also been suggestions and improvements for the design of genotoxicity assays [See the guidelines of the United Kingdom Environmental Biostatistics, German Cancer Research Center, Im Neuenheimer Feld 280, D-6900 Heidelberg, Germany. This paper was presented at the International Biostatistics Conference on the Study of Toxicology that was held May 13-25, 1991, in Tokyo, Japan. Mutagen Society (UKEMS) 1 (2). These are mostly intuitive and empirically proved methods rather than theories and they may be called "statistical common sense." In practice, genetic toxicologists do not conduct only one single assay. Usually, they repeat an assay several times either under identical or varying conditions. This may be done to assure previous results or to cope with the fact that genotoxicity of a compound can be expressed in different ways. The Ames test, for example, has used several tester strains sensitive to different types of mutations. Thus, results from multiple or repeated genotoxicity assays have to be combined somehow. Decision making on the presence or absence ofgenotoxicity is supported formally by statistical methods of multiple comparisons, and there may be further progress by use of Bayes methods. On the other hand, there is also need for a quantitative combination of results from several assays. Linear models and, more recently, generalized linear models (GLIMs) can be used ifthe design of the experiment was regular enough. In other cases, long-known methods ofweighted means are useful. Their use for genotoxicity assays will be described below. Before dealing with the question ofhow to combine estimates, some general design considerations for genotoxicity assays are given.

Design and Conduct of Genotoxicity Assays
This section addresses and illustrates basic elements of experimental design. More details on various assays (bacterial and mammalian cell colony and fluctuation, in vitro and in vivo chromosomal aberration, sister chromatid exchange, Drosophila and dominant lethal) can be found in Kirkland (2). Basic biostatistical elements in designing genotoxicity assays are listed in Table 1. Statistical analysis requires the specification ofan end Example Cell, cell culture in petri dish, animal in vwvo assay Observability, measurability, identifiability Inoculum size, parallel survival assay, incubation prior to treatment, treatment in nonnutrient medium, treatment after growth in nutrient medium.
Two-sample, many-to-one sample, dose response, controls

Sources of bias
Methods of statistical evaluation point, which might be a frequency ofcounts, a mutation rate, etc. Questions ofobservability, measurability, and identifiability have to be addressed in some cases, e.g., when a mutation rate has to be calculated from a mutation and a parallel survival assay. The number of cells seeded on a plate (inoculum size) or other conditions of experimentation affect the outcome. The recognition ofsources ofvariability is important. We distinguish within-assay variability and between-assay variability. Within-assay variability contributes to the sampling variation and may be caused by dilution, weighing, pipetting errors, variability in experimental handling, variability of cell division rates in different plates, or counting errors. Between-assay variability contributes to reproducibility and may be caused by physical and chemical properties of the agents, their storage and preparation, or changing growth conditions. More general between-assay variability may be caused by "historical" changes ofthe protocol, by personnel fluctuation in the staff of the laboratory, or by a genetic drift of the biological material.
A second important aspect is statistical bias: a systematic change ofthe end point variable, usually to higher or lower values than expected under the ideal experimental conditions. There is no guaranteed protection against biases, but there are possibilities to reduce or at least recognize them by some lessons learned from clinical-trials methodology: Running assays in several laboratories (multicentricity) increases the testing capacity, allows the assessment of interlaboratory variability, and increases representativity ofthe result. Blind evaluation is possible by using coded chemicals and coded dose groups, and randomization between laboratories and ofthe order ofexperimentation might be possible.

Experimentation
Formal experimental requirements have to deal with reproducibility: Physical and chemical properties oftest compounds have to be well characterized and controlled. A genetic drift of the biological material during prolonged culturing has to be recognized early. Induction, preparation, and storage of compounds and solvents is a major technical point. Decisions have to be made, for example, between agar-based and liquid-based assays. Use of auxiliary exogenous metabolic activation (e.g., S9 mix in Ames test) or selective agents must be considered because most target cells have only limited endogenous metabolic capacity. The guarantee of a stable and low spontaneous mutant frequency becomes a major point, when, at the same time, sufficient numbers of cells have to be plated to avoid zero counts. It has been recommended that the number ofcells plated initially should assure that a complete set ofzero counts occurs with probability not higher than 5 % (3). Replicated culturing is basic for statistically evaluable repeated measurements, but a sound statistical estimation ofvariability requires separate, stable preparation and treatment, not merely splitting the same mixture.

Design
Factors that are a potential source of confounding include number ofcells at inoculation, number of replications, number of treatment/dose groups, interval ofdosing, and the choice of controls. Mahon et al. (3) recommend a minimum ofthree dose levels and two separate cultures in each dose group. Blank negative controls and solvent negative controls should be used (4). Positive controls should be incorporated routinely for quality control. Some general requirements have been listed in Table  2. Important elements of statistical design are randomization and assessment ofresults under blind conditions. The effort and time spent to check for the application ofthese elements will result in more reliable and reproducible results. Two basic designs can be distinguished: (a) testing for a difference between the treated (exposed) and the untreated controls, (b) establishment of a doseresponse relationship with the aim to quantify genotoxic potency. Another benchmark is the choice between parametric and nonparametric models. Although this can still be decided when the experiment is over, it is wise to consider it in the design phase because ofits impact on the optimal determination ofdose groups and the number of replicates per dose.

End Points and Data Structure
The structure ofthe data ofa genotoxicity experiment depends on the type of the experiment and on the experimental units. Defining factors are shown in Table 3. Mostly, the end point is either a count (e.g., count of revertants, count of aberrations, count of sister chromatid exchanges), or it is a proportion, ifthe counts have to be related to a baseline number (e.g., the number of surviving mutants among all survivors). Quantitative data are usually hierarchically structured by treatments (dose groups), solvents, replicated cell cultures, and repeated measurements, taken at individual cells. Further stages on top ofthis hierarchy may be different laboratories or other target cell strains. ¶Uble 2. General requirements of designing a dose-response assay.

Sampling Model
The predominant question about sampling models has centered around the appropriate class of statistical distributions for the observed count data. This was triggered by the observation of extra-Poisson variation in the Ames test. Concurrent to methods coping with this so-called overdispersion is the use of transformations toward normality or the analysis of means obtained from an appropriate large number of measurements.

Dose-Response Model
The primary choice of a dose-response model is between a parametric and a nonparametric functional species. Nonparametric methods may be preferred if no agreement on a common sampling model can be found or ifone looks for statistical models that are valid under different experimental conditions (laboratory, tester strains, age, and status oftest compound). On the other hand, a parametric dose-response model provides an easier way to obtain mutagenic potency measures.

Control of Variability
Weighing, pipetting, transferring microbial cells between vessels and plates, and clumping ofcells are factors usually contributing to a high variability. Other sources are varying toxicities on plates, different rates ofcell division, dilution or counting errors, variable operators' skills, and calender time. In vvo experiments are further loaded by genetic differences between animals. Use of negative and positive controls is generally advised to control for day-to-day and animal-to-animal variability. Negative control data should lie in an acceptable range and should be compared with historical control data. On the other hand, positive control data should confirm the effectiveness ofthe entire assay. Table 4 concerns the use ofcontrol information in the process of deciding about genotoxicity. Multiple Experiments in Genotoxicology: An Example Multiplicity of genotoxicity assays is shown clearly in the investigations performed by the U.S. National Toxicology Program (5). Recently, 42 further chemicals were examined using the Ames Salmonella test in four laboratories (up to 3 per chemical), with 3 solvents (up to 2 per chemical), 5 tester strains, 3 S-9 mix categories (none, hamster, rat), and with as many as 4 repetitions per laboratory. This would have led to 1680 assays for one chemical ifthe maximum number ofpossible combinations had been used. Of course, most of the possible combinations were not realized because of a reduced number of laboratories, a choice strategy for tester strains, and the choice ofthe S-9 mixes (and costs). In fact, most chemicals are tested in one laboratory and with one chemical only, which reduces this number to 140 possible combinations. Actually, the total number of doseresponse experiments for a genotoxic investigation is usually below 100. For tribromomethane (Bromoform), Zeiger (5) reported 98 dose-response experiments. Reasons for multiple experiments vary. Duplicates are run for confirmation; random inclusion of known positive and negative controls are used for monitoring and controlling the quality ofthe laboratory; repeated tests are run if unexpected or conflicting results were obtained (6). Table 5 gives the nomenclature for the methods discussed in the following section.

Combination of Estimates
Note that before combining results, it has to be proved that the results are suitable to be combined. This is not easy and may be only partially solvable by statistical tests on heterogeneity or trend. Experimental comparability should be addressed in cooperation with the biologist. On the other hand, there may be situations where one has to come to a conclusion based on a series ofestimates if there remain doubts on the comparability.

One Factorial Set of Experiments
Let us consider I assays where each has led to an effect estimate, mi, with a variance estimate, vi, i = 1,..., I [see Cochran (7)]. In some cases we also assume that the estimate vi Experiment: usually an extended study comprising more than one assay is based onfi degrees of freedom and is stochastically independent of mi. An additive model for the estimate mi is assumed Unweighted Mean. Variances are given separately for u4 =0 or .ai* 0 in Table 6. For the degrees of freedom see Cochran (7). Grand Mean. Weighting by the degrees of freedom of the variances or by sample sizes, ni, gives the grand mean (Table 6). Variance and degrees of freedom are obtained similar to the unweighted mean. Semiweighted Means. Use of weights, wi = / of + a2, is known as semiweighting ( Table 6). The variance components, aq2, will be estimated from the variances, vi . More difficult is the estimation of the component aa2. Rao  Variance-weighted Means. If u4 = 0, the weighting reduces to wi = l/vi. Then this variance-weighted mean depends heavily on experiments of high accuracy, and assays with a large variance have almost no influence. To counteract this, a so-called partially weighted mean was introduced: The assays are subdivided into a class of low-precision assays weighted by their respective large variances and a class of high-precision assays weighted by a mean ofthose small variances. Note also the direct correspondence of weighted means and methods of metaanalysis, as well as their relation to Bayesian methods if the choice of a weighting scheme can be related to the choice of a prior distribution.

Multifactorial Set of Assays: Combination of Groups of Estimates
Multiple assays are usually structured by several factors, and it often becomes necessary to combine estimates over some of  Problems arise if the number of replications is small. Then an ad hoc solution would be a resampling method, where from each group one estimate is sampled randomly and the mean, m", of those I values is determined together with a variance estimate, v". The random sampling can be repeated many times like a bootstrap procedure. A total mean, mb, of all repeatedly calculated means, m", would give the estimate of the grand effect. A variance estimate can be obtained as the sum of the "bootstrap" variance of the me around mb and a mean variance between the I groups obtained as mean of the variances v". For details see Edler (8).

Example: DNA Damage Repair Short-Term Assays
Unscheduled DNA synthesis [UDS (11) ] is a type of shortterm test that uses the fact that specific cells (e.g., human fibroblasts) are able to synthesize DNA beyond S-phase, between phases GI and G2, (12). UV-induced synthesis of DNA between GI and G2 suggests repair of damaged DNA. In fact, most cells incorporate 3H-TdR into DNA during all stages ofthe cell cycle after damage. A distinction between S-phase and non-S-phase is achieved by preexposure labeling, resulting in heavily labeled S-phase cells, and postexposure labeling, resulting in lightly labeled non-S-phase cells representing UDS.
The experimental set up for an in vitro UDS assay may be as follows (13): Cells are taken from living tissue, incubated, and grown with antibiotics in medium in tissue culture flasks. Growth should be permitted until confluency to avoid replication nuclei, with enormous 3H-TdR uptake. Next the cells are labeled with 3H-TdR to obtain heavily labeled S-phase cells. Then they are exposed to the chemical carcinogens. They are labeled again, and autoradiograms are taken after washing, fixing, and drying them.
Use of radioactively labeled thymidine allows the application of autoradiography. The autoradiograms themselves require developing, fixing, washing, drying, and staining the specimen. This enables one to quantify the repair capacity of cells after some exposure to damaging agents as well as the amount of damage that is assumed to correspond to the amount of repair. More experimental details were found by Cleaver (14), who calculated mean number of grain counts of labeled cells adjusted for background by subtracting a mean of grain counts in fields of equal size outside the cell nucleus.
In vivo UDS in rat hepatocytes as complementary short-term assay to the mouse bone marrow cytogenetic was described by Margolin and Risko (15). They analyzed the end points, sources of variability, and the role of historical controls.

Autoradiography
To understand the variability ofthe data obtained by autoradiographic methods, a short description of the method is in order. Basically, autoradiography is a photographic method used to determine the distribution ofradioactivity in a specimen containing radioactive material. During autoradiography, the radioactive specimen is placed in contact with a photographic emulsion consisting of grains of silver halide, usually bromide. The photographic emulsion is suspended in a gelatin matrix, almost always coated on a glass plate or a film of cellulose acetate or polyester resin. Ionizing radiation liberates electrons, which initiates a reduction of silver ions into metallic silver at the site where radioactivity interacts with the emulsion. Photographic development enhances the effect catalytically, by reduction ofadditional silver ions in the immediate vicinity ofinteraction sites. Unaffected silver ions are removed by a fixing solution. The distribution of metallic silver corresponds to the distribution of radioactivity on the specimen. Experimental variations are possible by type and duration of the contact between photographic emulsion and radioactive specimen. Thus one may distinguish between temporary and permanent contact, using the sprinkling, slapping, dipping, floating, or stripping technique for the establishment ofthe contact (16). The emulsion is fixed and stained after some exposure and development time. Location and intensity of radioactivity of the specimen is indicated by black spots or grains ofmetallic silver. The end points ofthe evaluation are the silver grains made visible by this method and their number per cell nucleus. These grains are evaluated microscopically or by image analysis. The quantitative end point is the number and the areas ofthe grains. The selection procedure for cell identification and counting per nucleus has to be defined; random selection is preferred and "blindness" should be ensured.
The main source of confounding is the background radioactivity and grains generated by other sources than the experimentally controlled radioactivity. This may be the result of prolonged development ofthe emulsion, exposure to daylight, radiation effects from laboratory environment or cosmic radiation, pressure, chemography, metal ions, static electricity, and differences in their concentration of soluble bromide ions (17). The presence of background grains poses a problem for the analysis of autoradiographic counts. In dose-response experiments, the background can be subsumed under the control group (dose = 0) as long as background intensity does not depend on the dose Ishikawa and his coworkers (18) used for a graphic display of a plot ofthe mean number ofgrain counts versus the logarithm of the dose. This concept was further developed in Thielmann et al. (13). Among several other transformations investigated, the mean versus log-dose gave qualitatively the best results. Plotting the mean number of grain counts versus the logarithm of the dose, a parameter, Go, describing the linear increase of the mean number ofgrains resulting from a dose increase by the factor ofe = 2.72 was used as the potency. The simple linear regression has the advantage ofallowing a straightforward evaluation ofrepeated experiments. A normal distribution can be assumed because a large number ofcells can be evaluated. An investigation of individual animal net grain counts for the in vivo UDS rat hepatocytes assay revealed that mean net grain counts oftwo or more animals may be considered as normally distributed (15).

Linear Regression Model for Mean Counts
Data for a UDS dose-response assay are the number ofgrain counts, Yij per nucleus j (j = 1,...n,), and dose group i (i = 1,...1). The increase of the mean number of grain counts per nucleus with dose is usually concave, suggesting a logarithmic transformation ofthe dose as discussed above. Toxicity or saturation effects, which are not well understood, may cause a downturn of the dose-response curve at high doses. A recursive step-down procedure was used to cope with this. Let the model yi = ax + ,B1ndi i = 1,1,.1 be given for the mean number of grain counts, eventually after subtraction of the mean of the zero dose group. Then the successive regression equations yj = ca + 01lndi i = 1, ,.Iare evaluated for r = 0, 1,.. .1-3, and doses di are discarded as long as a selection criterion holds, such as the minimum estimated standard error (standard deviation of the residuals). If the procedure stops at R = R, the resulting model yi = a + lndi i = 1, 1,...,R is evaluated by simple linear regression.
Another selection procedure could be based on the method of Simpson and Margolin (19). The slope estimate $ is used as measure of repair capabity. This simple linear model for the mean number of grain counts per dose has, compared to more complex adaptive procedures, the advantage that it allows a straightforward evaluation of repeated evaluations and repeated experiments per day, several days, or even several laboratories. Because variance homogeneity might not hold in general, weighted regression methods may be indicated. Note that mean counts, Yi, are no longer independent when the zero dose mean, Yo, has been subtracted. However, the differences are independent of Yo, and hence the estimation ofthe slope and the error of variance are unaffected. 'Go values of each assay were obtained by linear regression and results of two to three assays were combined by an unweighted mean to a common Go value of each cell strain.
Deviations from dose linearity are observed frequently. A simple device is to use a piecewise linear regression, for example, by distinguishing two dose regions. Table 8 shows the unweighted means of an evaluation of 11 selected strains of volunteers from a large-scale evaluation (13). Slope estimates, Go, had been obtained from two to three dose-response assays by linear regression as described above. Go for strains S1 and S5 had a high precision in contrast to strains S9, Sil, and to some extent SlO, which had a low precision because ofa high interassay variability. The semiweighted mean over these unweighted means resulted in a combined Go of 3.0 for all normal strains with variance estimated as 0.07, whereas the partially weighted mean gave a combined Go = 2.9 with variance 0.09. Without the three lowprecision strains, we obtain a semiweighted mean of 2.9 with variance 0.07 and a partial weighted mean of 2.8 with variance 0.05. The variance component was estimated by the MINQ procedure (10).
Dose-response experiments for the in vivo UDS assay were analyzed by Margolin and Risko (15) by a simple linear regression, as long as the dose-response curve showed no down-turn at high doses. If there was such a downturn, a simple quadratic regression was applied. In both cases a measure ofmutagenicity was calculated from the estimated regression parameters.

Alternative Methods for UDS Count Data
A nonlinear regression model E [Yij = zi (xi, 1) = c1X (xi, 1) can be applied if u or X can be specified as a structural doseresponse relationship. The covariate xi is able to contain arbitrary factor information, i.e., data from rather general designs can be analyzed this way. If it can be shown that the count data Yij follow a Poisson distribution, Poisson regression methods can be applied (20). Ifthe dependence ofthe covariate can be expressed via a link function, the solution is also obtained by generalized linear models (GLIMs). Engel (21) applied quasi-likelihood methods to the analysis of count data from nested designs. The log-quasi-likelihood 1 (ux ) satisfies the equation allay = (xs) X V(y) where V (i) is the variance function. Tvo types ofmean variance relationships have been found to be important for count data: V(jy) = aug or V(y) = a2i2. A design where a random factor, B, is nested within a second random factor, A, was considered as well as a design with two fixed factors, A and B, for data Yij, i = 1,...,I;j = 1,...,J, k = 1,... ,K, satisfying a negative binomial distribution with parameter (a,1, Pt,). The variable a denotes the shape parameter ofthe hidden redistribution andp = 0/1 + 0, where 0 is scale parameter ofthe redistribution. Two cases are considered for the second design: a) only a,1 depends on the two factors (and 0 is independent of A and B), b) only 0,j depends on two factors. Case a corresponds to a constant mean/variance ratio dependent on the mean. Case b can be imbedded into a GLIM only if a is known because otherwise the distribution does not belong to an exponential family.

Conclusions
Biostatistics has made important contributions to an unbiased and efficient analysis ofthe Ames Salmonella assay; and despite the variety ofgenotoxicity assays, this methodology seems to be applicable or adjustable to a considerable number ofgenotoxicity assays. One has to account for considerable variability ofthe outcome variable ofan assay because of factors acting during the experimental progress as well as because ofconditions varying between experiments. Variability can be partially controlled by statistical methods. This necessitates designs with negative and positive controls and use of replicates. Otherwise, common biometric principles of experimental design apply to genotoxicity assays. This includes the comprehensive, formal planning of the whole investigation. Blind evaluation, reference evaluation, and principles of randomization should be established and repeated assays should be planned at the beginning ofan investigation. Sequential methodology may be helpful and should be explored further. Assays of a planned investigation should be checked for heterogeneity ofdistribution ofthe outcome values (e.g., means and variances). Weighted means are shown in this contribution as an elementary method for combining estimates obtained from genotoxicity assays and provide summary measures. They can be applied stepwise in higher-order designs. Statistical regression models may be applied to well designed factorial experiments and to studies with multivariate covariables.