Some applications of categorical data analysis to epidemiological studies.

Several examples of categorized data from epidemiological studies are analyzed to illustrate that more informative analysis than tests of independence can be performed by fitting models. All of the analyses fit into a unified conceptual framework that can be performed by weighted least squares. The methods presented show how to calculate point estimate of parameters, asymptotic variances, and asymptotically valid chi 2 tests. The examples presented are analysis of relative risks estimated from several 2 x 2 tables, analysis of selected features of life tables, construction of synthetic life tables from cross-sectional studies, and analysis of dose-response curves.


Introduction
During the past 25 years there has been a remarkable resurgence in the development of statistical methods for the analysis of categorized data. The methods available are comparable in flexibility and analytical power to those commonly used for intervally scaled data.
It is a measure of the development in this area that well written reasonably comprehensive text books are available; for example Bishop,Fienberg,and Holland (1), and Fienberg (2). Powerful computer programs for a variety of analyses are also available from several sources. Methods of analysis based on weighted least squares, maximum likelihood and minimum discrimination information are being vigorously pursued.
These methods may yield different analytical results in small samples even though they are asymptotically equivalent. However, the small sample properties of these approaches still represents an area of research where not much definitive is known.
In the authors' opinion, the importance of these developments lie in their ability to go beyond the usual tests of association by permitting the examination of more specific hypotheses about single or several multiway tables. Their generality allows data of different types for which the conventional hypoth-*Departmnent of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27514. eses of association or homogeneity are of little interest to be analyzed in ways that are more relevant.
In most cases estimation and testing can be performed by weighted least squares (WLS), maximum likelihood (ML), or minimum discrimination information. In some instances WLS and ML can be combined in useful ways.
In this paper we shall present some examples of the application of WLS methods to data arising in epidemiological investigations.

Notation
To fix ideas, consider the hypothetical data shown in Table 1 and the expected cell probabilities shown in Table 2. The development which follows is based on the methodology discussed in more detail in Grizzle, Starmer, and Koch (3), which hereafter is abbreviated as GSK.  However, for some types of data, if some of the nij = 0, S will be of rank less than u. Therefore, if difficulty is created by an occasional nij = 0, we follow Berkson (4) and suggest that it be replaced by l/r.
This has the effect of making the estimate of m.u be l/rni., which is the extension of Berkson's procedure to the multinomial case. However, we have made no extensive investigation of the effect of this rule in the multinomial case such as Berkson did for the binomial case. Alternatively, the combined WLS and ML methods described in Koch et al. (5) can be used to bypass this type of difficulty (see Example 4).

Estimation and Testing
We assume that (1) where X is a known design matrix (which may be different from the usual design matrix in the sense of reflecting a multivariate framework when more than one function is constructed within each population as will be illustrated later) of rank v c u and /3 is a vector of unknown parameters. Several workers have shown that if the hypothesized model fits the data, a best asymptotic normal (BAN) estimate of /8 is given by b, when b is the vector which minimizes (F -Xb)' S-1 (F -Xb). The minimum value of this form may be used to test the fit of the model F(7r) = Xf3. Given that the presumed model provides an adequate fit to the data, a test of the hypothesis Ho: C,8 = 0 is produced by conventional methods of weighted multiple regression, where C is a (d x v) matrix of arbitrary constants of full rank d ' v.
The test statistic for the fit of the model is which has asymptotically a (central) x2-distribution with D.F. = (uv) if the model fits, where b = (X'S-1X)-l X'S-1F. Given the model, the test of the hypothesis Ho: C,8 = 0 is produced by which has asymptotically a x2-distribution with D.F. = d if Ho is true. In many cases there is only one population, and the objective of the statistical analysis is to study the relationships among several ways ofclassification of Environmental Health Perspectives the sample units. Many tests appropriate to this problem can be formulated as This fits into the general framework by setting X = Iu, the identity matrix so that b = F and using C = lu so that the test statistic is F'S-1F, which has asymptotically a x2-distribution with D. F. = u if Ho is true.

Special Cases off(r)
The form of S depends on H and through H on the function F(n-). Therefore for each family of functions F(r), S will be different.
is of rank u c s(r -1); the ar1ij are arbitrary constants. where the ayj and kay are the appropriate elements of A and K, respectively. Here, K is a matrix of arbitrary constants of rank t sj c rs. Some care must be exercised to make sure that the H associated with the functions described above is offull rank (i.e., of rank u for the linear case and of rank t for the logarithmic case). The

Ixamination of Several Distributions
Multiway contingency tables often have some of the ways of classification fixed in advance. These tables correspond to designs in which the response to each 'treatment' is represented by a multiway table whose entries have a multinomial distribution. Often the problem is to determine how functions of the multiway table being considered as the response depend on the design variables. An example of data of this type is given in Table 3.
These data were collected in an international study of atherosclerosis (6). The complete table is an example with two dependent variables (responses), infarct and myocardial scar, which we denote by i andj, respectively, and two independent variables (in the regression sense), age and the combination location and race, denoted k and 1. Within each combination of the independent variables we have a 2 x 2 subtable with observed cell frequencies as shown in Table 4. Neither of the two marginal totals for this subtable is considered fixed and the corresponding expected cell probabilities are 7r.kl wherelij ZTikl = 1 for all combinations of k and 1. If infarct and scar are independent within this subtable 7T1lklT22klI17T2klT21kl = 1 (7) or taking the logarithm Pkl = In 7lTlkl -In T12kl -In T21k1 + In lT22ki =0 (8) This suggests using Ukl, the sample estimate of Tkl, as a measure of association. In this problem it would be informative to investigate how Uki depends on age and the combination race-location.
We consider the model E(uk) = .* + a** + S*, (9) where ,u,* is the overall mean effect, at, k = 1,2,3,4, is the effect of the k-th age group and ,f3*, I = 1,2,3, is the effect of the l-th location-race combination. We reparametrize to incorporate the restrictions a*k = 0 and f31 = 0, which is equivalent to calculating the estimates from: where the elements of u are the appropriate Ukl values taken from Table 3. We substituted (1/4) for the zero nl143 in order to avoid a singular S matrix. The remainder of the analysis is identical to weighted multiple regression. In this case the analysis is particularly simple since S is a diagonal matrix with lij (l1lnUkd) on the diagonal.
The residual from fitting the model is the age x location-race interaction with respect to the measure of association u. In addition, we can investigate how al 1a2 = Xf a31 182 this measure of association depends on age and the location-race combination.
The estimated parameters and the analysis of variance are shown in Tables 5 and 6, respectively.
Given the model the residual sum of squares is distributed as x2 with D.F. = 6; thus we conclude that the model fits the data adequately. The various other sums of squares are calculated using the general hypothesis form Ce = 0. Thus yields the test of homogeneity of age groups. To test for approximate linearity of the age effects (note that the last age group covers only a 5-year span), we chose the linear contrast -3a' -at + al! + 3al = 0. Taking into account the restrictions on the estimates, we find the contrast is estimated by -6CYl -4a&2 -2a3. For testing we might equally well choose 3ai + 2a2 + a3, which implies that we could use C = (0, 3, 2, 1, 0, 0,). We can produce other tests similarly.
We conclude from the analysis that there is no age Environmental Health Perspectives E{u} = E by location-race interaction and that the measure of association varies linearly with age. The major difference in race and location combination is between Oslo residents and New Orelans residents as shown by the test statistic for the contrast (,81 -2,81 + f83 = 0). The above method of analysis can be extended readily to contingency tables having subtables of any number of dimensions with or without some of the probabilities being zero because of a priori contraints. More than one function of the probabilities of the subtables can be considered dependent variables as in regression and we can examine how they depend on the independent variables. Example 2: Analysis of Selected Features of Life Table. This example has been discussed previously by Koch, Johnson, and Tolley (7). They show how to set up a life table in contingency table form that is amenable to analysis by WLS methods. Then they proceed to show the analysis of selected features of a life table can be examined in more detail.
The data were originally discussed by Zippin (8) in the context of a project carried out by the End Results Group of the National Cancer Institute, to compare two classification schemes for breast cancer-one due to the International Union Against Cancer and the other due to the American Joint Committee on Cancer Staging and End Results Reporting (9). The data on a subgroup of 1233 of the original 2039 were analyzed by Cutler and Myers (10) as a statistical examination ofthe classification of the extent of disease in cancer of the breast.
In their analysis, the cases were classified into 18 subgroups corresponding to the following classification: b. More than 2 cm but less than 4 cm (T2) c. More than 4 cm (T3) Table 7 shows the five year survival rates, and their standard errors for each of the 13 groups.
The variation in the five year survival rates with respect to the factors S, N, and T can be investigated by using the WLS methodology in GSK. Let F (Qr) = G (12) denote the vector of u = 18 five-year survival rates. Then linear regression models of the form F (r) = X,/ (13) can be fitted by weighted least squares where X is a specified (18 x v) coefficient or design matrix of rank v c 18, f8 is the corresponding vector of parameters to be estimated, and weighting is with respect to the reciprocals of the appropriate estimated variances. A goodness-of-fit test of the model and tests of hypotheses are obtained by applying the methods outlined above with F and S replaced by G and VG. The first model fitted has a matrix of independent variables. This is a saturated model since the number of effects (i.e., columns of Xi) is equal to the dimension of G. The definition of the columns of Xi is to some exent arbitrary. However the tests shown in Table 8 are unique in that any other definition of the columns of Xi that preserved the same vector spaces for the respective sets of effects would yield the same sums of squares. The analysis shown in Table 8 leads to conclusions similar to those obtained by Culter and Myers.
Because of the lack of significance of the two and three way interactions, it is reasonable to fit a model which contains main effects only. This model has a matrix of independent variates X2 which is shown below. X2 = 1i 1 -1 The results of this analysis are displayed in Table 9. As hoped, the residual goodness of fit is not significant. Tests on the parameters corresponding to X2 indicate significant (a = 0.01) main effects for nodes and tumor size. Predicted values based on X2 appear in column 5 of Table 7. Large residuals with absolute values in excess of 0.05 occur for SiNoT3, SiNiTi, SJN1T2, SiN1T3, S2NoTi, S2NoT3, S2N1T2. Although the goodness-of-fit test for this model is nonsignificant, its predictive value is not good. From this point onwards Koch, Johnson, and Tolley take a different approach. They note that the most important sources of variation are the main effects of nodes and tumor size, but that in addition, the main effects of skin fixation and the node x tumor size interaction have sizeable sums of squares. They proceed to examine node status within each degree of skin fixation and tumor size within the S x N classifications. Ultimately they arrive at the model defined by X3 where with DF = 2.76 implies that this model is a good fit.
The predicted values derived on the basis OfX3 are given in Table 7. In addition, the residuals for this  Table 7 and are all less than 0.05 in absolute value except for the S2N1T2 combination. Thus, they conclude that this model provides a reasonably complete explanation of the variation of the five-year survival rates as a function of the factors skin fixation, node status, and tumor size. The standard errors of the predicted values are estimated by the square roots of the diagonal elements of the matrix VG) = X[X'VG-lX]-YX' (16) They are shown in Table 10. An alternative to following a cohort and observing them until the occurrence of some well defined event is the study of separate samples drawn from nonoverlapping subpopulations each of which corresponds to a specific range of values (e.g., age range) for the overall time period of exposure to risk for the occurrence of the vital event of interest. This sampling is presumed to take place cross-sectionally at a single instant in time when subjects are examined for the presence of the event. An example of such data is shown in Table 11. These data, from Ashford and Sowden (11), are from a survey of working coal miners at a representative sample of collieries distributed throughout the United Kingdom. Each subject was classified as to whetherhe reported the symptoms ofbreathlessness and wheeze. Then under the assumption of (1) no remission from the event of interest, i.e., once an individual is observed with the event, he is never observed subsequently without it; (2) migration behavior being statistically independent of the event of interest; (3) occurrence rates for the event of interest being constant over time; the data for the period study in Table 11 can be regarded as a synthetic life table. An additional convenient assumption is (4) the survival rates associated with the respective time points throughout thej-th interval are symmetrically distributed with respect to the midpoint of the interval.
When the conditions (1)-(4) are applicable, the synthetic life table associated with a period studycan be analyzed by the same probability models that underly cohort studies. A result of these assumptions is that cohort studies and period studies as well as life table analysis and contingency table analysis can be unified into a common methodological framework.
In the analysis that follows, due to Freeman, Freeman and Koch (12), the Weibull distribution was fitted to a cross-sectional (period) study. In this context when t represents the time to the occurrence of an event of interest (e.g., a death or the detection of a tumor), then the Weibull cumulative distribution function may be written as: G(t1I,,S, w) = 1 -exp { -,u(tw)6} ,u, 8 0 for t 2 w; (17) with the interpretation of the parameters, (A,S,w) dependent on the type of data being analyzed. Most applications of Weibull type distributions is in circumstances in which a carcinogen is applied in a relatively uniform and continuous manner (for example, weekly skin paintings), and the variable of interest is the time to appearance of a tumor. Peto, Lee, and Paige (13) interpret the Weibull parameters as follows: ,u is a rate-determining scale parameter and 8 and w characterize the process by which the tumor develops. Accordingly, hypotheses concerning ,u are appropriate for examining whether carcinogens have different intensities, while differences among the 8 indicate different processes. From Eq. (17), it follows that ln {1 -G(tig, 8, w)} = -,u(tw)8 (18) wheret refers either to time or age. It is assumed that w has a fixed known value (e.g., w = 0) which can be justified. As a result, a linear model involving the parameters (In ,u) and 8 can be obtained by multiplying both sides of Eq. (17) by (-1) and then applying logarithmic transformations a second time which results in = In i + 8 ln (tw) (19) The weighted least-squares methodology in GSK can be used to fit the model (19) to sample estimates of 0(t) for which consistent estimates of variance can be obtained by methods described in Forthofer and Koch (14). To illustrate this method the data in Table   11 are used. These data have also been analyzed in a number of other papers (15)(16)(17). Letx denote age as it ranges from birth (x = 0) to 100 years in five-year intervals. Let j index these five-year age groups by corresponding to the right endpoint of the age group so that j' = {X 1,2,3, forx= 5, 10, 15, .. ., respectively. Since this is a period study, it is required that the parameter w be used to shift the survival probabilities back to the midpoint of the age interval. Let Pj denote the pro-  Table 12. The goodness of fit x2statistic indicates a satisfactory fit and the test of degeneracy of the model (Ho: 8 = 0) is highly significant CIable 13).
The original data given by Ashford and Sowden make it possible to consider two symptoms. Using the formulation for the breathlessness margin, it is possible to fit Weibull models to each margin and some appropriate measure of association. Freeman, Freeman, and Koch (12) present this method in more detail.  The data in Table 14 which appeared in Sugiura and Otake (18) show the number of deaths from leukemia (LD) observed at the Atomic Bomb Casualty Commission (ABCC) and the number of individuals who did not die from leukemia (NLD) during  according to age at the time of the atomic bomb and the estimated radiation dosage. The analysis that follows is taken from Landis, Heyman, and Koch (19).
The data in Table 14 involve s = 6 subpopulations (dose), r = 2 response categories (survival status), and q = S levels of the covariable (age).
For the first step of the analysis the logit model f(hi)h= In (ffhi1I7rhi2) = y + Ah + f (25) forh = 1,2, . ..,Sandi = 1,2, . . . ,6isassumed using either the IPF algorithm discussed in Bishop et al. (1) or the Newton-Raphson iteration procedure illustrated by Sugiura and Otake (18). For these data in Table 14, the maximum likelihood cell estimates under this model are as shown in Table 15. The goodness of fit statistic for model is QL = 27.8 for the likelihood ratio test or QP = 27.6 for the Pearson chi-square test. Alternatively, using the WLS approach outlined in GSK, the goodness-of-fit statistic is QwLs = 22.9. In each case, these statistics asymptotically follow the chi-square distribution with DF = 20 under the hypothesis of no dose x age interaction. As a result, this hypothesis is accepted at the a = 0.10 level of statistical significance. However, it should be noted that many of the expected cell frequencies in Table  15 are rather small (i.e., less than 5), which casts doubt on the strict validity of the goodness-of-fit statistic. We will, however, assume the model with no second order interaction is adequate.  16 -16 (28) where B' = [05, Is], 1r denotes a vector of r ones and Or denotes a vector of r zeros. The estimated parameters, together with their estimated standard errors, are shown in Table 16 and the ANOVA table is  shown in Table 17. These results are identical to those obtained by Sugiura and Otake.
By the two-stage ML/WLS method of Koch et al. (5), stepwise model-fitting techniques can be used to arrive at a model which best reflects the variation in the data with a minimum number of parameters. Since age was shown to have a nonsignificant effect it was removed from the model. The new analysis shows that the smoothed probability of leukemia on the logit scale was best characterized by a linear model on the square root of the midpoint on the radiation dosage scale. The final model had a matrix Table 14. Deaths from leukemia observed at ABCC   The ANOVA table is shown in Table 18. Analogous results would be anticipated for a ML fit by Newton-Raphson or some other direct optimization method.
Finally, the observed and fitted proportions of death from leukemia are shown in Table 19 for each mdiation dose subpopulation.  cNot defined because FARM is applied to no dose x age interaction model predicted frequencies.