Designing case-control studies.

Identification of confounding factors, evaluation of their influence on cause-effect associations, and the introduction of appropriate ways to account for these factors are important considerations in designing case-control studies. This paper presents designs useful for these purposes, after first providing a statistical definition of a confounding factor. Differences in the ability to identify and evaluate confounding factors and estimate disease risk between designs employing stratification (matching) and designs randomly sampling cases and controls are noted. Linear logistic models for the analysis of data from such designs are described and are shown to liberalize design requirements and to increase relative risk estimation efficiency. The methods are applied to data from a multiple factor investigation of lung cancer patients and controls.


Introduction
Case-control studies play an essential role in studying cause-effect relationships in human populations (1)(2)(3). Applications of these studies are becoming more and more complex, as was pointed out by McKinlay (4) in her recent review, with emphasis increasingly being given to the investigation and estimation of multivariate sources of variation. Thus modern multivariate statistical techniques could and should be applied in both the design and analysis of such case-control studies. This requires that statisticians understand many important ideas traditionally developed in epidemiology and that epidemiologists obtain a knowledge of complicated multivariate statistical techniques. It is hoped that this paper, written by a mathematical statistician beginning the study of epidemiology, may aid epidemiologists and statisticians in their mutual understanding.
The paper reviews recent developments in the design of case-control studies, including confounding, overmatching, and effect modification from a theoretical viewpoint after introducing a statistical definition of a confounding factor. Methods of identification of confounding factors, evaluation of their influence on the measurement of cause-effect associations, and a method to control for their influence are discussed. Linear logistic models to aid in this process are introduced and applied to the analysis of a set ofdata from lung cancer patients and controls.

Case-Control Studies
Let us consider the exposure and disease association in the population. Table 1 provides an example ofthe distribution of a rare disease and exposure to a single substance in the population; the prevalence rate of disease is 55/100,000, and half the population is exposed to the factor.
If the marginal column totals are fixed, then we have the cell probabilities given in Table 2. Table 2 suggests that if equal numbers of exposed and unexposed individuals were to be followed, well over 10,000 unexposed persons would be required before cases of disease could be expected. This type of  Table 3. Probabilities of the exposure in Table 1 when the marginal of the disease is fixed.

Unexposed Exposed Total
Disease 0.1 0. 9 1 Disease-free 0.5 0.5 1 follow-up study is called a prospective or cohort study.
On the other hand, if the row marginal totals are fixed, then we have the cell probabilities given in Table 3. These numbers suggest that under 100 diseased and disease-free individuals would be required. Such a study is called a retrospective, or case-control study, since past exposure to the factor is determined retrospectively among diseased and disease-free individuals. MacMahon and Pugh (3) have discussed several reasons for their preference of the terms "cohort" and "case-control" over the terms "prospective" and "retrospective." We shall follow their preference throughout this paper.
Case-control studies may be, as was shown in the above example and had been pointed out by Mantel and Haenszel (1), the only feasible approach to the study of cause-effect association for especially rare diseases, since a cohort study may prove too expensive to consider, and the study size required to obtain a respectable number of cases completely unmanageable.
Both case-control studies and cohort studies are able to study only cause-effect association, not prove cause-effect relationships. Mantel and Haenszel (1) have warned that "the findings of a retrospective study are necessarily in the form of statements about association between diseases and factors, rather than about cause and effect relationships." Such studies play an important role in the chain of scientific investigation of suspected causeeffect relationships. They are a part of the cyclic process of formulating hypotheses, examining the hypotheses against existing data, and then (testing) the hypotheses through various epidemiologic and experimental studies. The most significant purpose ofepidemiology is the prevention ofdisease. For that purpose it may not be necessary to identify the causal factors precisely.
Recognition of a cause-effect association, which is sometimes called epidemiologic association, can play an essential role in the prevention of disease. MacMahon and Pugh (3) made this point as follows: "The evaluation of the causal nature of a relationship, in the absence of direct experiment, is neither easy nor objective. Differences of opinion resulting from the subjective assembly and interpretation of evidence are common. Caution in judging relationships to be causal is laudable. On occasion, however, such caution appears to be carried to an unrealistic extreme. When the derivation of experimental evidence is either impracticable or unethical, there comes a point in the accumulation of evidence when it is more prudent to act on the basis that the association is causal rather than to await further evidence. If there is controversy or argument, it should center around the decision as to where this point lies, and not on the unanswerable question of whether the causal hypothesis is not proven." When marked increases in disease frequency in a short period of time are observed, sudden exposure to a single factor can generally be suspected, and it would not be difficult to elucidate the cause-effect association by case-control studies. Applications of such studies to the more difficult problems of cancer epidemiology were begun in 1950, and the usefulness of this approach was established in the muchpublicized studies clarifying the smoking and lung cancer relationship. Since the publication of the milestone paper by Mantel and Haenszel (1) which provided a methodology for the design and analysis of modern case-control study, studies have been undertaken to examine cause-effect associations with cancers of almost all sites.
Application of case-control studies to cancer epidemiology requires careful attention in the design of such studies, since effects of confounding variables such as sex and age, measurement errors, selection of controls, etc., could exaggerate or mask the association. One limitation of the case-control study is that it often depends upon information retrieved from the memories of individuals, or from poorly written documents. Because of these problems, case-control studies are often considered to be inferior to cohort studies. But where cancer epidemiology is concerned, this may not be true. Such problems may just be common features of studying human populations. Even if we could devise randomization or stratification in cohort studies, we would have no choice but to await the onset of disease. If the disease had a long latency period, follow-up could prove to be difficult or im-Environmental Health Perspectives possible, and we could expect to face problems similar to those generated by case-control studies. Problems associated with case-control studies have been discussed by many authors, including Mantel and Haenszel (1), Cochran (5), MacMahon and Pugh (3), Lilienfeld (6), and others. A review and an extensive list of papers on the design and analysis of observational studies have been published by .
In the following sections we shall use the tools of theoretical statistics to examine various ideas which were introduced mainly by epidemiologists; emphasis will be placed on confounding, effect modification, and the logistic linear model, all of which are important in the design and analysis of case-control studies.

A Measure of Association
We shall introduce a measure of association between an exposure and the disease we wish to study. Since a primary goal of a case-control study is to reach the same conclusion as would have been obtained from a cohort study, if one had been done under complete control, we choose to define the measure within a prospective framework. Let P (DIE) [P (DIE)] be the probability of disease in an individual previously exposed (unexposed) to a factor, P (DIE) [P (DIE)] be the probability of being disease-free for an individual previously exposed (unexposed) to the factor. The relative risk RR of disease due to the factor is defined by Eq. (1) P (DIE) Cornfield (7) showed that if the prevalence of the disease is small enough, the relative risk can be approximated by the odds ratio (2) P(DIE) P(DIE) (2) P(DIE) P(DIE) It follows from Bayes' theorem that 4i can be rewritten as in Eq. representation s-hows that tp may be estimated by a case-control study. t, provides, therefore, a rationale for replacing an idealized cohort study with a casecontrol framework. Berkson (8) pointed out that the relative risk measure has several drawbacks. However, the other measures do not have the invariance property of q,, or its function, and require outside knowledge which is frequently unobtainable from a case-control study. This and other problems of measures of association are discussed in Fleiss (9).

Confounding Factors
It is well known that exposure and disease association such as that between smoking and lung cancer are often influenced by such factors as sex, age, ethnic group, and others. Epidemiologists often term them confounding factors. The influence of confounding factors must be eliminated, either through procedures for selecting controlsby matching the controls with respect to the relevant factorsor in the analysis. However, neither an explicit definition of confounding factor nor a definitive method of evaluating its influence upon exposure and disease association has been given. In fact, which factors among many should be selected for case-control matching in studying exposure and disease association remains one of the most confusing and troublesome problems in the design of case-control studies. For example, matching on those factors known or strongly suspected to be related to disease occurrence was suggested by Mantel and Haenszel (1) and Worcester (10), among many others, whereas Miettinen (11) suggested matching on factors related to both exposure and disease. Hardy and White (12) emphasized matching factors related to exposure, although they generally agreed with Miettinen. Care must be taken in using this terminology. As was pointed out by Fisher and Patil (13), the phrases "related to disease" and "related to exposure", as used in the Miettinen article are ambiguous and can be understood in several different ways. To resolve this difficulty, we shall give a statistical definition of "confounding factor" and consider its relation to "relatedness." Let z be a third variable. Assume for simplicity that z is a dichotomous variable (such as sex) taking on two values, zi (male) and Z2 (female). Let P(D/E,z), P(D/E,z), P(D/E,z), and P(D/E,z) be the probabilities of being diseased or disease-free among individuals exposed or unexposed to the factorE, as a function of z. Then q, (z), Eq.
where g(z) [h(z)] is the distribution ofz in the exposed (unexposed) population. We may take g(z) = h(z) by such devices as stratification or matching, yet it is clear that 4, is influenced by the distribution ofz. It is not necessary that 4, = 4, (Z1) = 4, (Z2) hold. For example, let us consider the data given in Table 4.
In the above example it would be reasonable to accept *(Z1) = t,(Z2) = 5.06 as a proper association of the exposure and the disease, and to consider 4, = 1.05 as an improper association biased by the confounding factor z; in other words, we may say that the influence of the confounding factor z on t, is blocked by the stratification on z.
Stratification is applied regularly to block the influence of confounding variables. Note that matched pairs design is an extreme form of stratification, where only a case and a control are in each stratum. Generally, a 2 x 2 table is constructed for each stratum, the odds ratio is estimated and tested, and a summary statistic is calculated to summarize results obtained from all strata. Identification of confounding variables is a most difficult step in this procedure. Even if we could identify them successfully, we oc- casionally must ignore some factors whose influence on the association is not strong, especially if the number of cases is not large. For example, if the number of confounding variables were 10, then we would have to distribute cases among at least 210 = 1024 strata, an unfortunate situation if the number of cases were, for example, 300 or so. Therefore, in designing such a study, identification ofconfounding variables that exist in studying the exposure-disease relationship, evaluation of the strength of their influence, and introduction of efficient devices, such as matching, stratification, or others, to block their influence on the measure of association are essential. Next we shall consider the work of Miettinen in relation to the term confounding factor as defined above. The terms "related" and "unrelated" are defined as follows.
DEFINITION: Z is said to be related to disease when at least one of the probabilities P(DIE,z) and P(DIE,z) depends on z, i.e., altering the value of z changes the probability of disease among exposed or among unexposed individuals. z is said to be related to exposure when at least one of the probabilities P(EID,z) and P(EID,z) depends on z. If z is not related to disease, i.e., neither P(DIE,z) nor P(D/E,z) depends on z, z is said to be unrelated to disease. Similarly, ifz is not related to exposure, z is said to be unrelated to exposure.
It may be proved under general conditions that +,(z) = 4, for any value ofz if and only ifz is unrelated to at least one of the entities exposure and disease. Therefore from our definition of a confounding factor we are led to the same conclusion as that of Miettinen: a confounding factor is one related to both exposure and disease. Although it is difficult to check whether the variable z is related to exposure in a case-control framework, it would be extremely difficult to check whether z is related to disease.
Note that P(DIE,z) is the absolute risk of disease due to exposure to the factor. Generally, it is impossible to study absolute risk from a case-control framework unless further information is obtained from outside knowledge.
Fortunately, however, the interpretation of "related" which will be given below makes it possible to identify a confounding factor and to evaluate its influence on a cause-effect association, even from a case-control study. Let us consider Table 5 showing the joint distribution of exposure to a factor in cases and in controls. Table 5 Zi This is the situation of overmatching discussed by Miettinen. MacMahon and Pugh (3) suggested another case of overmatching: "Variables intermediate in the causal pathway between the study factor and the disease should not be matched. For example, if smoking altered blood cholesterol, which in turn was casually associated with cardiovascular disease, smoking would be considered a cause of cardiovascular disease. Yet, in a case-control study, if cases and controls are matched on cholesterol levels, no association of the disease with smoking would emerge." This suggests that, although blood cholesterol is a confounding factor, it should not be used for matching. Here we find one weakness of our statistical definition of a confounding factor. It is not feasible in the present framework to check whether the factor is intermediate in the causal pathway or not. This is essentially a point which must be resolved through medical knowledge.

Environmental Health Perspectives
As an illustration, let us consider the data summarized in Table 6. We have qi(zu) = q(Z2) = 1.0, whereas iJ = 5.0. This would be an example of overmatching if z were an intermediate factor in the causal pathway. However, if this is not the case, avoidance of matching provides a spurious association. From Table 6B we have *(Dz) = 405.8, Therefore, z is a confounding factor if and only if both of Eqs. (10) and (11) are violated. The magnitude of the violations reflects the strength of the influence of the confounding factor. The last point will be discussed further in the remaining sections. Note that the above table for the joint distribution is not stratified on z. So long as stratification and 2 x 2 table analysis are used in a case-control study, it is not feasible to check whether the factor is related to disease or not.
Overmatching Miettinen (11) has considered another important problem: overmatching. If a factor z is unrelated to exposure, nothing is changed by matching on z. Thus matching is futile. However, if a factor z is unrelated to disease but related to exposure, matching by z decreases the efficiency (i.e., increases the variance) of estimated relative risk, although it does not change the valueof the estimated relative risk itself. It can be proved that the stronger the relation to exposure, i.e., the larger the value of qi(EzID), the greater the decrease in efficiency. Thus in such a situation matching is harmful and should be avoided. These figures indicate that z is related to both exposure and disease, i.e., that is a confounding factor. Studying the relation of z to the disease could be more important than studying the present exposure and disease association, since the large values of 4u(DzIE) and tp(Ez~E) indicate that z is a predictor of the disease. It might be suspected that q'(zi) = q(Zz2) <4 because the data were matched on a predictor of the disease. However, this is not true. Roughly, the strength of the influence of the confounding factor z upon cause-effect association can be measured by the absolute value of and if T <0, then 4' (Z1) = i (Z2) > + pose of explaining it. That effect modification is equivalent to second-order interaction is well known arnong statisticians.
Let qi(zi) and q'(Z2) be the relative risks of disease associated with exposure in strata z 1 and Z2. If IQ(z1) vQi(z2), then we can say that the effect of exposure upon disease status in stratum zi is not equal to that in stratum Z2 (because of the existence of secondorder interaction). Such a factor z has been called an effect modifier. The magnitude ofeffect modification is measured by either e.m.
where tp(EzD), qi(EzD), and qi(Dz[E) are defined as in the previous section. Effect modification will be discussed further in the next section. This suggests that when z is a predictor of the disease, it is not its role as predictor but rather its relation to the exposure that leads to an under-or overestimation problem. Thus how strongly the strength of association of z with disease status is not logically related to overmatching. In concluding this section we emphasize the necessity of checking whether a factor which is identified by our statistical methods as a confounding factor is an intermediate factor in the causal pathway before matching upon it.
Effect Modification "z is related to exposure" is defined in the previous section by "at least one ofP(E P,z) and P(E[ ,z) depends on z". It is not unnatural to suppose that the influence of z = Zi on the exposure probability among cases is equal to that among controls for any fixedz, so that ifP(ED,z) depends onz,P(ED,z) also depends on z, and vice versa. The principle of pairwise-matching (stratification), where a control with the same value of z as a case is selected for comparison seems to have been based upon this idea. Cox's model (14) to prove the optimality of the McNemar test for matched pairs data, Cornfield and Haenszel's discussion (15) of the relative risk estimator for matched pairs data, Gart's method (16) of calculating a summary statistic by estimating the common odds ratio by strata, and many other studies have all assumed it implictly or explicitly. However, this is not true in general. Miettinen (17) noted this fact and introduced effect modification for the pur-

A Model with Two Risk Factors to Illustrate Confounding and Effect Modification
The following discussion regarding the joint effect of two risk factors in inducing disease should clarify understanding of confounding and effect modification.
Let A and B be factors suspected of inducing disease. Let us suppose for simplicity that both of them are dichotomous. Table 7 D[4,B) is the relative risk due toB among those exposed to A. If these relative risks Table 7.
where y = 0, y > 0 and y < 0 indicate no interaction, positive interaction, and negative interaction, respectively.
Next, let us recast this example in a case-control framework. The data are presented in Table 8 (18) Further, let us accept Cornfield's assumption that the prevalence of this disease is small enough so that the relative risks are approximated by the corre-  (15) in a prospective framework is, therefore, equivalent to (18) and (19) in a casecontrol framework, under Cornfield's assumption. The parameters of interest are AA, the log relative risk of A, AB, the log relative risk of B, and their interaction y. Thus parameters lA, I.B, and a are nuisance parameters introduced by the case-control framework.
Finally, let us suppose that cases and controls have been stratified in the design by means of the factorA, i.e., unexposed and exposed to A. Then we have Table 9.
Relative risks due to B within strata A and A are given by qp = exp {A1} and qi = exp {AB + )/}, respectively. Summarizing the above discussion, we may conclude that yA and AA are deleted from model (19) when we stratify on factor A; in other words, as has been well known, we should not stratify (or match) cases and controls on a factor that is under investigation. a and y may not be deleted from model (19) after stratification; in other words odds ratios for B within strata A andA are not equal, unless there is no interaction between A and B in the sense of relative risk. Since the factor A as considered in the framework of model (20) is identical to the variable z discussed in the previous sections, we may say that Miettinen's effect modifier is a factorz that has some interaction with the factor under investigation. The discussion above regarding confounding variables is illustrated by model (20) as follows. Let us set z = A, Z= A andZ2 = A.
If Az = y = 0, then z is not a confounding factor.
Further, since a = 0, a > 0 and a < 0 if and only if the joint distributions of exposure to B and z among the cases are independent, positively and negatively correlated respectively, we have: + = q(Z 1) = tI(Z2) if and only if thejoint distribution of exposure to B and z in the cases are independent; t > 44Zl) = t(Z2) if and only if those ofB and z in the cases are positively correlated; and * < p(z 1) = i (Z2) if and only if those ofB and z in the cases are negatively correlated. In the first of these cases, z is not a confounding factor.
If 'y 4 0, then z is a confounding factor.
The strength of the influence of the confounding factor upon exposure and disease association may be measured by Eq.
Many of the authors' studies cited above have assumed essentially that y = 0. Note that the application of the maximum likelihood method to the model with y = 0 provides the same summary odds ratio as Gart (16). However, if further risk factors were ignored in the study, -y = 0 still could not be expected even if A were definitely known not to induce disease, since the value of y could be influenced by some ignored factor which had interaction with the factor under investigation. Further, suppose that both A and B are (strong) risk factors and have no synergistic relationship in inducing disease. Then y should be negative since it measures interaction on the multiplicative scale, whereas a synergistic relationship is measured on the additive scale (3).
The model of Eq. (20) agrees with a special case of that considered by Prentice (2). He called z (i.e., factor A) a confounding factor if a $ 0. However, this may not be true. A counter example is given in Table 10. Here qP(zi) = 4' (Z2) = 4 = 6, so z is not a confounding factor, yet a = -0.85.

Classification and Stratification
In the model of Eq. (19) controls are selected from a population comparable to the population of cases; then it is determined into which of the classes AB, AB, AB or AB they fall. On the other hand, in the model of Eq. (20), a predetermined number of controls are selected among those individuals who have A and A, respectively, and they are then classified according to whether they have B or B. Therefore, we could say that the first model is based on classification, whereas the second model is based on stratification. The difference lies in the sampling strategies. The first model provides a relative risk, not only for factor B but also for factor A. Even though A is thought not to induce disease, we may fid the relative risk greater than 1. Investigation of the reason could often provide further information. For example, place of residence is normally not a risk factor for lung cancer, yet we might find the relative risk for some location greater than 1. Investigation could reveal the presence of certain suspect industries in the region. Or perhaps we will find a relative risk for A of 1 but with y greater (smaller) than zero. Such a finding would be especially interesting, since it would suggest that factorA alone is not the risk factor, but that it amplifies (diminishes) the relative risk ofB if it operates together with B. A significant advantage of the classification model is its flexibility. It permits us to identify and to evaluate the influence of confounding factors. It also provides estimates of relative risks free from the influence of these factors. Further, as will be seen in a subsequent section, it also provides estimates of relative risks adjusted for combinations of factors. Generally, the model (19) provides more information than the model (20).
A drawback of the sampling strategy which leads Environmental Health Perspectives to model (19) is that the estimates of AB and y are likely to be influenced by any bias present in the selection of controls. This should be seriously considered in a case-control study, since it further complicates the usual difficulties in selecting controls. Another advantage of stratification is that we can increase our precision in estimating AB and y by selecting an appropriate number of controls from each stratum. Summarizing the above discussion, we recommend the following strategy: (1) stratify cases and controls by means of confounding variables which are definitely known not to induce disease and which are not of interest to the investigation; (2) classify cases and controls by means of confounding variables whose role in the induction of disease is known or suspected.. An analytic model for this approach will be discussed in the next section. A weak point of the analysis based on the model of Eq. (19) is when the number of cases and controls is small, since the usual methods for estimation of parameters employ asymptotic approximations. In this case Breslow's (20) recent approach is useful. He has given an exact analysis, considering all the marginal totals of the two 2 x 2 tables in Table 8 to be fixed. The model which he applied is the linear model for the log odds ratio, which is derived from our model, Eq. 20, as follows: Ps(BIA) [ (22) He made use of a computer program to carry out the exact analysis. However, as the number of cases and controls becomes large, computation time becomes prohibitive.

A Model Taking into Account Classification and Stratification Simultaneously, Where One Factor Assumes More Than Two Values
Let us consider a model with simultaneous stratification and classification, where one variable can take on more than two values. We shall consider first a situation where there are two factors A and B. Let us suppose B is dichotomous, where A is trichotomous, with possible values A o, Al, A2. Table 11 summarizes the probability distributions for cases and controls, where both are classified on A and B.
An analytic model for Table 11 is given in Eqs. Next, let us expand on Table 11 by stratifying on certain confounding variables z and w, such as age and sex. Let us denote by Pijk (z,w) the probability Pijk in the stratum specified by z and w. Then the analytic model is given by Eqs. Normally it would be rare to have information beyond third-order interactions. In the simpler case, similar results to those based on the above model could be obtained by applying a model which ignores the ,8 and 8 parameters in the above model. Parameters for these models can be estimated by the weighted least squares method of Grizzle, Starmer, and Koch (21), or by the method of maximum likelihood intensively discussed in the book of Bishop, Feinberg, and Holland (19). The number of parameters in these models looks excessive. But I suggest that it is better to start from a saturated model and to undertake an iterative process to reach the most appropriate and simplest model that could explain the structure of data in detail; starting from the above model, first estimate all parameters, then examine them, deleting those that do not contribute significantly and finally develop a simplified model. The approach would be especially useful if a case-control tudy were an exploratory one intended to locate causal factors. Ifit is a confirmatory study, then we should use, of course, all information obtained from previous studies as weHl as existing knowledge to establish a simpler model for the initial model. Statistical methods, such as the "Akaike information criterion" (AIC) (22), all possible regressions (23), stepwise regressions (24), etc., can be applied to determine how many parameters should be included in the model. In my experience, the method employed by Grizzle, Starmer, and Koch is the most handy and efficient among others for that purpose, although special care is necessary in applying the method if empty cells exist.

Number of Cases and Controls
Generally, the number of confounding factors and the number of levels of each factor to be considered in the study are determined, therefore, based on the number of cases. If the group of cases is not large, then we must ignore some confounding factors or decrease the number of levels ofcertain factors, e.g., by collapsing the age categories into wider ranges for each stratum. If this process is suspected of introducing serious bias, we may have to switch to pairmatching. However, a well-known difficulty of matched pairs design lies in the selection of controls. Cochran (5) has estimated that the reservoir from which controls are to be selected must be at least six times the size of the number of cases. Prentice (7) proposed a method to liberalize the study design substantially and increase the estimating efficiency. This is a method of adjusting for the unavailability of a corresponding matched individual statistically in the analysis. The model proposed in the last section has the same property as Prentice's, when individuals are matched on z and w.
Special attention must be paid to the empty cells before collapsing the exposure categories or otherwise changing procedures in order to eliminate them, since they are likely to provide considerable information; for example, if the exposure categories are ordered in some way and there is a strong doseresponse relationship with respect to that ordering, then extreme cells for the controls could well be empty. If such is the case the number of controls should be increased to eliminate the empty cells; if no such dose-response relationship is seen, then reliance on the previously discussed stratification on a selected set of confounding variables would be suitable. An advantage of the models discussed in previous sections is that even if, say, 10% of the cells for cases and controls are empty, we can use the information obtained from the 90% of the cells that are not empty to estimate parameters which will represent the structure of the data satisfactorily.
It is not yet well established how to determine how many cases are necessary when several confounding factors are taken into account. It depends both on financial restrictions and on the purpose ofthe study. Let us ignore the former and consider only the latter. LetA and B be suspected (dichotomous) risk factors which are of interest. IfA is the target factor, then the familiar method discussed intensively in the book of Fleiss (9) may be applied to a 2 x 2 table, obtained by ignoring the factor B, to get a rough estimate of the required number of cases. If A and B are equally important factors and the investigation is intended to determine the effects of both A and B, as well as their interactions, in inducing disease, then a test of the degree of interaction could help to determine the required number ofcases. Ifthere is a priori evidence that interaction does not exist, a rough estimate of the required number could be obtained by applying the above method to two 2 x 2 tables, one obtained by ignoring the factor B and the other by ignoring factor A, and by using the larger number.
An Illustrative Example of the Method Information on lifetime smoking and occupational histories for 101 white male coastal Georgia residents diagnosed with lung cancer during 1970-76, and for 203 white male age-and residence-matched hospital controls diagnosed with conditions other than lunig cancer or lung disease, was obtained by personal interview.* Each case and control was classified into one of three smoking levels based on his cigarette smoking history: (1) none or light (< ½2 pack/day) (includes individuals who quit smoking at least 10 years before diagnosis); (2) moderate (1/2 to 1½2 packs/day); (3) heavy (2 or more packs/day). Each individual was also categorized (yes/no) as to whether he had ever been employed in each of the shipbuilding or construction industries. The resulting responses are listed in Table 12. A model with 22 parameters, similar to the one discussed in the last section, was set up as a preliminary model. A stepwise procedure was carried out, using the weighted least-squares method of Grizzle, Starmer, and Koch (21), and 10 of the 22 parameters were eliminated, leaving the model (31) as the one best reflecting the structure of the data given in Table 12 *The data presented here, which were provided to the author by Dr. William J. Blot, represent only a part of a complete casecontrol study; they were selected for illustrative purposes and should not be used to draw inferences about cancer risk. A detailed description of the Georgia study and a full report of the results is given elsewhere (25). log (PlijI/Piooo) = log (Poiji/Poooo) + (2 i)iAA(l) + Ui(i -1)/2]AA(2) + jAB + /AC + [i(3 -i) l/2]yAc i= 1,2;j=0, I;l=0, 1 where AA(M) and AA(2) are the log relative risks due to moderate and heavy smoking, respectively, as compared to none or light; AB the log relative risk due to employment in the shipbuilding industry as compared to nonemployment in the industry; and Ac the log relative risk due to employment in the construction industry as compared to non-employment in the industry; interactions ofA1 and C and A2 and C are assumed to be equal and are represented by yAc; the other parameters are nuisance parameters introduced by the case-control framework. Table 13 Table 14.
The influence of an empty cell in Table 12 was aPredictionsfrom model of Eq. (31). Factors:A, smoking (Ao = non smoking or light,Ai = moderate,A2 = heavy);B, ship building (Bo = unexposed, Bi = exposed); C, construction (Co = unexposed, Cl = exposed). determined to be negligiblereplacing the zeroes with values smaller than 1/6 had almost no effect on the computation. To check the validity of the model, the number of cases and controls in each category was predicted from Eq. (31) by using estimated values for the parameters. The predicted values, summarized in the second and fourth rows of Table 12, agree fairly well with the original data. Therefore, the model appears to describe the structure of the data nicely. Suppose that our primary interest is in the association between exposure to A and lung cancer, with B and C as additional factors. Adjusted relative risks due to exposure to A, adjusted for B and C, have a structure represented in Table 15. Estimates of these relative risks are obtained by substituting the values for the parameters shown in Table 13. Table 15 shows that the relative risks within stratum Bo are equal to those within stratum Bi. This results because in the model in [Eqs. (31)] YAiB = 0, i = 1, 2; i.e., B is not an effect modifier. B is, however, a confounding factor, since aA (2)    AB AB AB AB exposed to C is modified to a quarter of that among those unexposed to C. The smaller value of the relative risk due to A among those exposed to C occurs for the reason discussed above. The influence of C on the association is estimated by max (ITA (1) Ifthe Mantel-Haenszel method were applied in the analysis, we would have to ignore eitherB or C, or to poolAi andA2, since there are several cells in Table  12 whose entries are quite small. Because A is our study factor, we would prefer no pooling ofA. In that case, B should be ignored, since its r value is quite close to zero compared to the corresponding value for C.
Next, let us assume that B is our study factor, and A and C additional factors. Adjusted relative risks due to exposure toB, adjusted forA and C, have the structure represented in Table 16. All the entires in a row are equal. This results because A and C are not effect modifiers, i.e., in Eq. (31) YA(i)B = YBC = 0, i = 1, 2. Since aA(2)B 7 0, A is a confounding factor whose influence on the association is estimated by max  (33) The fact that TA(2) is negative indicates than an underestimate of the relative risk will result ifA is ignored in the study. Since aBC = 0, C is not a confounding factor in the assocation of exposure to B and lung cancer. Thus, if the Mantel-Haenszel method were applied in the analysis, C should be ignored for two reasons: (1) C is related to disease but unrelated to exposure; (2) the problem of small cell entries discussed above. When C is ignored and the Mantel-Haenszel method is applied, the summary statistic p= 1.87, which is fairly close to exp {AB} = 1.93.

Conclusion
Identification of confounding factors, evaluation of their influence on exposure and disease association, and the introduction of proper devices, such as matching, stratification, classification and others, into the design to block the influence of these factors are very important in designing case-control studies. We presented a theoretical review of recent developments in this area, based on a statistical definition of confounding factor. With such a definition, medical knowledge is required to determine whether or not confounding factors identified by our methods are intermediate factors in the casual pathway between the study factor and the disease. If a confounding factor is an intermediate factor we should not match on it (overmatching); if not, we must introduce some device to block its influence. Stratification, or matched pairs design in its extreme form, have been the main design devices for blocking the influence of confounding factors. However, the identification of a confounding factor and the evaluation of the strength of its influence on the association are not feasible from data selected by such sampling strategies. However, identification and evaluation can be achieved through a random sampling of cases and controls from a population and their classification into categories, based on known and suspected confounding factors. This paper suggested stratification of cases and controls on those confounding variables which are definitely known not to induce disease and which are not of interest in the study, and classification of cases and controls on confounding variables which are known or suspected of inducing disease. Logistic linear models were introduced for the combined purpose of identification of confounding factors, evaluation of their influence on the relative risk, and analysis of the data. They are extensions of the well-known logistic model for 2 x 2 table analysis, as applied in a casecontrol study by Prentice (2). The paper recommends starting from such a model, then following an iterative process to derive the most appropriate and simplest model that will explain the structure of the data in detail. If the study is a preliminary one, the resulting model can be used to identify the confounding factors and evaluate the strength of their influence on the cause-effect association in preparation for a follow-up study. Estimates of relative risks and interactions are also obtained. Estimates of adjusted relative risks, adjusted for combinations of factors, are also obtained by simple manipulation of the estimated relative risks from the model. In contrast to the method of Prentice (2), this approach requires only a single computer calculation, not suc-cessive iterations, but it shares with Prentice (2) the ability to adjust for the unavailability of matches for some individuals, if pair-matching is applied to certain confounding factors such as age. Thus it can substantially liberalize the study design and increase estimating efficiency.
Bishop, Feinberg, and Holland (19) have discussed thoroughly the analysis of frequency data by log linear models. The definition of a confounding factor given in this paper is identical to their concept of "collapsibility of categories." Thus their general approach could be used quite effectively in casecontrol studies. Statisticians may prefer their approach. However, it could result in useless statistical manipulation for epidemiologists unless statisticians understand precisely traditional epidemiological ideas which have been developed in the field. We hope that discussions in the present paper will help them to understand such ideas and apply them in their epidemiological research.
Case-control studies are becoming more complex in design and analysis, where, as was pointed out by McKinlay (2), "emphasis is increasingly being given to the investigation and estimation of multivariate sources of variation rather than simply being restricted to the removal of bias from a single comparison." Although the design and analysis of case-control studies using logistic linear models as introduced in the present paper seem complicated, such models, as well as the log linear model discussed by Bishop, Feinberg, and Holland (19), will play a central role in such studies.