Statistical analyses of the relative risk.

Let P1 be the probability of a disease in one population and P2 be the probability of a disease in a second population. The ratio of these quantities, R = P1/P2, is termed the relative risk. We consider first the analyses of the relative risk from retrospective studies. The relation between the relative risk and the odds ratio (or cross-product ratio) is developed. The odds ratio can be considered a parameter of an exponential model possessing sufficient statistics. This permits the development of exact significance tests and confidence intervals in the conditional space. Unconditional tests and intervals are also considered briefly. The consequences of misclassification errors and ignoring matching or stratifying are also considered. The various methods are extended to combination of results over the strata. Examples of case-control studies testing the association between HL-A frequencies and cancer illustrate the techniques. The parallel analyses of prospective studies are given. If P1 and P2 are small with large samples sizes the appropriate model is a Poisson distribution. This yields a exponential model with sufficient statistics. Exact conditional tests and confidence intervals can then be developed. Here we consider the case where two populations are compared adjusting for sex differences as well as for the strata (or covariate) differences such as age. The methods are applied to two examples: (1) testing in the two sexes the ratio of relative risks of skin cancer in people living in different latitudes, and (2) testing over time the ratio of the relative risks of cancer in two cities, one of which fluoridated its drinking water and one which did not.


Retrospective Studies
McKinlay (1) has reviewed the more general aspects of the design and analysis of retrospective studies.
The odds ratio has a history of use in statistical theory independent of this argument. Fisher (2) showed that it is the parameter in the noncentral distribution of the exact conditional test. The definition of interaction in 2 x 2 x k contingency tables (9,10) is the equality of the odds ratio over the several 2 x 2 tables. Other interesting properties of this measure are given in Armitage's (11) review. Conditional Inference on *.
Let xi be the number of individuals of the first population in a sample of n 1 cases and let X2 be the number of individuals of the first population in a sample of n2 controls. We assume xi and X2 to be independent binomial variates with parameters pi and p2, respectively. It is convenient to reparameterize to a logistic form, exp {,8 + (X/2)} pi = I1 + exp 1,8 + (X/2)} exp {/, -(X/2)} 1 + exp {/8 -(X12)} This implies that + = ex and that X = ln (pllqi) -In (p2/q2), the difference of the logits. In other words, inferences on X are equivalent to those on qp (or R under the rare disease assumption). The likelihood for this model is: L (X,4) = ni n2 exp -[(Xl-X2) X/2] + (X1 + x2),/}g(X,,8) (5) The form of this likelihood implies thatxi andxi + X2 are a minimal set of sufficient statistics. Inferences on X should be made by considering the conditional likelihood of xi given xi + X2 = x.is fixed (12).
This yields h(xlk.; O,) = respectively (4,13). Note that these limits will always agree with the exact significance test in the sense that ifP < a/2, qio (tL, u) and if P > a/2, tIo E (qL,tu). Cox (14) [see also Birch (15)] showed that the conditional maximum likelihood estimator of qp, A'cmi, is given by the solution to the equation, xi = E(Xilx.;*cml) (9) where Xi indicates the random variable whose expectation is calculated from Eq. (6). It is interesting to note that this is not, in general, equal to 4pami = pi q2/(p2qi), where the usual sample proportions are substituted for these parameters. Computer programs for calculating all these exact conditional methods are available (16,17).
The asymptotic approximations to these methods are rather easy to apply. The chi-square test of q = 1 is calculated by using the first two moments of the hypergeometric distribution. With this continuity correction, this is For the general case of tp = 4o, it is necessary to find the value ofiX1, for which, in the asymptotic distribution E(Xllx.;qo) = Xi This is the solution to the equation Xi (n2 -X. + X1) (1 1) (12) Here we assume, implicitly, that x. s n 1, and that we choose the solution to the quadratic, X1, such that 0 0 X1 S min (ni,x.). The variance ofXi is given by (6) (4,18), as first noted by Fisher (2). One may use Eq. (6) The asymtotic confidence limits are found by solving the quartic equations (4,13), As an alternative to solving quartics, Gart (19), following a suggestion by Cornfield (4), has given a much simpler method for solving Eq. (16) based on an iterative procedure involving only the solution of quadratics. Gart and Thomas (20) assessed, in the conditional space, the various approximate methods of finding confidence limits for tp. They found Cornfield's method [Eq. (14)] to be clearly superior to logit or Fieller Theorem methods. They recommended that Cornfield's method be used for 95% limits if the minimum ofkX1, x. -xi,n 1 -X1, and n2x. + X1 in Eqs. (17)

Unconditional Inference on if
If we wish to consider inferences in the unconditional space (where x. is not considered fixed) the asymptotic methods considered above still hold. It is appropriate to drop the 1/2 correction from the X2%0C, and from the calculation of the confidence coefficients [Eq. (16)]. As before, the uncorrected chisquare tests will also agree with uncorrected confidence limits in having P < a/2 when the limits do not cover qio and visa-versa.
Recently Miettinen (6) has introduced a "testbased" method for finding confidence limits. For the relative risk this involves first calculating the chisquare test of the 2 x 2 table, X2. The confidence limits for t, are similar to those based on a logit transformation with the logit variances estimated indirectly from the X2 test statistic: 4IU,L = tlaml 1 + (Xa/21X) (18) This method will always agree in covering tp = 1, when P > a/2 and not covering tI = 1 when P < a/2, where P is calculated from the chi-square test. It is not clear from Miettinen's papers (6,7) whether he is recommending his method in the conditional or unconditional space. Since it is essentially the logit method, it is likely to perform similarly to that method in the conditional space. Gart and Thomas (20) found the logit method to yield much too narrow limits (i.e., true confidence coefficient < 1-a) in the conditional space. In the unconditional space Halperin (8) pointed out that Miettinen's argument strictly holds only when 4 = 1 (or tp = qio if the chi-square is applied to testing iP = tjo). He stated that asymptotically the true confidence coefficient will be less than 1-a for any qi 4 1 (or *o). Miettinen (7) replied "that the principle, properly understood, is flawless . . .", and that a "systematic evaluation of the accuracy of test-based confidence limits are [sic] obviously needed." Apparently he claims his limits are adequate for + -1. Gart and Thomas (23) have evaluated exactly (by enumeration of all the points) the lower limit of the unconditional confidence coefficient for both Cornfield's method [Eq. (16)] (omitting the ½ correction) and Miettinen's test based limits, [Eq. (18)] omitting the continuity correction from the chi-square test). The results are given in Table 1.
It is seen that the Cornfield's and Miettinen's method have the same confidence coefficients for pl = p2 or qp = 1. However, as pi and p2 diverge, Cornfield's method is as close or closer to 0.95 than is Miettinen's method. In every case of Table 1 where tp t 1, Miettinen's method yields coefficients less than 0.95 and the deviation is quite large for * >> 1. In such cases we see the test-based limits are not correct in principle, in asymptotic theory, nor in exact evaluation. One can only agree with Halperin's (8) assertion that they "should not be used since methods not suffering this disability, even if computationally somewhat more onerous, are always available." The exact or Cornfield's limits can be quickly found by program (16,17), and the iterative method suggested by Gart (19) is not that difficult to do on a small pocket calculator. In any case, the true appropriate "test-based" interval is the modification of the chi-square test given by Cornfield's method [Eq. (16)], since that equation reduces to the usual chi-square statistics [Eq. (10)] when ti = 1. Fisher (13) showed this relation very clearly.

Misclassification Errors
All the results given above assume only sampling variation is present. The cases and controls are assumed to be classified correctly as to their population (smoking status, genetic type, etc.). Consider now the effect of misclassification. In the cases let O1 be the probability that a first population individual (e.g., smoker) be classified in the second population (e.g., nonsmoker), and let 41 be the probability that a second population individual (e.g., nonsmoker) be classified as a first population individual (e.g., a smoker). In the controls let the corresponding probabilities be 62 and 42. The apparent odds ratio, 4,pa would then be given Bross (24), in a significant paper, considered the case where the misclassification errors were equal in the cases and controls: 01 = 02 = 6 and 41 = 42 = 4.
It can then be shown that If we define the populations such that 4, > 1, it immediately follows the 4,a -S 4J, the true odds ratio.
The equality holds only when 4 = 1 or 4 = 6 = 0. Thus under this model the observed odds ratio can only underestimate the true odds ratio. If 01 7 02 and or 1 )i #42 it is not possible to make any general statements. Depending on the parameters, p,a may be less than or greater than 4,. Goldberg's (25) paper is an excellent discussion of the problems of misclassification and contains an extensive bibliography [see also Copeland et al. (26)].

Heterogeneity of Response
The methods derived above assume a binomial distribution. This implies that in both the case and control groups each individual has the same probabilities,pi andp2, respectively of being from population 1. In practice this is seldom the situation. Quite often, the cases and controls are matched or paired by age, sex, and other factors. Then it is reasonable to consider not a single p 1 and a singlep2 but a series of them for each pair, Plk, k = 1, 2, . . ., n, and P2k, k = 1, 2, ..., n. The usual model assumed is that 4 = (Plkq2k) I (P2klqlk), i.e., that the odds ratio is constant over the pairs (27). The pairing is often ignored in estimating the odds ratio; the results being pooled into a single 2 x 2 table. Let p,, be the pooled estimator. Armitage (11) showed that asymptotically E (4,,) = 4, wlw' (21) where Wk = P2klq2k, w = (Eq1kwk)I(IqIk), and w' (Yq2kWk)I(1q2k). Using the fact that q2klqlk = (1 + Wk /I(1 + Wk) is an increasing function of Wk, Armitage goes on to show that w' > w and thus that 1 < E (4,p) <4. Siegel and Greenhouse (28) showed a slightly less general result and give other references on the topic. Thus it is concluded that pooling can only underestimate the odds ratio.

Differential Bias in the Section of Cases and Controls
In the case we have just considered the heterogeneity of the response was balanced by having a one-to-one correspondence in the matched pairs. If there are differing biases in the selection of cases and controls, the direction of the bias in the pooled odds ratio cannot be predicted. Consider the rather extreme, hypothetical examples (29) shown in Table 2. It is seen that the pooled estimator may give a completely different picture than the estimators from the individual strata. We consider appropriate models forthe combination of strata in a subsequent section.
Environmental Health Perspectives 1.60

Application
The human leukocyte antigen system (HL-A) is a genetic system having several alleles. The allele HL-A2 was implicated by several authors as predisposing children to acute lymphocytic leukemia (ALL). Rogentine et al. (30) did a case-control study with fifty patients (n 1) who were examined at the National Cancer Institute. The control group were 200 (n2) normal blood donors. As the genetic trait is autosomal, there was no need for matching on sex. Race was considered; all the cases and controls are white. Among the cases, 36 (xi) individuals belonged to the HL-A2 population (72%) and among the controls 83 (X2) possessed this antigen (41.5%). For these numbers we find from Eq. (9) that lkcmi = 3.61 and aml= [(36) (117)]/[(83) (14)] = 3.62. The exact test of * = 1 yields a one-tailed P = 0.00009 while the approximate test [Eq. (9)] yields P = 0.00011. The 95% and 99%o confidence limits are given in Table 3. It is noted that both the exact and approximate P are < 0.005, and thus both corresponding 99% confidence intervals exclude i, = 1. Note also that the discrepancies between the exact and approximate methods are small throughout.
One explanation for the observed association between HL-A2 and ALL considered by Rogentine et al. (30) was that possession of HL-A2 conferred susceptibility to ALL. However, the 50 patients typed for the HL-A system were not an unbiased sample of the patients admitted over the period of years considered. In a later paper Rogentine et al. (31) analyzed a larger series of 137 patients cared for at NIH between 1962 and 1971. The HL-A typing was not begun until 1969 so that those diagnosed between 1962 and 1968 must necessarily have survived on year. For 1962-69, there were 32 in the typed series among the total of 279 patients (or 11%). In the Whenever one selects by a nonrandom mechanism from a larger set of cases a danger of bias is present.
Here there is a bias in calendar date of admission and the fact that the cases must survive to the time the study began. Also they must survive to reach a research hospital.

Combination over Strata
Consider now the situation where the cases and controls may each be divided into several matched strata or blocks. Let k= 1, 2, . . ., K be the index for the strata. Extend the notation used above by adding a subscript k, i.e., Plk, P2k, nlk, n2k, 13k; k = 1, 2, K. However, let the X remain constant over the strata. This implies the Ik = (P1kq2k)/P21q1k) ex i,, is constant over the strata. Thus we are interested in making inferences on the common odds ratio, i, in the several 2 x 2 tables. Other models as possible, but this one has several convenient consequences: (1) it leads to a minimal set of sufficient statistics for which conditional inference is valid; (2) it is equivalent to the "no interaction" model of Bartlett (9) for  (32) has shown this model to be robust with respect to asymptotic efficiency relative to other possible models, such a difference in logarithms or the arc sins of the square root of the p; (4) Asymptotic unconditional inference can be conveniently employed using the logit transformation (33).
The methods for analyzing this model has been extensively reviewed by Gart (19). We shall limit ourselves here to sketching the main points.  (23) in parallel with the previous result. In addition here, an exact test of the model is that 4' is constant. This test is based on the conditional distribution of xil, X12, . .. X1K.,givenx . andx., 1 X.2, . .. X.2, . .. X.K are all fixed (9). All these conditional "exact" methods are implemented in a program by Thomas (17).
It shall be noted that the above theory yields the usual methodology for the matched case, nlk = n2k = 1 for all k. An exact form of McNemar's test is given by Eq. (22), and tp,, is the ratio of unlike pairs (27).
The set of equations solved in finding 4aml is the same as those cited by Bartlett (9) and Norton (10)

Unconditional Inference on if for the Combined Case
The analysis of the common 4' in the unconditional space is most easily handled using logit arguments. Woolf (33) used logits to find point estimators, interval estimators, and interaction tests for this model. Haldane (40), Anscombe (41), Gart and Zweifel (42), and Cox (21), modified and improved somewhat the approximations to the mean and variance ofthe logit. Gart (29) showed the logit point estimator to be efficient and derived two other efficient point and interval estimators forthe common 4'. Plackett (43), Grizzle, Starmer, and Koch (44), and Cox (21) extended the arguments of Woolf to more complex situations.

An Application of Combination of Strata Methodology
It has long been known that the incidence of nasopharyngeal carcinoma (NPC) has an elevated incidence among Chinese and Chinese-related Environmental Health Perspectives X2 Cor_ populations. This has led to the speculation that this susceptibility has a genetic basis. Simons et al. (45) studied its association with the joint occurrence of HL-A2 and SIN-2 in a case control study among the Chinese population of Singapore. The data and analyses are reproduced in Table 5. Note that the strata here refer to dialect spoken by the individuals involved. The combined analyses (last column) clearly shows an association of this genetic trait with NPC. The analyses of the individual dialect groups shows only the Hokkien and Teochew groups with a significant P value. However, the interaction tests do not show any significant differences among the individual estimators of *. For comparison purposes the analyses of the pooled estimator is shown in the penultimate row ofTable 5. As the probabilities vary little among the strata, the results differ little from the combined analyses. The authors note that the cases with SIN-2 had poorer survival than those without it. Thus it is not thought that survival bias is present in this study. The authors suggest that the HL-A2-SIN-2 occurrence may confer susceptibility to NPC.

Prospective Studies Binomial Analyses
In prospective studies the proportion of cases or deaths is directly observed in the two populations, say, Pi = yl/Nl and P2 = y2/N2. The unconditional maximum likelihood estimates R = P1/P2. One can test R = 1 by the Fisher-Irwin exact test as P1Q2/P2Q 1 = 1 is equivalent to R = 1. However, for the binomial model, Y. is not an appropriate ancillary statistic for R (46), and thus conditional inference on R is not possible. Buhrman (47) showed that when inverse sampling is used (sampling until Niyi non-cases and N2 -y2 non-cases are found in the respective populations), exact conditional confidence limits for R can be derived (48,49). These do not appear to be useful in the usual epidemological problem. Thomas and Gart (50) derived "exact" limits for R from the exact limits on + assuming all the marginals fixed. These limits have the required confidence coefficient in the conditional space if Pi andP2 are such that NiPi + N2P2 -y., but they may be below the required value otherwise. Katz et al. and Gart and Thomas (23) investigated the Gart-Thomas limits in the unconditional space and found them yield true confidence coefficients near the nominal values. Katz et al. also sug,gested basing limits on the logtransform, i.e., let ln R = lnPiln P2 be approximately normally distributed, and found useful limits on R in a fashion analogous to Woolf's limits on q,.
The logarithmic transformation may be used in the unconditional space to analyze the combination of R's over several strata. Combined estimators, confidence limits, and "interaction" (equality of the R) tests follow in a manner completely analogous to Woolf's (33) results. Radhakrishna (32) also investigated the robustness and power of the combined test of a common R.

Analysis of R in the Poisson Model
As most diseases have a low incidence and usually prospective studies are concerned with large populations, it is appropriate to approximate the distribution of the numbers of cases by the Poisson distribution. If we also assume an exponential regression of the mean of the Poisson variables on the population and strata effects (52,53), the model yields minimal This model assumes that R = PldP2k = e& (k = 1, 2, .. ., K), that is, the R is constant over strata. The likelihood of this model yields K + 1 sufficient statistics y 1. and y.k, k = 1, 2, . . ., K. Conditioning on the Y.k yields a product of K independent binomial distributions exactly analogous to the hypergeometfic distributions found in the binomial analyses of q. The details of the estimation of the common R, exact and approximate tests of R =1 [cf. standardized mortality ratio (54)], and tests of the model are given in Gart (55). Gart (55) extended this model to comparing populations and adjusting for the sexual composition of the two populations. Letting the second subscript denote sex (1 = male, 2 = female), we assume that, (29) where o-k is the sex effect on the incidence. This model assumes the population ratios within sexes are constant over age and strata, that is, R.Hk = Pllk'P21k = & = P12klP22k = R.2k k = 1, 2, ..., K (30) The sex ratios are constant within strata, but may vary over strata [see Gart (55) for details of the analyses of this model, particularly for testing a = 0, the lack of population differences.]

Analysis of the Ratios of R in the Poisson Model
The more interesting model to consider is whether the ratio of population ratios is constant over the sexes Pk(P) = R.lkIR.2k = (Pllk P22k)I(P21k P12k) = , (P) k = 1,2, ...,K   (6) and (22)]. The fixed marginals in theK 2 x 2 tables here are Yl.k, Y2.k, Y.1k, andy.2k. The parameter is not qi but qP (p) tPk (N), where Pk(N) = N1 lkN22k Nl2kN2lk k = 1, 2, ..., K (33) If the qPk (N) = 1, fork, the two analyses are exactly equivalent with the total cases playing the role of the n. However if they are not, the test of Ho: i (E) = 1 involves a noncentral distribution of each of K hypergeometric variates. For this case, Gart (55,56) has derived the detailed tests of 4i (P) = 1, point estimators, and tests of the model. If all the k (N) -1, the binomial analysis and this analysis will yield quite similar results. If one is comparing disproportionate populations, such as native-born and immigrant populations, the 4Jk (N) may depart considerably for unity and the Poisson analysis is the more appropriate.

Applications of the Poisson Model
We consider two cases in which population based data are used to test hypotheses concerning cancer incidence.
Example 1: Non-melanoma Skin Cancer. Scotto et al. (57) studied the incidence of skin cancer other than melanoma among whites in four areas of the U.S.A. Latitude greatly effects the incidence of this disease. It is of interest to compare the incidence in the northernmost population studied, Minneapolis-St. Paul, with the southernmost, Dallas-Fort Worth, These data are presented in Table 6. Clearly Dallas-Fort Worth has higher rates than Minneapolis-St. Paul. For males R the average ratio is R,1 = 2.73 (z = 23.38, p < 0.0001) and for females, the average ratio isR.2.= 2.23 (z = 15.80,p < 0.0001). It is also clear that male incidence is higher than the female incidence. However, is the ratio of area incidences higher for males than for females? Or equivalently is the relative risk for males to females higher in the southern population than in the northern population? This is answered by testing whether +(P) = 1 (or A = 0). The asymptotic maximum likelihood estimator of +(P) is +(P) = 1.19, which, it should be noted, is not simply the ratio of the R. The Environmental Health Perspectives combined normal deviate test yields z = +2.44 (p -0.0073). Table 6 also gives the individual estimates and tests by age group. In two of the early age groups and the oldest age group qk (P) is less than one. These are based on such small numbers as not to be significantly different from one. The three ages groups spanning 45-75 yield qik (P) appreciably greater than one and each yields a normal deviate test which is nearly significant. There is not, however, any significant variation among the Jk (P) as the goodness of fit test is nowhere near significance. These data clearly show that the North-South incidence ratio is significantly higher for males than for females. It (58) attempted to link cancer to the artificial fluoridation of the water supplies. In examining this question, Hoover et al. (59) used, among other data, the cancer incidence data of the Second (60) and Third (61) National Cancer Surveys. The second survey was done in 1947-48 and the third survey was done in 1969-71. Hoover et al. noted that Denver was not fluoridated in 1947-48, but by 1955, 66% of the area was fluoridated. On the other hand, Birmingham was largely unfluoridated throughout the time period, being only 3.2% fluoridated in 1970. Thus if fluoridation has an effect on cancer incidence we might expect the rates in Den-ver to increase relative to those in Birmingham over this time period.
As once again we have a comparison of a northern and southern city Hoover et al. (59) excluded the skin cancers whose rates, we have just seen, are greatly affected by differences in latitude. As male ratios are more likely to be affected by occupational considerations, we shall consider here only the famel rates (although Hoover et al. considered both). The data are given in Table 7. It is to be noted that the second survey was of one year duration while the third survey was a three-year survey. This difference does not affect the comparisons within surveys or the analyses of the ratio of ratios, but it would affect the direct comparison of surveys. In Table 7 we should note that the cities play the roles the sexes previously played and the surveys play the roles the populations previously played.
Using the binomial analyses, Hoover et al found the (Denver to Birmingham) for the second survey to be 1.02, and for the third survey to be 1.07. Since the 95% confidence limits (17,37) for ti in the second survey (0.91-1.15) entirely covered those for the third survey (1.02-1.13), they took this as "indicating no statistically significant differences." The populations considered here are not proportionately distributed by survey time; the q,k (N) vary somewhat. They range from 0.539 in the youngest age group to 1.274 in the oldest age group. Thus the Poisson analysis is preferred when testing q, (P) = 1.
The results in Table 7 agree, in the main points, with Hoover et al. The relative risks of Denver to Birmingham is found to be 1.02 in the second survey and 1.07 in the third survey. The asymptotic ML estimator of the ratio of ratios is found to be 0.969. The normal deviate test of ti (P) = 1 yield z = -0.46  (p 0.323). The goodness of fit test yields a chisquare of 6.484, with 7 degrees of freedom, so there is no indication of that the f,k (P)'S vary significantly over the age groups. Thus this test also concludes that there is no indication of a significant increase in the cancer rates among females in Denver as compared to Birmingham in the period following fluoridation in Denver. This test also yields a similar nonsignificant finding for the male comparison.
I am grateful to Joseph Scotto, John L. Young, and Robert N. Hoover for making available to me the detailed data used in the last two examples. I am grateful to Alroy M. Smith for computer programming and Sue Tiffany for typing of the manuscript.