Estimation of gene frequency and test for Hardy-Weinberg equilibrium in the HLA system.

This paper concerns the testing for Hardy-Weinberg equilibrium and the estimation of gene frequency in the human leukocyte antigens (HLA) system. An extensive simulation study for both testing and estimation is given for investigating the performance of the projection method by Eguchi and Matsuura, which has a closed form, and the method is asymptotically equivalent to the maximum likelihood method. We compare our projection test statistic with the likelihood ratio test and the single degree of freedom chi-square test suggested by Nam and Gart. Actual mean square errors of the projection estimator of gene frequency under the Hardy-Weinberg equilibrium are compared with the maximum likelihood estimator and some other estimators recently discussed by Nam.


Introduction
The human leukocyte antigen (HLA) system has been observed, not only in human genetics and anthropology but also in biostatistics. Farewell and Dahlberg (1) investigated some statistical methodology including the analyses for associating particular diseases with some genotypes in the HLA system. For these statistical analyses or gene frequency estimations, one of the most fundamental assumptions is the Hardy-Weinberg law. Eguchi and Matsuura (2) proposed a projection method for the testing and estimation problem. The method is associated with a geometric interpretation similar to least-square method in the linear regression model. The key idea is to construct the regression plane, which is tangent to the (gene frequency) parameter space in the Hardy-Weinberg equilibrium. Thus, the method is based on the projection of sufficient statistics onto the regression plane. The test statistic can be regarded as the residual sum of squares and the estimator for gene frequency regarded as the least-square estimator. This paper describes the structure of HLA data used in the statistical analysis and gives brief reviews on the methods used to test the Hardy-Weinberg equilibrium and estimate gene frequencies in the HLA system. Furthermore, the test statistics and the gene frequency estimator given by Eguchi and Matsuura (2) are compared with those either recently proposed or those formally used in the practical fields in a simulation study. The test statistics that are compared include the ordinal likelihood ratio statistics and the single degree of freedom test statistic proposed by Nam and Gart (3). The simple Bernstein's method extended by Yasuda and Kimura (4); the gene-counting method developed by Smith et al. (5)(6)(7) and extended by Yasuda and Kimura (4); and the bias reduced method proposed by Nam (8) are included for comparing the gene frequency estimators.
Structure of the HLA Data and Model under the Hardy-Weinberg Law The HLA system consists of several linked loci on chromosome 6. A large number of alleles in the population are at each locus, but the complete set of alleles at a particular locus has not yet been identified. New loci and alleles have been recognized at the International HLA Workshop and their names are decided by the World Health Organization (WHO) Nomenclature Committee. By 1984 the loci A, B, C, D, DR, DP, and DQ were recognized [see Albert et al. (9) for all alleles found at each locus]. Throughout this paper, we consider the alleles at a fixed locus and treat the phenotypic data.
Using the notation of Yasuda and Kimura (4) and Nam and Gart (3), m -1 known alleles (or antigens) at a given locus are denoted as A1, A2, ..., Am_ 1, and a pool of unidentified alleles denoted by 0, which is considered a recessive allele in the generalized ABO-like blood system. Since chromosomes are paired, an individual will have two alleles. Thus, the possible combinations of genotypes are AAj, AiO, AjAk, and 00, 1 -im -1, 1 S jkm -1, and the total genotype is m(m + 1)/2. The indices i, j, and k are run, as previously mentioned, and we sometimes omit their ranges.
For the serologically defined antigens (for example HLA-A, -B, -C, and -DR), typing panels-where the known antibody corresponding to the antigen Ai for some i is included-are used for detecting the HLA type of an individual, which is decided by an antigenantibody reaction. The mixed lymphocyte reaction is used for HLA-D locus. The alleles A1, A2,..., Amr1 are codominant, therefore, if an individual possesses two distinct antigens Aj and Ak, the phenotype is denoted by AjAk, which is identical to the genotype. However, if only one antigen Ai is detected for a person, it is not possible to distinguish the genotype AjAi from the genotype A%O without typing other family members, so we denote phenotype of this person as Ai. If no antigen is detected, the person can be considered as having two unknown alleles that are different from the known alleles Ai, i = 1, 2, . . . , m -1. In this case the genotype is referred to as 00 and the phenotype denoted by 0. We sometimes call the phenotype 0 double blanks. Accordingly, the total phenotype becomes (m2m + 2)l2.
There are two possibilities in constructing the genotype AjAk, 1 -j < km -1: one assumes that the allele Aj comes from the mother and Ak, from the father; the other is the reverse with the allele Ak coming from mother and vice versa. According to the Hardy-Weinberg equilibrium or random mating, the probabilities of the genotype AjAk in the population are given by PjPk + PkPj = 2P,Pk, 1 -j < k -< m -1, where pi is the gene frequency of the codominant allele Ai, 1 -i -i m -1. Here the frequency of a pool of unknown alleles mr-1 is denoted by r, so that p Pi + r = 1. Similarly, probabilities of genotypes AjAi, A4O, and 00 are given by p2i, 2pir and r2, respectively, where i = 1, 2,..., m -1. For the phenotypic data the probability of the phenotype Ai can be taken by merging the probabilities of the genotypes AjAi and AjO, thus we obtain p2 + 2pir for i = 1, 2, . . ., m -1. Probabilities of the phenotypes AjAk and 0 are the same as those genotypes AjAk and 00, respectively. In the population of sample size N, the observed numbers of phenotypes Ai, AjAk, and 0 are written by ni, n1k, and noo, respectively. Define nkj = njk, 1 -j < k m m -1. For the convenience notation, let Gi be the sum of the observations with the phenotype Ai, that is m-1 Gi = ni + E nij for i = 1,2, . . . , m-1 jXij The next section investigates the testing problem based on the phenotypic data.

Testing the Hardy-Weinberg Equilibrium
As noted in the "Introduction," the notion of gene frequency is reasonable only when the population is subject to the Hardy-Weinberg law. Therefore, if this assumption does not hold, resulting estimates of gene frequencies will be misleading whatever method is used. Thus, a check of this assumption should be included in the analysis of gene frequencies.
In a population of sample size N, the counts ni, n1k, and noo, 1 S i S m -1, 1 -j < km -1, respectively, have a multinomial distribution with cell parameter vector where 7ro + E -ri + E Trjk = 1. If the population is subject to the Hardy-Weinberg law, then the vector r is in the subsurface which is the subsurface in the space HI of cell parameter vectors. Thus the null hypothesis that the population is under the Hardy-Weinberg law is expressed as H: Tr E IIHw and the alternatives as K: r e H1-rIHW-Testing the hypothesis H is usually accomplished by means of a goodness-of-fit likelihood ratio or Pearson chi-square statistic with a null distribution characterized by chi-square distribution with (m -)(m -2)/2 degrees of freedom. Let L(wa) be the likelihood of r, then the likelihood ratio statistic is given by XLR = 2{log L(wa) -log L(sp*)l where I is the full maximum likelihood estimator (MLE) on xn, and p* is the MLE ofp = (Pi, P2, ... pm-_)T under the Hardy-Weinberg law. Here, it is intractable to obtain the MLE p*, which may need an iteration method. Note that the minimum chi-square statistic also requires such an iteration technique. To obtain the MLE iteratively, Yasuda and Kimura (4) extended the gene counting method to the generalized ABO-like system. The gene counting method was devised by Ceppellini, Siniscalco and Smith (5) and develop by Smith (6,7). Using the gene counting method the estimates at the kth step are given by ()_Gi p(k -1) ni8~k ) + ki (nt Alk) = 2N +(k-1) + 2r(k-1) 12N) (1) for i= 1,2,..., m -1. and rn-i r(k-1) = 1 -E p(k-1). i=i Here, the Bernstein estimator extended by Yasuda and Kimura (4) may be used for the first step, which is given by (2) and r= Vi Nam When m = 3, or the ABO system, XZ reduces to the statistic given by Stevens (10). Though the statistic X2D does not require the iterative manner, Eguchi and Matsuura (2) theoretically pointed out that the statistic X2D essentially causes an acceptance region outside fLHW whatever significance level one chooses when m > 3. This point will be observed in the simulation study here.
Eguchi and Matsuura (2)  2N -PjPk 1 j < k S m -1, and X6 is the Fisher information matrix of 0 when evaluated at ', and is given in Appendix 1. The test statistic X2PR has a closed representation of a residual sum of squares by a projection mapping onto the estimated regression plane. The derivation is based on the maximum likelihood method and closely related to a standard regression theory.

Estimation of Gene Frequencies
The simple Bernstein's estimator obtained by Eq. (2) has been widely used for the generalized ABO-like system. However, the sum of these estimates (i.e., I + r) is not necessarily equal to one. Also, according to the simple Bernsteins's method, the estimate of the recessive gene frequency has the negative bias, (11). Fur-thermore, the simple Bernstein's estimator is generally inefficient when m -3. Smith (7) gave an alternative estimator of the recessive gene frequency defined by rn-l r= 1 -> i in which case the sum of estimates is equal to one and V(r ) -V(r), but ro may yield negative values. Yasuda and Kimura (4) proposed the so-called adjusted Bernstein's estimator Pi = pi(l + D/2) for i = 1, 2,..., m -1, where D is defined in Eq. (3). Here I p* + r =1-D214, and r* may also yield negative values. Nam and Gart (3) suggested the so-called modified Bernstein's estimator p'i = Pi / (1 -D/2) for i = 1, 2,.. ., m-1 and r' (r + D12) / (1 -D12) and showed that the adjusted and modified estimators are not only inefficient, but also both of them have asymptotic variances larger than the simple Bernstein's estimator. They also investigated the problem that the method with a single gene-counting iteration, defined in Eq. (1) using the modified Bernstein's estimator as an initial step, leads to a nearly efficient estimator.
Haldane (11) recognized that the Bernstein's recessive gene estimator is negatively biased and suggested the corrected version: r = V(noo + 0.25) / (N + 0.25) Also, Nam (8) recently investigated the bias of the simple Bernstein's method and considered the reduction of its bias in the generalized ABO-like system. Removing the major bias term, he obtained a bias reduced Bernstein estimator as pi= 1 -V -G/(N + 0.25) fori= 1,2,...,m-1 for the codominant allele frequencies. For the recessive allele frequency, the form of the estimator is identical to Haldane's estimator. Nam (8) also suggested another estimator called the bias reduced Smith's estimator, which is given by the equation r= 1p i=l and he showed that this estimator is best used in achieving the smallest absolute bias among the simple Bernstein's, Smith's and Haldane's estimators.
Regarding the inference procedure based on the likelihood function, Fisher (12) introduced the joint maximum likelihood estimation of gene frequencies; Rao (13) also discussed the estimation. Farewell (14) gave the details of its application to HLA data and used some convenient reparametrizations to improve the rate of convergence [see also Farewell and Dahlberg (1)]. Furthermore, Gart and Nam (15) showed the detailed description for obtaining the MLEs using the scoring method discussed by Rao (16). They also showed that the MLEs of codominant allele frequencies are given by Pi = (Gi + ni)I2N fori = 1, 2,...
-1, under the assumption that r = 0. As noted in the previous section, by using the gene-counting method defined in Eq. (1), the estimates obtained that are well converged are equivalent to the MLEs, which are fully efficient estimators, but the estimates require the iterative calculations.

Simulation Study
Simulation Procedure For the test statistics X2 PR, X2LR and X2 D, the actual Type I error (the size of test) and the powers are compared in a simulation study. We also investigate the actual squared errors of the projection method, gene counting method, bias reduced method and the simple Bernstein's method. In the simulation for fixed N and m, our procedure for the data generating process under the Hardy-Weinberg equilibrium is taken as follows: a) To determine the true gene frequencies, m -1 random numbers are taken from the uniform distribution U(0,1) and are arranged in ascending of magnitude such that u1 <U2 < ... < um-1. Define r = U1, Pi = Ui+ -Ui fori = 1,2,..., m -1, and pm-1 = 1 -umi1. We regard the set (r, P1, ... , Pm-i) as the true gene frequencies. Note that we omit the frequency set if r > 0.5, since such a situation is rate in practice; we also omit the one with pi < 0.005 for some i, because the Gi is likely to be zero. b) To determine the genotype of an individual, we generate a pair of random numbers (u;, u2) from U(O,1) and define ur = 1. If u; lies in a interval between 0 and u1, namely u; e [0, u1], we regard the individual as having the recessive allele 0, and if u; E (ui, ui 1), then we consider that the individual has the codominant allele Ai for some i = 1, 2,..., m -1. Note that the length of the interval (ui, ui+1) is equivalent to the true gene frequency pi. Similarly, we determine the other phenotype of this individual using u2. Therefore, we can decide the phenotype of this subject according to the genotype. c) For fixed sample size N, step b is iterated N times and we count the observed numbers noo, ni, n,k for 1 -im -1 and 1 -i < jm -1. If Gi = 0 for some i, we omit the data set from the simulation. d) We repeat the process from a to c 5000 times for fixed N and m. For the situations where the Hardy-Weinberg law equilibrium does not hold, we change the steps a and b as follows: a) Since the possible number of phenotypes is 1 + m(m -1)12, m(m -1)/2 random numbers are taken from U(0,1) and arranged in ascending of magnitude such that v, < V2 <, ... ,< Vm(m1)12. We define -no = V1l 1T1 = V2 -V1 ... v Tm-3,m-1 = Vm(m-1)/2-1 -Vm(m1)X2 and 7rm-2,m-l = 1 -Vm(m1l)/2. We regard these sTo, wi, 1 t i < m -1, Tjk, 1 < j < k < m -1as the cell probabilities of the multinomial distribution of sample size N. b) If a random number v* following U(0,1) lies in a interval [0, v1], we assume that the subject has the phenotype 0. If v* E (vi+1 -vi), the subject has phenotype Ai. Similarly, the subject has the phenotype A1k, if v* falls in the corresponding interval of ITjk, 1 j < km -1. The remainder is the same as the step (b). Note that true gene frequencies and true cell probabilities are not fixed, and these are changed in each replication in order to examine the numerous situations.

Results of Testing the Hardy-Weinberg Equilibrium
The results on sizes, based on the percentage of times the computed test statistic exceeded the 0.05 level critical values, are presented in Table 1 for m -1 = 5 and 10. The results based on the 0.10 critical value are also included in Table 1.
For the test statistic based on the projection method, the actual Type I errors are less than nominal size 0.05 for m -1 = 5 and for m -1 = 10 with N > 500; the test statistic becomes conservative as the sample size becomes large. The likelihood ratio test seems to be somewhat liberal for the range of N between 250 and 2500, which is the usual sample size in practical fields. And this test is too conservative for m -1 = 10 and N < 75. For the one degree of freedom test statistic, the actual Type I error is larger than 0.05. Therefore this test seems to be liberal, regardless of sample size.
To investigate the situation with no double blanks, that is noo = 0, under the Hardy-Weinberg law, we  (17), Gart and Nam (18), and Nam and Gart (19,20). The results are given in Table 2 for m -1 = 5 and 10 and N = 100, 250, 500, and 1000. In this restricted situation theoretical behavior has not been examined. But the simulation shows that the actual Type I error decreases for the projection method, as compared to Table 1 with the same sample size. The similar behavior can be seen for the likelihood ratio test when m -1 = 5 and N < 500, and m -1 = 10 and N < 250; conversely, the Type I error tends to increase for a moderate large sample size. For test statistics with one degree of freedom, such behavior like the likelihood ratio test becomes too strong.  Nam and Gart (20) also showed the inefficiency of the statistic X2D in another testing problem.

Results of Estimation of Gene Frequency
To compare several estimators we use the measure of actual mean square effort (MSE) defined by s MSE = -I (ii(s) -p(8))T (j(s) _ p()) s8=1 in the sth simulation with replications S = 5000. Table 4 presents the comparison of MSE based on the several methods of estimation under the Hardy-Weinberg equilibrium. For the situation with no double blanks under the Hardy-Weinberg law, the results are shown in Table 5. Estimates based on the gene-counting method have small MSE compared to other estimates for all sample sizes and for the number of alleles in Tables 4 and 5. The MSE based on the projection method is slightly larger than that of MLE, but the difference in MSE becomes smaller as the sample size becomes larger. And Y, is the information matrix of i given by Here rO = 1 -> Pi, the Smith's estimator. The de-i=l rivation of rc is based on the maximization of p(r) under ro < 0, defined by p(r) = 2r exp[-2 (rro)1.
Here 2r can be considered as prior density of r. And the rOc have the properties as follows: (a) ?Oc > 0 and roc > oro for any AO,(b) if #O is fixed and N oo, then m-l rOc = ro when I Pi < 1, andO = O when Pi p¢ 1, and (c) #Oc and #O have the same asymptotic distribution. Therefore, the rOC is consistent since r1O is consistent. Thus, we use the consistent estimator ITOc defined by f2c.