A comparative study of two statistical models for the analysis of binary data from longitudinal studies.

This study extensively compares two statistical models for the analysis of binary data from longitudinal studies. The first model was proposed by Zeger, Liang, and Self, which was abbreviated as ZLS model and another model was proposed by Origasa. The comparison focuses on both analytical and statistical view-points. The first discusses a type of the models and the second evaluates the effect from model misspecification by stimulation, assuming that the ZLS model is true.


Introduction
A study in health sciences frequently uses a design involving a time factor. Data are collected at multiple occasions with respect to each subject. They may be referred to as longitudinal data or repeated measures. Such data can be produced by both retrospective and prospective studies. Survival data are usually excluded from them because they cannot involve recurrent events (1). Thus, a longitudinal study may be defined as one in which data are collected on several occasions, regardless of the direction and type of study.
The longitudinal study provides several advantages over the cross-sectional study. For example, it increases the precision of treatment contrasts by eliminating within-individual variation and enables us to examine the individual's changing response pattern over time.
Time series technique may be a solution for analysis of such dependent observations. However, it is only effective for studies with a large number of occasions. A study in health sciences often involves relatively small number of occasions, say two to six. In most of the clinical trials conducted by Japanese pharmaceutical companies, data are collected on a few occasions after randomization. Also, from the viewpoint of modeling, the data need to include some covariates such as baseline risk factors into the model.

Literature Review
Three approaches are possible for analyzing binary longitudinal data. The first is modeling for marginal probabilities; the second is modeling for transition (or conditional) probabilities; and the last is nonparametric, i.e., not a model-based approach.
With respect to the first type of modeling, GSK (Grizzle, Starmer, and Koch) linear model (2) is fairly general. It can be applied to longitudinal data (3). Suppose there are T occasions with binary responses. Then, there are 2 T profiles that each individual corresponds to. On can express any function generated from the vector of proffles. Another function is shown by Liang and Zeger (4) which uses the generalized linear model (5). The within-individual covariance matrix is included in the model. This model allows us to deal with a mixture of discrete and continuous variables.
Modeling for transition probabilities has been proposed by many authors. Again, the GSK approach is applicable for them. The Markov chain model can also be applied. Muentz and Rubinstein (6) has shown a logistic expression for those. The last two, i.e., ZLS (7) model and Markov logistic regression model (8) will be described in another section.
By rearranging data into T consecutive 2 x 2 tables (Table 1), several authors have proposed different statistics to test for treatment effect (8)(9)(10)). An underlying model is T-fold product binomial (11) with Markov property.

ZLS Model
The ZLS model is formulated as the following two stages. The first one is expressed as: logit (Pi,) = log {pi1/(l -PiJ)} = Zi 8, at the initial occasion Zi is a q x 1 vector of timeindependent covariates and 8 is a vector of associated parameters. The second stage expresses the stationary first-order autoregressive, that is, where p is the autocorrelation coefficient.
Time course is simply determined by the most recent outcome, autocorrelation coefficient, and the initial probability so that no covariates are related to changing probabilities of having a symptom over time. Statistical inference can be performed using the likelihood, that is, which is called the unconditional likelihood because it summarizes the entire data.

Markov Logistic Regression Model (MLRM)
This model comes from a small modification of the ordinary logistic regression model to incorporate the covariate of previous outcome. It approximately corresponds to a covariance structure of the first-order autoregressive process. It allows us to use data much more efficiently than the multivariate approach, such as the one proposed by Grizzle and Allen (12). The principle is that adjacent nonmissing pairs can be used. The model is expressed as: logit (Pit) = a + Yit-1 + Xi' y + Zj'I where the Pit is the conditional probability of having a response at time t (t = 1, . . . ,T) for the ith individual, given the past observation (yi,t-) and the covariates (Xilt, Zi).

Comparisons
Suppose that there are no time-dependent covariates. The MLRM turns out to be: The ZLS model is, on the other hand, expressed as: where the parameter 8* is generally different from B. Although the transition of responses is only varied by a constant autocorrelation parameter (p) for the ZLS model, it is a complex expression for the MLRM as: p = [exp(pyi,t-l + ZL5)I{1 + exp(pyi,t-l + Zi2)} -Pil]/lYit-l -Pil]-A plausible expression for the relative risk from a previous outcome might be different between two models. It might be useful for the ZLS model to express it as an additive form, so that RZLS = Pr {yit = 1 | yi,t -= 1} -Pr{yit = Yi,t-= O} = p If a past observation is unrelated to the present one, then the relative risk should be zero which corresponds to p = 0. A relative risk for the MLRM might be usefully expressed as a multiplicative form as: The null value of relative risk is 1 when 1 = 0, which means there is no effect from the previous outcome.

Simulation Study
The purpose of conducting a simulation study is to evaluate the robustness of the MLRM from the view-  Table 6. Effects of the model misspecification on the significance results (a = 0, p = 0.0, c = 0.0).
Type I error Empirical LR power of p Empirical LR power of (3 Table 7. Effects of the model misspecification on the significance results (a = 0, p = 0.0, c = 0.3).
Results of the simulation experiments are: two models are almost equally fitted even though data are generated by the ZLS model, more accurate Type I error rates are achieved in the misspecified MLRM, especially for either smaller sample sizes or smaller tail probabilities, power of testing the autocorrelation (p for ZLS, 1 for MLRM) is similar. However, the power of testing the treatment effect is a bit less powerful under the misspecified model (MLRM) although it is ignorable for moderately large sample sizes. (For details, see Tables 2-5 for goodness of fit results, and Tables 6-9 for hypothesis testing results.)

Conclusions
With respect to the comparison between the MLRM and ZLS model, the five features (generalizability, interpretability, dealing with incomplete data, software availability, and computability) are considered. The MLRM is preferable in terms of the generalizability and software availability. The effect of model misspecification from ZLS model is ignorable both for conservativeness and for power of the test.
Future research areas are multiple. First, one must explore a more effective model whose characteristics might be interpretability, generalizability, and more fit. Second, one needs to develop a methodology to allow a study with information missing by design and seek for the relative efficiency. The third may be the development of a model that allows a variable with multiple responses. Finally, a more efficient algorithm for statistical inference and its related computer softwares should be developed after performing more extensive comparative studies among the previous models.