Section 4: Reasons for Making the Transition to PROC MIXED Why should you apply ANOVA or ANCOVA with PROC MIXED instead of GLM? One of the most striking differences between PROCs GLM and MIXED is observed for repeated measures analysis of variance models. These data imply more than one observation has been collected from each subject (i.e., the data are clustered); thus, one of the key ANOVA assumptions for independence of observations no longer applies. PROC MIXED works very well in this common data analysis problem, as will be demonstrated in later sections. In this section, distinctive features PROC MIXED offers for the analysis of repeated measures data. These reasons alone should have you seriously considering why PROC GLM may no longer be a suitable choice for this type of data analysis problem. A Brief Review of the GLM Repeated Measures Model in SAS and SPSS Here are typical PROC GLM statements for both SAS and SPSS utilized for the analysis of repeated measures with one between-subjects factor group, (e.g., male, female) and one within-subjects factor, time (e.g., 3 data values collected over the same time intervals for each subject): * SAS repeated measures with PROC GLM ; PROC GLM; CLASS group; MODEL y1 y2 y3 = group / NOuni; REPEATED time 3, contrast / printE summary; RUN; * SPSS repeated measures. GLM y1 y2 y3 BY group / WSFACTOR = time 3 simple / CONTRAST (group)=simple / METHOD = SSTYPE(3) / PLOT = PROFILE( time * group) / CRITERIA = ALPHA(.05) / WSDESIGN = time / DESIGN = handedness . Both SAS and SPSS require data to be placed in multivariate format (that is a wide layout) with these procedures: all data collected over time for each subject (i.e., y1, y2, and y3) are placed in one row. It also assumes every subject has complete data for the three repeated observations; any subject with one or more missing values will be completely omitted from the calculations. The sphericity condition for the within-subjects factor, time (when it has 3 or more levels) is tested (see description of sphericity and what it implies below). The option printE on the REPEATED statement with SAS invokes this test. Advantages of the MIXED model approach for repeated measurements 1. Analyze all the data With the GLM approach to repeated measurements only one value for each of the various response variables at each time point and for each subject can be analyzed. That is, for the GLM examples above y1 represents one value collected at time 1, y2 one value at time 2, and y3 one value at time 3 (note, any interval of time with equal or unequal spacing can be accomodated). Suppose that at each time point, two or more measurements on the response variable are actually collected. Collecting multiple values over several trials for each response variable at each time point is a common feature of studies. When applying GLM to analyze these types of repeated measurements the natural inclination has been to compute the mean (or another suitable summary statistic) and then enter that value as the dependent variable in the analysis, thus removing an important source of variability, the within-subject variance for each variable on each day. As will be emphasized in subsequent sections, analyzing summary statistics is usually not necessary when utilizing PROC MIXED - you can actually analyze all the data. 2. Work with missing data One of the real advantages of computing ANOVAs with PROC MIXED is that the underlying theory implies the presence of missing values is not as severe a problem as it can be with PROC GLM. This is especially true with repeated measurements collected from each subject over time; PROC GLM automatically removes all the data for any subject that does not have complete data. In PROC MIXED data can be missing at random (MAR) or missing completely at random (MCAR). The benefit of this assumption is with a repeated measures problem (e.g., longitudinal data) you can still analyze data from all the subjects, even if some have missing data (assumed MAR or MCAR). Some potential estimation problems remain with missing data with a mixed model, but they may not be nearly as likely to happen as completely deleting entire observations (i.e., subjects) and other problems found with PROC GLM. In fact, in general, a repeated measures problem with some missing data is more robust when done using PROC MIXED than with PROC GLM. You can read how SAS works with missing data in general and more detailed definitions of these terms at: http://www.uoregon.edu/~robinh/missing_data.txt 3. The sphericity assumption is not a restriction Whatever repeated measures design you may be facing, it is important to understand what the sphericity assumption implies and how to deal with it when working with correlated data. If you only have two measurements (e.g., pre-test, post-test data collected over time points or under two treatement conditions) the sphericity assumption does not apply. With three or more data points from each subject, under the GLM framework the sphericity assumption must be considered. GLMs place rather severe restrictions on the structure of the variance/covariance matrix of the within-subject effects. Violations of the sphericity assumption will impact the interpretation of significance tests (i.e., the p-values are smaller than they actually should be) as the severity of the departure from sphericity increases. One common choice to test sphericity is the Mauchly test from 1940 (available in PROC GLM and SPSS repeated measures). However, it has been demonstrated to have low power for small sample sizes and sensitive to departures from normality, plus it is sensitive to outliers ("Statistical Methods for the Analysis of Repeated Measurements", Chapter 5, p. 110 by Charles Davis). The sphericity assumption essentially comes from the requirement that all pairwise differences in means have the same variance: VAR(x1m - x2m) = VAR(x1m) + VAR(x2m) - 2*COV(x1m,x2m) = VAR(x1m) + VAR(x2m) - 2*CORR(x1m,x2m)*STD(x1m)*STD(x2m) = constant for all values of i and j, i NE j The following 3x3 covariance matrix is "spherical" _ _ | 8 3 5 | | 3 10 6 | | 5 6 14 | - - because the three comparisons of two means have the same variance: VAR(x1-x2) = 8 + 10 - 2*3 = 12 VAR(x1-x3) = 8 + 14 - 2*5 = 12 VAR(x1-x2) = 10 + 14 - 2*6 = 12 Under compound symmetry, the variances must be equal, which constrains the covariances to be equal as well, so the variance of a difference is: VAR(x1m-x2m) = VAR(x1m) + VAR(x2m) - 2*CORR(x1m,x2m)*VAR(xm) = 2*VAR(xm) * (1-CORR(x1m,x2m)) This formula for a variance of a difference is crucial for understanding how MIXED models work with repeated measurements. The larger the correlation of the two data values, the smaller the variance of the difference in means. Compound symmetry (CS) also implies that all pairwise differences of the means for the within subject treatments have identical variances. However, this definition implies that all variances have the same value and thus all covariances must have the same value (which will be smaller than the variance). The "Sphericity" assumption first demonstrated is less restrictive in that the variances can change and likewise covariances will get larger or smaller so that the computation of equal variances for all pairwise differences is constant. Both choices are available with the REPEATED statement in PROC MIXED with options TYPE=cs or TYPE=hf as demonstrated in Chapter 7b. However, with complete data where all subjects are measured on multiple occasions on the same dependent variable, a repeated measures analysis with PROC GLM may not be appropriate. MIXED models have the cability to work with the assumptions concerning the structure of the data, in particular, the variance/covariance matrix of the errors. The GLM procedure in both SAS and SPSS for the analysis of repeated measurements allow for three types of within-subject correlation structures, namely, independence (mostly likely not appropriate with regression or ANOVA), sphericity (Repeated Measures), or unstructured (MANOVA). These very restrictive choices may not be optimal for data analysis with correlated errors. For repeated measures data (three or more observations collected over time on the same variable), the assumption of sphericity is often not met, especially as the number of repeated observations increases beyond three. Mauchly's test (1940), the likelihood ratio test (see Rencher, Chapter 7.2.2), is printed on the output from both SAS and SPSS GLM procedures to assess the presence of significant deviations from sphericity. This test has been described to be of little practical value [Cornell (1992) JES, 17, 233-249 and Davis, "Analysis of Repeated Measures Data", Chapter 5]. It has low power for small sample sizes and for larger numbers of subjects it may indicate a violation of sphericity that is not necessary present. It is also overly sensitive to departures from normality. If a violation occurs, then choose one of the corrected pvalues. The Greenhouse-Geisser (1958) correction assumes that the maximum violation has occurred and is the most conservative test available. Huynh & Feldt showed that estimating "epsilon" from the sample data is not as conservative as the Geisser-Greenhouse correction (1976). Another feature of the SPSS output of repeated measures is the deceptive presence of both multivariate tests and F-tests based on sphericity or adjusted by the estimated epsilons. The output begins with multivarite tests that assume an unstructured covariance matrix. Als given are tests of the "within-factors" are based on a correlation matrix that assumes compound symmetry (essentially, equal correlations for any combinations of the within factors). If the EMMEANS table is requested and comparisons of means are calculated, these are made with the unstructured variance/covariance matrix, which can have very different implications than tests of means computed with a compound symmetry matrix. The advantage of PROC MIXED is all tests from beginning to end are computed with your choice of variance/covariance matrix. I plan to enter an example of what this means in the near future. The descriptions of statistics based on compound symmetry in textbooks and their ongoing applications for repeated measure data analysis can largely be attributed to historical reasons: they are the methods many researchers have been taught and since they have always done it this way, why consider anything else? As a result, there is a large body of published literature and research findings based on this approach to repeated measures. However, for most applications it is now possible to toss these statistics into the historical archives; today they are of only limited value and applicability for most statistical analyses of this type of data. MIXED models now available in both SAS and SPSS have a variety of correlation structures to choose from. Section 7 covers this type of repeated measures ANOVA in more detail. Correlation structures are further described in Section 8.