6. Regression Models for Dichotomous Data For data collected with only two possible response values such as yes/no, success/failure, alive/dead. Events measured with two distinct values are called binary or dichotomous data. Under the assumption of independence the individual outcomes are called Bernoulli trials. A flip of a coin is a familiar example. When tossed in the air, spinning end over end, it can land on a level surface only as heads or tails (landing on its edge on a flat, hard surface is considered impossible). Since a "fair" coin is balanced, heads or tails are equally likely to occur. Do not consider a recoding of measurements that are considered ordinal, interval, or continuous into two levels as binary data. If you have numerical data that meet the assumptions of another model, you should apply that model. There are situations when recoding data with several levels or even continuous data into two (or more) levels might be reasonable; however, the responses considered in this section will be considered strictly dichotomous, collected independently across all trials. This means observing a "success" on any given trial does not influence the probability of a "success" on any subsequent trial, either within or across groups. The experimental procedure to collect binary data has the same concepts as designing a stody to collect continuous data. The type of study considered here is called prospective, that is you first randomly assign subjects to groups (e.g., treatment(s) and control) or randomly sample subjects from existing groups (female or male). You then perform the experiment or observe the behavior and record which of the two levels of the dichotomous outcome were observed for each subject. How binary data are recorded on paper or in a dataset is arbitrary. For reasons that will soon become apparent during data analysis, coding the two response levels as 0/1 may be the most logical and informative choice. These two values work very nicely when making summary computations. Note that the existence of a binary outcome does not automatically mean you have data which can be analyzed with the methods explained in this section. It only means that you have a two level response variable and which can be modeled in a variety of ways. If the data are collected in clusters (i.e., not truly independent) they should be modeled with random effects or as correlated responses. Other procedures to handle these situations will be discussed in subsequent sections. When binary data are collected from a homogeneous collection of subjects randomly divided into two or more groups, each trial is the realization of a Bernoulli random variable. If n subjects are randomly assigned to Group 1 then for each subject the experimenter will observe: PROB(response = 1) = p1 PROB(response = 0) = q1 = 1-p1 where 0 < p1 < 1 and p1+q1=1. The reason for assigning the response equal to 1 with probability p is a feature of the analysis techniques to be discussed here. Under these conditions, the aggregated data (that is, y is defined as the number of responses equal to 1 in the n trials) follow the binomial distribution, y~BINOMIAL(n,p). The data can be summarized with a phrase such as "y_1 successes were observed in n_1 trials for Group 1". Similar statements for the other groups can be made. Binary data of this nature is most commonly analyzed under the name logistic regression. Another name is probit regression. I'll eventually explain the difference, although you may find that logistic regression offers you the most intuitive approach. Before looking at the analysis of binary data, review the two important assumptions of a series of n Bernoulli trials (n = 1, 2, 3, ... ): * the desired response occurs with a constant probability p (0 ChiSq Intercept 1 -0.6931 0.7071 -2.0791 0.6928 0.96 0.3270 gender F 1 1.3863 1.1180 -0.8050 3.5776 1.54 0.2150 gender M 0 0.0000 0.0000 0.0000 0.0000 . . Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. In the Parameter Estimates table, the entries under the column labeled "Estimate" are based on the choice of logit as the link on the MODEL statement. A logit is a transformation of a proportion, a number bounded between 0 and 1: logit(p) = LN(p/(1-p)) For any value of p between 0 and 1, the range for logit(p) is the entire real number line: -infinity < logit(p) < +infinity The logit is the natural log transformation of the odds=(p/(1-p)) and spans the entire real number line going to negative infinity as p -> 0 and positive infinity as p -> 1 (it is undefined for p=0 or p=1). If you choose a different link, the parameter estimates will look very different, though in many situations the conclusions will be similar. The logit is a linear function of the explanatory data (continuous or categorical): logit(p) = a + b*X How to work backwards with the computed logits to determine probabilities and odds ratios will be described below. LR Statistics For Type 3 Analysis Chi- Source DF Square Pr > ChiSq gender 1 1.63 0.2014 In this example, the LR Statistic is the same result that you observe with the "Likelihood Ratio Chi-Square" test of independence with these data entered into PROC FREQ (since it is a 2x2 table of independent counts). Contrast Estimate Results Standard Chi- Pr > Label Estimate Error Confidence Limits Square ChiSq logit for Females 0.6931 0.8660 -1.0042 2.3905 0.64 0.4235 Exp(logit for Females) 2.0000 1.7321 0.3663 10.9192 logit for Males -0.6931 0.7071 -2.0791 0.6928 0.96 0.3270 Exp(logit for Males) 0.5000 0.3536 0.1250 1.9992 Females vs Males 1.3863 1.1180 -0.8050 3.5776 1.54 0.2150 Exp(Females vs Males) 4.0000 4.4721 0.4471 35.7876 The coefficients in the Parameter Estimate Table compute the log(odds) for each level of gender. To compute the odds, exponentiate the sum of the intercept and the coefficient for the respective value of gender listed in the table. The sum of these coefficients relevant to each group is called the logit. For women the odds is EXP(-.6931 + 1.38) = 2. Rather than summarizing data with the odds for each group, you can utilize them to derive the probability of a "Yes" for women: p1=Prob(Response=Yes|Women) = EXP(logit)/(1+EXP(logit)) = odds / (1+odds) = 2/(1+2) = 0.667 For men, the logit is the exponent of the intercept, since GLM coding assigns a 0 as the reference value for males. Therefore, odds of a "yes" is EXP(-.6931)=0.5. p2= Prob(Response=Yes|Men) = odds / (1+odds) = 0.5 /(1+0.5) = 0.333 Now compute the odds ratio from the estimates of p1 and p2: OR = (p1/(1-p1)) / (p2/(1-p2)) = (0.667/(1-0.667)) / (0.333/(1-0.333)) = 2 / .5 = 4 One more table to observe: Contrast Results Chi- Contrast DF Square Pr > ChiSq Type Females vs Males 1 1.54 0.214998 Wald Females vs Males 1 1.63 0.201389 LR The default test from the CONTRAST statement is the profile likelihood ratio test (indicated with LR under Type). You may specify results from CONTRAST statements to be Wald tests by adding wald as an option on this statement as indicated on the first CONTRAST statement. The test for each individual contrast is identical to the p-value returned in the table of parameter estimates. These two tests are based on specific theoretical approaches which are asymptotically equivalent, yet may give very different results often observed with relatively small sample sizes. The Wald test will likely compute a larger p-value than the corresponding test with the LR, especially when the logit coefficient is large. Also, printed in the log window is a reminder: NOTE: PROC GENMOD is modeling the probability that response='1'. How does GENMOD compare with PROC LOGISTIC? PROC LOGISTIC is another SAS procedure whose primary purpose is to compute logistic regression analyses. Its associated statements look nearly the same as those just presented for GENMOD: PROC LOGISTIC DATA=one descending; CLASS gender / param=glm; MODEL response = gender / EXPB ; CONTRAST 'LOGIT for Females ' intercept 1 gender 1 0 / estimate ; CONTRAST 'LOGIT for Males' intercept 1 gender 0 1 / estimate; CONTRAST 'LOGIT dif: Females vs Males' gender 1 -1 / estimate; TITLE1 "Logistic Regression with LOGISTIC"; RUN; A few major differences in the syntax commands are noted. * LOGISTIC now has a CLASS statement (prior to version 8 it could only work with continuous or dummy coded explanatory data and you will find references that incorrectly state it continues to work that way); however, the default for the CLASS variables is called "effect" coding (-1,1) like that found in PROC CATMOD. To get the equivalent of GENMOD coding (i.e., GLM) enter the param=glm option. * Computation of odds ratios is requested as a MODEL statement option. * PROC LOGISTIC does not have an ESTIMATE statement; instead, the estimates of the specified logits entered with the coefficients are computed as a CONTRAST statement option. Two features of PROC LOGISTIC are not available with PROC GENMOD: exact computations and ROC analysis. An example of whe exact logistic regression is necessary is with small or zero cell counts when the MLE estimation assumptions are not met. The example dataset is an illustration of when exact tests may be more appropriate when small numbers are inevitable. In this example, gender was dummy coded to be equal to 1 for Females and 0 for Males (what GLM coding implies): PROC LOGISTIC DATA=one descending; MODEL response = gnd ; EXACT gnd / estimate=both; run; The asymptotic estimates (not shown) are also printed. However, the exact results offer added insight into the data: Exact Conditional Analysis Conditional Exact Tests ----- p-Value ----- Effect Test Statistic Exact Mid gnd Score 1.5000 0.314685 0.216783 Probability 0.1958 0.314685 0.216783 Exact Parameter Estimates 95% Confidence Parameter Estimate Limits p-Value gnd 1.2863 -1.1975 4.1589 0.461538 Exact Odds Ratios 95% Confidence Parameter Estimate Limits p-Value gnd 3.619 0.302 64.004 0.461538 ^^^^^ ^^^^^^ Since the example is based on a small sample size, the confidence interval on the odds ratio is much wider than the asymptotic result (which was 0.4471 to 35.7876). An Alternative Method of Entering Response Data Since the example in this section contains one categorical explanatory variable (gender), the events/trials notation provides an intuitively simpler and more compact way to work with binary data. It provides an easier way for both PROCs GENMOD or LOGISTIC to analyze them. Given the individual responses, summary counts of the number of successes and total trials can easily be produced with PROC MEANS: PROC SORT DATA=one; BY gender response; RUN; PROC MEANS DATA=one Noprint; BY gender response; VAR response; OUTPUT OUT=onea(DROP=_type_ _freq_ ) SUM=y N=n; RUN; In the next example the summary data produced by PROC MEANS are utilized. The response variable y is the number of "yes" responses (the level of greater interest coded as 1) as in the previous example and n is the total number of trials for each level of gender. Gender y n F 4 6 M 3 9 The events/trials summaries are appropriate if you have multiple trials and they are independent and identically distributed bernoulli events, that is each outcome (0/1) is independent and the probability of a "success" is p. (If you need to account for a covariance structure in the individual trial data, that is, they are not independent, then events/trials is not appropriate.) Also be aware that the descending option on the PROC statement does not work with data expressed as event/trial. It assumes you have coded y as the number of responses for the level of greater interest. PROC GENMOD DATA=onea ORDER=data; CLASS gender; MODEL y/n = gender / dist=binomial link=logit TYPE3 ; CONTRAST 'Females vs Males' gender 1 -1 / wald ; CONTRAST 'Females vs Males' gender 1 -1 ; ESTIMATE 'Females vs Males' gender 1 -1 / exp; TITLE1 "Logistic Regression with GENMOD: events/trials"; OPTIONS LS=78; RUN; Two tables of the output are shown here: Contrast Estimate Results Standard Chi- Pr > Label Estimate Error Confidence Limits Square ChiSq Females vs Males 1.3863 1.1180 -0.8050 3.5776 1.54 0.2150 Exp(Females vs Males) 4.0000 4.4721 0.4471 35.7876 Contrast Results Chi- Contrast DF Square Pr > ChiSq Type Females vs Males 1 1.54 0.2150 Wald Females vs Males 1 1.63 0.2014 LR The results are exactly the same as computed with the individual response data. PROBIT Regression Probit regression can be modeled as a class of generalized linear models in which the response probability function is binomial and the link function is probit. The probit function is another transformation of the computed proportion based on the normal distribution which maps p contained in the interval (0,1) into the entire real line. That is, you convert p into a Z-value (the cumulative standard normal distribution, N(0,1)). For example, * if p=.025, its probit equals -1.96 * if p=.975, its probit equals 1.96 These two numbers should be familiar to you from hypothesis testing based on the N(0,1) distribution call Z or the values of t as the sample size becomes very large. Both PROC GENMOD and LOGISTIC fit a probit model by adding the link=probit option on the MODEL statement: PROC GENMOD DATA=onea; CLASS gender; MODEL y/n=gender / dist=binomial link=probit type3; RUN; Interpretation and working backwards to compute probabilities with the coefficients have similar implications as the logit. The logit gives you the advantage of computing the odds and odds ratios directly, whereas with PROBIT it is a long process to determine then. Also, the logit transformation is more sensitive with rare or common events. If the computed p's are mostly between .3 and .7, logit and probit will probably give you similar conclusions. If the events are rare or common, then they may differ. Two Important Points 1. The MODEL Statement NOint Option The MODEL statement for both PROCs GENMOD and LOGISTIC allows the NOint option to suppress the intercept. Unless you have a specific reason for doing so, it is usually best to avoid entering it for the type3 tests will change considerably and you should know why. Suppose you take this simple design we have looked at with only gender (female/male) as the independent variable. The comparison of interest is the difference between the two levels of genders. When you fit the model: LOGIT(p) = b0 + b1*(gender=female) then the mean value of the response for males is b0 and the mean value for females is b0 + b1. The difference in logits between the two levels is the parameter b1 and the type3 tests indicate whether b1 = b0. As shown above, you can test the significance of the difference between the two levels through a LR test (CONTRAST statement) or a Wald test constructed as from the parameter estimate and standard error [i.e., Wald_chi_sq=(b1_hat/SE(b1_hat))**2. The model can also be parameterized as LOGIT(p) = mu1*(Gender=Male) + mu2*(Gender=Female) In this parameterization there is no intercept term. However, from the first model we know that mu1=b0 and mu2=b0+b1. If the mean logit for males is not zero, the tests for both mu1 and mu2 are likely to be significant. With logistic regression models it is a rare circumstance where the mean logit is expected to be zero. Here, we are not talking about a difference of means from each other, but just the difference of the estimate for each logit from 0. When you have a test where the null hypothesis is mu1=0 and mu2=0, then you will almost surely reject that null hypothesis in favor of the alternative hypothesis that mu1^=0 or mu2^=0, which in fact is what the Type3 tests would do in this case. This is what happens when you remove the intercept term from the model. The option NOint constructs tests based on null hypotheses that logit(p1)=0 and logit(p2)=0. This is equivalent to testing whether p=0.5 in each group, since LOGIT{p) = 0= LN(1) = LN(p/(1-p)) and p/(1-p) = 1 implies that p=0.5 Any reasonably large dataset will almost always reject such tests. Before running such a model, ask yourself "Am I really interested in testing whether the response probability in each group is 0.5?" One item that is the same with both parameterizations is that comparisons made with CONTRAST and ESTIMATE statements work the same in either situation, since they focus on the difference in estimates which remains constant with either approach. 2. Beware of quasi-complete separation! DATA ONEA; INPUT gender education yes no; TOTAL=YES+NO; CARDS; 1 1 23 14 1 0 0 45 0 1 16 23 0 0 0 38 ; PROC TABULATE DATA =ONEA NOseps; CLASS gender education ; VAR yes no total; TABLE education, gender* (yes no total) *SUM=' '*F=6.0 / rts=12; RUN; ----------------------------------------------------- | | gender | | |-----------------------------------------| | | 0 | 1 | | |--------------------+--------------------| | | yes | no |TOTAL | yes | no |TOTAL | |----------+------+------+------+------+------+------| |education | | | | | | | |0 | 0| 38| 38| 0| 45| 45| |1 | 16| 23| 39| 23| 14| 37| ------------------------------------------------------ PROC GENMOD DATA=onea; CLASS gender education; MODEL yes / total = gender education / dist=binomial link=logit; run; PROC LOGISTIC DATA=onea; CLASS gender education / param=glm; MODEL yes / total = gender education ; run; Model Convergence Status Quasi-complete separation of data points detected. No such message appears with GENMOD in either the LOG or OUTPUT windows. However, results like those printed below should immediately alert you to a problem: the presence of extremely large parameter estimates and standard errors is a clear sign of separation, much as multicollinearity inflates variances in regression analyses. Analysis of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 0.4964 0.3390 -0.1679 1.1608 2.14 0.1431 gender 0 1 -0.8593 0.4700 -1.7805 0.0618 3.34 0.0675 gender 1 0 0.0000 0.0000 0.0000 0.0000 . . education 0 1 -28.0967 125991.4 -246967 246910.6 0.00 0.9998 education 1 0 0.0000 0.0000 0.0000 0.0000 . To detect separation with continuous predictors, make a scatter plot to check for overlap of the response across the values of the predictor: DATA abc; INPUT age rsp @@; CARDS; 17 0 21 0 16 0 18 1 19 0 24 0 22 0 27 1 24 1 25 1 29 1 22 1 20 1 ; PROC plot data=abc; PLOT rsp*age=rsp / vaxis=-0.5 to 1.5 by .5 vref=.5 haxis=14 to 30 by 2; options ps=25 ls=72; run; quit; Plot of rsp*age. Symbol is value of rsp. rsp | | 1.0 + 1 1 1 1 1 1 1 | 0.5 +------------------------------------------------------------------ | 0.0 + 0 0 0 0 0 0 0 | | -+-------+-------+-------+-------+-------+-------+-------+-------+- 14 16 18 20 22 24 26 28 30 age If you can draw a _vertical_ line for any value of age that separates all the 1's to one side and all the 0's to the other, the data are completely separated. For example, removing values of age > 21 for rsp=0 and value of age < 21 for rsp=1 gives completely separated data: Plot of rsp*age. Symbol is value of rsp. rsp | | 1.0 + | 1 1 1 1 1 | | 0.5 +------------------------------|---------------------------------- | | 0.0 + 0 0 0 0 0 | | | -+-------+-------+-------+-------+-------+-------+-------+-------+- 14 16 18 20 22 24 26 28 30 age Quasi-complete separation actually means that some non-trivial linear combination of the independent variables will separate the dependent variable's 1's vs. 0's, except possibly for ties at the borderline. A two-dimensional scatter plot (with PROC PLOT) with the response as the plotting symbol can show how complete separation may exist in the context of two variables: x2 | \ 1 1 | 0 \11 1 | 0 \ 1 1 1 | 0 0 \ 1 | 0 0 \ 1 | 0 0 \ | 0\ ----------------------- x1 A straight line was added to show how the responses values are separated in x1*x2 space (i.e. a linear combination of x1 and x2 exists that clearly separates the 1's and 0's into two distinct groups), so in this example the data are completely separated. Also, separation has nothing to do with the correlation of x1 and x2, since here the plot shows the correlation must be near 0. The plot also shows that checking for separation with one independent variable at a time is not enough; notice how both x1 and x2 have overlap of the response variable, yet jointly the responses are completely separated by the line.