Measures of Effect Size for Categorical Data Risk and Risk Ratios, Odds and Odds Ratios Categorical data analysis methods are applied to data collected from surveys, experiments, or other studies where a discrete random variable is the response that contains two or more discrete values (nominal or ordinal). Various types of data are important to distinguish, since the type of categorical analysis applied depends on what the data represent. Binary: data with two levels only and includes such responses as yes/no, success/failure, pass/fail, favor/oppose, etc. Nominal: discrete data which have multiple values and may fall under such diverse descriptions as race, type of birth control, religious or political affiliations, and many others. Ordinal: discrete data where the order of the values matters; for example, Likert scales from surveys where the response ranges from "Strongly disagree" to "Strongly Agree" or a motion study where for a given stimulus, the recorded response for each trial is whether the subject 'keeps feet-in-place', 'takes one or more steps', or 'falls'. Discrete response data expressed as counts can be evaluated as a function of explanatory data which may be either discrete, continuous, or a combination of both types (with similar interpretation to linear regression models for continuous data). These data comprise the independent variables of a statistical model, commonly called the design matrix in linear regression terminology (or the X matrix). The assumptions concerning the explanatory data in linear regression also apply when analyzing categorical data, i.e., avoid multi-collinearity, outliers, the distribution of each variable covers the relevant range of interest, and so forth. This article introduces you to measures of effect size in the analysis of categorical data. In particular, it compares interpretations of measures commonly found for this type of data: * risk and risk ratio (commonly computed as percents) * the odds, and the odds ratio These measures will first be demonstrated with a simple example showing a two-level discrete response variable as a function of one discrete, explanatory variable with two levels. In particular, the example demonstrates how to interpret a binary response (such as yes/no) as a function of a person's gender (Female/Male). These techniques can also be applied to a categorical response as a function of continuous explanatory factors; however, interpretation of the model should be easier to grasp when first evaluating categorical explanatory data with two levels. Assume you choose a random sample of 100 women and 100 men from a well-defined population (cohorts in a prospective study) and ask them a very basic question such as "Have you been examined by a dentist as least once in the past two years?" For reasons that will be apparent later, this question was chosen so that most, but not all, women and men would respond "yes". An intuitive way to summarize responses of this nature is through a 2x2 table where the counts of the number of 'yes' and 'no' responses are shown for both women and men. The usual convention is to have the columns of the table specify the levels of the response variable (e.g., Yes/No). The two levels of gender (women/men) label the rows. Response Yes No Total Gender Women a b a+b Men c d c+d The response 'Yes' is placed in the column to the left of the column for 'No' since it is the level with the greater interest for this question. The factor assigned to the rows (Gender) is the predictor variable which consists of two levels: Female and Male. The level assigned to the first row is assumed to signify the exposure group or the "at risk" cases; in this example it is the level of Gender for whom the response of a "Yes" is of the greatest interest. "Exposure group" and "at risk" are terms derived from medical studies where the purpose of the analysis is to detect the presence/absence of a disease in a particular group (e.g., women/men, smokers/non-smokers). The values of a, b, c, and d in the table are positive integers that 'count' the number of responses for each cell and for this example are assumed to all be greater than 0. Counts from this 2x2 table could easily be divided to include additional explanatory factors with two or more levels (e.g. race, age, location). How to analyze count data that contain a zero count in a 2x2 table is described in another article (lgst_zero.txt). Pearson chi-square The Pearson chi-square statistic, commonly applied to counts in "rxc" contingency tables, only tests for the independence of the levels from two nominal variables. The resulting p-value only provides a relative indication of the strength of the relationship; it does not indicate the direction. Risk and Relative Risk Risk and Relative Risk are two very simple measures of effect size that provide more information contained in 2x2 tables than just a p-value: both strength and direction. They are applicable in prospective studies only, that is, where the number of persons to be included in a survey or experiment is determined before data are collected. Considering the structure of the table above, we are most interested in evaluating the data in terms of the number of "Yes" responses for Women. Many people will naturally interpret proportion as a means to quantify the chance some event will occur. We naturally think in terms of numbers that range from 0 to 1. In this article risk will be defined as the proportion of "Yes" responses out of the total number at each level of the exposure group (Female/Male). p_1 = The probability of a yes for females = a/(a+b) = {the number of women who said yes)/all women} p_2 = The probability of a yes for males = c/(c+d) = {the number of men who said yes)/all men These two probabilities are point estimates of the conditional probabilities of a "Yes" response given each level of gender. The structure of the table is important since cohort relative risk applies to having the level of the explanatory variable of interest placed in the first row. Risk then provides a measure of the desired response placed in either the first column or the second column. Difference in risk (Dif_Risk) compares the risks (or proportions) across the two rows (groups). It is computed as the column 1 risk for row 1 minus the column 1 risk for row 2, Dif_Risk = [ p_1 - p_2 ] If the two rows represent independent binomial samples, the standard error for the difference between "Yes" responses in the two groups is computed as: se(Dif_Risk) = sqrt[ VAR(p_1) + VAR(p_2) ] where VAR(p_i) = p_i*(1-p_i)/n_i Exact confidence limits for the column 1, column 2, and overall risks can be computed using the F distribution method described in Collett (1991). Issues involved in constructing exact confidence limits of the difference in risks (or proportions) is found in Agresti (1992). The same ideas as the difference can be quantified with a ratio of the two risks. Relative Risk (RR or the Risk Ratio) is defined as the ratio of the two probabilities of a "Yes" response for Women versus Men: RR =(p_1/p_2) It can be calculated only in cohort studies, that is, where subjects are randomly selected and assigned to the groups. A point estimate of RR is the conditional probability of obtaining a 'Yes' for subjects from the first row (Women) divided by the conditional probability of obtaining a 'yes' for subjects from the second row (Men): RR^ = p_1/p_2 = [a/(a+b)] / [c/(c+d)] = [a*(c+d)]/[c*(a+b)] Relative Risk (RR) measures how much larger or smaller the proportion of 'Yes' responses in the first row is when compared to the proportion for 'Yes' responses in the second row thus giving direction of the relationship. It ranges from 0 to +%inf where 0.0 < RR < 1.0 indicates a 'negative' association (p_1=a/a+b < p_2=c/c+d) RR = 1.0 indicates no association (p_1=a/a+b = p_2=c/c+d) 1.0 < RR < +%inf indicates a 'positive' association (p_1=a/a+b > p_2=c/c+d) The strength of the association is indicated by the distance RR lies from 1.0; the further the Risk Ratio deviates from 1.0 in either direction, the stronger the association. +----------------+----------------+----------------+--------------+-- RR= 0 0.5 1.0 1.5 2.0 | <---- negative No positive ----> association association association Response Yes No Total Risk Gender Women 80 20 100 .8 Men 60 40 100 .6 For the data above, the Relative Risk is RR = .8/.6 = 1.333. This statistic is interpreted as the probability of observing a 'Yes' from a woman is 1.333 times higher than the probability of observing a 'Yes' from a Man. Since the value is greater than 1, it is a positive association: "yes" responses were more frequently received from women than from men. To obtain an interval estimate of RR, assume the normal approximation to the binomial distribution is valid. The sampling distribution of ln(RR^) more closely follows a normal distribution than RR^ itself. The variance of the relative risk for a 'yes' response is: b d VAR{ln(RR^)} = --------- + --------- [a*(a+b)] [c*(c+d)] A two-sided 100%*(1-%alpha)CI for ln(RR) is [c1 = ln(RR^) - zz/2*se{ln(RR^)}, c2 = ln(RR^) + zz/2*se{ln(RR^)}] where zz is selected from the Z distribution for a (1-alpha) confidence interval. The antilog of each endpoint provides a two-sided CI for RR itself: [ exp(c1), exp(c2) ] Note: that this method of estimation is valid only if (a+b)*(p_1)*(1-p_1) GE 5 and (c+d)*(p_2)*(1-p_2) GE 5. Despite its ease of interpretability (Davies, Crombie, & Tavakol, 1998; Laird & Mosteller, 1990), the relative risk (and by extension, the relative hazard) is not ideal for categorical data analysis (Fleiss, 1994) as it has at least two restrictive features. First, the magnitude of the two individual risks is a necessary component of its interpretation. Small risks give much larger values of RR than risks of moderate size, even though the differences in the risks are nearly the same or identical. For example, the difference between 0.010 and 0.001 is 0.009 which is the same difference that is computed between 0.410 and 0.401; yet the Relative Risk of the first pair of numbers is 10.0 whereas the Relative Risk for the second pair is only 1.022. Another restrictive feature is that if the baseline risk for a yes response is greater than 0.5, it is not possible to double it. The size of the Relative Risk is constrained by the value of the denominator probability, p_2, as 1/p_2 as the upper bound. For example, if p_2=.5, then Relative Risk can be no larger than 1/.5 = 2; likewise, if p_2=.8, then RR can be no larger than 1/.8 = 1.25. Relative Risk is left over from a time before it was easy to make other types of computations. As some examples will soon demonstrate, RR is usually a much inferior way of summarizing an effect for categorical data. Despite its limitations, Relative Risk still has applications in biomedical applications with relatively rare events (observing the response of interest in only a small percent of the total study population); for example, a physician might prefer to know that the cure rate of a previously difficult to treat disease has been significantly increased with a new drug, as opposed to knowing how much larger the odds of cure are for the new drug relative to the odds for the standard drug. Another effect size measure for categorical data than does not have these restrictions is now presented. The Odds and Odds Ratio Two statistics for categorical data that compare the effect size of two responses (Yes/No) across two or more groups are the Odds and the Odds Ratio. With counts given for two distinct response categories (e.g., Yes/No) the Odds of a 'Yes' versus a 'No' is the ratio of the number of events ('Yes') to the number of non-events ('No') for each group. For any group, if the odds of a 'Yes' are greater than 1, a 'Yes' is more likely to be observed than a 'No' for that category (the odds of an certain event is infinite); if the odds are less than 1 the observing a 'Yes' is less likely than a 'No' (the odds of an impossible event is zero). The Odds and Odds Ratio are mathematically convenient to work with, but they are not as easy to interpret as Relative Risk. Odds Ratios near 1.0 indicate weak or nonexistent associations between variables, whereas odds ratios greater than 3.0 (or less than 0.33, in the case of negative associations) represent strong associations between variables (Haddock et al., 1998). Experts recommend computing the Odds Ratio rather than Relative Risk as a measure of effect size for categorical data (Fleiss, 1994; Haddock, Rindskopf, & Shadish, 1998; Laird & Mosteller, 1990). One important reason is that the Odds Ratio can be estimated from both retrospective and prospective studies. Intuitive Explanation The term "Odds" should be familiar from its connection with gambling or assessing the chances of an individual winning an election, a horse winning a race, or a sports team crowned as champion. For example, the concept of "odds" is familiar from gambling. You may hear the odds of a particular horse winning a race are "3 to 1"; or you may hear the "odds" of Tiger Woods winning the U.S. Open golf tournament is "4 to 1". The comparison of these numbers actually means the horse or Mr. Woods are more likely to lose than win. The numerical relationship shows why you need to be careful when you interpret an odds because sometimes it represents the odds in favor of winning, but more often than not, it represents the odds against winning. Usually the context clarifies whether it means winning or losing. In this case, an odds of "4 to 1" means that if 5 identical tournaments could be played, Tiger would win 1 and lose 4. Players even less likely to win the tournament are given larger odds of say "8 to 1", "12 to 1", "20 to 1", or even larger. With this convention, the value of an odds is bounded below by zero but has no theoretical upper bound. When you read that the odds of winning a lottery are a 1,000,000 to 1, you know that this means that you would expect to hold a losing ticket about 999,999 times for every 1 ticket that would hit the jackpot. In medicine and epidemiology, when an event is less likely to happen (more likely not to happen), the odds are represented as a value less than one. So odds of "4 to 1" against an event would be represented by the fraction 1/4. When an event is more likely to happen than not, the odds are given a value greater than 1. So odds of "3 to 1" in favor of an event would be represented simply as 3. The Odds is the ratio of the probability something is true divided by the probability that it is not true. Thus for women the Odds of a yes versus no is 0.667/0.333 = 2:1 = 2. If in the same population one-half of women say Yes and one-half say No, then the Odds of a Yes vs No is 0.5/0.5 = 1 or 1:1. You also need to be careful when you read interpretations of the odds as the probability of winning is x times higher than the probability of not winning. Calculating the Odds and Odds Ratio It's easy to compute odds with counts or with probabilities. It is also easy to convert odds into probabilities and vice versa. With odds of 3/1 in favor (or against), you would expect to see roughly 3 wins (losses) and only 1 loss (win) out of four attempts. In other words, your probability for winning is 0.75. If you expect the probability of winning to be 20%, you would expect to see roughly 1 win and 4 losses out of 5 attempts. In other words, your odds are 4 to 1 against. The formulas for conversion are: odds=prob /(1-prob) and prob = odds / (1+odds). Refer to the table above with a, b, c, and d representing the frequencies of responses in each cell. For example, let a be the number of "Yes" responses for Women (row 1, column 1). For each level of the explanatory factor, the odds is merely the ratio of the number of "Yes" responses to the number of "No" responses. Two values for the odds can be obtained as the ratio of "Yes" versus "No" for each row (i.e., one for Women and one for Men). Odds of a 'Yes' for Women = a/b Odds of a 'Yes' for Men = c/d The counts in each cell are assumed to be non-zero for the following calculations to be feasible. The Odds of a "Yes" response can also be expressed in terms of probabilities of a success verses a failure for each level of gender: Odds(Yes | women) = Prob(Yes | women) / Prob(No | women) = Prob(Yes | women) / [ 1 - Prob(Yes | women) ] = p_w / (1-p_w) = [a/(a+b)] / [b/(a+b)] = a/b where p_w = Prob(Yes | women) = a/(a+b) 1-p_w = Prob(No | women) = b/(a+b) Odds(Yes | men) = Prob(Yes | men) / Prob(No | men) = Prob(Yes | men) / [ 1 - Prob(Yes | men) ] = p_m / (1-p_m) = c/d where p_m = Prob(Yes | men) = c/(c+d) The odds measures the strength of preference for a 'Yes' versus 'No' at each level of gender. The odds ratio (OR) compares the magnitudes of the two odds. It is computed as the ratio of the two odds. It is usually expressed in the form of a cross-product ratio (because a, d and c, b are both in diagonally opposite corners of the table): OR = [odds for women / odds for men] = (a/b) / (c/d) = (a*d) / (c*b) Although these probabilities may not be as intuitive to derive from the Odds Ratio, another way to compute the OR is through a ratio of two 'conditional' probabilities: OR = (a/b) / (c/d) = a*d / c*b = (a/c) * (b/d) a/(a+b) / c/(c+d) = ----------------- b/(a+b) / d/(c+d) P(Yes | Women) / P( Yes | Men) OR = ----------------------------- P(No | Women) / P(No | Men) Concerning the interpretation the data, for any 2x2 table where (i,j) represents a non-zero count for the cell where the ith row and jth column intersect, the odds ratio is computed as (n_11*n_22)/(n_12*n_21), so you can have the yes-female condition be either n_11 or n_22 and the value of the computed odds ratio will be the same. The odds ratio computed from a random sample is an estimate of the population parameter often expressed in the statistical literature with the Greek letter theta. In hypothesis testing, the value of theta under the null hypothesis equals 1 to indicate no association. For example, when the odds of a Yes response in each row for the two levels of gender are identical, the computed odds ratio equals 1. This happens when the two success probabilities, or the probability of a 'Yes' from females and males, are equal to each other. Just as was observed with the interpretation of Relative Risk, odds ratios near 1.0 indicate weak or nonexistent associations between variables; the farther from 1 the odds ratio lies in either direction, the stronger the association between gender and their response. Although the size of an effect size is subject specific, odds ratios greater than 3.0 for positive associations or less than 0.33 for negative associations, represent strong associations between variables (Haddock et al., 1998). Confidence Interval for the Odds Ratio Because of the asymmetry of the distribution of the Odds Ratio, a log transformation should be performed before making any statistical inference (such as computing a confidence interval). The distribution of the natural log of the Odds Ratio, ln(Odds_Ratio), is symmetric around 0 and is also approximated reasonably well by the normal distribution. For the 2x2 table with the cell counts as given above the Odds Ratio is (a*d)/(c*b). Taking natural logs: beta = ln(theta) = ln[(a*d)/(c*b)] Note that the natural log function, ln(), uses the real number "e" as its base (rather than the more familiar base 10) and where "e"^0=1, "e"^1=2.7182818284..., and so forth. Using the formula for the Odds Ratio, observe that the inverse function, exp(beta), is equal to Odds Ratio, theta. [This is an important feature to remember when the connection of the odds and odds ratio is applied with logistic Regression.] The Odds Ratio, OR=exp(beta), is first computed by looking at the 'odds' for a binary response variable (Yes/No coded as 1/0) by considering the ratio: Odds = Prob(yes) / [1-Prob(yes)] beta is defined as the logit transformation of the probability of a yes: beta= logit(yes) = ln [Prob(yes) / (1- Prob(yes)) ] = ln(Odds) theta= exp(beta) = prob(yes)/[1-prob(yes)] = Odds Ratio An approximate value of the variance for the estimate of the natural log of the odds ratio is: VAR[ln(theta)] = [ 1/a + 1/b + 1/c + 1/d ] The normal distribution with an appropriate values of z (depending on the level of the Type I error, alpha) is now used to create an approximate confidence interval for the ln(Odds Ratio). This formula is based on the maximum likelihood derivation for the standard error of the ln(odds). A 95% confidence interval for the natural log of the odds ratio, theta, can be obtained by first computing the endpoints of the interval: * c1 = lower = ln(theta) - 1.96 * se(ln(theta) * c2 = upper = ln(theta) + 1.96 * se(ln(theta) The upper and lower bounds of the odds ratio in its original units is computed as: [ exp(c1), exp(c2) ] Because of the nature of the exponential function, the confidence interval will not be symmetric around the odds ratio; the upper bound may lie a considerably greater distance from exp(beta) than the lower bound. [ Note: When small cell sizes exist in the 2x2 table (e.g., a,b,c,d={0 or 1 or 2} and for consistency across all cell counts, one procedure to follow is to add 0.5 to all cell counts. This is merely for computational purposes and results in biased estimates. A later section will expand the treatment for working with zero counts.] Summary Computations for the Odds Ratio and a Confidence Interval from a 2x2 Table Given you have counts from a 2x2 table, here is a brief description of calculating the odds ratio and a confidence interval. ------- | a | b | |---+---| | c | d | ------- Steps in computing a confidence interval: Odds Ratio = (a*d) / (c*b) LN(Odds Ratio) +/- [1.96*SQRT(1/a + 1/b + 1/c + 1/d)] where 0.5 is added to all cell counts if any are zero or very small. Exponentiate the logs to find the end points of the confidence interval for the odds ratio itself: Lower Bound = EXP{LN(odds ratio)-1.96*SQRT(1/a + 1/b + 1/c + 1/d)} Upper Bound = EXP{LN(odds ratio)+1.96*SQRT(1/a + 1/b + 1/c + 1/d)} Some points to consider: There are slight modifications to this formula for this interval and your answer may differ because the software may use different set-ups that result from slightly different assumptions. The literature is wide on the topic of estimating a common odds ratio from stratified tables where a single table is a special case. Formulas for stratified tables can reduce to slightly different estimators when applied to a single stratum (e.g. use N instead of an N-1). See, e.g. Hauck WW. The large sample variance of the Mantel-Haenszel estimator of a common odds ratio, Biometrics 1979;35:817-19 and Robin, Greenland, Breslow. "A General Estimator for the Variance of the Mantel-Haenszel Odds Ratio", American Journal of Epidemiology, 124:5:719-723. When calculating the odds ratios for a 2x2 table notice that the confidence interval is not symmetric around the estimate. The formula for calculating the confidence interval may at first look as if it should produce a symmetric confidence interval around the odds ratio. However, recall from the above formula that when constructing confidence intervals for an odds ratio, you are working with natural logs. Here is a step-by-step approach: Step 1: Convert the odds ratio, OR, to a natural log ln(OR) Step 2: Add and subtract 1.96*SE (or another appropriate z-value) to the ln(OR) (this is the symmetric part) Step 3: Exponentiate the OR, LCL, and UCL and observe that endpoints are no longer symmetric around the OR If you run your analysis through a computer program that computes logistic regression coefficients, you'll end up with the following in the table (among other things): B SE exp(B) Lower Limit Upper Limit -1.48 .2075 .862 exp(-1.48 - 1.96*.2075) exp(-1.48 + 1.96*.2075) The column labeled exp(B) is the OR. Construct the confidence interval for the OR by using the result in column B plus or minus 1.96*SE. When you exponentiate the endpoints, you end up with confidence intervals around exp(B). Test based intervals are back-calculated from the value of a Mantel-Haenszel type chi-square statistic for testing the hypothesis that the Odds Ratio=1. It does not require adding a 0.5 to empty cells. However, it's bias increases as a function of the degree to which the Odds Ratio is not equal to 1. If you have cells with zero counts, the large sample formulations may not be appropriate anyway and you should consider exact confidence intervals available from software such as StatExact from Cytel (Ref. Mehta, Patel, Gray, "Computing an exact confidence interval for the common odds ratio in several 2X2 contingency tables", J Am Stat Assoc 1985;80-969-73). Comparison of Relative Risk and Odds Ratio [ p_w / (1-p_w) ] Odds Ratio = ----------------- [ p_m / (1-p_m) ] p_w (1-p_m) (1-p_m) = ----- * ------- = Relative Risk * ------- p_m (1-p_w) (1-p_w) As will become evident below, the probabilities of a 'Yes' response are nearly equal for both women and men, the odds ratio and relative risk are nearly the same quantity. It's also important to note that this relationship can be expressed as: 1 + c/d Relative Risk = Odds Ratio * -------- 1 + a/b which indicates the OR approximates RR when both a and c are small relative to b and d, respectively. This is called a ‘rare’ outcome and will be demonstrated with an example. Interpreting the Odds Ratio Since the odds ratio is a ratio of finite counts that are all assumed to be greater than 0, the odds ratio itself will be a finite positive number (i.e., 0 < OR < +infinity). Independence of the levels of the explanatory factor(s) is equivalent to theta=1. When 1 < theta < +infinity, Women (in row 1) are more likely to give the first response (Yes) than Men (in row 2). Conversely, when 0 < theta < 1 Men are more likely to give a 'Yes' response. Values of theta that deviate from 1.0 in either direction represent stronger levels of association. Example. Here are sample results from the question posed in earlier section: Gender Yes No Total Risk Odds Female 99 1 100 99/100=0.99 99/1 = 99 Male 96 4 100 96/100=0.96 96/4 = 24 Risk Ratio: The ratio of the two risks for column=Yes: 0.99 / 0.96 = 1.0313 Odds Ratio: The ratio of the two odds: 99 / 24 = 4.125 The interpretation here is that the Odds of obtaining a response of 'yes' from women is 4.125 times the Odds of obtaining a response of 'yes' from men. Your intuition to interpret the odds ratio might lead you to state this number means "it is 4.125 times as likely" to get a 'Yes' from Women as it is to receive a 'Yes' from Men. However, for this interpretation to be correct assumes that if the proportion of 'Yes' responses from women equals 0.99 then the proportion of 'Yes' responses from men equals 0.24. However, as seen in the table the actual value for the proportion of 'Yes' responses from men is 0.96. The confusion comes from a lack of understanding of the odds, namely (the number of positive events occurring divided by the number of non-events occurring) i.e., the odds for Women equals 99/1 (=a/b). This calculation is quite different than 'Risk', which is the probability of the event of interest occurring and for Women equals 0.99 = 99/(99+1) = a/(a+b). If 99% of females in a population are expected to give a 'Yes' response, on average for every 100 women polled in a random sample, 99 will respond 'Yes' and 1 will not -- the odds of a positive response for women is therefore 99/1=99. Similarly, if there is a 96% chance men respond with a 'Yes' in a population, on average, for every 100 men polled in a random sample, 96 will response 'yes' and 4 will not - the odds of a 'yes' response from men is therefore 96/4=24. If the Odds of obtaining a 'Yes' response from 100 females equals 99 and the Odds of a 'Yes' from 100 males is 24, then the odds ratio is 99/24 = 4.125. In this situation, the interpretation of the odds ratio (OR = 4.125) is very different from the interpretation of the risk ratio (RR = 99/96 = 1.0313). This example illustrates how 'Odds' (and hence Odds Ratios), are very different from 'Risks' (and hence risk ratios) when one works with commonly occuring events (i.e., most Women and Men in the sample are expected to respond 'Yes' to a particular question). In contrast, suppose we ask the question "Have you been to a foot doctor in the past year?" to the same random sample. Notice that due to the nature of the question, the counts of yes and no responses for both Women and Men are likely to shift to the other extreme. Since a relatively small proportion of the population visit a foot doctor each year, 'Yes', the response of interest, will not be given nearly as often as 'No'. When one works with rare events where the response of interest occurs with rather small frequencies (e.g. the occurrence of a rare disease) then the Odds Ratios and Risk Ratios will be similar. For example, Yes No Total Risk Odds Women 1 99 100 .0100 1/99 = .0101 Men 2 98 100 .0200 2/98 = .0204 Ratio of the two risks: .0100 / .0200 = .5 Ratio of the two odds: .0101 / .0204 = .4949 To compare a relative risk of 0.01 (1%) and 0.02 (2%) between women and men, the 'Risk Ratio' is 0.5. The corresponding odds are 1/99=.0101 for Women and 2/98=.0204 for Men, hence the 'Odds Ratio' is theta = (1*98)/(99*2) = 0.4949, nearly the same value as the Risk Ratio. Both the odds ratio and risk ratio are both perfectly legitimate measures -- they are just different ways of summarizing a table of counts. One purpose of these examples is to show that how it is incorrect to interpret odds ratios in the same manner as risk ratios. The statements concerning 'rare events' are purely arithmetical. Simple consideration of a 2x2 contingency tables for case-control studies indicates that the odds ratio will only be a good approximation to relative risk (risk ratio) for rare events (i.e. when the absolute level of risk is low). This is a statement of mathematical fact, regardless of the source of the data. For further references on the proper use and interpretation of odds ratios see the excellent discussion in David Kleinbaum's _Logistic Regression: A Self-Learning Text_. It's oriented to epidemiological subject matter, but the principles apply broadly. It does not have a long list of examples, but it does a good job of laying out the issues. When you have more than two levels of the explanatory variable, the following example demonstrates how to compare three odds: Group Yes No Odds of Yes vs. No A 99 1 99 to 1 B 96 4 96 to 4 C 24 76 24 to 76 The odds for a yes response in Group A is 99:1, the odds for a yes response in Group B is 96:4, which can be reduced to 99/1=99 and 96/4=24. The ratio of these two odds is 99/24, or 4.125, the odds ratio for A verses B. The odds for a negative response in group A is 1:99, the ODDS for a negative response in group B are 4:96, which can be stated as 1/99 and 4/96 respectively. The RATIO of these two odds is (1/99) / (4/96) = 0.2424 = 1/4.125, which is conveniently observed to be the multiplicative inverse of the odds ratio for a positive response. Whether you are interested in reporting 99 and 96 as the number of yes responses, or summarizing 1 and 4 as the number of no responses, the interpretation of the odds ratio remains the same. If the true rates were A:99/1 and C:24/76, then the likelihood of a positive response, considered using 99 and 24, would be about 4.125, but the odds ratio is enormously higher; Group Yes No Total Risk Odds A 99 1 100 0.99 99/1 = 99 B 96 4 100 0.96 96/4 = 24 C 24 76 100 0.24 24/76 = 0.32 Ratio of the risks A:C is RR = 0.99 /0.24 = 4.125 Ratio of the odds A:C is OR = 99 /0.32 = 313.5 Ratio of the risks B:C is RR = 0.96 /0.24 = 4 Ratio of the odds B:C is OR = 96 /0.32 = 300 For both comparisons, imagine how people could be misinformed depending on which number was reported to them: that is, if what they thought was implied by the odds ratio to be 313.5 or 300 should actually be interpreted as the relative risk, 4.125 or 4. Properties of Risk Ratios, Odds, and Odds Ratios The value of the odds ratio does not change if all the entries in either a row or a column of the 2X2 cross-tabulation are multiplied by a constant, or if the rows or columns are interchanged. For example, Table 1 Group + - Risk Odds Odds Ratio A 8 2 .8 4 4/5 = 0.8 B 5 1 .833 5 Multiply row 1 of the Table 1 by 2 Table 2 Group + - Risk Odds Odds Ratio A 16 4 .8 4 4/5 = 0.8 B 5 1 .8333 5 Multiply column 2 of Table 1 by 3 Table 3 Group + - Risk Odds Odds Ratio A 8 6 1.33 1.33 / 1.67 = .8 B 5 3 .625 1.67 The risk ratio, on the other hand, is only invariant to multiplication of all the entries in a row (assuming rows represent levels of the "independent" variable) by a constant. This is not a problem if you're using a prospective sampling design (treating the independent variable marginals (Gender) as fixed), but it means that the Risk Ratio can't be used with a retrospective sampling design (where the dependent variable marginals are treated as fixed) because it will vary with the sample size and the proportion of entries in each category of the dependent variable. The odds ratio doesn't have this problem. This is an important point to consider when comparing 'Risk Ratios' with 'Odds Ratios'. Risk Ratios are more obvious, more 'natural', and a more easily conceptualized approach (particularly for non-statisticians) -- so why use 'odds ratios' at all? The simple answer is for the reason risk ratios cannot be obtained directly from retrospective 'case control' studies -- whereas odds ratios can. In studies where the response is a rare event, however, the odds ratios are usually very close to what the risk ratio would be - or when the majority of responses of interest are 'rare' or 'very rare' in statistical terms. Even very 'common' diseases, like cancer, have quite low incidence and prevalence. Example Consider the situation where a large volume of mail is shipped each week. If the customer complaint rate jumps from .5% to 1% what is the best procedure to use to determine that the shift is significant? Note that the difference to be detected is extremely small. A contingency table analysis using Pearson's chi-square to compare proportions is difficult or even inappropriate to detect a difference of .5%. Fisher's exact test could be used to analyze tables of counts for rare events, but it is on the conservative side. Month Yes No Risk Odds June a b a/(a+b) a/b July c d c/(c+d) c/d Consider using the "risk ratio" or "odds ratio". The data indicate increases in "risk" of a customer complaint has doubled. The odds ratio would be a good choice to quantify this increase of a events that occur rather infrequently ("rare events"). Of concern here, though, is the correlation between adjacent months. Since we are looking at two adjacent months, June and July, the number of complaints could easily be more alike then than when either of these two months is compared with January when larger numbers or returned items (e.g., Christmas) may be the reason. A more meaningful analysis would look at trends over time. References The two books by Agresti are excellent starting points. Agresti A. Categorical data analysis, 2nd Ed. New York: Wiley, 2002. Agresti A. An Introduction To Categorical Data Analysis. New York: Wiley, 1996. Bishop YMM, Fienberg SE, Holland PW. Discrete nultivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975 Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981. Fleiss (1994) "The Handbook of research synthesis", edited by Harris Cooper and Larry V. Hedges. Russell Sage Foundation, New York. Kleinbaum, David (1994)"Logistic Regression: A Self-Learning Text", Springer: New York. Somes GW, O'Brien, KF. Odds ratio estimators. In Kotz L, Johnson NL (eds.), Encyclopedia of statistical sciences, Vol. 6, pp. 407-410. New York: Wiley, 1988.