Section 10. Rate Models Poisson regression assumes the events of interest (i.e., the counts) are observed within groups that have equal numbers of potential subjects or that can be considered collected from populations of "infinite" size for practical purposes. However, sizes of units tend to be finite and will vary according to some index that measures the relative exposure of subjects within these groups. When the sizes of the groups vary, the numbers of events observed will also vary due to the fact that larger groups will likely produce more events merely because they are larger. How can one model events which occur in this situation? Including a denominator that indicates groups size produces a rate is one approach in this situation: Rate = events / (number of individuals exposed) Poisson regression techniques can also be applied to data modeled as rates. Rates are counts divided by a measure of exposure, like population size, area, or length of exposure time. For example, accidents can be measured as the number of incidents per total miles driven; defects as the number observed over a measure of surface area. To fit a rate model, you will need to have the original counts (the numerator) and a measure of exposure (the denominator). Poisson regression then fits an equation to the counts that includes the log of the exposure (denominator) as an offset variable. (The mathematical details are summarized below). Rates can be examined for how they depend on explanatory factors, such as cross-classified variables, e.g., an ANOVA, which determines the extent to which the rates differ across the levels of a factor and whether there are interactions among these factors. Rate analysis is typically applied to relatively rare events; that is, the number of observed events is small compared to the size of the population that generated the events - e.g., deaths from lung cancer and the prevalence of skin cancer among people who live different parts of the country are examples. Also, evaluating the frequency of automobile accidents on cities of different sizes under similar driving conditions is another application. With rare events, the Poisson approximation to the binomial distribution may be applied to construct an appropriate model. That is, as the binomial sample size n -> "large" and p -> 0 such that mu = np remains constant That is, as the sample size gets large and the probability of an event is small, the Poisson distribution of a random variable Y with mean mu (=np) is a good approximation to the binomial. Assume the number of events, y, is the observed value of a random variable Y. The rate of observed outcomes is observed rate= y/nn where nn is a relevant measure of size. The expected value of the rate is E(Y/nn) = mu/nn A loglinear model for the expected rate has the form: log(mu/nn)= alpha + beta*x = eta The log-linear model of the form log(mu/nn) = eta is adopted for the true rate where eta is a linear systematic component that contains the effects due to the different factors which explain the log rate: log(E(Y/nn)) = log(mu/nn) = alpha + beta*x The expected value of Y, the actual number of counts, is modeled with: log(E(Y)) = log(nn) + (alpha + beta*x) This log-linear model is now in the form of a Poisson regression model. The adjustment term log(nn), which has an implied coefficient of 1, is called the offset. In this example nn is the exposure variable to be entered as an offset with its natural log. It should not appear in the model and/or class statement(s). Furthermore, when you enter the canonical link for the Poisson distribution (LOG) such that the expectation mu is mu = exp( nn + eta) = exp( nn + (alpha+ beta*x)) so the reason why you need to compute LOG(nn) as the offset so that mu = EXP(LOG(nn) + eta) = nn * EXP(eta) = nn * lambda The expectation increases proportional to nn size of the population. You must add the offset variable ln_nn=LOG(nn) to the dataset before invoking GENMOD. Thus, first compute the log of the size measure: DATA vs; SET vs; ln_nn = LOG(nn); RUN; PROC GENMOD DATA=vs; CLASS group; MODEL visit = group / dist=poisson link=log offset=ln_nn; RUN; The examples which follow further demonstrate the computations when working with an offset parameter. The first example comes from Chapter 6 of "Introduction to Categorical Data Analysis" (1996) by Alan Agresti, pp. 86-87 which examines motor vehicle accident rates for elderly drivers. DATA acc; LABEL totk='Thousand Years' acc='Accidents' obs_rate='Observed Rate'; INPUT gnd $ gender acc totk; lg_totk = log(totk); obs_rate = acc/totk; lg_rate = log(acc/totk); CARDS; M 2 320 21.4 W 1 175 17.3 ; PROC PRINT DATA=acc NOobs n Label; VAR gnd acc totk obs_rate lg_totk; run; Thousand Observed gnd Accidents Years Rate lg_totk M 320 21.4 14.9533 3.06339 F 175 17.3 10.1156 2.85071 The ratio of the observed counts: 320/175 = 1.83 The ratio of the observed rates: 14.95/10.12 = 1.48 Thus, the difference in population size is an important consideration when computing a rate. PROC GENMOD DATA=acc order=data; CLASS gender; MODEL acc = gender / dist=poisson link=log offset=lg_totk; ESTIMATE 'Compare Men to Women' gender 1 -1 / exp; TITLE3 'GENMOD: 4.3.4: Rate Model Example'; RUN; Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 2.3141 0.0756 2.1659 2.4622 937.12 <.0001 gender 2 1 0.3909 0.0940 0.2066 0.5751 17.28 <.0001 gender 1 0 0.0000 0.0000 0.0000 0.0000 . . Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. Contrast Estimate Results Standard Chi- Label Estimate Error Confidence Limits Square Pr>ChiSq Compare Men to Women 0.3909 0.0940 0.2066 0.5751 17.28 <0.0001 Exp(Compare Men to Women) 1.4782 0.1390 1.2295 1.7773 ^^^^^^ The accident rate for men is 1.4782 higher than the accident rate for women; it can also be stated the accident rate for men is 47.8% higher than the women's rate.