Section 7.3 - SAS Numerical Functions A. Collecting Summary Statistics in a DATA Step B. Finding the Median of variables across an observation C. The MOD, FLOOR, and CEIL Functions D. The ROUND, ROUNDE, and ROUNDX Functions E. Variance Stabilizing Transformations F. Random Numbers G. Miscellaneous Numerical Functions A. Collecting Summary Statistics in a Data Step Functions can be applied to data across one observation (i.e., row) to collect several types of summary statistics. These include the sum (SUM), arithmetic mean (MEAN), variance (VAR), standard deviation (STD), median (MEDIAN), minimum value (MIN), and maximum (MAX) values. One could compute these statistics with data specified in arrays and processed with a DO loop or even through brute-force programming. However, these built-in functions do this work for you. For example, to collect a sum across 4 variables within one observation enter: sum_y = SUM(y1,y2,y3,y4); The function N gives the number of variables in the current observation that have non-missing values. n_y = N(y1,y2,y3,y4); The function NMISS gives the number of specified variables in the current observation that have missing values. nmss_y = NMISS(y1,y2,y3,y4); The sum of the two counts computed here, n_y and nmss_y, must add to the number of variables listed in parenthesis, i.e., n_y + nmss_y = 4. For large variable lists, rather than list all the variable names, if they have been coded with numerical indices, here is a very convenient short-hand notation to use to compute the sum, number with values, and the number missing data: sum_y = SUM(of y1-y20); * the sum of all the non-missing data; n_y = N(of y1-y20); * the number of non-missing data; nmss_y =NMISS(of y1-y20); * the number of missing data; The single hyphen which appears between the first and last variable names requires that the names have numeric suffixes. The syntax to sum 20 variables with SUM(y1-y20) or SUM(y1--y20) for variables with the same prefix will not work. For example, in the statement sum_y = SUM(y1-y20); the portion within the () is considered to be the difference between y1 and y20 -- the other 18 variables are ignored! Likewise, SAS interprets SUM(y1--y20) as SUM(y1-(-y20)), i.e. SUM(y1+y20), the sum of y1 and y20. If first_var is the first variable and last_var is the last variable of a set of physically adjacent variables in the dataset, then you can compute the sum of all variables with a double hyphen: Sum_y = SUM(of first_var--last_var); The actual order variable names are stored can be checked with PROC CONTENTS (see section 5.7). The numbers in the column under the number sign (#) from the resulting output indicate the order they are stored within SAS and the order they will appear if you print them without a VAR statement or in the SUM function as indicated. If you need to restructure the order, one way is to make a new dataset with RETAIN and SET statements as follows: DATA cls; RETAIN sex Name Height Weight age ; SET sashelp.class; Proc print; run; * variables are in a new order; Obs sex Name Height Weight age 1 M Alfred 69.0 112.5 14 2 F Alice 56.5 84.0 13 etc. If you have missing data, the SUM function works differently than adding all the variables with addition. For example, assume the following equation sums the variables y1, y2, y3, and y4: sum_y = y1 + y2 + y3 + y4; If one or more of the individual yi's are missing, then sum_y is also set to missing. The SUM function [e.g., SUM(y1,y2,y3,y4) and SUM(of y1-y4)] give the sum of the data that are NOT missing. For that reason, it is very important to use the N or NMISS functions along with it. For example, here is an efficient method to compute a scale from 10 variables (assumed to be ordinal responses) with some missing data, collected from a survey with individual items coded as a1 - a10: n_a = N(of a1-a10); nms_a = NMISS(of a1-a10); mn_a = MEAN(of a1-a10); * the mean of the non-missing data; If the data are missing at random, you can apply mean substitution for perhaps 1 or 2 missing items out of a total of 10 (not recommended for more than 20%) to compute the scale: IF (0 LE nms_a LE 2) THEN scale_a = mn_a*10; A method to sum data from one variable that may have missing values within a DATA step itself is to combine the RETAIN statement and the SUM function: DATA new; SET old; RETAIN total 0; total = SUM(total, cost); The Language and Procedures manual implies that this code is internally implemented as: RETAIN total .; IF cost NE . THEN total = SUM(total, cost); which is turn is the equivalent to: RETAIN total .; IF cost ne . THEN IF total ne . THEN total = total + cost; ELSE total = cost; B. Finding the Median from a list of Variables across each Observation Either PROC UNIVARIATE or PROC MEANS can be applied to compute the median of a given variable across observations of the data set (vertical orientation). However, what if you want to find an order statistic, such as the median, of a set of variables across a single observation (horizontal)? The ORDINAL function returns value of the rank, i=1 to p, you specify from a collection of p variables across each observation. If the dataset variables have names such as a,b,c,d, the value of rank i is: ord_i = ORDINAL(i,a,b,c,d); * i = 1,2,3,4; For 4 variables, you first enter the rank you desire (e.g., 1,2,3, or 4) followed by the variable names: DATA one; a=4; b=. ; c=.z; d=-99; ord_1 = ORDINAL(1,a,b,c,d); ord_2 = ORDINAL(2,a,b,c,d); ord_3 = ORDINAL(3,a,b,c,d); ord_4 = ORDINAL(4,a,b,c,d); OUTPUT; a=14; b=.z; c=7; d= .b; ord_1 = ORDINAL(1,a,b,c,d); ord_2 = ORDINAL(2,a,b,c,d); ord_3 = ORDINAL(3,a,b,c,d); ord_4 = ORDINAL(4,a,b,c,d); OUTPUT; a=24; b=. ; c=2; d=99; ord_1 = ORDINAL(1,a,b,c,d); ord_2 = ORDINAL(2,a,b,c,d); ord_3 = ORDINAL(3,a,b,c,d); ord_4 = ORDINAL(4,a,b,c,d); OUTPUT; run; proc print Noobs; run; Produces: a b c d ord_1 ord_2 ord_3 ord_4 4 . Z -99 . Z -99 4 14 Z 7 B B Z 7 14 24 . 2 99 . 2 24 99 Notice that with missing data, the ordered values are listed in the according to how SAS evaluates missing data with comparison operators (read my article on 'How SAS Handles Missing Data' you will find under 'Selected SAS Topics' on my SAS web page). To find the Min or Max of a set of variables is a simple modification of this approach. If there are no missing values among the variables for which you want to compute the median, the ORDINAL function is one simple approach. For an odd number of variables, all you need to specify is the list of variables and the index of the median. For example, with p=5, the value with rank order 3 would be the median: median = ORDINAL(3,of var1-var5); For an even number of variables with no missing data, a further calculation is required to find the median. The ORDINAL function is applied, but if the number of non-missing arguments is even, the median is defined as the average of two ordinals. Thus, here is a DATA step to determine the median for either an odd or even number of variables: DATA new; SET yourdata; nonmiss = N(of var1-var5); * N counts the number of non-missing data; ord1 = FLOOR((nonmiss+1)/2); * FLOOR is the closest integer less than the value; ord2 = CEIL((nonmiss+1)/2); * CEIL is the closest integer greater than the value; IF ord1=ord2 then median = ORDINAL(ord1, of var1-var5); ELSE median = (ORDINAL(ord1, of var1-var5) + ORDINAL(ord2, of var1-var5)) / 2; RUN; A MEDIAN function to perform this calculation will be available in Version 9. C. The MOD, FLOOR, and CEIL Functions MOD(a,b) The MOD function returns the integer remainder from the ratio of two integers, a and b. For example: a = MOD(5,2); * for a=5 and b=2 for 5/2, the remainder equals 1 ; b = MOD(6,2); * for 6/2 the remainder equals 0 ; Be careful if you attempt to divide by a non-integer. For example, the following statement will not produce a value of zero for variable c. c = MOD(135.45,0.05); Why? Because of the internal 'precision problem' a value of 0.05 will not be stored precisely in a binary-based computer, so that the binary representation of 0.05 does not divide exactly into 135.45. The problem does not exist with integers (which can be stored precisely), so if you multiply the numbers by 100, then MOD(13545,5) DOES return zero, as illustrated by this program: DATA _null_ ; c = MOD(135.45,0.05); d = MOD(13545,5) ; PUT c= d= ; FORMAT c 18.16 ; run ; prints: C=0.0499999999999811 D=0 D. The ROUND, ROUNDE, and ROUNDX Functions ROUND(x,precision) To avoid internal representation of data with extended decimal problems as shown in the previous example, the ROUND function can be of assistance. For example, to have c=.05 in the MOD function, insert the statement: DATA _null_ ; c=0.0499999999999811; rounded = ROUND(c,.05); PUT c= f16.14 rounded f12.10; remainder = MOD(135.44,rounded); PUT remainder; run; prints the following to the log file: c=0.04999999999998 0.0500000000 0.04 To round a number to some desired level of precision, integer or decimal, for example, the nearest even number (i.e. 0.5--> 0 and 1.5 -->2 etc.) specify precision=2 as the rounding criteria in the ROUND() function: DATA a; ARRAY num _numeric_; i = .5; j = 1.5; k = 2.5; DO OVER num; rounded = round(num,2); put num= rounded=; END; RUN; In the log file you will find: i=0.5 rounded=0 j=1.5 rounded=2 k=2.5 rounded=2 The ROUND() function returns the value of the first argument rounded to the nearest multiple of the second argument. For integers, many consider 0.5 to be in the middle of the range. However, this isn't true. For example, the range of values whose integer portion is 3 can be expressed as [3.0,4.0). Notice that 4.0 is not included in the range. 3.5 would be the midpoint if 4.0 was included in the range, but it obviously marks the beginning of the upper half of the range when 4.0 is not included in the range. If we include 4.0 as the endpoint of the range and as the starting point of the next range then we've skewed the number line! That said, there are some who believe that 0.5 is the midpoint and that it should be rounded to the nearest even integer. So, SAS has a function called ROUNDE() that rounds the first argument to the nearest multiple of the second argument and "returns an even multiple when the first argument is halfway between the two nearest multiples." SAS also has a ROUNDX() function that behaves very much like ROUNDE() but ROUNDE() - and ROUND() - fuzz the numbers so that numbers that are very close to the midpoint are considered to be the midpoint. ROUND function description http://support.sas.com/onlinedoc/912/getDoc/lrdict.hlp/a000245942.htm E. Variance Stabilizing Transformations The following statements can create useful transformations: z = SQRT(count); /* counts (Poisson distribution) */ /* variance proportional to mean: Poisson */ z = LOG(conc); /* concentrations, weights (log normal) */ /* SD proportional to mean */ z = LOG10(conc) /* take the Log (Base 10) */ ln5_conc=LOG(conc)/LOG(5); /* From math theory, log base A of x=ln(x)/ln(A), so you can get the log to any base, say base 5 */ /* constant coefficient of variation (CV) */ z = arsin(sqrt(prop)); /* proportions (0 < prop=y/n < 1) */ /* Binomial distribution */ logit = log( (perct/100) /(1 - (perct/100) ) ); /* enter logit for percentages (0 < perct < 100) */ /* variance proportional highest in middle */ F. Random Numbers Random numbers are available for a wide variety of distributions. The functions to use begin with RAN. Computer generated random numbers are never truly random -- caution and some checking on your own are always a good idea. Random numbers can be generated in a data paragraph: DATA xyz; DO i=1 to 10; uni=RANUNI(0); /* an argument of 0 uses the clock as a seed */ /* otherwise, use a 5 to 7 digit odd number */ OUTPUT; END; A do loop, always ends with an END; statement. The OUTPUT statement writes a new record to the data set xyz for each uniform random number created. Each record in set xyz will have the variables uni and i. Here is a list of the random number generators: x = RANUNI(seed) /* uniform between 0 & 1 */ x = a+(b-a)*RANUNI(seed); /* uniform between a & b */ x = RANBIN(seed,n,p); /* binomial size n prob p */ x = RANCAU(seed); /* Cauchy with loc 0 & scale 1 */ x = a+b*rancau(seed); /* Cauchy with loc a & scale b */ x = RANEXP(seed); /* exponential with scale 1 */ x = RANEXP(seed) / a; /* exponential with scale a */ x = a-b*LOG(RANEXP(seed)); /* extreme value loc a & scale b */ x = RANGAM(seed,a); /* gamma with shape a */ x = b*RANGAM(seed,a); /* gamma with shape a & scale b */ x = 2*RANGAM(seed,a); /* chi-square with d.f. = 2*a */ x = RANNOR(seed); /* normal with mean 0 & SD 1 */ x = a+b*RANNOR(seed); /* normal with mean a & SD b */ x = RANPOI(seed,a); /* Poisson with mean a */ x = RANTRI(seed,a); /* triangular with peak at a */ x = RANTBL(seed,p1,p2,p3); /* random from (1,2,3) with probs p1,p2,p3 */ The seed above is either less than or equal to 0 (to use the value of the internal clock to randomly start the sequence); positive (used as initial seed -- it should be odd and less than 2**31-1). The performance is untested for 0 or negative seed -- use at your own risk. The seed is only examined on the first encounter with a random number generator in your program, so you cannot change the process once you begin. G. Miscellaneous Numerical Functions Suppose you want to insert the constant pi=3.1415926535897932384626433832795.. into a program and you do not really need to or want to type 37 (or even 5 or 6) digits? Here are several numerical functions that will compute it for you: pi_1 = CONSTANT ( 'PI' ) ; * the CONSTANT* function available with version 8.1; pi_2 = 2*ARSIN(1); * applies a property of the arcsin transformation; pi_3 = GAMMA(0.5)**2 ; pi_4 = 4*ATAN(1); * slightly less accurate, usually close enough ; * For optimal DATA step performance, do not enter a CONSTANT() function in a DO loop. Assign it to a temporary variable prior to a DO loop and then enter the new variable as the constant. The DATA step compiler does not recognize and optimize a call to a CONSTANT() function. FACT(n) produces the values of n!, e.g., 5! = 5*4*3*2*1 = 120 DATA fct; INPUT n; yy=fact(n); ; 1 2 3 4 5 ; n yy 0 1 1 1 2 2 3 6 4 24 5 120