Section 4.8: Executable Statements in the DATA Step Statements in the DATA step are generally defined to be of two types: declarative (the order they appear in the DATA step does not affect data processing) and executable (the order in the program does matter). The DATA step works with an automatic looping sequence for each observation in the input data file. Executable statements must be processed in a sequential manner to the final one in the loop. The DATA step then returns to the first statement when all statements for the current observation have been processed and begins processing the next observation. Unless a RETAIN statement appears, all variables are assigned missing values at the top of the DATA step. When reading an existing SAS dataset, the SET statement reads an observation from it and assigns values to variables in the new dataset from the same variables in the existing dataset. If an external file is read, the INPUT statement reads the next observation and assigns values to the variables. Executable statements are those whose inherent order must follow the logic of the program and usually are associated with a computation. Once the end of DATA step is reached, the all variables for this observation are automatically written to the new dataset (as if an OUTPUT statement were there). SAS then returns to the beginning of the loop and automatically resets all variables to have missing values before processing the next observation from the input data. The following list includes some of the most commonly used executable statements: DO DELETE FILE and PUT IF-THEN/ELSE INPUT LINK OUTPUT PUT RETURN STOP User written formulas DO - run a set of statements if a condition is 'true' The DO statement tells SAS to process all statements until it encounters the next END; statement. It is often attached directly to an IF statement which tests whether a specified condition is true. IF (UPCASE(over_time)="YES") THEN DO; tot_salary=salary + hours*rate; taxes=salarly*tax_rate ; OUTPUT wages; END; In this example, the statements between the DO; and END; are computed only if the initial condition (over_time="Yes") is true. DO - Process a collection of statements a fixed number of times You can set up a looping structure within the DATA step to repeat a collection of statements a fixed number of times that are placed between DO; and END; for example, here is an example to make a dataset called simul which contains six observations, one each for x=1 to 6 with x-squared also a variable: DATA simul; DO x = 1 to 6; x_sq = x*x; OUTPUT simul; * the OUTPUT statement must be included before the END; END; RUN; Each pass through the DO loop increases the value of x by 1 (the default increment). When x equals 6 the statements are processed and then x is incremented to 7 so the condition no longer is true, and the DO loop ends. The OUTPUT simul; statement must be included within the loop, otherwise only the last record will be written to the dataset. That is when x=6 and x_sq=36 (since there is an automatic OUTPUT; at the end of every program that does not already contain an OUTPUT; statement within it). To process only odd-numbered elements of an array, begin with 1 and increment the counter by 2: DO v1 = 1 to 7 by 2; * odd-numbered values 1 3 5 7 are created for v1; < insert statements >; OUTPUT ; END; To process specific index numbers only, list the numbers with a comma between each value: DO i = 1, 4, 5, 8, 9; * process specific values of i, not listed sequentially; ; OUTPUT ; END; You can write a DO loop with a non-numeric variable in the following way: DO lvl = 'a', 'b', 'c'; < relevant sas statements>; END; The index variable lvl is repeated for values of 'a', 'b', and 'c'; lvl is a categorical variable, which takes the character constants a, b, and c. If you want to exit the DO loop before the final iteration, you may enter enter the LEAVE statement which is usually the result of a conditional test: DATA _null_ ; DO i = 1 to 6 ; IF i = 3 THEN DO ; PUT 'Leave Loop Now' ; LEAVE; END ; PUT 'i = ' i ; END ; PUT 'After Loop: i = ' i; RUN; The following message appears in the LOG window: i = 1 i = 2 Leave Loop Now After Loop i = 3 Two more options include: DO i = 1 to 1000 until(Test='True'); ; END; OR DO i = 1 to 1000 while(Test='False'); ; END; A DO loop is a fundamental feature of processing data assigned to DATA step arrays. For more information on this application see Section 4.5. DELETE: stop processing the current observation and return to the beginning of the loop without saving the record The DELETE; statement immediately stops processing the current observation which will NOT be written to the output dataset. The loop returns to the beginning of the DATA step to process the next observation. This statement is particularly helpful if some condition is detected for which you do not want to save this particular record: IF ( gender LE 0 or gender GE 3 ) THEN DELETE; * delete miscoded gender records; FILE and PUT : write data to an external text file The FILE and PUT statements work the opposite of INFILE and INPUT. FILE identifies an external text file to write data and PUT indicates what data you want to write. These statements make it possible to write values of variables or selected texts (such as error messages) from a DATA step or to perform data checking or DATA step debugging. They are also useful when needing to write contents of a SAS data set to a text file. You can use either of: FILE log; FILE print; FILE fileref; %* Use with statements FILENAME fileref ""; * see above for proper use of FILEREF statement ; FILE ""; FILE and PUT statements can be applied in either a DATA step or DATA _null_; Note: the macro statement %PUT always writes information to the log file (see Chapter 9 on using SAS macros). IF <> THEN <> ; ELSE <>; Conditional Statements IF < > THEN < >; statements process one or more statements if the condition specified in the logical condition based on values from the current observation meets the criteria. If the condition is true, then one or more statements following THEN are executed. If multiple statements are needed, they need to be enclosed with in a DO; END; block, e.g., INPUT < variables > ; IF THEN < enter one SAS statement> ; IF THEN DO; < enter multiple SAS statements >; END; Because variables in the logical condition require they have been assigned values, IF statements must appear following the INPUT statement or the statement which computes its values for each record. Logical operators can be written with letters or with symbols. Equivalent choices are presented by column: Not greater less Equal equal than than EQ NE GT LT GE LE AND OR NOT = ^= > < >= <= & | ^ IF (a EQ b) THEN c=2*a + b; IF (a NE 0) THEN d=2/a; IF (x > 0) THEN x_lg=LOG(x); If subjects IDs are linked to the group they belong, here is a quick way to assign group numbers in the program: IF ( 1 <= id <= 15) THEN group=1; IF (16 <= id <= 30) THEN group=2; IF (31 <= id <= 45) THEN group=3; These IF statements show how to notate a range of relevant values for a numeric variable; that is, the two endpoints of an interval are entered with the variable name in the middle along with the type of inequality (can also enter "less than" < or "greater than" > symbols): ( lower <= id <= upper) The SAS notation is essentially a shortcut to specifying a range of values with its two boundary values which you would otherwise define as two separate conditions: IF (id >= 1 AND id <= 15) THEN .. ; If missing data are stored as an impossible or illogical number (e.g. -9 when possible responses are from a Likert scale=1,2,3,4,5) you need to convert it to the SAS system missing data symbol: IF age=-9 THEN age= . ; A colon modifier can be added just prior to character data within quotes to indicate "begins with". In the context of the following example, it is an instruction to perform the comparison operation by truncating the longer operand (day) to the length of the shorter ("Mon"), if necessary. Without the colon, operands of unequal length are compared by padding the shorter to the length of the longer. For all days that have a value equal to M, Mo, Mond, Monday, Money, etc., take a specific action. IF (day EQ :'Mon') THEN DO; ..; END; DATA a; LENGTH text $4 string $15 ; INPUT text ; IF text=:'AB' Then string='We Gotta Winner'; CARDS; abcd Abcd ABCD PROC PRINT NOobs; run; returns text string abcd Abcd ABCD We Gotta Winner The colon can also be entered with other operators such as >, <, and IN (see below). However, when applying these operators differing variable lengths may require special attention. The IN Operator The IN operator, which is a comparison operator, searches for either character or numeric values that are equal to one from a list of values of the same data type. The list of values must be placed in parentheses, with each character value in quotation marks, and separated by either a comma or blank. If you have a few specific levels of a variable to test for such that a convenient 'less than' or 'greater than' enumeration as shown previously is not possible, the IN operator may be more efficient. IN is a comparison operator that searches a list of character or numeric values that must be placed between parentheses, with each number or character value (in quotation marks) separated by either a comma or blank. IF id IN (113 111 123 122 133 132 144 233) THEN OUTPUT; To exclude records from the output dataset for a few specified id values, enter the NOT IN operator: IF id NOT IN (123 143 146 155) THEN OUTPUT; For character data, the values in parentheses must be placed between quotation marks and separated by either a comma or a blank. For example, suppose you want to use all locations from North Carolina or Texas. You could specify: IF (UPCASE(state) EQ 'NC' OR UPCASE(state) EQ 'TX') THEN OUTPUT; However, a more efficient way to code this statement (especially with several values of state) is the IN operator, which tells SAS to include any state specified in the list: IF UPCASE(state) IN ('NC','TX') THEN OUTPUT; In addition, you can use the NOT IN logical operator to exclude character data from a list: IF (UPCASE(state) NOT IN ('CA', 'TN', 'MA')) THEN OUTPUT; Note that you cannot enter a variable list of integers with a lower and upper bound separated by a hyphen (i.e., a range of consecutive integers such as 1,2,3,4,5 cannot be designated as 1-5). The hyphen is translated as a negative sign attached to the following number and therefore, is interpreted as a distinct number in the list. For example, an IF statement with following IN list IF xvar IN (1-5, 12-14, 30-34) then output; tests for the presence of the following numerical values: 1 -5 12 -14 30 -34 DATA a; DO xvar = -35 to 35; IF xvar IN (1-5, 12-14, 30-34) then OUTPUT; END; proc print DATA=a; var xvar; run; produces this output Obs xvar 1 -34 2 -14 3 -5 4 1 5 12 6 30 The odd and perhaps potentially confusing part is that in addition to a space, both the comma and negative sign (attached to a number) are accepted as delimiters. In the latter case it is both a delimeter and assigns a negative value to the integer! If you want to enter a list of consecutive integers, yet do want to type them all between the (), a macro can be of assistance (see Chapter 9). DATA a; DO xvar = -10 to 10; OUTPUT; END; %MACRO rng(start,end); %DO i = &start %TO &end; &i. %END; %MEND rng; DATA b; SET a ; flag = 0; If xvar IN (%rng(-6,-4), %rng(1,3), %rng(7,8)) THEN flag = 1; RUN; proc print data=b NOobs; WHERE (flag EQ 1); run; xvar flag -6 1 -5 1 -4 1 1 1 2 1 3 1 7 1 8 1 Note that the IF statement in the previous DATA step works as if you had typed: If xvar IN (-6 -5 -4 1 2 3 7 8) THEN flag = 1; The valid contents of a categorical (integer) variable can be placed into a macro variable and then entered into a statement like: IF nn IN (&tlst.) THEN DO; ; end; DATA tst; tt = 1; OUTPUT; tt = 5; OUTPUT; tt = 3; OUTPUT; tt = 2; OUTPUT; tt = 4; OUTPUT; tt = 2; OUTPUT; tt = 1; OUTPUT; tt = 3; OUTPUT; run; proc sql noprint; select distinct tt into :tlst separated by ',' from tst; quit; run; %put &tlst. ; results in &tlst == 1,2,3,4,5 The colon modifier with the IN operator. DATA ck; SET rx; flag=(UPCASE(product) IN :('ALP', 'CHL', 'CLM', 'CLZ', 'DIA')); RUN; (Note that flag is a dummy variable that indicates 1=true, 0=false) proc print; run; Obs product flag 1 ChL 1 2 CLMh 1 3 C2LZ 0 4 DIA 1 5 AaLP 0 6 CLmy 1 7 CeLZz 0 9 Alpto 1 Dichotomous Data Dichotomous data can be created based on the ranges of other numerical or categorical data. PROC FORMAT can write a format that will assign numerical values to the relevant text, e.g. 0='No' and 1='Yes'; IF (0 LE sales LT 2000) THEN bonus=0; IF ( sales GE 2000) THEN bonus=1; A special notation is available as a much shorter yet equivalent version of the previous two IF statements. bonus=(sales GE 2000); SAS automatically calculates a variable called bonus that is equal to 1 if the condition is true and equal to 0 otherwise. You should not apply this statement as it is written if any sales data are missing - why?? Two ways to get around the problem are to first check that 'sales' is in a valid range or to add the MISSING function in an IF statement: IF (MISSING(sales) NE 1) THEN bonus=(sales GE 2000); If you are not careful in your handling of observations with missing values when you assign dummy variables in this manner, results might be influenced. IFN and IFC Functions SAS functions are covered in Chapter 7. However, the IFN and IFC are directly related to the IF/THEN/ELSE statements, they are introduced here. The IFN function is similar to the IFC function, except that IFN returns a numeric value whereas IFC returns a character value. IFN: Returns a numeric value of an expression based on whether the expression is true, false, or missing The IFN function uses conditional logic that enables you to select among several different values based on the value of a logical expression. IFN evaluates the first argument, then logical-expression. If logical-expression is true (that is, not zero and not missing), then IFN returns the value in the second argument. If logical-expression is a missing value, and you have a fourth argument, then IFN returns the value in the fourth argument. If logical-expression is false, IFN returns the value in the third argument. The IFN function, an IF/THEN/ELSE construct, or a WHERE statement can produce the same results However, the IFN function is useful in DATA step expressions when it is not convenient or possible to use an IF/THEN/ELSE construct or a WHERE statement. Calculating Commission Using the IFN Function In the following example, IFN evaluates the expression TotalSales > 10000. If total sales exceed $10,000, then the sales commission is 5% of the total sales (first computation). If total sales are less than $10,000, then the sales commission is 2% of the total sales (second computation)(. DATA _null_; INPUT TotalSales; commission = IFN(TotalSales > 10000, TotalSales*.05, TotalSales*.02); PUT commission=; datalines; 25000 10000 500 10300 ; run; IFC: enables you to select among several different values based on the value of a logical expression. IFC evaluates the first argument, logical-expression. If logical-expression is true (that is, not zero and not missing), then IFC returns the value in the second argument. If logical-expression is a missing value, and you have a fourth argument, then IFC returns the value in the fourth argument. If logical-expression is false, IFC returns the value in the third argument. The IFC function is useful in DATA step expressions, and even more useful in WHERE clauses and other expressions where it is not convenient or possible to use an IF/THEN/ELSE construct. IFC evaluates the expression grade>80 to implement the logic that determines the performance of several members on a team. The results are written to the SAS log. DATA _null_; length performance $17 ; input name $ grade; performance = ifc(grade>80, 'Pass', 'Needs Improvement'); PUT name= performance=; datalines; John 74 Kareem 89 Kati 100 Maria 92 ; LINK : call a subroutine The LINK statement calls a subroutine embedded within the current DATA step: DATA ds_out; SET ds_in; ... LINK sub01; * this statement links to the subroutine name at this point; ... RETURN; * return to keep normal execution from falling into the subroutine; sub01: * subroutine statements here; RETURN; * go back to statement after the LINK which called the subroutine; As with the GOTO statement, the subroutine name must have a colon after it [and the label may have to be in column 1]????????. The RETURN statement instructs SAS to go back to whichever LINK statement called the routine. OUTPUT: writes records to a SAS dataset The OUTPUT statement writes the input and all calculated variables one observation at a time into the dataset specified on the DATA statement. A KEEP or DROP statement, if included, in the DATA step indicates which variables will be stored. If you make only one dataset and process each record once, you do not need to include the OUTPUT statement; SAS automatically puts one at the end of the DATA step. This statement is not necessary in many applications of sequential data processing. It often helps to include it so that you know how SAS works. IF more that one dataset is to be produced, you need to add the appropriate data set name following OUTPUT, e.g., DATA data1 data2; OUTPUT data1; OUTPUT data2; RUN; PUT In the DATA step this statement writes lines to the SAS log, to the SAS output window, or to an external location (such as a text file) that is specified in the most recent FILE statement (on which the “link” can be specified with a FILENAME statement outside the DATA step – see 037_filename.txt). (Note: there is also a %PUT statement (Chapter 9: SAS macros) and a PUT function (Chapter 7.2: character functions) explained in the indicated documents) RETURN : stop processing the current observation, output the results, and return to the beginning The RETURN statement is usually placed at the conclusion of set of statements invoked from a logical test with an IF statement or at the end of a subroutine. SAS will stop processing the current observation and return to the beginning of the DATA step for the next iteration, ignoring all remaining statements. The difference between the RETURN and DELETE statements is that with RETURN, data from the current observation are automatically written to the SAS data set _unless_ an OUTPUT statement appears elsewhere in the program. If an OUTPUT statement is used elsewhere, the current record will not be written to the current data set. STOP : Immediately halt the processing of the DATA step STOP; immediately halts processing in the DATA step at the current observation if it appears by itself or if it is the result of a true conditional statement you specify. For example, DATA new; INFILE 'xyz.dat'; INPUT x; IF_n_ > 25 THEN STOP; * halts execution of the DATA step on the 26th iteration; This statement is particularly useful when testing code for applications with large data sets. The output file is greatly reduced in size when you process the first 25 observations and check the status of the program and results rather than process the entire data set. USER WRITTEN FORMULAS - Equations and Functions for Data Transformations There is no need to perform mathematical transformations on your actual data outside of SAS. In fact, it is good practice to enter your data after it is checked and corrected for errors and let SAS compute the transformations. When you have a large number of variables to process, you can also perform transformations using DO loops or specially supplied function statements that contain the respective formula. All data must contain a value before they can be used in a formula. Therefore, any statement to transform data must be inserted *after* the SET or INPUT statements. The new variable will be assigned a missing value if any of its components are missing. Mathematical equations similar to those used with other programming languages are used to create additional variables. Special SAS functions can also be included on the right hand side of the equation (see Chapter 7 for more examples). They are specialized routines that return a value computed from one or more arguments enclosed within parentheses. For example, the statement DATA lgt; LABEL x='Weight of Subject' sq_x='Sq Root of Weight'; INFILE 'measure.dat'; INPUT id x; sq_x = SQRT(x); RUN; The INPUT statement reads a value of x representing the weight of a subject. The formula takes the square root of the current value of x and stores it in the new variable called sq_x. You can also take existing SAS data sets and create new variables. For example, if a variable y is contained in a data set called direct, then a log transformation is computed as: DATA logs; SET direct; log_y = LOG(y); Run; This DATA step creates a new data set called logs that contains all the data from the data set direct plus the new variable log_y is created as the natural log of the variable y. Here are the functions for a few other commonly used data transformations: z = LOG(y); natural log of y z = LOG10(y); log of y, base 10 z = SQRT(y); square root of y z = x*y; x multiplied by y; use * for multiplication, + for addition, - for subtraction, / for division z = y**2; exponent: "y squared" or y*y z = y**0.5; "y raised to the 1/2 power" (same as SQRT(y)) z = x**-2; negative exponent: "1 /(x squared)" z = SIN(x); trigonometric sin function of x (also COS(x), TAN(x),...) Combine Character Data You can also combine two character variables into one with the concatenation operator, consisting of two vertical bars (||) placed between the variable names: var3 = var1||var2; if var1 and/or var2 may have spaces at the beginning or end, try: var3 = trim(left(var1))||left(var2);