Chapter 6 - ADVANCED APPLICATIONS OF THE DATA STEP Section 6.1 -- Dataset Manipulations a. Methods to APPEND Two or More SAS Datasets into One Dataset b. MODIFY or UPDATE SAS Datasets c. Read Multiple Text files with the same format into one SAS Dataset d. Read Multiple SAS Datasets into one SAS Dataset and select subset of records a. Methods to APPEND Two or More SAS Datasets into One Dataset SAS has several methods to append two or more datasets into one. Before you apply any of them, all data must first be read into a SAS dataset. "Appending" two or more SAS datasets can be visualized as combining them vertically (sometimes called concatenation). It essentially means to stack datasets on top of each other. Typically, in the output dataset, the number of observations becomes the sum of observations from the two input datasets and the number of variables remains the same. In order to stack two or more datasets properly you should take special care to give them an identical data structure, i.e., the same variable names with identical data formats. SAS has several approaches to perform this task, including: * DATA step * PROC APPEND * PROC DATASETS * PROC SQL The simple, brute-force, pound-them-together approach to append two or more datasets into one dataset, no matter what their data structure, is to apply the DATA step and list the individual dataset names on the SET statement. The following DATA step stacks individual datasets in the order they are listed on the SET statement, no matter what their data structure: DATA combined; SET ds_1 ds_2 ds_3 .. ds_n;. < other SAS statements> RUN; If datasets ds_1, ds_2, ..., ds_n have n_1, n_2, .. n_n observations respectively and all have the same number of variables with identical names, the new file with the name 'combined' has a total of N=(n_1+n_2+..+n_n) observations and contains all the variables found in each. The brute-force nature of the DATA step approach is illustrated when variable names found in one dataset do not exist in the other datasets. For these observations, missing values for that item will be entered for all records taken from the other files. Note the length of any character data will be established from the lengths found in ds_1, the first dataset on the SET statement. Values of the same variable will be truncated if longer lengths of that variable appear in subsequent datasets. One drawback of DATA step is that it processes datasets one record at a time. However, on today's fast systems, efficiency probably isn't a major concern for reasonably sized datasets. The distinct advantages of DATA step method include the ability to compute new variables as it appends files together and to save observations with an OUTPUT statement if a specified condition is true. It will also 'force' the files together, whatever their data structure (which could give you some undesirable results). SAS can also keep track of which records which came from specific data sets and then assign values to new variables accordingly. For example, with the SET statement you can enter an option to compute an indicator variable that records this information. DATA FileA; vartext='fileA'; RUN; DATA FileB; vartext='fileB'; RUN; DATA FileC; vartext='fileC'; RUN; DATA NewFile; SET FileA(IN=a) FileB(IN=b) FileC(IN=c); Group_A=a; Group_B=b; Group_C=c; RUN; PROC PRINT DATA=NewFile NOobs; Run; The following output is produced: VARTEXT GROUP_A GROUP_B GROUP_C fileA 1 0 0 fileB 0 1 0 fileC 0 0 1 PROC APPEND also concantenates datasets with a master dataset; however, currently appends only one dataset for each invocation of the PROC step. Since it does not "read" the datasets (i.e., process them one record at a time), this approach works much more efficiently than the DATA step: PROC APPEND BASE=stat DATA=ds_1; RUN; PROC APPEND takes the dataset called ds_1 and adds it to the bottom of the master dataset called stat. PROC APPEND works well in simulation examples where output data from a SAS procedure are collected into a dataset and appended into a master dataset containing all the results of the experiment. Be aware that PROC APPEND combines files with more consistency checking than the DATA step approach. Therefore, the two datasets should be compatible regarding the number and names of the variables and also regarding data type and length of character variables. PROC APPEND takes special attention to ensure the two datasets to be combined are strictly congruent. By default, it has a safety feature that prevents combining data from different files where character variables with the same names have different lengths. It also excludes variables from the dataset to be appended that are not defined in the master dataset. This safety feature is designed avoid problems, that other methods described won't check for you. One way to relax the strict compatibility of the two datasets is to add the FORCE option: PROC APPEND BASE=stat DATA=ds_1 FORCE; RUN; If the two datasets are not completely compatible, the FORCE option allows some differences in variable attributes such as length of character data or variable names in the master dataset which do not appear in the appended file. In this example the FORCE option will append ds_1 to the master file even if some of the variables contained in it are not found in ds_1 or the lengths of character data do not match. However, be aware that appended character data will have values truncated to the length specified on the master file, if any of their values are longer. Also, if the new dataset contains extra variables, they will not be appended to the master dataset, even with the FORCE option in place. It's generally good programming practice to first append datasets without using the FORCE option and note any problems with the data SAS tells you in the LOG window. This will alert you to differences in data formats of the two files of which you might not be aware or had forgotten to consider. To append a second dataset, assuming stat now contains the contents of ds_1, enter a second PROC APPEND step with the next dataset you want to attach to the master: PROC APPEND BASE=stat DATA=ds_2 FORCE; RUN; When the master dataset does not yet exist, you can make an empty file with a DATA step containing the desired data structure: DATA stat; FORMAT name $char5. ; * the presence of name as character adds the variable to the file; * note that name is specified to have a character format of length 5; * define numerical data (can enter shorter lengths, if all integers; LENGTH aic loglik sse numresid nparms ndifs errorvar mu last 8; STOP; * no need to output any records; RUN; If the data structure of the master dataset already is present in an existing SAS dataset, you can make a copy of the structure with: DATA dat.new_stat; STOP; SET dat.stat; RUN; In both examples the STOP; statement prevents any records from being written to the new master dataset. The variable names and formats will be there ready to accept datasets to be appended. For the first example above, if you don’t know what the exact structure the master file should be, you can run the DATA or PROC step that produces the file you plan to plan to append files to. Then extract the variable names and data formats with PROC CONTENTS. Once the two datasets are compatible, you are ready to apply PROC APPEND to concatenate the files together. One important application of PROC APPEND occurs when running a loop from a SAS macro where the same procedure is applied to many datasets and you want to accumulate results, such as regression coefficients or summary statistics from each run, into one dataset. Other examples are found with simulations or bootstraps where you collect output data from many runs that produce individual datasets with the same structure. Another example is the collection of repeated measures data. Suppose that subjects are followed in an exercise program over time, with height, weight, and other measurements collected month. The master dataset consists of the initial set of measurements on the subjects. Each month you want to append the measurements you record from each subject. In this example, the master dataset is called subjects and the first month’s data month1 are appended to it with the following statements: PROC APPEND BASE=subjects DATA=month1; RUN; The same results could also be achieved with the DATA step as described above; however, the principal advantages of PROC APPEND are the safety feature for file compatibility when combining datasets and the quicker processing time. However, new variables cannot be added or computed with PROC APPEND; this must occur in a subsequent DATA step. Chapter 4.9 shows an example how PROC DATASETS will append two SAS datasets together. The same advantages and limitations found with PROC APPEND also apply with it. PROC SQL is yet another method to append a set of SAS datasets together. The following commands take three datasets and append them into one dataset called combined: PROC SQL; TITLE 'SQL Table COMBINED'; CREATE TABLE combined AS SELECT * FROM (SELECT * FROM ds_1 WHERE (1 <= a <= 1) UNION SELECT * FROM ds_2 WHERE (10 <= a <= 11) UNION SELECT * FROM ds_3 WHERE (20 <= a <= 21)); QUIT; PROC PRINT DATA=combined; run; PROC SQL can concatenate up to 32 tables in one step. Character data with the same names do not need to have the same lengths and will not experience truncation errors, even if the dataset with the character data containing the longer lengths is placed second. The WHERE options included in this example allow you to specify ranges of records based on one or more of the variables from each dataset to be written to the combined file. Note that to append files only, PROC APPEND has a real time advantage over either the DATA step or the SQL approach. Results will depend on file width and number of records. Usually the SQL union has the slowest performance. One additional safety feature you should apply with any method you apply to append datasets is to make sure that the same data are not appended to a master dataset more than once. One solution is to apply PROC DATASETS immediately following the DATA step or PROC APPEND to delete the dataset(s) SAS most recently appended to the master dataset once the task has been completed successfully. PROC DATASETS; DELETE ds_1 ds_2 ds_3; * list dataset names to delete here ; RUN; QUIT; b. UPDATE or MODIFY Datasets To update a SAS data set means to apply changes to a master dataset from a transaction dataset. You will work with two input datasets. * master: the data set containing the original data * transaction: the data set containing corrections to existing records SAS datasets can be edited by using the UPDATE statement or the MODIFY (which work very much like the MERGE statement, described in Chapter 6.2). The MODIFY statement replaces, deletes, or appends observations in an existing dataset. It modifies the dataset as it is, without making a copy of it, thus minimizing the use of disk space and input/output run time. MODIFY allows for different ways to handle combine data. Here is the simplest possible illustration to update one value from a single record: DATA master; INPUT id $ x; CARDS; a 8 b 1 c 7 ; DATA transaction; INPUT id $ x; CARDS; b 16 ; DATA master; MODIFY master transaction; BY id; RUN; proc print data = master; run; -- Master -- -- transaction -- -- new Master -- Obs id x Obs id x Obs id x 1 a 8 1 b 16 1 a 8 2 b 1 2 b 16 3 c 7 3 c 7 UPDATE uses observations from the transaction dataset to change the values of corresponding observations from the master dataset. You must use a BY statement with the UPDATE statement. All observations in the transaction data set are keyed to observations in the master data set according to the values of the BY variable. The process can also add new records. Assume the transaction file contains one new record with its respective values: DATA transaction; INPUT id $ id x; CARDS; c 12 d 17 ; DATA master; UPDATE master transaction; BY id; RUN; -- Master -- -- transaction -- -- new Master -- Obs id x Obs id x Obs id x 1 a 8 1 c 12 1 a 5 2 b 16 2 d 27 2 b 16 3 c 7 3 c 12 4 d 17 After running a DATA step with the UPDATE statement, the number of observations in the new dataset is the sum of the number of observations in the master dataset and the number of unmatched observations in the transaction dataset. The variables on the BY statement (more than one is allowed) must specify unique values for each observation in the master data set. If the master data set contains two observations with the same values of the BY variables, the first observation is updated and the second observation is ignored. SAS will then write a warning message to the log file when the DATA step is run. c. Read Multiple External Text Files with the Same Format into one SAS Dataset When you have many text files where all of them have data with identical formats and file names of the same length, the DATA step can read each one and sequentially place their data to one SAS dataset. First, enter all file names in a file with a name such as ‘names.txt’ which lists the drive, complete path, and file name, 1 file per line c:\data\nal11.dat c:\data\nal12.dat c:\data\nal13.dat c:\data\nal14.dat c:\data\nal15.dat etc.. The DATA step shown below first reads the file names in the order listed from this file. From each file it then reads the specified data and if desired, processes the data contained in it with additional SAS statements. DATA one; INFILE 'c:\data\names.txt'; LENGTH nextfile $17. ; * 17 is the number of characters in each line; INPUT @1 nextfile ; INFILE dummy FILEVAR=nextfile END=lastrec missover lrecl=500; LENGTH x_char $20 ; DO until (lastrec); INPUT id 1-5 x_char 6-25 ; < insert SAS data processing statements here > ; OUTPUT one; END; RUN; PROC CONTENTS; RUN; d. Read Multiple SAS Datasets into one SAS Dataset and select subset of records Some situations may require you to match records from two datasets where the ID variables may not necessarily be unique in one dataset. [If they are unique, then merging the files with a DATA step as described in the next section is more efficient.] A special type of DATA step processing can be applied as this example demonstrates with financial data. First you have a company file that contains various company identification codes (ticker values) which do not need to be unique. You also have a file of stock market returns that consists of financial data collected daily over a long period of time for the companies listed in your other file. Your task is to select financial data from a specified time interval from the returns file for each record in the company file. Note that this same problem can be addressed with PROC SQL without the capability for calculations. See the next section on merging datasets in this chapter for an example. * read returns data from a text file; DATA returns; INFORMAT date $8. ticker_r $5. ; FORMAT obs_date mmddyy10. ; INFILE 'c:\sas\data\returns_all.csv' firstobs=2 dlm=',' dsd missover; INPUT permno date ticker_r ret vwretd; * convert a date stored as character to a SAS date value; yr = SUBSTR(date,1,4); mnt = SUBSTR(date,5,2); day = SUBSTR(date,7,2); dtc = TRIM(mnt) ||'/'|| TRIM(day) ||'/'|| TRIM(yr); obs_date= INPUT(dtc,mmddyy10.); ticker_r=UPCASE(ticker_r); * make sure all ticker values are upper case; * apply the following statement for testing; * IF (SUBSTR(ticker_r,1,1) EQ "A" AND SUBSTR(ticker_r,2,1) LE "D") then OUTPUT; RUN; PROC SORT DATA=returns; BY ticker_r obs_date; RUN; PROC CONTENTS data=returns; RUN; PROC FREQ; TABLE ticker_r; RUN; PROC PRINT data=returns(obs=20); RUN; * need to compute the count of observations in the dataset returns; DATA _null_; SET returns NOBS=count; IF _n_ = 1 THEN CALL SYMPUT("nobs",LEFT(PUT(count,8.))); STOP; RUN; %PUT &nobs; * read in the company data from an EXCEL file; PROC IMPORT DATAFILE="c:\consult\johnc\data_sas.xls" OUT=company DBMS=excel2000 REPLACE; SHEET="final_1993"; GETNAMES=yes; run; PROC PRINT DATA=company(obs=10) NOobs; run; DATA co; SET company; KEEP transaction ticker date_t date_low date_upr; FORMAT date_t date_low date_upr mmddyy10. ; * convert excel date into the year, month, and day variable; date_t=DATEPART(date); date_low= date_t - 20; date_upr= date_t + 120; * statement for testing purposes; *IF ticker IN ('AA', 'ABT', 'ACK', 'ACR', 'ACY' ) then output; RUN; PROC SORT data=co; BY ticker; RUN; PROC CONTENTS data=co; RUN; PROC PRINT data=co(obs=20) NOobs; RUN; DATA all; * loop through each record in the company dataset; SET co; ticker=UPCASE(ticker); PUT _n_ ticker date_low date_t date_upr transaction; KEEP date_low obs_date date_upr ticker ticker_r transaction vwretd ret; * for each ticker value in the company dataset, loop through the returns dataset to match ticker values and select observations from them within the relevant range of the date; DO i = 1 to &nobs; SET returns point=i; IF (TRIM(ticker) EQ TRIM(ticker_r)) AND (date_low LE obs_date LE date_upr ) THEN OUTPUT all; END; RUN; PROC PRINT DATA=all(obs=100); VAR ticker date_low obs_date date_upr transaction vwretd ret; RUN;