Section 6.2: Methods to MERGE two or more SAS Datasets into one Dataset Merging two or more datasets is a horizontal concept. It is akin to entering a freeway while driving: your car (i.e., an incoming variable) will take its place 'safely' and retain its distinct identity (i.e., won't crash) alongside vehicles (i.e., other variables) in other lanes. The general idea is to combine the contents of two or more datasets that contain different variables and have nearly the same number of observations (though equal numbers are not required as long as you realize how missing data are produced). You want the individual variables to retain their identity in the new file and not be overwritten by a variable with the same name from a different file. In order to merge files so you have a one-to-one match with the appropriate observation, records must be connected by one or more variables common to both datasets that uniquely identify each observation. An easy way to visualize the process of merging files is to place two Excel worksheets side-by-side, where each file contains one or more identification variables (with the same name and data type) sorted in ascending order to link their rows. The other variables must have different names. For each unique combination of the identification variables, from the two spreadsheets select the variables with different column names to enter into a new worksheet. Depending on your selection criteria, the number of rows in the new worksheet may be as small as the number from the worksheet with the fewest observations or, in the extreme case, it can be as large as the sum of the distinct rows (as identified by their id values) from all worksheets. As the number of rows which do not have a unique match across the two files gets larger, this may imply the separate worksheets are not compatible for a merge. To read what SAS provides concerning merging files, go to: http://ftp.sas.com/techsup/download/technote/ts644.html Also read "How MERGE Really Works" by Bob Virgile found at: http://www.pswcrl.ars.usda.gov/Popham%5Cmerge.pdf In this section merging files will first be discussed within the context of the MERGE statement contained in a DATA step. In some applications PROC SQL is a more efficient (i.e., faster) approach if you don't need to make any calculations. It will also do types of merges not easily programmed with the DATA step (such as many-to-many illustrated at the bottom of this page). Concepts behind merging SAS datasets are best learned when limited to two datasets although it is possible to merge more than two files in one DATA step. If you have more than two datasets to merge, take the output dataset from the DATA step that merges the first two files and then merge it with additional datasets in subsequent DATA steps. Merging two SAS datasets (in these examples called file1 and file2) assumes they have n1 and n2 observations respectively, where these two numbers are typically equal (though many applications exist where they do not need to be equal). They should both have one or more identification variables with the same name and of the same data type. File1 contains p variables and file2 contains q variables which are unique to each dataset. A new dataset will be written that matches observations with unique values of the identification variables and attaches the remaining variables from both files. Merging files without Identification Variables It is possible to merge datasets without a BY statement, that is no unique identification variable exists on the datasets. For two datasets the process automatically matches data from row 1 of file1 with data from row 1 of file2, data from row2 in file1 with the data from row 2 in file2, and so forth. If you are absolutely certain these two datasets are row compatible, you can apply this method to merge two (and even more) files. DATA file1; INPUT x y @@; CARDS; 1 2 2 3 4 5 6 8 7 9 ; DATA file2; INPUT a b @@; CARDS; 11 22 12 23 14 25 ; DATA all; MERGE file1 file2; RUN; PROC PRINT DATA=all NOobs; RUN; Here are the results: x y a b 1 2 11 22 2 3 12 23 4 5 14 25 6 8 . . 7 9 . . In the output dataset "all" the first three rows are matched 1-1 with records from the two input datasets given in the same order. However, since file1 has two more records than file2, the final two rows of the new dataset have missing values for the variables taken from file2. Two or more SET statements also perform a one-to-one reading (also called one-to-one matching) of the specified datasets. [Note this merges two or more files horizontally which is a vastly different process than placing two dataset names on one SET statement, which appends or stacks the files vertically.] The new dataset contains all the variables from all the input datasets. The number of observations in the new dataset is equal to the number of observations from the smallest original dataset. DATA all2; SET file1; SET file2; RUN; PROC PRINT DATA=all2 NOobs; RUN; x y a b 1 2 11 22 2 3 12 23 4 5 14 25 Note the BIG difference in merging datasets in this manner: * Merging two files with a MERGE statement will read through the last row of the dataset which contains the most rows giving missing data for variables from the smaller dataset if the number of rows is unequal. * Merging files with multiple SET statements stops when the last row of the dataset with the fewest rows is complete. Here is a simple way to merge two datasets where the first contains the level 1 data (i.e., the actual observations from file1) and the second file contains the means or level 2 data, computed from data in file1 (produced with an OUTPUT statement from PROC MEANS): DATA all2; SET file1; IF _n_ EQ 1 THEN SET means; RUN; x y mn_x mn_y 1 2 4 5.4 2 3 4 5.4 4 5 4 5.4 6 8 4 5.4 7 9 4 5.4 Merge Datasets which Include Identification Variables It is generally good practice to merge datasets that both have one or more unique identification variables for each observation; that is, every row in both datasets should contain one or more categorical or integer variables that give each observation in one file a direct link to an observation in the other file. Merging datasets by matching on data considered to be "continuous" (i.e., they contain decimals or the results of calculations in the DATA step) is not recommended. If you need to match on continuous data, first compute some other variable in both files that will 'match' records rather than having two continuous numbers collide and assume the two datasets will merge. If two continuous values are computed with different formulas, the floating point representation in computer memory may differ in the last couple bits, leading to an inability to match exactly. So, if you need to do this type of merge, first apply the ROUND function to the data to limit the number of decimal point accuracy (See Section 7.9 on Miscellaneous SAS Functions, part C) and then merge on these new variables. The BY statement is fundamental to many data processing tasks, especially merging files. You can learn about how it works at: http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a001283274.htm Merging files in a DATA step without a BY statement may work just fine in special situations; however, what happens when you forget to include a BY statement in the DATA step when it is really needed? SAS processes the two files, but the resulting output dataset will likely be merged incorrectly. To avoid this problem, you can invoke a System Option to detect a missing BY statement in the DATA step when you merge datasets. As a safety value, you can enter this option as your default value by includiing it to your autoexec.sas file (see Chapter 1.5 for instructions how to find and edit this file). Then it will always be active before you run a DATA step that merge files. You can also enter it at any line prior to the DATA step with an OPTIONS statement: OPTIONS MergeNoBy=error; If a BY statement does not appear in a DATA step that contains a MERGE statement, this option will stop data processing as if a coding error occurred and will print this message in the LOG window. If on the rare occasion you want to or must merge files with no BY statement present (as shown in the first example), then enter either ‘NoWarn’ (proceed as if nothing happened) or ‘Warn’ (issue a warning message in the LOG window) instead of ‘error’ following the equals sign. OPTIONS MergeNoBy=Warn; Sorting Datasets by Common ID Variables Because it is very easy to unintentionally to merge variables from observations across two datasets that do not match, we recommend that you always first sort the two datasets by their unique identification variables with PROC SORT. You then enter the BY statement into the DATA step following the MERGE statement that lists the names of the two datasets. The following statements sort records by the values of the ID variables from smallest to largest. PROC SORT DATA=file1; BY id year; RUN; PROC SORT DATA=file2; BY id year; RUN; In these examples, assume the data are sorted by the ID variables listed from left to right and in ascending order on the BY statement. When merging datasets (or when applying any SAS procedure with BY statement processing), SAS will print an error message to the LOG window if the files are not sorted in this manner. DATA all_vars; MERGE file1 file2; BY id year; < other SAS statements > RUN; If the same unique id variable(s) occur in both ‘file1’ and ‘file2’, the combined data set ‘all_vars’ will contain the same number of records, and will include the id and year variables and the p + q additional variables taken from both data sets. If file1 and file2 have combinations of ID variables that do not match, the merged dataset will contain all the ID variables listed in both files. The total number of variables in the new data set is still the variables on the MERGE statement and the p + q additional variables; any variable not present in the observation from one of the datasets will be set to missing. Add variables from one dataset into a master dataset Suppose file1 is your master dataset and you want to maintain all records that are present in it. File2 contains other information you want to add to the observations in file1. Records with the same identification variables may or may not exist in file2, or it may also contain observations you do want to include in the merged dataset. Known as a Left Join, the MERGE and IF statements from this next DATA step specify that records from file1 only will be written to the output dataset with the corresponding data from file2, if the record exists. * left join merge; DATA all_vars; MERGE file1(IN=ii) file2(keep=id year var1 var2 var3); BY id year; < insert other SAS statements > IF (ii EQ 1) THEN OUTPUT; * enter this statement last ; RUN; The IN= option of the MERGE statement specifies an indicator variable ii. The value ii appears once again in the ensuing IF statement. SAS sets this indicator variable to 1 when the record comes from file1 and sets it to 0 when the record is not found in file1 but only on file 2. The variable named ii from the MERGE statement can be any valid SAS variable name; however, note two important features: * it must NOT be a variable name that already exists on the specified input data set (if the variable does exist, it will not be written to the new dataset) * the indicator variable ii will not be written to the merged dataset This feature is especially helpful when file1 is your master dataset that contains only the observations to be sent to the next PROC or DATA step. More efficient use of resources (i.e., filesize is minimized) can be achieved if you enter the KEEP= option as shown attached to the file2 dataset. This option is especially helpful if it contains many variables you don't need in the merged file. Merge Records which are Common to Two Datasets Suppose you want to merge observations from two datasets into one dataset that only contains observations common to both input files. That is, the same observations may not exist in both file1 and file2, but in order to use the data you only want to collect records and their associated variables that are common to both files. The following set of commands writes a new SAS dataset called common that consists of observations that are common to both file1 and file2. DATA common; MERGE file1(IN=ii) file2(IN=jj); BY id year; < enter other SAS statements > ; IF ((ii=1) and (jj=1)) THEN OUTPUT; * enter this IF statement at the end ; RUN; Variables ii and jj are indicator variables (i.e., dummy) where ii=1 if the record comes from file1 and jj=1 if the record is found in file2; ii and jj are both set to 1 if the observation’s ID are both found in their respective dataset. Thus the observation is written to the merged dataset only if both files contain a record with the same unique ID variables. Overlaid Values in a MERGE One unfortunate possibility exists if you merge two datasets that contain common variables names other than the ID variables that appear on the BY statement. Variables with common names most likely contain very different data – if file2 contains updated values for the same variable in file1, the UPDATE statement is more appropriate (see chapter 6.1, part b). For example, suppose a variable named growth appears in both file1 and file2 and is not considered an identification variable. In the merged dataset SAS will write the values for growth taken from the second dataset listed on the MERGE statement and values of growth from the first dataset will be omitted. In the event you merge data files with common variable names, you should not take for granted which dataset provided values for the output data set. By default, SAS gives you no warning that this situation exists. However, you can add another system option to your autoexec.sas file telling SAS to write a message to the LOG window whenever a MERGE statement causes one or more variables to be overwritten: OPTIONS msgLevel=i; If common variable names in two datasets occur and you don’t want to rename the orginal variables in one of the datasets, you may enter the RENAME= option next to the dataset name on the MERGE statement so that data from the first dataset are not overwritten by data from the second. For example, if a variable named growth exists in both file1 and file2, you can rename growth from file2 on the MERGE statement so that the contents of this variable will be given a different name, (e.g., growth_f2) in the merged dataset. DATA new_file; MERGE file1 file2(RENAME=(growth=growth_f2)); BY id year; RUN; A DROP list can be entered next to the dataset name when you want to have values for growth come from file1 and ignore the values in file2, DATA new_file; MERGE file1 file2(DROP=growth); BY id year; RUN; One potentially serious problem with merging files, is that if common variable names are present one of the files may contain missing or out-of–date data which then overwrites current values in another file. If you want to only change the current values of a few records in the master dataset, the UPDATE statement is similar to MERGE and is probably a better choice. ADVICE: Always maintain tight control of which variables you want to take from each dataset. Apply KEEP or DROP lists, either when creating the DATA step or connected to the input data sets on the MERGE statement. The following example demonstrates what happens during a merge when the id variable named 'key' that is to link the two datasets together does not have unique values for each observation. DATA one; INPUT key x y same @@; output; CARDS; 2 2 1 3 3 3 4 4 3 5 7 7 4 9 0 3 4 8 2 4 4 2 1 5 5 0 0 2 ; PROC PRINT NOobs; key x y same 2 2 1 3 3 3 4 4 3 5 7 7 4 9 0 3 4 8 2 4 4 2 1 5 5 0 0 2 DATA two; INPUT key z1 z2 same @@; OUTPUT; CARDS; 3 13 14 19 4 19 10 28 4 18 12 15 4 12 11 11 5 22 15 16 5 42 25 18 5 52 65 14 6 20 10 18 ; PROC PRINT Noobs; RUN; key z1 z2 same 3 13 14 19 4 19 10 28 4 18 12 15 4 12 11 11 5 22 15 16 5 42 25 18 5 52 65 14 6 20 10 18 DATA ALL; MERGE one two; BY key; PROC PRINT NOobs; RUN; results in this file: key x y z1 z2 same 2 2 1 . . 3 3 3 4 13 14 19 3 5 7 13 14 7 4 9 0 19 10 28 4 8 2 18 12 15 4 2 1 12 11 11 5 0 0 22 15 16 5 0 0 42 25 18 5 0 0 52 65 14 6 . . 20 10 18 Key = 2 exists in dataset 'one' and does not appear in dataset 'two'. The values z1 and z2 from 'two' are set to missing in the first record. Key = 3 appears twice in dataset 'one' and appears once in dataset 'two'. The output has the two records from 'one' with the data z1 and z2 repeated from 'two'. Key = 4 appears three times in both datasets. The output has the three records with data from each of the three rows appearing together in the order given. Key = 5 appears once in dataset 'one' and appears three times in dataset 'two'. The output has three records for key with the data from 'one' copied for all three records from 'two'. Key = 6 is not in 'one' and appears in 'two'. The values x and y from 'one' are set to missing. One source of confusion when merging two files that contain non-unique id values is when variables with the same name exist on both datasets (in this example, the variable common to both is called 'same'). The value from the variable in "the last dataset read" is the value that goes into the new dataset. The key to understand what happens is the phrase "the last data set read" which is not the same as the phrase "the last data set listed in the MERGE statement." In the example above 'same' is coded with a single digit for dataset one and two digits for dataset two so you can more easily see which file contributed the value of 'same' to the merged dataset. You will see two records for id=3 occur in the dataset one and only one record with id=3 in dataset two. In this case the value of 'same' for the first record comes from the dataset 'two' and for the value for the second record comes from the dataset 'one'. Two Warnings when Applying a Merge with Character Data 1. Does the case of the variables across the two files match? If a variable on the BY statement varies between upper and lower case, the files won’t merge properly. One way to avoid this problem is to apply the UPCASE function to all character data on the BY statement: MERGE file1 file2; BY UPCASE(char_var); DATA file1; INPUT crop $ Param1; CARDS; Wheat 3.1 Hay 3.1 Barley .2 ; PROC SORT; BY crop; run; DATA file2; INPUT crop $ Param2; CARDS; wheat 223.1 hay 33.21 barley 1.221 ; PROC SORT; BY crop; run; DATA abc; MERGE file1 file2; by crop; run; PROC PRINT DATA=abc NOobs; run; crop Param1 Param2 Barley 0.2 . Hay 3.1 . Wheat 3.1 . barley . 1.221 hay . 33.210 wheat . 223.100 One way to avoid this problem is to enter the UPCASE function on the BY statement: DATA abc; MERGE file1 file2; BY UPCASE(crop); RUN; Produces the output: crop Param2 barley 1.221 hay 33.210 wheat 223.100 2. Are the lengths of the variables on the BY statement equal? When a DATA step merges two datasets, character variables on the BY statement are assumed to have the same length in both sources of data. When a BY variable has different lengths in one of the incoming data sets, the merge can produce bizarre results. SAS determines the length of character variables placed on the BY statement from what the data it encounters in the file listed first the MERGE statement. When the SAS dataset containing the variable with the shorter length appears first on the MERGE statement, values of that variable from the second file will be truncated. This triggers the warning message in the LOG file: WARNING: Multiple lengths were specified for the BY variable [variable name] by input data sets. This may cause unexpected results. When the SAS dataset containing the variable with the longer length is placed first, no warning message is printed since no data will be truncated. When using the MERGE statement to combine two datasets, it is left up to you, the programmer, to decide whether a merge on a particular variable listed on the BY statement makes sense. Several solutions to this situation exist which you can read: http://www2.sas.com/proceedings/sugi28/098-28.pdf Many-to-Many Merges A many-to-many merge (i.e., all records from one file are crossed with all records from a second file, can be applied with a DATA step. However, the solution is rather complex (not shown here). The few statements required to do it with PROC SQL show that it is an elegant way to proceed: DATA f1; INPUT state $ farm $ Param1; CARDS; NE 1 0.111 NE 2 0.289 IA 4 11.12 ; DATA f2; INPUT state $ crop $ Param2; CARDS; NE wheat 223.1 NE hay 33.2 NE barley 1.2 IA wheat 202.9 IA hay 31.8 IA barley 3.1 ; PROC SQL; CREATE TABLE f1_x_f2 AS SELECT * FROM f1, f2 WHERE f1.state = f2.state; RUN; QUIT; PROC PRINT DATA=f1_x_f2 NOobs; TITLE1 'many to many merge'; run; writes the cross product file for each state: state farm Param1 crop Param2 NE 1 0.111 wheat 223.1 NE 2 0.289 wheat 223.1 NE 1 0.111 hay 33.2 NE 2 0.289 hay 33.2 NE 1 0.111 barley 1.2 NE 2 0.289 barley 1.2 IA 4 11.120 wheat 202.9 IA 4 11.120 hay 31.8 IA 4 11.120 barley 3.1 The WHERE statement produces a dataset with n1 x n2 records from each state where n1 is the number of records for state=i in file 1 and n2 is the number of records for state=i in file 2. Without the WHERE statement, it will produce N1xN2 records, where the Ni are total records from each file, so it should be applied with care, even with small datasets. You should at least be aware of how many records the output file will actually produce before you submit it! If you want to be even more selective about which records from the files are output, the following example demonstrates how to do so: DATA file1; INPUT id farm crop1 $ Param1 dt1 dt2; CARDS; 1 1 corn 0.91 1 2 2 1 wheat 0.11 1 3 3 2 wheat 0.10 1 4 4 1 hay 0.28 2 3 5 2 hay 0.29 1 2 ; PROC PRINT DATA=file1 NOobs; RUN; DATA file2; INPUT crop2 $ Param2 dt3; CARDS; corn 113 1 corn 523 2 corn 231 3 corn 543 4 corn 151 5 corn 366 6 wheat 123 1 wheat 213 2 wheat 323 3 wheat 443 4 hay 31 1 hay 22 2 hay 33 3 hay 24 4 ; PROC PRINT DATA=file2 NOobs; PROC SQL; CREATE TABLE f1_x_f2b AS SELECT * FROM file1, file2 WHERE (crop1 EQ crop2) AND (dt1 LE dt3 LE dt2); PROC SORT DATA=f1_x_f2b; BY id crop1 param1 dt1 dt3; RUN; PROC PRINT DATA=f1_x_f2b NOobs; VAR id farm crop1 Param1 dt1 dt3 dt2 crop2 Param2; TITLE1 'Selective many to many merge'; run; output: Selective many to many merge id farm crop1 Param1 dt1 dt3 dt2 crop2 Param2 1 1 corn 0.91 1 1 2 corn 113 1 1 corn 0.91 1 2 2 corn 523 2 1 wheat 0.11 1 1 3 wheat 123 2 1 wheat 0.11 1 2 3 wheat 213 2 1 wheat 0.11 1 3 3 wheat 323 3 2 wheat 0.10 1 1 4 wheat 123 3 2 wheat 0.10 1 2 4 wheat 213 3 2 wheat 0.10 1 3 4 wheat 323 3 2 wheat 0.10 1 4 4 wheat 443 4 1 hay 0.28 2 2 3 hay 22 4 1 hay 0.28 2 3 3 hay 33 5 2 hay 0.29 1 1 2 hay 31 5 2 hay 0.29 1 2 2 hay 22