Section 6.5: TRANSPOSE A SAS DATASET Topics in this Section a. Multivariate and Univariate Data Structures b. The DATA Step Approach to Convert from Multivariate to Univariate Structure c. Modify a Dataset Structure with PROC TRANSPOSE d. Transposing a Dataset from Univariate to Multivariate Structure It is important to consider the design of your data structure as you consider the objectives of data analysis. Some statistical procedures require data to be entered with all the data for one observation across the columns of each row known as multivariate format (e.g., PROCs GLM and CORR); other procedures require the data to be placed in one column with an associated index variable known as univariate format (PROCs PLOT, UNIVARIATE, and MIXED). a. Multivariate and Univariate Data Structures Suppose you have data collected for each observation which are arranged across two or more columns (known as data in multivariate structure or a wide dataset). For data analysis with PROC MIXED or PROC GENMOD (or some other reason), it needs to be placed in a single column (known as univariate structure or a narrow dataset). Reshaping a file in this manner means changing from wide to narrow or transposing from multivariate to univariate. Either terminology implies that variables currently placed in the columns are to be rotated into the rows of the dataset which are identified by an index value. The opposite case of transposing observations (placed in rows) into variables (columns) also exists and will be dealt with later in this article (see section d). A simple example demonstrates the difference between the two data structures. If you collect measurements of a similar type from experimental units over time, you may easily store them in the rows of an Excel spreadsheet, one for each subject identified by a column name. For example suppose you collect three consecutive values of pressure and their associated temperatures over time and store them in a multivariate format: ID Date Pr1 Pr2 Pr3 T1 T2 T3 A1 11/01/2004 223 204 195 30 28 27 B7 11/01/2004 211 192 183 31 28 26 The numbers 1, 2, and 3 attached to Pr and T as variable names indicate the order in which the two different types of variables were collected over time (e.g., each hour) for every subject (row). This is the data structure required for multivariate or repeated measures analysis with the GLM procedures (for both SAS and SPSS). However, some procedures require the same data must exist in a univariate format so the data needs to have the structure indicated below: ID Date Time Pressure Temp A1 11/01/2004 1 223 30 A1 11/01/2004 2 204 28 A1 11/01/2004 3 195 27 B7 11/01/2004 1 211 31 B7 11/01/2004 2 192 28 B7 11/01/2004 3 183 26 Time is a new variable that serves as an index for both pressure and temperature. The values of the variables Pressure and Temp are now placed two columns indexed by Time. To manipulate data in this manner you can transpose the data structure as you read data from an existing dataset or external file with the DATA step. In some cases PROC TRANSPOSE with appropriate options and statements will do it much more efficiently (i.e., require fewer statements). The following sections demonstrate the basic techniques of transposing data with these methods. b. The DATA Step Approach to Convert from Multivariate to Univariate Structure To transpose data with the DATA step you need to first read in the multivariate data (using either an existing SAS dataset or from an external file proceeds in the same manner) into a SAS dataset DATA multdat; FORMAT date mmddyy10.; INPUT id $ date mmddyy10. pr1 pr2 pr3 t1 t2 t3; CARDS; A1 11/01/2004 223 204 195 30 28 27 B7 11/01/2004 211 192 183 31 28 26 ; * PROC PRINT allows you to check if the data were read correctly; PROC PRINT DATA=multdat(obs=10) NOobs; VAR id date pr1 t1 pr2 t2 pr3 t3 ; TITLE1 'The data are in multivariate format'; RUN; If you have only a few variables to transpose the simplest method is to read the entire line with the INPUT statement, make new variables, and then enter separate OUTPUT statements for each new record. With this approach you should also enter the KEEP (or an equivalent DROP) statement showing the variable names you want to place into the new dataset along with all ID variables (otherwise the program will also write out the values of pr1 pr2 pr3 t1 t2 t3, a complete set of the same numbers for each record identified by id). DATA univdat; SET multdat; KEEP id date time pressure temp; time=1; pressure=pr1; temp=t1; OUTPUT; time=2; pressure=pr2; temp=t2; OUTPUT; time=3; pressure=pr3; temp=t3; OUTPUT; PROC PRINT DATA=univdat NOobs n; VAR id date hour p t; TITLE1 'The data now appear in univariate format'; RUN; Assuming you always have three measurements for each person, another way to convert one row of multivariate data to a univariate structure is to read in the data in its multivariate format with the trailing ampersand @ at the end of several INPUT statements to hold the pointer at a given location. Then write the data immediately with an OUTPUT statement to a data set in univariate format: DATA univdat; FORMAT date mmddyy10. ; INPUT id $ date mmddyy10. @; time=1; INPUT pressure temp @; OUTPUT; time=2; INPUT pressure temp @; OUTPUT; time=3; INPUT pressure temp; OUTPUT; CARDS; A1 11/01/2004 223 204 195 30 28 27 B7 11/01/2004 211 192 183 31 28 26 ; PROC PRINT DATA=univdat NOobs n; VAR id date hour pressure temp; TITLE1 'The data are transposed to univariate format'; RUN; The preceeding examples can be cumbersome if the dataset has many variables to transpose. It may be more efficient to write multiple INPUT and OUTPUT statements as part of a DO loop as shown below: DATA univdat; INPUT id $ date mmddyy10. @; DO time=1 TO 3; INPUT pressure temp @; * read each pair of numbers and hold the pointer; OUTPUT; * be sure to add the OUTPUT statement within the loop; END; CARDS; A1 11/01/2004 223 204 195 30 28 27 B7 11/01/2004 211 192 183 31 28 26 ; PROC PRINT DATA=univdat NOobs n; VAR id date time pressure temp; TITLE1 'DATA in is now in univariate format'; RUN; However, with SAS there are often several different approaches to perform the same process. Here's yet another method that implements ARRAYS and DO loops which will likely be a much more efficient technique to transpose data, especially from SAS datasets with many variables in multivariate format: DATA univdat; SET multdat; ARRAY pp{3} pr1 pr2 pr3; ARRAY tt{3} t1 t2 t3; KEEP id date time pressure temp; DO time = 1 to 3; temp=tt{i}; pressure=pp{i}; OUTPUT; * be sure to add the OUTPUT statement within the DO loop; END; RUN; PROC PRINT DATA=univdat NOobs; VAR id date time pressure temp; TITLE1 'DATA are now placed in univariate format'; RUN; c. Modify a Dataset Structure with PROC TRANSPOSE The previous examples demonstrate that a DATA step is often the simplest and most convenient method to transpose a dataset from a multivariate to univariate structure. Another method to transpose a dataset which is more efficient in many applications and that involves transposing many variables into one variable with an index is PROC TRANSPOSE. This procedure is designed to transpose data from two or more columns into rows (or rows to columns) such as transposing the values in a matrix. This can be done for the entire data set, or for subgroups of the data identified with BY variables. The PROC TRANSPOSE statement uses the following options: DATA= defines the input dataset OUT= defines the output dataset to contain the transposed data NAME= gives a name to a new variable that contains the name(s) of the transposed variables (the variables listed on the VAR statement). If you do not enter this option, SAS automatically includes a variable called _NAME_. You can drop it with (DROP=_NAME_) placed immediately following OUT= or enter the RENAME=(_name_ = new_name) option. PREFIX= provides the initial characters for the names of the new variables that will be appended with the value of the variable listed on the ID statement, such as PREFIX=p will list new variables as p1, p2, p3, ... Often, the _ is a convenient choice for the first character of the transposed variable. The default name in the new dataset set for the transposed value is COL indexed by the number of levels of each BY value. To change this name, enter the ID statement which takes values the specified ID variable and appends them to the prefix name listed on PREFIX=. If the ID variable is numeric, you'll see why the underscore was a good choice for the prefix. Without the ID statement SAS drops records that have missing data on this variable and indicates this in the LOG file. An example to get around this problem will be given. For the option PREFIX=p with the statement ID pstat; the new variables will become p1, p2 and p3 pstat contains the values 1, 2, or 3. If the ID statement were not entered, missing data in the variable entered on the VAR statement will cause problematic data transformation errors as a later example will show. Consider the following example: DATA rounds; x = 22/7; * a non-repeating decimal; idvar='a'; y1 = round(x,1); y_1 = round(x,0.1); y_01 = round(x,0.01); OUTPUT; x = 33/7; * a non-repeating decimal; idvar='b'; y1 = round(x,1); y_1 = round(x,0.1); y_01 = round(x,0.01); OUTPUT; RUN; PROC PRINT NOobs; RUN; idvar x y1 y_1 y_01 a 3.14286 3 3.1 3.14 b 4.71429 5 4.7 4.71 PROC TRANSPOSE DATA=rounds OUT=trnsp NAME=orig_vars PREFIX=y_; VAR x y1 y_1 y_01; ID idvar; RUN; PROC PRINT DATA=trnsp NOobs; RUN; orig_vars y_a y_b x 3.14286 4.71429 y1 3.00000 5.00000 y_1 3.10000 4.70000 y_01 3.14000 4.71000 This example shows how to transpose a data set with one id variable and four variables with two records into one dataset with two variables indexed by the id variable (orig_vars) with four records. If you have more than one variable to transpose with multiple id variables for each record to reshape the dataset becomes more tedious since you will need to run PROC TRANSPOSE multiple times, once for each specific grouping of variables, and then merge the results into a new dataset. With the dataset multdat described earlier, a two applications of PROC TRANSPOSE are necessary. (The process can be generalized using macro variables, not shown here). PROC SORT DATA=multdat; BY id date; PROC TRANSPOSE DATA=multdat OUT=p(DROP=_name_ rename=(col1=p)); BY id date; VAR pr1 pr2 pr3; RUN; PROC TRANSPOSE DATA=multdat OUT=t(DROP=_name_ rename=(col1=t)) ; BY id date; VAR t1 t2 t3; RUN; DATA univdat; MERGE p t; BY id date; RUN; PROC PRINT DATA=univdat NOobs; VAR date p t; TITLE1 'DATA are now placed in univariate format'; RUN; As shown in the section on merging datasets (6.2), it is wise to include a BY statement when merging two datasets together and to have unique id variables in each file so that records from both will match exactly. This is especially important if some of the repeated measures data have missing values. d. Transposing a Dataset from Univariate to Multivariate Structure The transpose of a univariate dataset into a multivariate data structure is generally not encountered as often as the opposite direction. DATA univ; do id=1 to 3; do a1=1 to 2; do a2 = 1 to 2; y=rannor(1234579); output; end; end; end; run; proc print; run; To convert to multivariate structure, one way is to merge the file into itself multiple times as indicated below with WHERE= and RENAME= options. The response variable is given subscripts based on values of the two between-subjects variable as shown below: /* Convert data structure to multivariate */ DATA test_glm; MERGE test(WHERE=(a1=1 and a2=1) RENAME=(y=y11)) test(WHERE=(a1=1 and a2=2) RENAME=(y=y12)) test(WHERE=(a1=2 and a2=1) RENAME=(y=y21)) test(WHERE=(a1=2 and a2=2) RENAME=(y=y22)); BY id; DROP a1 a2; RUN; proc print; run; Obs id y11 y12 y21 y22 1 1 0.24840 0.62477 -1.77141 -0.79994 2 2 -0.45636 -1.12911 -0.42210 -0.97007 3 3 -0.45795 -1.23050 -1.10047 -0.49320 In this next example, a text file is written that places a ~ between all codes, assuming the variable identifying 'repeated group' is account. DATA whatever; INPUT account code; CARDS; 1111111 372.4 1111111 372.42 1111111 372.6 2222222 402.1 3333333 372.51 3333333 372.52 3333333 372.53 3333333 372.54 ; DATA _null_; FILE 'c:\sas\florp.txt'; DO UNTIL (last.account); SET whatever; BY account; IF first.account THEN PUT account '|' @; PUT code '~' @; END; PUT; run; No allocation is made for unequal numbers of codes within each level of account. According to Richard A. DeVenezia, this approach is technically known as "florping through the data, flumping it out sideways." To have more control over how you place data sideways, PROC TRANSPOSE offers solutions that apply the BY and ID statements. The PROC step assumes you have a variable that gives the variable names that is character data (it may be numeric if you enter the prefix= option on the PROC statement). It must have distinct values of the variable 'test' for each level of the id variable (i.e., variables with the same name or duplicate values of test within levels of ID are not allowed). Suppose you have test results for students identified by their id and the test they took (only once!) listed in univariate format as follows: DATA data1; INPUT school id test $ result; CARDS; 5 1 S1 92 5 1 F0 93 5 1 S2 91 7 2 S1 94 7 2 F0 95 7 2 S2 91 ; * Sort the data by the observations' id and test variables; PROC SORT DATA=data1; BY school id test; PROC PRINT DATA=data1 NOobs; VAR school id test result; RUN; SCHOOL ID TEST RESULT 5 1 F0 93 5 1 S1 92 5 1 S2 91 7 2 F0 95 7 2 S1 94 7 2 S2 91 PROC TRANSPOSE DATA=data1 OUT=data2(DROP=_name_); BY id; ID test; VAR result; COPY school; * will produce a record for each school to the transposed records; RUN; PROC PRINT DATA=data2 NOobs; RUN; id school F0 S1 S2 1 5 93 92 91 1 5 . . . 1 5 . . . 2 7 95 94 91 2 7 . . . 2 7 . . . If COPY statement omitted, then output file looks like: id F0 S1 S2 1 93 92 91 2 95 94 91 The ID statement contains the new variable names. Since they begin with a letter, they are valid SAS variable names. If the names of the tests were entered as numbers, SAS would write names that begin with the underscore, followed by the number, such as _1, _2, _3, etc. As another example of how to apply the ID statement, PROC TRANSPOSE reorganizes quarterly records stored in univariate format into multivariate. Initially, there are up to 4 observations per year, one each for quarter, with missing data a common feature. The transposed dataset is to have one observation per year and four variables, one for each quarter. With incomplete univariate data, that is the sequence of four quarterly values within a year is not necessarily available, the presence of an ID statement will place missing values for these variables on the transposed dataset. DATA univ; INPUT id type $ qtr; CARDS; 100 a 2 100 b 4 200 a 1 200 a 2 200 a 3 200 c 4 300 a 4 300 d 1 300 c 3 ; PROC TRANSPOSE DATA=univ OUT=mlt(DROP=_name_) PREFIX=qtr; BY id; ID qtr; VAR type; RUN; PROC PRINT DATA=mlt NOobs; VAR id qtr1-qtr4; RUN; id qtr1 qtr2 qtr3 qtr4 100 a b 200 a a a c 300 d c a The PREFIX= option specifies the first three letters of a new variable name (since dataset names cannot begin with numbers). Note what would happen with the same quarterly dataset if the ID statement were not present. In fact, if all records had at least one value missing, then qtr4 would not exist on the transposed dataset. PROC TRANSPOSE DATA=univ OUT=mlt(DROP=_name_) PREFIX=qtr; BY id; VAR type; RUN; PROC PRINT NOobs; run; id qtr1 qtr2 qtr3 qtr4 100 a b 200 a a a c 300 a d c Proc Transpose on Missing ID Values Is there a way for PROC TRANSPOSE to work with missing ID values? Assume you have a dataset with a numerical variable that you want to act as the ID variable. Some of these values are set to different missing values codes (.A, .M, .). When this file is transposed with this variable as an ID, the value of any observation that contains a missing code such as .A, .M, and . is lost. SAS will even write this in the LOG file. ?????SAS can't handle this because it would then have Columns without names. * Note that fld needs to have unique values within each level of i ; DATA a; INPUT i qrt val; MISSING m a; CARDS; 1 m 5 1 3 6 1 4 16 2 1 7 2 m 6 3 2 8 4 a 9 4 . 19 5 1 1 5 a 5 5 3 7 ; /* Make a view that PUTs the numeric variables into character variables. The PUT honors the distinct missing values. */ DATA bb ( drop = qrt) / view = bb; SET a; qrtx = put(qrt,12.-l); run; proc print data=b; run; Obs i val qrtx 1 1 5 M 2 1 6 3 3 1 16 4 4 2 7 1 5 2 6 M 6 3 8 2 7 4 9 A 8 4 19 . 9 5 1 1 10 5 5 A 11 5 7 3 PROC TRANSPOSE DATA = bb OUT = cc(DROP=_NAME_) prefix = qrt_; BY i; VAR val; ID qrtx; run; proc print data=cc NOobs; VAR i qrt_1 qrt_2 qrt_3 qrt_4 qrt_A qrt_M qrt__ ; run; i qrt_1 qrt_2 qrt_3 qrt_4 qrt_A qrt_M qrt__ 1 . . 6 16 . 5 . 2 7 . . . . 6 . 3 . 8 . . . . . 4 . . . . 9 . 19 5 1 . 7 . 5 . .