Section 4.5: ARRAY - Process a Vector of Data An ARRAY is a particularly helpful feature to process multiple values of numerical or character data in a DATA step within each observation (i.e., variables within one row). Arrays are a feature of the DATA step which enables SAS to process many variables with relatively few statements. Definition of an Array Page 295 of the SAS Language Reference manual, Version 6, First Edition states: "... arrays in the SAS System are different from those in many other languages. In the SAS System, an array is simply a convenient way of temporarily identifying a group of variables. It is not a data structure, and it exists only for the duration of the DATA step." Based on this description, the essential concept of an array is it defines a collection of variables of the same data type that allow you to work with them as a group rather than as individual variable names. Arrays are vectors of a specified list of variables where all of them are of one data type (i.e., they are all numeric or all character). Two or more arrays in the same DATA step can share common variable names. DATA steps are completely separate components of a SAS command file whose memories exist only for their durations (that is, the next RUN; or DATA or PROC statement). Arrays exist only for the duration of the DATA step in which they are defined. An array in one DATA step has nothing to do with an array in any another DATA or PROC step; you need to start over with new ARRAY definitions with each new DATA step. Since arrays are memory-based structures, they are entirely suspended or removed at the end of a DATA step. One-Dimensional Arrays An array in a DATA step is defined with an ARRAY statement. You may have one or more in each DATA step. The ARRAY statement is a declarative statement as contrasted with executable statements. Hence it is not processed in a sequential manner such as a sequence of calculations. An explicit numerical array is defined by the keyword ARRAY followed by a name, a number enclosed within {} which indicates the number of items contained in the array (i.e., its dimension), and finally, a list of the names of all numerical variables contained in this array: ARRAY wk{7} day1 day2 day3 day4 day5 day6 day7; OR ARRAY wk{7} day1-day7; NOTE: The ARRAY name, wk in this example, is your choice and should not be a variable name in the dataset or the name of a SAS function. The array elements can be any collection of variable names as long as they are all of the same data type (e.g., all numeric, all character, etc). The following layout is a conceptual representation of a one-dimensional array given the name week: Array | Name | Variable names in the array ------+----------------------------------- wk | day1 day2 day3 day4 day5 day6 day7 Index | 1 2 3 4 5 6 7 The array wk contains seven elements, the variables day1 through day7, which are given index numbers 1-7 and will be an essential part of how to refer to each element of an array. To reference an individual variable name from this array, enter the array name with its index number within curly brackets {} in the following form: wk{i}, where i is the variable's index number. For example, wk{4} represents the fourth element in the array listed above which is day4. Both the variable name and the array element can be entered into SAS statements to represent the value in day4. For example, in a DATA step: time = day4; is equivalent to time=week{4}; The one-dimensional array described thus far is called explicit because the number of elements and the specific variable names within it are known and clearly defined in the ARRAY statement. Entering explicit arrays is the safest mode to work with data, though with more experience shortcut notation is available. For example, you can specify an array name and the number of elements in the array without listing variable names in the ARRAY statement. SAS will make variable names by concatenating the array name with the numbers 1, 2, 3, up to the array size. If a variable name in the series already exists in the dataset, SAS references that variable name instead of the new one. The following ARRAY statement ARRAY abc{5}; defines a numerical array called abc and references five variable names which by default will be called: abc1, abc2, abc3, abc4, and abc5. Explicit character arrays contain the $ format designation following the {}. The array will consist of specified variables all expected as character data: ARRAY names{10} $ nm1-nm10; With this statement, any value assigned to the variable names should not exceed 8 characters otherwise they will be truncated to the first 8 characters. If the length of the contents of any element in the array is greater than 8 characters, specify the maximum length following the $, e.g. ARRAY address {4} $25 city state zip_code country; This ARRAY statement sets all elements of the array to an explicit length of 25 characters. Values of each variable in the array are likely to have lengths less than or equal to 25. If so, add the TRIM function to remove any leading or trailing spaces to all statements that refer to elements of the array, e.g. TRIM(address(j)), where j is the element number of the array element. Since the ARRAY statement is declarative, it is usually placed near the beginning of the DATA step, following the SET statement or before the INPUT statement (depending on which method reads data). Other placements are possible depending on the context of its application. Contents of an array are specified to be data of the same type of measurement, although any valid variable name can be entered as components of an array. For example, the numerical array defined below called wgt references three weights collected over time. DATA persons; INPUT id $ age gender $ weight1 weight2 weight3; CARDS; s01 55 m 172 163 155 s02 45 m 170 158 152 s03 53 f 168 167 164 ; DATA a; SET persons; LENGTH name_var $6; ARRAY wgt{3} weight1 weight2 weight3; KEEP id gender name_var value; DO time = 1 to 3; name_var = VNAME(phy{time}); value = phy{time}; OUTPUT; END; RUN; The ARRAY statement above defines a name wgt which will be referenced later in the DATA step. The number 3 within the braces indicates this is an explicit array: that is, it will contain 3 numerical variables. Their names are then listed in the order they will be treated by the index numbers. They do not need to be contiguous (i.e., side-by-side) in the dataset or listed in the same order found on the INPUT statement or in an Excel file. Within the DATA step, SAS statements containing the name and index of the array name wgt interpret the values of the array as if one were using the variable names: entering wgt{1} represents weight at time 1 entering wgt{2} represents weight at time 2 entering wgt{3} represents weight at time 3 The number within the braces {} indicates the array index for the specific time the weight was collected. When accessing array elements, you can determine what the actual variable name is with the VNAME function (described below). This feature makes the ARRAY statement particularly convenient with DO loops as subsequent examples will show. PROC PRINT DATA=a NOobs ; VAR id gender name_var value; run; id gender name_var value s01 m weight1 172 s01 m weight2 163 s01 m weight3 155 s02 m weight1 170 s02 m weight2 158 s02 m weight3 152 s03 f weight1 168 s03 f weight2 167 s03 f weight3 164 The VNAME Function The VNAME function with an ARRAY can help you structure a dataset to explore data for linear relationships, such as Pearson correlations between all possible pairs of continuous variables, as the DATA step below indicates with the sashelp.class dataset: DATA cls; SET sashelp.class; LENGTH xname yname $6 ; * 6 is the max length of letters in a variable name; KEEP name sex x xname y yname; * list the variable names you want to correlate on the ARRAY statement; ARRAY vr{3} age height weight; DO i = 1 to (3-1); DO j = i+1 to 3; x=vr{i}; xname=VNAME(vr{i}); y=vr{j}; yname=VNAME(vr{j}); OUTPUT; end; end; RUN; PROC SORT DATA=cls; BY sex xname yname ; run; ODS OUTPUT Pearsoncorr=pscr; ODS LISTING close; PROC CORR DATA=cls; BY sex xname yname ; VAR x ; WITH y; run; ODS LISTING; PROC PRINT DATA=pscr; run; Statements When You Cannot Enter Explicit Array References If you want to add labels to the variables listed in the array, the LABEL statement expects the actual variable names on the left-hand side of the equal sign (not the array elements): LABEL age="Current Age" height="Height in inches" weight="Weight in pounds"; Although array references such as phy{1} for age are possible in assignment statements or calculations, it is generally not possible to use array in other declarative statements such as LABEL, ATTRIB, FORMAT, etc. In functions one should use the actual variable names and not the array designation; e.g., the sum function (and others like it) cannot handle array element references as part of an "of-" varlist: SUM( of abc(1) - abc(5) ); * does not work; SUM( of abc1 - abc5 ); * works fine see Chapter 7, Section 9 for further information; SAS arrays are static, in the sense that their size cannot 'shrink' or 'grow' as the DATA step proceeds. To get around unknown array sizes you can: 1. Make an array large enough for the worst case scenario and assign only the indices of the array as you need them; 2. Enter a DATA _null_; to capture the number of incoming variables and place that value into a macro variable to enter into the ARRAY declaration (see Chapter 9 - an introduction to macro variables). By default, the numbering of the index of a SAS array begins with 1. However, array indices can start at some other value with explicitly subscripted arrays. The number within the brackets indicates how many items are in the array. This makes the array convenient to use in a DO loop as the examples later show. If you want the numbering to begin with a different index, you can specify that value within the brackets: ARRAY yy{0:2} age height weight; In this situation SAS arrays conform to the following rules: 1) the indices must be integers (positive, negative, or 0). Fractional indices are implicitly converted to integers using the INT (integer) function without warning, that is, positive indices are converted to their floors (1.8 -> 1), and negative numbers are converted to their ceilings (-2.8 -> -2). 2) the upper bound must be greater than or equal to the lower bound. 3) for non-temporary arrays: a) all items have entries as variables in the symbol table; therefore, their number (given by the DIM function) is limited to 32767 minus the number of variables already present in the symbol table; b) the length of a numeric item is always 8 bytes internally stored just like any other numeric variable; c) both expression and memory lengths of character items combined into an array are as declared. DO loops are frequently connected with arrays to process data. For example, a very common application of arrays is to check that values lie within a specific range of numbers (quality control) or to convert missing data to one of the 28 valid SAS missing data symbols (e.g., usually a period). Many coding systems use impossible or illogical numbers to indicate missing data (such as -9 or 99 for a Likert scale) rather than letters or a period. Do not code missing data with a 0 because it will be treated as a legitimate, though incorrect, value! Before data are processed with SAS, missing data values stored as numbers must be converted to a missing data value (such as a period or one of the 27 other values you define to represent missing data) so that they are not treated as actual numbers. The following statements convert "missing data" stored as a negative number (-9) in the input file to a period in the SAS data set for 200 variables. It also checks for data items that are out of range and writes them to an error data set. The following DATA step assumes valid responses for the responses from a survey are measured on a Likert scale that are defined to range from 1 through 5 and that missing data have been coded with a -9. DATA surv(DROP=qst) errors(KEEP=id qst value); ARRAY srv{200} q1-q200; INFILE 'survey.csv' dlm=',' dsd missover; INPUT id $ q1-q200; DO qst = 1 to 200; * set missing data to .m = missing ; IF srv{qst} = -9 THEN srv{qst}=.m ; * quality control ; IF (srv{qst} LT 1 or srv{qst} GT 5) THEN DO; value=srv{i}; OUTPUT errors; RETURN; END; * perform a range check, output id and survey item number of incorrect values; END; OUTPUT surv; * save values that pass the QC check; RUN; Temporary Arrays Temporary arrays contain numerical or character values unique to a DATA step which you do not want or need to read in from an external file. They appear much like the ARRAY statement with the added designation _temporary_ following the array name and dimension declaration and then enter the actuals values of the array variables placed between (). The values may or may not be separated by commas. An example of a temporary numerical array is: ARRAY id{5} _temporary_ (11 22 33 44 55); A temporary character array also contains the $ (to indicate character data) followed by a number to indicate the maximum length of the character data (not necessary when 8 digits or less, yet helpful to include) and is structured with the character values placed in quotes: ARRAY sym{5} $1 _temporary_ ('a' 'b' 'c' 'd' 'e'); How can a temporary array be helpful? One application is when their function is to assist in calculations of contents of an array that do not need to be saved as variables. To preserve the result of the calculation, you would need to assign it to a variable name. Performance time can be greatly improved by using temporary data elements. A _temporary_ array is a stretch of consecutive memory. An array of 8-byte numerical values takes approximately 8*N bytes of memory where N is the number of elements in the array. There is also a small one-per-array chunk of memory for array management. Some examples will demonstrate possible applications. Usually ID values contain more than one letter or they all begin with the same letter or number, such as 301, 302, 303, etc. These necessary and practical features make the actual id values impractical of little value as plotting symbols in PROC PLOT. One application of a temporary ARRAY is to assign plotting symbols (to be subsequently printed on the output with PROC PLOT) based on the value of an ID variable: DATA tch; ARRAY id{5} _temporary_ (113 213 263 311 345); ARRAY symb{5} $1 _temporary_ ('a' 'b' 'c' 'd' 'e'); LABEL grd_lvl='Grade Level' n_stdnts='No. of Students'; INPUT teacher grd_lvl n_stdnts; DO i = 1 to 5; IF id{i}=teacher THEN sym = symb{i}; END; CARDS; 113 1 34 263 2 23 213 2 28 311 3 31 345 3 29 ; PROC PLOT DATA=tch; PLOT n_stdnts*grd_lvl=sym / haxis = 1 to 3 by 1 Vaxis = 20 to 35 by 5; options ps=30; RUN; quit; Another example of an application of a temporary array is similar to the type of manipulation performed by PROC TRANSPOSE (see Chapter 6). The objective is to place data from a column of a SAS dataset (lined up vertically) into elements of an array (horizontal). A procedure could very well be more efficient than coding the problem in a DATA step. However, two situations where you may want to transpose values of variable into a temporary array through a DATA step include: * make complicated calculations not available in a procedure through a function you wrote * select random samples with replacement from a dataset to submit to the bootstrap. To write a dataset for either situation, assume you have a SAS dataset with n=15 observations of variables called x and y: DATA raw; SET my_dat(keep=x y) END=lastobs; ARRAY _x_ {15} _temporary_; * my_dat is known to have 15 observations; ARRAY _y_ {15} _temporary_; _x_{_n_} = x; _y_{_n_} = y; seed=5435865; /* set value of seed between 1 & (2^31 - 1)*/ * write 10 random samples of x and y pairs, size n=10, into one dataset ; IF lastobs THEN DO; DO sample=1 to 10; DO i=1 to _n_; point = CEIL(RANUNI(seed)*_n_); x = _x_{point}; y = _y_{point}; OUTPUT; END; END; END; KEEP x y sample; RUN; The advantage of temporary arrays is they do not overload the SAS compiler with the need to arrange the array elements as SAS variables in the symbol table. The first DATA _null_; step shown below makes 100000 'temporary' variables whereas the second needs to make 100000 'actual' variable names in the symbol table. 1 data _null_ ; 2 array a [100000] _temporary_ ; 3 run ; NOTE: DATA statement used: real time 0.00 seconds cpu time 0.00 seconds 4 data _null_ ; 5 array a [100000] aa1-aa100000 ; 6 run ; NOTE: DATA statement used: real time 1.48 seconds cpu time 1.38 seconds Results with transposing 10000 records with applying PROC TRANSPOSE versus a DATA step showed dramatic results: 93 data two; 92 drop i; 93 DO i=1 to 10000; 94 a=i*3; 95 OUTPUT; 96 END; 97 232 PROC TRANSPOSE DATA=two out=NEW PREFIX=t_; 233 VAR a; 234 RUN; NOTE: There were 10000 observations read from the data set WORK.TWO. NOTE: The data set WORK.NEW has 1 observations and 10001 variables. NOTE: PROCEDURE TRANSPOSE used: real time 0.07 seconds cpu time 0.07 seconds 237 Data _null_ ; 238 SET two; 239 array inc( 10000 ) t_1-t_10000 ; 240 do i = 1 to 10000 ; 241 inc[i]=a; 242 end ; 243 run; NOTE: There were 10000 observations read from the data set WORK.TWO. NOTE: DATA statement used: real time 21.54 seconds cpu time 21.53 seconds 244 Data _null_ ; 245 SET two; 246 array inc( 10000 ) _temporary_ ; 247 do i = 1 to 10000 ; 248 inc[i]=a; 249 end ; 250 run; NOTE: There were 10000 observations read from the data set WORK.TWO. NOTE: DATA statement used: real time 21.18 seconds cpu time 21.18 seconds Implicit Arrays Another helpful feature of arrays is that they can be made 'implicit', that is, the number of items to be included in the array vector does not need to be known ahead of time or even what particular variables will be included: SAS can take care of the details. Instead of entering a number in brackets, place a single * in them. If you have variables coded with the val_1-val_n notation use val_: to reference the entire array. The DIM() function determines how many elements the array contains. As shown below, an 'implicit' array can be used in the same manner as the 'explicit' array described earlier. * Set val_1-val_n to missing for each record, where the size of the array, n, is unknown; DATA one; SET a; ARRAY numvar(*) val_: ; DO i=1 to dim(numvar); numvar(i)=.; END; * Set all numerical values to missing for a given observation; DATA two ARRAY nmbr(*) _numeric_ ; * _numeric_ automatically inserts all numerical items in the dataset; DO i=1 to dim(numbr); nmbr(i)=.; END; * another shortcut notation for all numeric data with the DO OVER command; DATA test; SET test; ARRAY numbrs _numeric_; DO OVER numbrs; numbrs=round(numbrs,.00001); END; run; Note that for large computing projects, use of an implicit array causes an immense slow down. Perhaps this is why SAS discourages their use. It may be that SAS is not efficient at figuring out which variables are to be included in each array. Two Dimensional or Doubly Subscripted Arrays Consider the contents of the following two-dimensional array (matrix): abc = | 11 21 | | 21 22 | | 31 32 | Enter the column values across each row as a single dimension array: DATA exmpl(keep=i j order) two(keep=i j1 j2); ARRAY abc{3,2} _temporary_ ( 11 12 21 22 31 32); DO i = 1 to 3; j=1; order=or{i,j}; j1=order; OUTPUT exmpl; j=2; order=or{i,j}; j2=order; OUTPUT exmpl; OUTPUT two; END; PROC PRINT data=exmpl NOobs; VAR i j order; run; i j order 1 1 11 1 2 12 2 1 21 2 2 22 3 1 31 3 2 32 proc print data=two NOobs; run; i j1 j2 1 11 12 2 21 22 3 31 32 DATA two_dim; ARRAY mt{2,3} sc1-sc6; ss=0; DO i = 1 to 2; * row ; DO j = 1 to 3; * column ; ss= ss+1; mt{i,j} = ss; end; end; output; run; PROC PRINT DATA=two_dim NOobs; run; sc1 sc2 sc3 sc4 sc5 sc6 1 2 3 4 5 6 Since i is the row index and j is the column index of the 2x3 matrix mt, the single dimension array sc1-sc6 is treated in two dimensions by first filling row 1 before moving onto the second: mt = | sc1 sc2 sc3 | = | 1 2 3 | | sc4 sc5 sc6 | | 4 5 6 | Testing means with a saved covariance matrix Situations occasionaly arise where you need to compute significance tests of a vector of means (Estimate column): Group Estimate 1 120.5 2 82.2 3 62.5 with a variance/covariance matrix saved into another dataset: row Cov1 Cov2 Cov3 1 370.3 26.2 15.0 2 26.2 144.0 22.6 3 15.0 22.6 117.7 For example, the test of equality for Estimates from group 1 and 2 is: T12 = ( 120.5 -82.2) / SQRT( 370.3 + 144.0 - 2*26.2 ) = 38.3 / 21.492 = 1.78 Where the standard error of a difference is Stderr_diff = SQRT (VAR1 + VAR2 - 2*Cov_12) To compute a pvalue, will also need to have the degrees of freedom for the error term placed in a macro variable: %LET df = 25; One- and two-dimensional arrays can assist in this calculation when you have many possible pairs and a rather large covariance matrix. First, transpose the 3 rows of the estimate vector (placed in a dataset called est) into 3 values in 1 row: PROC TRANSPOSE DATA=est OUT=tr_est(drop= _name_) prefix=grp; VAR estimate; ID group; run; Obs grp1 grp2 grp3 1 120.5 82.2 62.5 If the covariance matrix is stored in the dataset ascy, extract the number of rows and place it in a macro variable. The covariance matrix consists of pp= nn x nn entries (a square matrix with n rows and n columns). DATA _null_; SET ascy NOBS=n; pp=n*n; CALL symput("dm", LEFT(n)); CALL symput("pp", LEFT(p)); STOP ; RUN; * dm= 3, the number of "estimates" pp= 9, the dimension of square covariance matrix; %PUT &dm &pp ; The %PUT statement prints 3 and 9 into the log window. * this next DATA step converts the covariance matrix to have all its elements in one row: DATA tr_ascy; set ascy end=eof; ARRAY cc{&dm.} cov1-cov&dm. ; * the covarariance estimates are named cov1 cov2 cov3; ARRAY ccc{&pp.} a1-a&pp. ; * a as a vector name is chosen only for convenience; KEEP a1-a&pp. ; RETAIN a1-a&pp. ; DO i = 1 to &dm. ; j= ((_n_-1)*&dm. + i); ccc{ j } = cc{i}; END; IF eof THEN OUTPUT; * OUTPUT only on the final record of the dataset ; RUN; PROC PRINT DATA =tr_ascy NOobs; var a1-a9; RUN; a1 a2 a3 a4 a5 a6 a7 a8 a9 370.3 26.2 15 26.2 144 22.6 15 22.6 117.7 Both single line vectors can be read into a DATA step to perform the necessary calculations: DATA cmp_ests; SET tr_est; SET tr_ascy; LENGTH vnm1 vnm2 $12 ast $2; ARRAY s{&dm.} grp: ; * p=dm parameter estimates; ARRAY cv{&dm.,&dm.} a1-a&pp. ; * pxp covariance matrix; KEEP vnm1 vnm2 est1 est2 diff stderr_dif t pv_t ast ; DO i = 1 to (&dm. -1); DO j = i+1 to &dm. ; vnm1=vname(s{i}); vnm2=vname(s{j}); est1 = s{i}; est2=s{j}; diff = est1 - est2; stderr_dif = SQRT( cv{i,i} + cv{j,j} - 2*cv{i,j} ); t = diff / std_dif; pv_t = 2*(1-probt(abs(zz),&df.)); ast=' '; IF pv_t < .05 then ast='*'; IF pv_t < .01 then ast='**'; OUTPUT; END; END; run; proc print data=cmp_vars NOobs n; VAR vnm1 vnm2 est1 est2 diff stderr_dif t pv_t ast; FORMAT t 5.2 pv_t 6.4; run; vnm1 vnm2 est1 est2 diff stderr_dif t pv_t ast grp1 grp2 120.5 82.2 38.3 21.4919 1.78 0.0763 grp1 grp3 120.5 62.5 58.0 21.4009 2.71 0.0073 ** grp2 grp3 82.2 62.5 19.7 14.7139 1.34 0.1821 Comment Arrays are not always the answer, even if repetitive code is involved. Sometimes when every bit of performance is at stake, it even makes sense to dynamically write repetitive code and somewhat overload the compiler than the run-time process. A classic example of which is of course the case of a1 = 1 ; a2 = 2 ; ... a100 = 100 ; executes about 20 times faster than Do j = 1 to 100 ; a(j) = j ; End ; To check, just run 42 %macro a (n) ; 43 %do i = 1 %to &n ; 44 a&i = &i ; 45 %end ; 46 %mend a ; 47 data _null_ ; 48 do i = 1 to 1e6 ; 49 %a (100) ; 50 end ; 51 run ; NOTE: DATA statement used: real time 0.51 seconds cpu time 0.20 seconds 52 data _null_ ; 53 array a (100) ; 54 do i = 1 to 1e6 ; 55 do j = 1 to 100 ; 56 a(j) = j ; 57 end ; 58 end ; 59 run ; NOTE: DATA statement used: real time 22.07 seconds cpu time 21.53 seconds Computing an index to refer to array items has a cost in both real and cpu time: at times the delay will irrelevant; however, under other circumstances it could be very time consuming. The latter can be especially exacerbated if there is an index-depending logic within the loop. Use common sense to distinguish between the two.