VIII. PROC TABULATE - Data Summaries in Tables I. Summarize Data in Table Format II. Percentages of Counts III. Percentages of Sums IV. Embellish a Table V. Tabulate and Missing Data VI. Writing tables with empty rows and/or columns VII. What is the difference between the WEIGHT and FREQ statements? Before running PROC TABULATE on the PC, be sure to invoke the following option: OPTIONS formchar='|----|+|---' ; This choice enters pure text for the table borders which are then easily copied from the output window into WORD and then edited, if needed. I. Summarize Data in Table Format PROC FREQ is one way SAS will produce summaries of discrete data in terms of counts. PROC FREQ DATA=sashelp.class; TABLE age / nocum nopercent; RUN; Age Frequency 11 2 12 5 13 3 14 4 15 4 16 1 PROC TABULATE can also produce the same results in the form of a table. PROC TABULATE DATA=sashelp.class NOseps; CLASS age; TABLE age, n*f=3.0 / rts=6; RUN; ---------- | | N | |----+---| |Age | | |11 | 2| |12 | 5| |13 | 3| |14 | 4| |15 | 4| |16 | 1| ---------- While PROC FREQ is helpful for understanding the distribution of 1 or 2 discrete variables in tabular form and will print many summary statistics for two-way tables, PROC TABULATE has features that help you summarize the contents of a dataset according to chosen classification factors with more control over the presentation format. It can also compute summary statistics (e.g., counts, mean, min, max, and variance) for continuous data for each combination of the classification factor levels. With a unique combination of id variables it can also place the actual data values into tabular form. Here is another basic example: OPTIONS NOcenter; TITLE1 'Tables to show data and summary statistics'; DATA one; DROP d1 d2; INPUT row column d1 d2 @@; repl=1; y=d1; OUTPUT; repl=2; y=d2; OUTPUT; CARDS; 1 1 3.28 3.09 1 2 3.52 3.48 1 3 2.88 2.80 2 1 2.46 2.44 2 2 1.87 1.92 2 3 2.19 2.19 3 1 2.77 2.66 3 2 3.74 3.44 3 3 2.55 2.55 4 1 3.78 3.87 4 2 4.07 4.12 4 3 3.31 3.31 ; PROC TABULATE DATA=one NOseps; CLASS row column repl; VAR y; TABLE row*repl, column*y=' '*SUM=' '*f=7.2 / BOX='Actual data' rts=18; RUN; ------------------------------------------ |Actual data | column | | |-----------------------| | | 1 | 2 | 3 | |----------------+-------+-------+-------| |row repl | | | | |1 1 | 3.28| 3.52| 2.88| | 2 | 3.09| 3.48| 2.80| |2 1 | 2.46| 1.87| 2.19| | 2 | 2.44| 1.92| 2.19| |3 1 | 2.77| 3.74| 2.55| | 2 | 2.66| 3.44| 2.55| |4 1 | 3.78| 4.07| 3.31| | 2 | 3.87| 4.12| 3.31| ------------------------------------------ The role of each statement can be inferred from these examples; here is a brief description of the most important statements: CLASS: indicates the categorical data that define classification factors for the rows and columns VAR: lists one or more numerical variables on which summary statistics are to be computed TABLE: define the table structure and its contents The most basic statements in the procedure include: PROC TABULATE DATA= NOseps; CLASS ; VAR y; TABLE , , *** / < options >; The NOseps option on the PROC TABULATE statement prints tables with only the borders and without the horizontal lines between each row of the table data. Variables listed on the CLASS statement include: * = tabl1*tabl2.. make a separate table for combinations of level of these variables * = cvar1*cvar2.. levels are printed across the top of the table * = rvar1*rvar2.. levels are printed along the left-hand side of the table Separate tables are printed according to the combination of levels with the variables . All possible combinations of levels are formed with the asterisk (*) separating the variables, for example rvar1*rvar2. One of PROC TABULATE's most useful functions is to produce tables of specified summary statistics defined by multiple row and column level definitions. The following procedure prints the sample size, mean, and variance for each combination of the levels of classification factors in the table. PROC TABULATE DATA=one NOseps; CLASS row column repl; VAR y; TABLE row, column*y*(N*f=2.0 MEAN*f=6.2 VAR*f=9.6) / BOX='Summary Statistics' rts=14; TABLE row, y*(N*f=2.0 MEAN*f=6.2 VAR*f=8.5) / BOX='Row Summary Statistics' rts=14; RUN; The following table of summary statistics is printed by the second TABLE statement: ---------------------------------- |Row Summary | Y | |Statistics |-------------------| | | N | MEAN | VAR | |------------+---+------+--------| |ROW | | | | |1 | 6| 3.18| 0.09159| |2 | 6| 2.18| 0.06198| |3 | 6| 2.95| 0.26014| |4 | 6| 3.74| 0.12831| ----------------------------------- PROC TABULATE from Version 8 knows the option OUT=, so that you may write the results to a dataset which may then be exported to Excel. To export a table into Excel and keep its tabular structure enter the ODS statements. ODS HTML PROC TABULATE; etc. RUN; ODS HTML close; You can export the table you see in the results viewer directly into an Excel worksheet which can then be formatted to your specifications. II. Percentages of Counts KEYWORDS: PCTN COLPCTN ROWPCTN PAGEPCTN REPPCTN The first example demonstrates how to calculate percents for the number of observations in each cell. Denominator definitions to calculate percentages with PROC TABULATE can be very confusing. Row or column percentages are most commonly encountered. To have meaning in a conditional probability sense, the percentages in each row or column will add to 100%. To calculate percentages that add to 100% across all columns within each row specify the column variable as the denominator definition. PROC TABULATE DATA=indat NOseps; CLASS row column; TABLE row, (column all)*pctn*f=6.2 / rts=11 ; RUN; To calculate percentages that add to 100% across all rows within each column specify the row variable as the denominator: PROC TABULATE DATA=indat NOseps; CLASS row column; TABLE (row all), column*pctn*f=6.2 / rts=11 ; RUN; The next example demonstrates how to compute counts and percents of count data across rows and columns. DATA hs; INPUT row column @@; CARDS; 1 2 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 3 1 1 1 3 2 2 2 1 1 1 1 3 1 3 2 2 1 4 2 4 1 4 1 4 1 2 1 3 2 1 2 3 1 2 2 3 2 4 2 4 2 4 1 3 ; PROC TABULATE DATA=hs NOseps ; CLASS row column ; TABLE (row all='Total'), (column all='Total')*n=' '*f=5.0 / rts=10 box='Total Counts'; TABLE row, (column all)*pctn*f=7.2 / rts=14 box='Percents across each row for the column vars'; TABLE (row all), (column all)*(n*f=3.0 pctn*f=7.2) / rts=14 box='Counts and percents across the row'; TABLE row, (column all)*rowpctN*f=7.2 / rts=18 box='Prob (column A | row B)'; RUN; * the percents can also be interpreted as prob(COLUMN | ROW) = prob(Row and Column)/Prob(column) Output from the first TABLE statement: ---------------------------------------- |Total | column | | |Counts |-----------------------| | | | 1 | 2 | 3 | 4 |Total| |--------+-----+-----+-----+-----+-----| |row | | | | | | |1 | 9| 12| 5| 3| 29| |2 | 4| 5| 3| 4| 16| |Total | 13| 17| 8| 7| 45| ---------------------------------------- Output from the second TABLE statement: ------------------------------------------------------ |Percents | column | | |across each |-------------------------------| | |row for the | 1 | 2 | 3 | 4 | All | |column vars |-------+-------+-------+-------+-------| | | PctN | PctN | PctN | PctN | PctN | |------------+-------+-------+-------+-------+-------| |row | | | | | | |1 | 31.03| 41.38| 17.24| 10.34| 100.00| |2 | 25.00| 31.25| 18.75| 25.00| 100.00| ------------------------------------------------------ The third TABLE statement shows how to place the counts and the percents side-by-side; it also produces row and column totals: -------------------------------------------------------------------------- |Counts and | column | | |percents |-----------------------------------------------| | |across the | 1 | 2 | 3 | 4 | All | |row |-----------+-----------+-----------+-----------+-----------| | | N | PctN | N | PctN | N | PctN | N | PctN | N | PctN | |------------+---+-------+---+-------+---+-------+---+-------+---+-------| |row | | | | | | | | | | | |1 | 9| 31.03| 12| 41.38| 5| 17.24| 3| 10.34| 29| 100.00| |2 | 4| 25.00| 5| 31.25| 3| 18.75| 4| 25.00| 16| 100.00| |All | 13| 28.89| 17| 37.78| 8| 17.78| 7| 15.56| 45| 100.00| -------------------------------------------------------------------------- The RowPctN option works the same as PctN with fewer entries: ---------------------------------------------------------- |Prob (A | row B)| Column | | | |-------------------------------| | | | 1 | 2 | 3 | 4 | All | | |-------+-------+-------+-------+-------| | |RowPctN|RowPctN|RowPctN|RowPctN|RowPctN| |----------------+-------+-------+-------+-------+-------| |Row | | | | | | |1 | 31.03| 41.38| 17.24| 10.34| 100.00| |2 | 25.00| 31.25| 18.75| 25.00| 100.00| ---------------------------------------------------------- The pctn and ColpctN options demonstrated belww work in an analogous manner to pctN and RowpctN as it computes percents which add to 100% down each column: PROC TABULATE DATA=hs noseps ; CLASS row column ; TABLE (row all), column*pctn*f=7.2 / rts=14 box='Conditional percents for row given col'; TABLE (row all), column*ColpctN*f=7.2 / rts=14 box='Conditional percents for row given col'; RUN; --------------------------------------------- |Conditional | column | |percents |-------------------------------| |for row | 1 | 2 | 3 | 4 | |given col |-------+-------+-------+-------| | | PctN | PctN | PctN | PctN | |------------+-------+-------+-------+-------| |row | | | | | |1 | 69.23| 70.59| 62.50| 42.86| |2 | 30.77| 29.41| 37.50| 57.14| |All | 100.00| 100.00| 100.00| 100.00| ---------------------------------------------- You cannot enter variable groupings as the nested notation found with PROCs GLM or MIXED; you must enter variable groupings in their ungrouped form. For example, a*b a*c is the equivalent way of notating a*(b c). Calculate percents down a column: DATA one; DO car=1 to 3; do age=1 to 2; rept = ceil(8*ranuni(92838)); do i=1 to rept; output; end; end; end; PROC TABULATE DATA=one NOseps ; CLASS car age ; TABLE (car * ( age all ) all), n*f=4.0 colpctn='(%)'*f=6.1; RUN; ------------------------------- | | N | (%) | |-----------------+----+------| |car age | | | |1 1 | 5| 17.9| | 2 | 1| 3.6| | All | 6| 21.4| |2 age | | | | 1 | 8| 28.6| | 2 | 1| 3.6| | All | 9| 32.1| |3 age | | | | 1 | 6| 21.4| | 2 | 7| 25.0| | All | 13| 46.4| |All | 28| 100.0| ------------------------------- III. Calculate percentage of sums * first, simulate some housing study data; * generate a random number of houses of various types across four city sectors; DATA two; DROP nn; DO sector = 1 to 4; nn = INT(200*ranuni(924)); * generate a random no. of houses in each sector; DO i=1 to nn; garden=ranuni(928) > .5; * is it a garden? 1=yes, 0=no; suburb=ranuni(928) > .8; * is it located in the suburbs? 1=yes, 0=no ; OUTPUT; END; END; PROC TABULATE DATA=two NOseps; CLASS sector; VAR suburb garden; TABLE (sector all), (n*f=4.0 pctn*f=6.2) suburb*(sum*f=3.0 pctsum*f=6.2) garden*(sum*f=3.0 pctsum*f=6.2) / rts=9; RUN; ------------------------------------------- | | | | suburb | garden | | | | |----------+----------| | | N | PctN |Sum|PctSum|Sum|PctSum| |-------+----+------+---+------+---+------| |sector | | | | | | | |1 | 38| 6.77| 8| 6.84| 19| 6.79| |2 | 188| 33.51| 35| 29.91| 89| 31.79| |3 | 175| 31.19| 42| 35.90| 97| 34.64| |4 | 160| 28.52| 32| 27.35| 75| 26.79| |All | 561|100.00|117|100.00|280|100.00| ------------------------------------------- IV. Embellish the Table To remove a label, assign a blank label to each variable in the table statement. PROC TABULATE; CLASS r c; VAR y; TABLE r, c=" "*y*mean / RTS=10; run; In this example, variable c will no longer have a header because it is supplied with a blank label (or you can supply a new one between the quotes). Variable r will continue to have the header listing the variable name or label. You can apply the same approach to remove the heading for a statistic: PROC TABULATE; CLASS r c; VAR y; TABLE r, c=" "*y*n=" " / BOX="Counts" rts=10; RUN; V. PROC TABULATE and Missing Data Missing values and formats Missing levels Print missing combination Consider this sample input dataset and its associated format: PROC FORMAT; VALUE incm 1='less than 15000' 2='15000 to 30000' 3='30000 to 45000' 4='greater than 45000' .n="no response" .r="refused to answer" .u="unemployed" .="unknown"; RUN; DATA two; LABEL income='Income'; FORMAT income incm. ; MISSING n r u; * the letters n r u refer to specific reasons for missing data from income listed in PROC FORMAT; INPUT id income @@; CARDS; 1 1 2 . 3 r 4 4 5 2 6 n 7 u 8 1 9 2 10 r 11 1 12 3 13 2 14 3 15 n 16 . ; The option 'missing' on the PROC statement summarizes how many missing values are present in the data. If you have coded with a variety of the 28 possible missing data codes as observed here, you will find summaries for each type. PROC TABULATE NOseps missing; CLASS income; TABLE income, n='Count'*f=7.0 / RTS=20; RUN; ---------------------------- | | COUNT | -------------------+-------| |Income | | |unknown | 2 | |no response | 2 | |refused to answer | 2 | |unemployed | 1 | |lt 15000 | 3 | |15000 to 30000 | 3 | |30000 to 45000 | 2 | |gt 45000 | 1 | ---------------------------- The option printmiss on the TABLE statement prints all row and column labels for each portion of the table, even if these combinations don't exist for certain levels. The 'misstext'= option tells SAS what character to print of the particular cell is empty. Typical a blank (as shown here) for numerical summaries is entered. A '0' may be inserted for cell counts equal to 0. DATA tst; lvl=1; a=.z; b=1; output; a=1; b=2; output; a=2; b=1; output; b=2; output; lvl=2; a=1; b=.t; output; b=2; output; a=2; b=1; output; b=2; output; PROC TABULATE DATA=tst NOseps missing; CLASS lvl a b; TABLE lvl*a, b*n=' '*f=4.0 / rts=16 printmiss misstext=' '; OPTIONS LS=75; run; ------------------------------- | | b | | |--------------| | | T | 1 | 2 | |--------------+----+----+----| |lvl a | | | | |1 Z | | 1| | | 1 | | | 1| | 2 | | 1| 1| |2 Z | | | | | 1 | 1| | 1| | 2 | | 1| 1| ------------------------------- proc format; value ysn 0='No' 1='Yes'; run; PROC TABULATE DATA=tbl NOseps missing; CLASS group time1 time2; TABLE time1, group*time2*n=' '*f=4.0 / rts=10 printmiss misstext='0'; FORMAT time1 time2 ysn. ; run; time2=yes is not available for group 3 ---------------------------------------- | | group | | |-----------------------------| | | 1 | 2 | 3 | | |---------+---------+---------| | | time2 | time2 | time2 | | |---------+---------+---------| | | No |Yes | No |Yes | No |Yes | |--------+----+----+----+----+----+----| |time1 | | | | | | | |No | 1| 0| 1| 0| 4| 0| |Yes | 7| 2| 4| 3| 3| 0| ---------------------------------------- VI. Writing tables with empty rows and/or columns DATA one ; input gender $1 age @@; CARDS; M 11 M 12 F 11 F 12 O 11 O 12 ; PROC PRINT; RUN; PROC TABULATE DATA=sashelp.class(RENAME=(sex=gender)) Noseps classdata=one exclusive; CLASS gender age; VAR height; TABLE gender, age*height*(n*f=6.0 mean*f=6.1) / rts=9 misstext='0'; run; prints only the levels specified in the dataset called one: ------------------------------------- | | Age | | |---------------------------| | | 11 | 12 | | |-------------+-------------| | | Height | Height | | |-------------+-------------| | | N | Mean | N | Mean | |-------+------+------+------+------| |gender | | | | | |F | 1| 51.3| 2| 58.1| |M | 1| 57.5| 3| 60.4| |O | 0| 0| 0| 0| ------------------------------------- * Assume a 4 x 4 table; * generate 16 records which include all possible combinations of row and col; DATA one; DO row =1 to 4; DO col =1 to 4; OUTPUT; * outputs 16 combinations of row and col; END; END; RUN; * Suppose your data only have a few of the cells of the table represented, yet you want to see a table which contains all possible combinations of rows and columns; DATA cnts; INPUT row col count; Cards; 1 2 3 3 2 2 3 1 1 1 4 4 ; PROC TABULATE DATA=cnts Noseps CLASSDATA=one ; CLASS row col; FREQ count; TABLE (row all='Total'), (col all='Total')*n=' '*f=5.0 / rts=11 misstext=' '; RUN; ----------------------------------------- | | col | | | |-----------------------| | | | 1 | 2 | 3 | 4 |Total| |---------+-----+-----+-----+-----+-----| |row | | | | | | |1 | | 3| | 4| 7| |2 | | | | | | |3 | 1| 2| | | 3| |4 | | | | | | |Total | 1| 5| | 4| 10| ----------------------------------------- VII. What is the difference between the WEIGHT and FREQ statements? Assume nn is an integer. The FREQ statement then replicates the observations "nn" number of times, e.g., if you have n y 3 48 entering FREQ nn; in PROC TABULATE summarizes a dataset as if you had entered this one record nn=3 times: n y 3 48 3 48 3 48 SAS computes stats as if 3 records like this one exist. Note that if n is a decimal, SAS rounds it down to the nearest integer. The statement WEIGHT n; specifies that only one record exists, but gives it a weight of nn in calculations. In this case, non-integer weights are acceptable. For means and sums, the difference in FREQ or WEIGHT with integer values isn't an issue, since the mean of means or the sum of sums will be the same; however, it is a big issue when computing variances as the following example shows. DATA one; INPUT nn y; CARDS; 8.2 10.8750 6.1 10.3333 8 7.6250 7 1.5714 5 10.0000 4 7.0000 ; * unweighted results (each observation counts as 1 record with weight 1); proc tabulate noseps; VAR y; table y*n*f=4.0 y*mean*f=9.5 y*var*f=8.5; run; ----------------------- | y | y | y | |---+--------+--------| | N | Mean | Var | |---+--------+--------| | 6| 7.90078|12.02367| ----------------------- * treat each record as if it exists the integer portion, NN times in the file and given a weight of 1; PROC TABULATE Noseps; VAR y; TABLE y*n*f=4.0 y*mean*f=9.5 y*var*f=8.5; FREQ NN; run; ----------------------- | y | y | y | |---+--------+--------| | N | Mean | Var | |---+--------+--------| | 38| 7.86841|11.14989| ----------------------- * treat each record as if it exists once in the file and given a weight of NN; proc tabulate noseps; VAR y; Table y*n*f=4.0 y*mean*f=9.5 y*var*f=8.5; WEIGHT NN; run; ------------------------- | y | y | y | |----+---------+--------| | N | Mean | Var | |----+---------+--------| | 6| 7.89055|82.98855| ------------------------- In the above table the weighted mean is corrected, though it assumes only 6 records exist in the file. To produce a count of records that reflects the weight variable, the procedure counts a variable called a where all values equal 1; proc tabulate noseps; VAR y a; Table a*sum*f=4.1 y*mean*f=9.5 y*var*f=8.5; WEIGHT NN; run; ------------------------- | a | y | y | |----+---------+--------| |Sum | Mean | Var | |----+---------+--------| |38.3| 7.89055|82.98855| -------------------------