Lesson 7

Descriptive statistics and tables

Before conducting hypothesis tests, regression analysis, or other statistical analysis, it is wise to begin by getting a basic understanding of your data and whether the values have been entered correctly or not.   Printing out a data set and examining the listing will often allow researchers to spot recording or data entry errors, particularly if the data are sorted.  However, for large data sets it is often faster to produce summary descriptive statistics, including minimum and maximum, and plots such as histograms or bar charts which often help us identify data problems.  Researchers should also check distributional properties (e.g., normality) of the variables required for analysis.

It is often useful to construct contingency tables of categorical variables to detemine sample sizes of data subsets before analysis.   The SAS procedures used to construct tables also provide information on such sample sizes but also calculate chi-square statistics to test hypotheses of independence or homogeneity.

This lesson provides an introduction to the following SAS procedures :

PROC UNIVARIATE provides basic descriptive statistics (see the list on page 147 of the text), basic graphics, and a test of normality.

PROC GCHART produces vertical or horizontal bar charts for categorical variables and histograms for quantitative variables.

The following program uses PROC GCHART and PROC UNIVARIATE to describe resting pulse rates (pulse1) for the permanent pulse data set (presummed to be on floppy disk) discussed in Lesson 3.

DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
LIBNAME LIBRARY 'a:\';
PROC GCHART DATA= LIBRARY.pulse;
     VBAR pulse1;
     TITLE 'Default bar chart for pulse1';
PROC UNIVARIATE DATA=LIBRARY.pulse PLOT NORMAL;
     VAR pulse1;
     TITLE 'PROC UNIVARIATE for pulse1';
RUN;
The PLOT option in PROC UNIVARIATE produces a stemplot, a boxplot, and a normal probability plot.  The NORMAL option produces a test of normality.  More than one variable can be listed in the VAR statement and a BY subcommand (BY vars; after the VAR line) will produce summaries, plots, and tests for each value of another variable, e.g., gender, if the data is sorted first on this variable.

Click desc1.txt, use EDIT, SELECT ALL, COPY, paste the program into the SAS Program window and run it.  This program generates both an Output window (PROC UNIVARIATE) and a Graph window (PROC GCHART).  Examine both windows.  Scroll or use page up-page down to view all the contents of the Output window.
 

You have used PROC MEANS in Lesson 5 and in the homework for Lesson 6.  The general form for using this procedure is as follows:

PROC MEANS  DATA=file1 options;
     VAR variable-list;
     BY  categorical-variable list;  /*data must be sorted first*/
You have used PROC TABULATE in Lesson 3 and Lesson 5.  This procedure is useful for producing tables of frequencies or descriptive statistics in easy to read form.  The general form for using this procedure is as follows:
PROC TABULATE  DATA=file1 options;
     CLASS categorical-variable-list;
     VAR   variable-list;
     TABLE specifications;
The homework in this lesson will provide further instruction on PROC TABULATE.

PROC FREQ is useful for producing contingency tables showing frequencies, marginal and conditional distributions, and conducting chi-square analyses.  The general form for using this procedure is as follows:

PROC FREQ  DATA=file1;
     TABLE categorical-variable * variable /options;
For example the following program uses the permanent SAS data set LIBRARY.pulse (presummed to be on floppy disk in drive a) to produce a contingency table of smoker by activity which gives a chi-square test of independence and the expected values and chi-square contribution of each cell:
DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
LIBNAME LIBRARY 'a:\';
PROC FREQ  DATA=LIBRARY.pulse;
     TABLE smoker * activity/CHISQ EXPECTED CELLCHI2;
RUN;
Click freq.txt, EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it.  Examine the Log and Output windows.

It is often useful to use PROC FREQ to conduct analyses of contingency tables from frequency data rather than from a raw data set.  For example, we may already have a contingency table such as the following:
 

Gender
Diseased? Female Male
No 48 64
Yes 27 35

This table can be analyzed in PROC FREQ using the following commands:

DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA  file1;
      INPUT gender $ disease $ count;
      CARDS;
Male Yes 35
Male No  64
Female Yes 27
Female No  48
;
PROC FREQ  DATA=file1;
     TABLE disease * gender/CHISQ EXPECTED CELLCHI2;
     WEIGHT count;
RUN;
Click freq2.txt, EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it.  Examine the Log and Output windows.

Other SAS procedures which are useful for forming or analyzing tables include PROC SUMMARY and PROC CATMOD.  PROC SUMMARY is very similar to PROC MEANS but, by default, it does not print out descriptive statistics to the Output window; it does write out summary statistics to new files.  PROC CATMOD is a procedure for modeling categorical data using such techniques such as linear and log-linear models and logistic regression.
 

Homework #7

 Read Sections 5.11 and 7.1.

For the exercises below turn in the programs you used and the output.  Do not turn in  a copy of the data for this assignment.

1.  PROC GCHART will produce horizontal histograms with frequency tables by replacing
                VBAR variables;      with            HBAR variables;
     The frequency table can be eliminated by including a NOSTAT option as in
                HBAR pulse1 / NOSTAT;
      PROC GCHART will also produce bar charts for categorical variables through use of
      the DISCRETE option as in
               HBAR activity / DISCRETE;
Do the following using the pulse data set and PROC GCHART:

a.  Produce a horizontal histogram with a frequency table for weight and
b.  Produce a horizontal bar chart without a frequency table for activity.
2.  In PROC TABULATE a comma separates dimensions of a table (rows or columns), an asterisk crosses elements within a dimension (rows within rows), and a space concatenates (stacks) elements in a dimension.  To better understand these three operators run the following command on the pulse data set:
            PROC TABULATE     DATA=LIBRARY.pulse;
                 CLASS gender ran;
                 VAR pulse1;
                 TABLE gender;
                 TABLE gender, ran;
                 TABLE gender*ran;
                 TABLE gender ran;
                 TABLE gender*ran, pulse1*MEAN;
                 TABLE gender*ran, pulse1*(MEAN N);
Next to each of the six tables produced, write the TABLE statement that produced it.

3.  Sort the data and use a BY statement in a PROC FREQ step to get tables of counts, expected values, and cell chi-squares for tables of gender by smoker for each value of activity, i.e., three tables.