It is often useful to construct contingency tables of categorical variables to detemine sample sizes of data subsets before analysis. The SAS procedures used to construct tables also provide information on such sample sizes but also calculate chi-square statistics to test hypotheses of independence or homogeneity.
This lesson provides an introduction to the following SAS procedures :
PROC GCHART produces vertical or horizontal bar charts for categorical variables and histograms for quantitative variables.
The following program uses PROC GCHART and PROC UNIVARIATE to describe resting pulse rates (pulse1) for the permanent pulse data set (presummed to be on floppy disk) discussed in Lesson 3.
The PLOT option in PROC UNIVARIATE produces a stemplot, a boxplot, and a normal probability plot. The NORMAL option produces a test of normality. More than one variable can be listed in the VAR statement and a BY subcommand (BY vars; after the VAR line) will produce summaries, plots, and tests for each value of another variable, e.g., gender, if the data is sorted first on this variable.DM 'CLEAR LOG'; DM 'CLEAR OUTPUT'; OPTIONS LINESIZE=72 NODATE NONUMBER; LIBNAME LIBRARY 'a:\'; PROC GCHART DATA= LIBRARY.pulse; VBAR pulse1; TITLE 'Default bar chart for pulse1'; PROC UNIVARIATE DATA=LIBRARY.pulse PLOT NORMAL; VAR pulse1; TITLE 'PROC UNIVARIATE for pulse1'; RUN;
Click desc1.txt, use EDIT, SELECT ALL, COPY,
paste the program into the SAS Program window and run it. This program
generates both an Output window (PROC UNIVARIATE) and a Graph window (PROC
GCHART). Examine both windows. Scroll or use page up-page down
to view all the contents of the Output window.
You have used PROC MEANS in Lesson 5 and in the homework for Lesson 6. The general form for using this procedure is as follows:
You have used PROC TABULATE in Lesson 3 and Lesson 5. This procedure is useful for producing tables of frequencies or descriptive statistics in easy to read form. The general form for using this procedure is as follows:PROC MEANS DATA=file1 options; VAR variable-list; BY categorical-variable list; /*data must be sorted first*/
The homework in this lesson will provide further instruction on PROC TABULATE.PROC TABULATE DATA=file1 options; CLASS categorical-variable-list; VAR variable-list; TABLE specifications;
PROC FREQ is useful for producing contingency tables showing frequencies, marginal and conditional distributions, and conducting chi-square analyses. The general form for using this procedure is as follows:
For example the following program uses the permanent SAS data set LIBRARY.pulse (presummed to be on floppy disk in drive a) to produce a contingency table of smoker by activity which gives a chi-square test of independence and the expected values and chi-square contribution of each cell:PROC FREQ DATA=file1; TABLE categorical-variable * variable /options;
Click freq.txt, EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it. Examine the Log and Output windows.DM 'CLEAR LOG'; DM 'CLEAR OUTPUT'; OPTIONS LINESIZE=72 NODATE NONUMBER; LIBNAME LIBRARY 'a:\'; PROC FREQ DATA=LIBRARY.pulse; TABLE smoker * activity/CHISQ EXPECTED CELLCHI2; RUN;
It is often useful to use PROC FREQ to conduct analyses of contingency
tables from frequency data rather than from a raw data set. For example,
we may already have a contingency table such as the following:
Gender | ||
Diseased? | Female | Male |
No | 48 | 64 |
Yes | 27 | 35 |
This table can be analyzed in PROC FREQ using the following commands:
Click freq2.txt, EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it. Examine the Log and Output windows.DM 'CLEAR LOG'; DM 'CLEAR OUTPUT'; OPTIONS LINESIZE=72 NODATE NONUMBER; DATA file1; INPUT gender $ disease $ count; CARDS; Male Yes 35 Male No 64 Female Yes 27 Female No 48 ; PROC FREQ DATA=file1; TABLE disease * gender/CHISQ EXPECTED CELLCHI2; WEIGHT count; RUN;
Other SAS procedures which are useful for forming or analyzing tables
include PROC SUMMARY and PROC CATMOD. PROC SUMMARY is very similar
to PROC MEANS but, by default, it does not print out descriptive statistics
to the Output window; it does write out summary statistics to new files.
PROC CATMOD is a procedure for modeling categorical data using such techniques
such as linear and log-linear models and logistic regression.
For the exercises below turn in the programs you used and the output. Do not turn in a copy of the data for this assignment.
1. PROC GCHART will produce horizontal histograms with frequency
tables by replacing
VBAR variables; with
HBAR variables;
The frequency table can be eliminated by including
a NOSTAT option as in
HBAR pulse1 / NOSTAT;
PROC GCHART will also produce bar charts
for categorical variables through use of
the DISCRETE option as in
HBAR activity / DISCRETE;
Do the following using the pulse data set and PROC GCHART:
a. Produce a horizontal histogram with a frequency table for weight and2. In PROC TABULATE a comma separates dimensions of a table (rows or columns), an asterisk crosses elements within a dimension (rows within rows), and a space concatenates (stacks) elements in a dimension. To better understand these three operators run the following command on the pulse data set:
b. Produce a horizontal bar chart without a frequency table for activity.
PROC TABULATE DATA=LIBRARY.pulse; CLASS gender ran; VAR pulse1; TABLE gender; TABLE gender, ran; TABLE gender*ran; TABLE gender ran; TABLE gender*ran, pulse1*MEAN; TABLE gender*ran, pulse1*(MEAN N);Next to each of the six tables produced, write the TABLE statement that produced it.
3. Sort the data and use a BY statement in a PROC FREQ step to
get tables of counts, expected values, and cell chi-squares for tables
of gender by smoker for each value of activity, i.e., three tables.