Lesson 5

Sorting data and using data subsets

This lesson addresses the following: To complete this lesson you must have a copy of the raw data file pulsedat.txt and the permanent SAS data set LIBRARY.pulse created in Lesson 3 on your floppy disk.  If you do not have the raw data set pulsedat.txt on your floppy,  copy it by clicking the link in this paragraph, use EDIT, SELECT ALL, COPY, paste the data set into your wordprocessor, and save it on your floppy as TEXT ONLY or ASCII DOS(TEXT). If you do not have the permanent SAS data set LIBRARY.pulse, please return to Lesson 3 and reconstruct it onto floppy disk.

Selecting a subset of the data for analysis

There are two common ways to select a specific subset of the data for analysis;
1.  An IF command in the DATA step.  For example, the following program (select1.txt) would print the pulse data for males only:
DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA file1;
     INFILE 'a:\pulsedat.txt';
     INPUT pulse1 pulse2 ran smoker gender
           height weight activity;
     IF gender = 1;  /* equivalently IF gender = 2 THEN DELETE;*/
PROC PRINT;
     TITLE 'The pulse data for males only';
RUN;
 2.  A WHERE command in the PROC step.  For example, the following program (select2.txt) would print the data for males only:
DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA file1;
     INFILE 'a:\pulsedat.txt';
     INPUT pulse1 pulse2 ran smoker gender
           height weight activity;
PROC PRINT;
     WHERE gender=1; /*equivalently WHERE gender NE 2; */
RUN;

Running analyses on the full data set and on a subset in the same program

The program below (full&sub.txt) creates keeps a copy of the complete pulse data set in a temporary file called file1 and a subset of the data for runners (ran = 1) only in a temporary file called subset.  One procedure is applied to each of these two temporary files.
DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA file1;
     INFILE 'a:\pulsedat.txt';
     INPUT pulse1 pulse2 ran smoker gender
           height weight activity;
DATA subset;
     SET file1;
     IF ran = 1;
PROC PRINT DATA=file1;      /* prints the entire data set         */
     TITLE 'The entire data set';
PROC MEANS DATA=subset;    /* For runners only prints sample size, */
     VAR pulse1;           /* the mean, standard deviation, min,   */
     TITLE 'Runners only'; /* and max for pulse1                   */
RUN;

Running the same analysis on each of several subsets of data

Many SAS procedures allow the use of a BY vars; subcommand which will run the procedure for each value of the variable(s) given in the subcommand.  For example, the commands
PROC MEANS;
     VAR pulse1;
     BY gender;
would output the mean, standard deviation, minimum, and maximum of the variable pulse1 for  each gender.

The use of a BY statement in SAS procedures, other than PROC SORT, require that the data are sorted.  That is, the values of the variables are ordered in descending (largest to smallest) or ascending (smallest to largest) order.   PROC SORT is used to sort data in SAS.  This procedure can be used to sort numeric or character data (A to Z).  The following example illustrates the use of PROC SORT on the raw data set pulsedat.txt described in Lesson 3:

DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA file1;
     INFILE 'a:\pulsedat.txt';
     INPUT pulse1 pulse2 ran smoker gender height weight activity;
PROC SORT DATA=file1;
          BY gender weight pulse1;
PROC PRINT;
     VAR gender weight pulse1 pulse2 height activity ran smoker;
RUN;
Click sort1.txt, use EDIT, SELECT ALL, COPY, paste this program into the SAS program window and run it.   You must have pulsedat.txt on the floppy disk in drive a for the program to run successfully.

Examine the Log window and Output window carefully to understand the sorting that occurred. Note that SAS sorts first by gender (smallest 1= male to largest 2= female), then by weight (smallest to largest), then by pulse1 (smallest to largest).

If you have the SAS permanent file LIBRARY.pulse with variable labels and value labels you created in Lesson 3 on a floppy disk, you can run the same sort as above by Clicking sort2.txt, use EDIT, SELECT ALL, COPY, paste this program into the SAS program window and run it.  Be sure to have your diskette with LIBRARY.pulse in the floppy drive.

The example program below illustrates the use of PROC SORT followed by PROC MEANS using a BY subcommand on the pulse data set:

DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
LIBNAME LIBRARY 'a:\';
PROC SORT DATA=LIBRARY.pulse;/* Sorts the data by gender.             */
          BY gender;
PROC MEANS;                  /* Finds the sample size, mean, standard */
     VAR pulse1;             /* deviation, minimum, and maximum of    */
     BY gender;              /* pulse1 for each gender.               */
PROC SORT;                   /* Sorts the data by gender and activity.*/
     BY gender activity;     
PROC MEANS;                  /* Finds the sample size, mean, standard */
     VAR pulse1;             /* deviation, minimum, and maximum of    */
     BY gender activity;     /* pulse1 for each gender and activity   */
RUN;                         /* combination.                          */
If you have your copy of LIBRARY.pulse on diskette, click means1.txt, use EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it.  Examine the Log window to see how SAS signifies the sorts and examine the Output window to view the kind of results it gives.
 

Homework # 5

Read sections 4.6, 5.1-5.4 and 5.9

1.  Answer the following questions concerning PROC SORT:

a.  Is the default to order values in ascending or descending order?

b. How do you change the default ordering?

c. If missing values occur in a data set for a variable and the data is sorted in ascending order on that variable, where do the missing values appear, at the beginning or at the end of the list?  You should try this by editing pulsedat.txt to include some missing values (.) in pulse1 then try a sort on pulse1.  A narrative answer is all that is required for the homework.

2.  Using the permanent SAS data set LIBRARY.pulse described in Lesson 3, do parts a and b below using a single SAS program:
a.  Use PROC MEANS (set linesize =110) to output the average height of males that ran in place (a data subset).  Include a title in this procedure which describes what is being done; see Lesson 1, and

b.  Using the entire data set, use PROC TABULATE to produce two tables; one for smokers and one for non smokers.    Each table should present the sample sizes and means for weight for each combination of gender and activity. Include a title in the procedure to describe what is being done. You must use a  PROC SORT with BY smoker as a start.  Then the following PROC TABULATE commands:

 PROC TABULATE DATA = your temporary file name;
            VAR  weight;
            CLASS gender activity;
            TABLE = gender* activity * weight*(N, MEAN);
            BY smoker;
3.  Without sorting the data (see section 5.9; CLASS subcommand) write a single PROC MEANS command with options which outputs the sample size, mean, and standard error of the mean for weight, height, and bulk=weight/height for each gender.  You should run your command to test it but you do not have to turn in the program in your homework; you need only include the command in your write up.