LESSON 6

Stacking and combining data sets

Often researchers have information on some collection of variables from several times (seasons or years) or locations or from different experiments or measurements taken on the same collection of experimental units (subjects, field plots).  As the data is collected and recorded it is often put in separate files; one for each experiment, time or location.  In some instances the variables recorded in several data sets are the same (or nearly so) at the different times or locations.  In other cases,  the experimental units (subjects, plots) are the same but the variables recorded in separate files differ.  Statistical analyses often require that these data sets be brought together for analysis.  This lesson provides an introduction to how data sets are brought together in SAS.
 

Stacking data sets

The SET command in SAS is useful for stacking (or concatenating) two or more data sets into a single file which can be analyzed as a whole or output to a new file. Suppose we have data recorded for four variables, w, x, y, and z for two locations as described below:
 

Location 1

Location 2

Variables Variables
w
x
y
z
w
x
y
z
1
3 m 5.6 2 4 f 10.9
1 4 m 4.5 2 6 f 12.3
1 5 f 5.7 2 8 m 13.4

The program below illustrates the use of the SET command to stack these two data sets which are in files stacdat1.txt and stacdat2.txt (on floppy disk), construct a plot, and write out some of the variables to a new file (newdat.txt; on floppy):

DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA   file1;                          /*read in first data set*/
       INFILE 'a:\stacdat1.txt';
       INPUT w x y $ z;
DATA   file2;                         /*read in second data set*/
       INFILE 'a:\stacdat2.txt';
       INPUT w x y $ z;
DATA   file3;                         /combine data into one*/
       SET file1 file2;
       j= w/x;                        /*calculate a new variable*/
PROC PLOT   DATA=file3;               
     PLOT j*z;                       /*construct a plot for all data*/
DATA   _NULL_;
       SET file3;               /write out a new data set to floppy */
       FILE 'a:\newdat';
       PUT X 1-5  Y $ 8 Z 10 - 15 .2 J 17-25 .2;
RUN;
Copy stacdat1.txt and stacdat2.txt onto your floppy drive by clicking the links here, EDIT, SELECT ALL, COPY, paste into your word processor and use FILE, SAVE AS with TEXT ONLY (or ASCII DOS TEXT).   To run the program on the data, click stack1.txt, use EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it.  Examine the Log window and graph window to see what happened.

You can exercise some control in how the data are brought together using PROC SORT and a BY statement in the DATA step which creates the combined file3.  The program below illustrates what happens with and without a BY statement in the SET command.  Recall that a BY statement requires that the data are sorted.

DM 'CLEAR LOG';
DM 'CLEAR OUTPUT';
OPTIONS LINESIZE=72 NODATE NONUMBER;
DATA   file1;
       INFILE 'a:\stacdat1.txt';
       INPUT w x y $ z;
PROC SORT DATA=file1;
          BY y;
DATA   file2;
       INFILE 'a:\stacdat2.txt';
       INPUT w x y $ z;
PROC SORT DATA=file2;
          BY y;
DATA   file3;
       SET file1 file2;
       BY y;
PROC PRINT DATA=file3;
     TITLE 'Result of SET with a BY statement';
DATA   file4;
       SET file1 file2;
PROC PRINT DATA=file4;
     TITLE 'Result of SET without a BY statement';
RUN;
Click stack2.txt , EDIT, SELECT ALL, COPY, paste this program into the SAS Program window, and run it.  Examine the Log window and the output to see what happened.   Using LOCAL and RECALL TEXT, edit the program in the Program window reversing the order of file1 and file2 in the SET command to see if it makes a difference.  That is,
change           SET file1 file2;               to                SET file2 file1;
Rerun the program.  Examine the Log and Output window to determine if there was any effect of reordering the files in the SET command.
 

 Merging Data sets

When two files contain different variables about the same experimental units (subjects, plots) the two files may be merged if there is some variable which identifies the unit in each file.  Suppose we have the two files below:
 

Data set 1

x
plot id
sample
3.2 3 1
3.7 2 1
4.3 1 2
4.5 2 2
4.8 3 2
5.1 1 1
 

Data set 2

y
plot id
sample
x
0.4 3 1 6.1
0.5 2 1 5.2
0.7 1 2 4.6
0.7 3 2 6.3
0.8 1 1 4.1
0.9 2 2 5.6
 

Suppose that plot id and sample number uniquely identify plot and time of sampling so the data must be merged using these two variables.  SAS requires that the two data sets be sorted on the merging variables so each data set must be sorted.  The variable x has been measured twice on the same plot and sample, e.g., perhaps subsampling occurred, so we really have x1 in data set 1 and x2 in data set 2.  The variable x must be relabeled as the two data sets are combined.  The variable y only occurs in data set 2.

The following program reads the data from the two files (mergdat1.txt and mergdat2.txt) on floppy disk, sorts each, merges the files into a single file, writes out a new file which includes all the original data and the mean and standard deviation of the two values of x (x1 and x2) for each plot and sample.

DM 'CLEAR LOG'; 
DM 'CLEAR OUTPUT'; 
OPTIONS LINESIZE=72 NODATE NONUMBER; 
TITLE1   'Merging data sets'; 
DATA file1; 
     INFILE 'a:\mergdat1.txt'; 
     INPUT x plot sample; 
PROC PRINT DATA = file1; 
Title2  'mergdat1 data';
PROC SORT DATA=file1; 
          BY plot sample; 
DATA file2; 
     INFILE 'a:\mergdat2.txt'; 
     INPUT y plot sample x; 
PROC PRINT  DATA = file2; 
Title2 'mergdat2 data'; 
PROC SORT DATA=file2; 
          BY plot sample; 
DATA file3; 
     FILE 'a:\newfile.dat';
     MERGE file1 (RENAME=(x=x1)) file2 (RENAME=(x=x2)); 
     BY plot sample; 
      MEANX=MEAN(x1,x2); 
      STDX=STD(x1,x2); 
      PUT plot sample x1 x2 meanx stdx y;
PROC PRINT DATA =file3; 
TITLE2 'complete merged data'; 
RUN;
Copy mergdat1.txt and mergdat2.txt  onto your floppy drive by clicking the links here, EDIT, SELECT ALL, COPY, paste into your word processor and use FILE, SAVE AS with TEXT ONLY (or ASCII DOS TEXT).   To run the program on the data, click merge.txt, use EDIT, SELECT ALL, COPY, paste this program into the SAS Program window and run it.  Examine the Log window and 3 pages of Output window to see what happened.
 

Homework # 6

Read sections 5.12, 6.1 - 6.10 of the text

1.  The file sub1.txt contains the pulse data, variables gender, pulse1, pulse2, and weight  for male non smokers.  The file sub2.txt contains the pulse data variables gender, pulse1 pulse2, and weight  for female nonsmokers.  Pulse1 has been set to missing if pulse1 is less than 60 in these two data sets.  Copy these two data sets onto your floppy disk, examine them, and write a SAS program to do the following:

a.  Read in the two data sets, combine them, and order them with respect to pulse1 (see section 6.3 of your text).  Use PROC PRINT to get a listing of the combined file.
b.  Using the combined data set, construct a single plot of pulse1 (vertical axis) versus weight (horizontal axis) which is labeled by gender (see section 5.12 in your text).
c.  Write out a new file to floppy disk which contains the combined data set.
In your homework solution include a copy of the two original data sets, the documented program, the PROC print output of the combined data, and a copy of the combined file which was written out to floppy disk.

2.  The two files earlyhw.txt and latehw.txt describe homework scores for a collection of students before (3 homework scores) and after the midterm (4 homework scores) in a statistics course.  Students are identified by the last four digits of their social security number in each file.  Copy these two data sets onto your floppy disk, examine them, and write a SAS program to do the following:

a.  Read in the two data sets, merge them based on the last four digits of the social security number,
b. Determine the total homework score for each student using SUM(x1, x2, ...) as MEAN(x1,x2) was used in merge.txt to determine the average in this lesson.
c.  Use PROC PRINT to output the combined file with the total for each student
d.  Use PROC MEANS to output the average score for each of the 7 homework assignments, and
e.  Write out the combined data set, including the total score for each student, to a new file on floppy.
In your homework solution include a copy of the two original data sets, the documented program, the PROC print output of the combined data, and a copy of the combined file which was written out to floppy disk.