Merging two datasets in SAS

There are times when we need to create a new variable by using variables that reside in separate data files. Circumstances such as these require that we merge, or join, two datasets together. As an example, pretend that we want to compute the total population of cities in the United States by adding together the male and female population, but the variables for male and female population reside in two different data files. Below are a few sample records:

/root/sample_directory/maledata/malepop.sas7bdat:

city   malepop
01      100
02      150
03     190

/root/sample_directory/femdata/femalepop.sas7bdat:

city    femalepop
01       50
02       100
03       20

Sample Program

How do we merge two datasets together, and compute a new variable using our hypothetical datasets? Here is a sample program in BASE SAS that would accomplish this task:

/*COMMENT: Specify the location of the input data files */
 LIBNAME male "/root/sample_directory/maledata";
 LIBNAME fem  "/root/sample_directory/femdata";


/*COMMENT: Sort the two data files by a common ID variable */
 PROC SORT DATA = male.malepop OUT = male;
 BY city;
 RUN;


PROC SORT DATA = fem.femalepop OUT = female;
BY city;
RUN;


/*COMMENT: Join the files, compute new variable */
DATA totalpop;
  MERGE male female;
  BY city;
  totpop = malepop + femalepop ;
RUN;


/*Print the results of the output using the PRINT procedure */
PROC PRINT DATA = totalpop ; run ;

Code Review

Here is a step-by-step breakdown of the program:

We first define where the two datasets reside on your computer with a pair of LIBNAME statements. Note that if your input data all resides in one directory, you would only need one LIBNAME statement. Remember, LIBNAMES only tell SAS the location of your input data. They can also tell SAS where to store output datasets.

 

LIBNAME male "/root/sample_directory/maledata";
LIBNAME fem  "/root/sample_directory/femdata";

 

One requirement for merging in a SAS data step is that the two datasets must be sorted by a unique ID. In SAS, we sort datasets using the SORT procedure. In order to preserve the original state of the data file being sorted, we create temporary datasets using the OUT option. The OUT option instructs the SORT PROCEDURE to output a temporary dataset, which in this case is a sorted version of the original. It also ensures that the original datasets will remain unchanged. We place the unique ID that we would like to sort by in the BY statement of the SORT procedure. In this example, our unique ID is the variable "city."

 

PROC SORT DATA = male.malepop OUT = male;
BY city;
RUN;

PROC SORT DATA = fem.femalepop OUT = female;
BY city;
RUN;

 

Next, in order to combine the two dataets, we begin by creating a new dataset. As always, we create a dataset in SAS by starting with the keyword "data" followed by the name of the dataset you would like to create. Below, I name this dataset "totalpop." After naming the dataset, we merge the two data files together by writing the keyword "merge" followed by the name of the two datasets we would like to combine. We specify the unique ID to merge by with the "by" option. Recall above that our unique ID in this example is the "city" variable. We compute total population by adding male and female population together.

 

DATA totalpop;
MERGE male female;
BY city;

totpop = malepop + femalepop ;

RUN;

 

We can take a look at the final dataset using the PRINT procedure:

PROC PRINT DATA = totalpop ;  RUN;

city    malepop      femalepop   totpop
01      100                50                   150
02      150                100                 250
03      190                20                   210

Summary

While there will always be more than one way to accomplish a task in SAS, merging usually entails 4 steps:

  1. Specifying the location of the input datasets using the LIBNAME statement
  2. Identifying a unique ID variable (or variables) in both datasets that uniquely identify rows of data.
  3. Sorting the input datasets by a common ID that uniquely identifies records
  4. Merging the two datasets together in a data step using the MERGE and BY statements.

Further Reading

The UCLA ATS website has many helpful examples on how to merge in SAS. Here is one article:

http://stats.idre.ucla.edu/sas/modules/match-merging-data-files-in-sas/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s