There are times when we need to create a new variable by using variables that reside in separate data files. Circumstances such as these require that we merge, or join, two datasets together. As an example, pretend that we want to compute the total population of cities in the United States by adding together the male and female population, but the variables for male and female population reside in two different data files. Below are a few sample records:
How do we merge two datasets together, and compute a new variable using our hypothetical datasets? Here is a sample program in BASE SAS that would accomplish this task:
/*COMMENT: Specify the location of the input data files */ LIBNAME male "/root/sample_directory/maledata"; LIBNAME fem "/root/sample_directory/femdata";
/*COMMENT: Sort the two data files by a common ID variable */ PROC SORT DATA = male.malepop OUT = male; BY city; RUN;
PROC SORT DATA = fem.femalepop OUT = female; BY city; RUN;
/*COMMENT: Join the files, compute new variable */ DATA totalpop; MERGE male female; BY city; totpop = malepop + femalepop ; RUN;
/*Print the results of the output using the PRINT procedure */ PROC PRINT DATA = totalpop ; run ;
Here is a step-by-step breakdown of the program:
We first define where the two datasets reside on your computer with a pair of LIBNAME statements. Note that if your input data all resides in one directory, you would only need one LIBNAME statement. Remember, LIBNAMES only tell SAS the location of your input data. They can also tell SAS where to store output datasets.
LIBNAME male "/root/sample_directory/maledata";
LIBNAME fem "/root/sample_directory/femdata";
One requirement for merging in a SAS data step is that the two datasets must be sorted by a unique ID. In SAS, we sort datasets using the SORT procedure. In order to preserve the original state of the data file being sorted, we create temporary datasets using the OUT option. The OUT option instructs the SORT PROCEDURE to output a temporary dataset, which in this case is a sorted version of the original. It also ensures that the original datasets will remain unchanged. We place the unique ID that we would like to sort by in the BY statement of the SORT procedure. In this example, our unique ID is the variable "city."
PROC SORT DATA = male.malepop OUT = male;
PROC SORT DATA = fem.femalepop OUT = female;
Next, in order to combine the two dataets, we begin by creating a new dataset. As always, we create a dataset in SAS by starting with the keyword "data" followed by the name of the dataset you would like to create. Below, I name this dataset "totalpop." After naming the dataset, we merge the two data files together by writing the keyword "merge" followed by the name of the two datasets we would like to combine. We specify the unique ID to merge by with the "by" option. Recall above that our unique ID in this example is the "city" variable. We compute total population by adding male and female population together.
MERGE male female;
totpop = malepop + femalepop ;
We can take a look at the final dataset using the PRINT procedure:
PROC PRINT DATA = totalpop ; RUN;
city malepop femalepop totpop
01 100 50 150
02 150 100 250
03 190 20 210
While there will always be more than one way to accomplish a task in SAS, merging usually entails 4 steps:
- Specifying the location of the input datasets using the LIBNAME statement
- Identifying a unique ID variable (or variables) in both datasets that uniquely identify rows of data.
- Sorting the input datasets by a common ID that uniquely identifies records
- Merging the two datasets together in a data step using the MERGE and BY statements.
The UCLA ATS website has many helpful examples on how to merge in SAS. Here is one article: