A Gentle Introduction to the Data Step

In many other data processing languages, such as stata or SPSS, you begin manipulating data by opening a dataset. This means that these programming languages are implicitly designed to handle one dataset at a time. What makes SAS unique is that it can accommodate multiple datasets in one program. The reason it can do this is because of the basic input/output architecture built into SAS: the datastep.

Put simply, in a datastep you create a dataset using input data. The datastep is also one of primary ways through which you manipulate data by creating new variables. If you understand all that is involved in datastepping, then you will have a good foundation for understanding everything else that is involved in processing data in SAS.

While there is much about datastepping to discuss, a good SAS datatep usually has 4 ingredients:

  1. A libname statement that points SAS to the location of the input data (if you have a SAS formatted dataset).
  2. A “data” statement that states the name of a SAS dataset that you would like to create.
  3. A “set” statement that specifies the location and name of the input data.
  4. Extra code to crate new variables or manipulate the input dataset.

Let’s go over a few examples of datastepping using the income data. Here are a few sample records:

/root/sampledata/income.sas7bdat

id sex salary bonus
1 M 50 2
2 M 60 3
3 F 40 1
4 F 55 1.5

Creating a Copy of an input dataset

Let’s first demonstrate a basic datastep where we create a copy of the income dataset (displayed above).


/******************************************
Comment:
Basic copy program:
This program creates a copy of the income
dataset using a datastep.

*******************************************/

/* Location of the input data */
libname sst "/root/sampledata";

/* Create new dataset */
data new;
set sst.income ;
run;

/* Print out both datasets */
proc print data = sst.income; run/* Original data */
proc print data = new;  run /*  New data       */

 


Code Review

The program begins with a libname statment, which tells SAS the location of the input dataset. The libref, or “sst”, is the “name” that we are assigning the directory. This means that instead of of writing out the full directory path, we can simply write in sst.

libname sst "/root/sampledata";

Next, we initiate a datastep by writing the word “data.” This tells SAS that we are about to create a dataset. The text after the word data will be the name we assign to the dataset. Here, I am naming the dataset “new.”

data new;
set sst.income ;
run;

In order to bring a SAS formatted dataset into the datastep, we need to use the set statement. The set statement specifies the location and name of the SAS dataset we are reading into the program. We specify the location of the input dataset with a libref. We then specify the name of the actual dataset by following the libref with a period, followed by the name of the acutal dataset without it’s file extention (SAS assumes that the file extention is sas7bdat).

data new;
set sst.income ;
run;

Finally, all SAS datasteps and procedures end with a run statement. This tells SAS that the datastep is over.

data new;
set sst.income ;
run;

Creating new variables with a datastep

Now that we know how to create a basic datastep, we can begin creating variables in a datastep. In this example, I will create a variable called income by adding together salary and bonus.



/******************************************
Comment:
Basic recode program.
This program creates a new variable
from the income dataset.

*******************************************/

/* Location of the input data */
libname sst "/root/sampledata";

/* Create new dataset */
data new;
set sst.income ;

/* Create new variable */
income = salary + bonus;

run;

/* Print out both datasets */
proc print data = sst.income; run/* Original data */
proc print data = new;            run /*  New data       */


Code Review

This program is basically the same as the first. The one key difference is in the datastep. Here we create a new variable called income by adding together salary and bonus. If you are experienced with packages like stata, then this may seem kind of strange to you because in SAS there is no command to create a variable. You just write in the name of the variable you want to create after the set statement, followed by an equal sign (=), then followed by some operations. In this case, the operation is the addition of the salary and bonus variables.

/* Create new dataset */
data new;
set sst.income ;

/* Create new variable */
income = salary + bonus;

run;

We can print out the datasets to confirm that the new variable in the “new” dataset.

proc print data = sst.income; run/* Original data */

id sex salary bonus
1 M 50 2
2 M 60 3
3 F 40 1
4 F 55 1.5

proc print data = new;   run /*  New data       */

id sex salary bonus income
1 M 50 2 52
2 M 60 3 63
3 F 40 1 41
4 F 55 1.5 56.5

Creating Multiple Datasets

As I mentioned above, SAS can handle mutltiple datasets in the same program. The reason it can do this is because you can write multiple datasteps in a single program. To demonstrate this, lets pretend that we have two income datasets that we want to process, and we want to create the income variable again for each of these datasets. The first dataset in an income dataset from 2009:

/root/sample/2009/inc2009.sas7bdat

id year salary bonus
1 2009 50 2
2 2009 60 3
3 2009 40 1
4 2009 55 1.5

The second datafile contains updated figures from 2010:

/root/sample/2010/inc2010.sas7bat

id year salary bonus
1 2010 70 1
2 2010 60 3
3 2010 30 5
4 2010 55 1.5

Here is how we would create two datasets for each of these annaul files, and create an income variable by adding together salary and bonus.



/* Location of the input data */
libname inc09 "/root/sample/2009";
libname inc10 "/root/sample/2010";

/* Create 2009  dataset*/
data inc09;
set inc09.inc2009 ;

/* Create income variable */
income = salary + bonus;

run;

/* Create 2010 dataset */
data inc10;
set inc10.inc2010 ;

/* Create income variable */
income = salary + bonus;

run;

/* Print out both datasets */
proc print data = inc09; run;  /* 2009 data */
proc print data = inc10; run; /* 2010 data */


Code Review

This program is similar to the programs we reviwed earlier. The only difference is that there are now two libnames and two datasteps – one for each datafile that we want to process. Just as we can have multiple datasets in a single SAS session, we can also have multiple libnames pointing to multiple directories:


libname inc09 "/root/sample/2009";
libname inc10 "/root/sample/2010";

Next, we start two data steps and create two datasets – inc09 and inc10. The set statements reference different input files. Aside from that, because the files have the same variable names, the coding is the same.


data inc09;
set inc09.inc2009 ;

income = salary + bonus;

run;

data inc10;
set inc10.inc2010 ;

income = salary + bonus;

run;

Because we now have two different datases in memory, we can reference these datasets with other datasteps and SAS procedures. In this program, I printed out each of the new datasets with the PRINT PROCEDURE:

proc print data = inc09; run; /* 2009 data */
proc print data = inc10; run; /* 2010 data */

How datasteps work

So far we have demonstrated how to do some basic data maninpuations using the SAS datastep without any consideration for how SAS operates in the background. While it is not necessary for simple data manipulation, understanding how the datastep works will become important once you attempt more complex endeavors. Consider one of the datasteps we reviewed above:

data inc09;
set inc09.inc2009 ;

income = salary + bonus;

run;

In this datastep, we process the inc2009 input dataset and create the inc09 dataset. Let’s say that the inc09 dataset contains four rows of data (or records). The SAS datastep will actually process each of these individual lines of data iteratively. To illustrate, the datastep will begin by taking the first record from the inc2009 dataset. It will then send this record through all of the datastep coding. In this case, it will create one new variable by adding together salary and bonus. After this, it will reach the run statement, and since there are no other SAS statements, the datastep will output the record to the inc09 dataset. The datastep will repeat this process until there are no observations left in the input file.

Summary

Datastepping is the primary method that most people use to process and manipulate data in SAS. Most of the more complex programming concepts in SAS will come naturally to you if you gain a firm understanding of the datastep.

A datastep begins with the word “data,” which instructs SAS to create a dataset, followed by the name you would like to give the dataset. Next, we read input data into the data step with a SET statement. A complete set statement has a libref (which points SAS to a directory) and the name of the dataset you would like to read in to the program (without the file extention). After the SET statement, you can proceed to write code that will create new variables and manipulate the dataset. In SAS, we create a variable by first writing a variable name, followed by an equal sign, followed by some operation, such as addition or substraction.

A datastep operates iteratively. This means that the code will iterate once for every record (or row of data) in the input dataset. It begins by reading in the first record of the input data file, sends the record through the datastep code, and then outputs this new record to new dataset that the datastep is creating. SAS will repeat this process until there are no more records left to process in the input datafile.

Advertisements

One thought on “A Gentle Introduction to the Data Step

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s