Combining Steps with PROC SORT

Anyone who has been programming in SAS for some time has a list of DOs and DON’Ts for the programs they manage. One of the biggest rules I have is removing data steps dedicated only to sub-setting and keeping/dropping/renaming variables. Not only can these tasks be carried out in other steps, but it also leads to a lot of redundant code and longer run times.

As am example, let’s pretend that we have a dataset containing income data for individuals across the span of three years, and we want to sort the dataseet by ID and Year, but only keep records for men prior to 2012. After we sort, we want to compute the average income between the years of 2010-2012. Here are a few sample records:

/root/mydata/income_years.sas7bdat

ID state sex year ic
1 1 M 2011 30
1 1 M 2012 30
1 1 M 2013 40
2 6 F 2010 40
2 6 F 2011 45

where:

ID: Unique ID for respondent
State: Current state of residence
Sex: Sex of respondent
year: Year of interview
ic: income in current year

Here is how we would accomplish this task if we have data steps dedicated to the tasks of droping/keeping/renaming/subsetting.


libname dsn "/root/mydata";

/**************************************
Drop, keep, rename relevant variables
***************************************

data prep;
set dsn.income_years;

if year < 2013 and sex= "M";

drop state;
rename ic=income;

run;


/****************************
Sort dataset by id and year
*****************************/

proc sort data = prep ;
by id year;
run;


/****************************************
Get the average income for individuals
between the years of 2010-2012
*****************************************/

proc summary data = prep;
by id ;
var income;
output out=avg_inc mean=;
run ;

proc print data = avg_inc; run;


This is a lot of code for a relatively simple task! How can we shorten this program and increase efficiency? One property of SAS PROCEDURES is that they can all output datasets, which also enables you to keep and drop variables in the OUT options. Another property is that they almost all support a WHERE option, which allows you to subset observations out of the dataset you are creating. Here is how I combine the data step and SORT PROCEDURE from above.


libname dsn "/root/mydata";

/****************************************
sort dataset while dropping, renaming
and subsetting the data
*****************************************/

proc sort data = dsn.income_years
out = prep (drop=state rename=(ic=income));
where sex="M" and year < 2013;
by id year;
run;

/****************************************
Get the average income for individuals
between the years of 2010-2012
*****************************************/

proc summary data = prep;
by id ;
var income;
output out=avg_inc mean=;
run ;

proc print data = avg_inc; run;



Code Review

In place of the datastep, we begin by initiating a PROC SORT and reference the input dataset in the data statement and output a temporary dataset with the OUT statement.

proc sort data = dsn.income_years
out = prep (drop=state rename=(ic=income));
where sex="M" and year < 2013;
by id year;
run;

Instead of dropping, renaming and subsetting in a datastep, I take care of these tasks in the dataset options and WHERE statement:

proc sort data = dsn.income_years
out = prep (drop=state rename=(ic=income));
where sex="M" and year < 2013;
by id year;
run;

The remaining code should remain the same. If you are really adept with SAS, you can probably see a way to forgo the PROC SORT altogether:


libname dsn "/root/mydata";

proc summary data = dsn.income_years nway; 
class id ;
where sex= "M" and year < 2013;
var ic;
output out=avg_inc (keep =id ic rename=(ic=income)) mean=;
run ;

proc print data = avg_inc; run;


Summary

The power of statistical software packages like SAS is that it can handle and summarize large amounts of data. However, if you work with large amounts of data and write a lot of redundant code, your programs will execute so slowly that they will be more of a frustration than anything. Fortunately, as we have seen in this brief example, SAS provides you with an array of tools that enable you to shorten code. When you write programs in SAS, keep an eye out for ways to combine steps.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s