Subsetting IF and WHERE Statements

When we subset data in SAS (or any language, for that matter), we are removing unneeded records (or rows of data) from the dataset we are creating. For example, if we only wanted to analyze females, but our input data had records for both males and females, then we would have to subset out male records.

SAS offers two statements for sub-setting: IF and WHERE. Which statement you use should depend upon whether you need the record to be processed through the datastep, and how important efficiency is to your data processing tasks.

For this example, let’s use a hypothetical dataset called income, which has income data for both men and women. Our goal is to subset out the male records:

/root/mydata/income.sas7bdat

id sex income
1 F 50
2 F 60
3 F 40
4 M 30
5 M 70
6 M 65

The Subsetting IF Statement

Subsetting data using the IF statement in a SAS datastep is simple enough, but may be bewildering if you have never seen the syntax before. Here is an example of how to subset out observations for men using the IF statement:


libname dsn "root/mydata";


/*******************************
Subset out male observations
********************************/

data female;
set dsn.income ;

if sex = "F" ;

run;


proc print data = female; run


Code Review

If you are experienced with another programming language, you may find the “IF” statement confusing. You may be thinking to yourself: “where is the ‘then’ syntax.” To understand what is going on here, you need to think like the SAS datastep. The datastep begins by reading in the first observation from the income dataset. It then reaches the IF statement and evaluates whether the SEX variable is in fact equal to “F”.  If it is equal to “F”, the datastep outputs this observation to the dataset we titled “female.” The datastep will then repeat this process until there are no more records left in the input dataset. Because the subsetting IF statement requires that we process all of the records, it is generally less efficient than the WHERE statement.

The Subsetting WHERE Statement

The where statement resembles the general form and pattern of the if statement, but differs in how it processes the data:


libname dsn "root/mydata";


/*******************************
Subset out male observations
********************************/

data female;
set dsn.income ;

where sex = "F" ;

run;


proc print data = female; run ;


SAS also offers an additional way to write in the WHERE syntax. Instead of writing a WHERE statement into the body of the datastep code, you can write it into the options of the SET statement. This is the method that I prefer because it makes it more explicit:


libname dsn “root/mydata”;


/*******************************
Subset out male observations
********************************/

data female;
set dsn.income (where = (sex = "F" ));
run;

proc print data = female; run ;


Code Review

Now we can talk about the difference between IF and WHERE. As we discussed earlier, the subsetting IF statement requires that SAS  actually process a record through the datastep and evaluate the IF syntax. This means that if we had 1,000 records, the datastep would execute 1,000 times.

WHERE operates a bit differently. Instead of processing each record through the datastep, WHERE syntax instructs SAS to not send a record through the datastep if a condition is evaluated as true. So if we had a dataset of 1,000 records, and 500 were male and the other 500 were female, then the datastep in the example with the WHERE statement would only execute 500 times.

When do you use an IF and WHERE Statement?

Generally speaking, if you are concerned about efficiency, and if you are writing a long datastep, you should consider using WHERE syntax because it will always be more efficient. This is because WHERE will not send a record through datastep processing if the logical test evaluates as true, while IF requires that every record be processed through the datastep.

You should always use IF if you need to use the variable you are trying to subset by to create a new variable. For example, if you want to create dataset of women and men that earn more the 50,000 dollars a year, and create a new variable for individuals earning more than 50,000 dollars a year, then you would need to use an IF statement:


libname dsn "root/mydata";


/*******************************
Subset out male observations
********************************/

data female;
set dsn.income ;

if income > 50 then inc=1;
else inc=0;
(if inc=1 and sex = "M" ) or (sex = "F" );
run;


proc print data = female; run ;


 

In this example, we only keep records that are either female, or records that are male and have incomes above 50,000 dollars. Because we require records for both male and females, we use the IF statement to make this logical test, and refrain from using WHERE syntax.

Summary

Filtering (or subsetting) out records is an important part of creating analytical datasets, and for programming in general. SAS offers two methods of subsetting. The IF statement processes each record in the input dataset, conducts a logical test, and outputs the record to the dataset you are creating if the test resolves to true. The WHERE statement will not send a record though a datastep if the logical test evaluates to true, which makes it more efficient than the if statement.

There are two considerations you should go through when determining whether to use an subsetting IF or WHERE statement. The first is efficiency. If you are using a large dataset consisting of millions of records, then where statements will dramatically shorten your processing time because the data step will not have to process every record. If you need to use the variable you are trying to subset by to create a new variable, you will have to use IF.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s