Let’s start our journey of learning SAS with a situation related to our favorite school days.
First of all lets create a library in which all our codes and datasets can be saved for future reference.
Here is the code for same.
libname A15004 "/folders/myfolders/Assignments";
There is a school in which kids are of age group 12 and 13, they have given their quiz and got grades like ‘A’, ‘B’, ‘C’, ‘D’, ‘F’ . They also completed their Mid term and final exams. Now the problem is Mid terms and finals are in numerical values (see table below) but Quiz in grades as explained earlier.
Download the code from here to create the following dataset
Now we want to put all the kids of age 12 under category 6 and kids of age 13 under category 8, we shall assign numeric equivalents to grades for example for grade A we shall assign 95, for B, C, D, F we shall assign 85, 75, 70 and 65 respectively.
Using SAS we can do this very simply.Here is the code for our problem.
data A04_school; set A15004.school; *simple if else block of two lines to categorize kids of different age groups; if (Age = 12) then Grade=6; else if (Age = 13) then Grade = 8; *Below five lines is to give numerical equivalents to grades in Quiz; if(Quiz EQ 'A') then EqMarks = 95 ; else if(Quiz EQ 'B') then EqMarks = 85; else if(Quiz EQ 'C') then EqMarks = 75; else if(Quiz EQ 'D') then EqMarks = 70; else if(Quiz EQ 'F') then EqMarks = 65; *Below code is to compute the course grade with weightages of 20% to quiz marks 30% to mid term and 50% to final marks; CourseGrade = (EqMarks*0.2)+(Midterm*0.3)+(Final*0.5); proc print data=A04_school noobs; run;
So our coding part is done, now it is time to check output. Please look into below figure for your ease I have also highlighted the new variables with yellow. Those are the new variables which were created with our code. You can compare initial dataset with the final output dataset.
Please be careful while using the code downloaded from above link, you need to change your library appropriately, I used A15004 as my library in this code. This is applicable for all the codes explained here after.
Learning: We have learned how to use if-else statements using a condition and to create new variables for the benefit of further analysis.
Lets look at another situation where we have data of 1000 patients who are in different age categories like young and old, we have captured their blood group, WBC count and RBC count. Now our objective is to find out their cholesterol level – High, Medium or Low.
Look into the head of our dataset with just 16 observations (patients).
If you are interested to create this dataset on your own, please download the code from here.
Now our aim is to check certain conditions to categorize all our patients into three cholesterol groups i.e., High, Medium and Low. We should not forget there are certain missing values where we cannot compute the exact cholesterol group, so we should display ‘Data Missing’ in such cases.
I’ve decided to use ” select, when and end clause” to do this task easily. Please look at the code screenshot below for more details.
data A04_cholgrp_blood; set A15004.blood; *We are defining length for our new variable so that it doesn't truncate our 'Data Missing'; length cholgroup $16; *Here is how we are using select, when and end clause to do our task that is to seggregate cholesterol group with the condition like cholesterol value less than or equal to 110 and cholesterol value less than or equal to 140; select; when (missing(chol)) cholgroup = "Data Missing"; when (Chol le 110) cholgroup = "Low"; when (Chol le 140) cholgroup = "Medium"; otherwise cholgroup = "High"; end; proc print; run;
Now its time for output, please check the following screenshot to have a quick view about output from our program.
Learning: The important takeaway from this problem is we learned how to use select, when and end clause perfectly and this example enhanced our understanding even significantly with the condition we implemented to assign the cholesterol group.
Lets learn how to use missing function in this section with the help of a dataset which contains variables lets say ‘A’, ‘B’, ‘C’ in these variables lets put 7 observations in each by deliberately missing some values in the variables.
Now our task is to find how many missing values each variable contains. For this we shall create three more variables to see the count value.
If we translate our problem into SAS code it looks like this:-
data A04_missing; input A $ B $ C $; if missing(A) then MissA+1; if missing(B) then MissB+1; if missing(C) then MissC+1; datalines; X Y Z X Y Y Z Z Z X X . Y Z . X . . ; proc print noobs; run;
you can see how missing function has been used to count how many missing values each variable (A, B & C) contains.
Let’s look at output for better picture of what we are trying to derive.
You can see the count values for Missing values of A is 0, Missing values of B is 1 and Missing values of C is 3.
Learning: This section taught us about using Missing function to count any missing values in the variables, which is very critical when we are dealing with more number of observations. We can take decision based on this whether to treat missing values or just ignore them.
Now, lets get something related to racing we have three speed reading methods, for each method I have 10 observations of speeds in an unstructured way, now I have to read 10 observations for each speed reading method.
Here I would like to use DO statement. Please look at the following code.
data A04_SRM; do Method= 'A','B','C'; do Subj=1 to 10; input Speed @; output; end; end; datalines; 250 255 256 300 244 268 301 322 256 333 267 275 256 320 250 340 345 290 280 300 350 350 340 290 377 401 380 310 299 399 ; proc print data=A04_SRM noobs; run;
Now the output of the above code will look as follows.
Learning: In this section we have learned how to use do statement and nested do statement. This will be immensely helpful for loading data into SAS.
In this section lets play with dates for a while I have three observations which have dates in different formats, as you might be already aware that dates can be written different formats like mmddyyyy, ddmmyyyy, ddmmmyyyy etc.,
We have these dates in three rows
01/03/1950 01/03/1960 03Jan1970 05/15/2000 05/15/2002 15May2003 10/10/1998 11/12/2000 25Dec2005
You can see there are different formats in the above three lines. Now our task is to normalize all the formats to mm/dd/yyyy using SAS and not only that I want to save these dates as date 1, date 2 and date 3 for three observations
I would like to see difference between date 2 and date and assign it to date12 variable, then see difference between date 3 and date 2 and assign it to date 23 variable, if my output is in rational number format I want to round it off to nearest integer value.
Here is the SAS code for same.
data A04_dates1; input @2 date1 mmddyy10. @13 date2 mmddyy10. @24 date3 date9. ; format date1 date2 date3 mmddyy10.; date12 = yrdif(date1,date2,'Actual'); date23 = yrdif(date2,date3,'Actual'); format date12 date23 2.1; datalines; 01/03/1950 01/03/1960 03Jan1970 05/15/2000 05/15/2002 15May2003 10/10/1998 11/12/2000 25Dec2005 ; title "Display of dates and difference between them"; proc print; run;
Now lets look at the output of the above code.
Here you can see three different dates which are formatted into single format and two created variables to find the difference between ages of different dates date12 for difference between date2 and date1 and date23 for difference between date3 and date2.
We also utilized yrdif function to calculate difference between two dates. It takes arguments of two date variables and ‘Actual’
Learning: We have learned how to load dates of various formats and take the difference between dates.
In this part of blog lets see a different situation of bicycle sales of a company which is spread across in countries like USA, France, Italy etc.,
Our task is to subset the dataset with respect to two measures one is country name and other model type
in our first subset we want to have all the observations where country is USA and Model type of bicycle is Mountain Bike and lets save it into a dataset named MountainUSA
in our second subset we want to have all the observations were country is France and Model Type of Bicycle is Road Bike.
Here is how our master dataset look like
Lets look at the code to subset based on the discussed criteria from the above dataset.
data a04_MountainUSA a04_RoadFrance; set a15004.bicycles; if Country = 'USA' and Model = 'Mountain Bike' then output a04_MountainUSA; else if Country = 'France' and Model = 'Road Bike' then output a04_RoadFrance; run; title "List of MountainUSA"; proc print data=a04_MountainUSA; run; title "List of RoadFrance"; proc print data=a04_RoadFrance; run;
The output for the following code is shown in a screenshot below.
Learning: We have learned how to subset two datasets with some conditions from a master dataset. And if you observed we have done this in single data step.
Lets learn some new things in this section mainly _NULL_, CONSTANT, ‘exactint’ and Put _ALL_
our task is to generate the largest number a computer can score with respect to byte length so we will create largest number for 3 bytes, 4 bytes, 5 bytes, 6 bytes and 7 bytes.
firstly lets discuss about _NULL_ this is a reserved keyword for telling SAS engine to not create a dataset just display the variables there in the code. You will better understand when I show you the code below. we use Constant to extract largest number for the given byte length.
Please look at the code for more better understanding.
data _NULL_; Largest3= constant('exactint',3); Largest4= constant('exactint',4); Largest5= constant('exactint',5); Largest6= constant('exactint',6); Largest7= constant('exactint',7); put _all_; run;
‘exactint’ is used to take largest integer value, put _all_ is to just display all the variables which are there in the _NULL_ step i.e., Largest3 to Largest7
The output looks like this not in the result window but in the log window.
Please concentrate on the highlighted part in the above screenshot.
Learning: In this section we have learned many new things like _NULL_, Constant, ‘exactint’, put _ALL_ etc., and we learned how to display the variables without actually creating a dataset using _NULL_
In this section we are going to learn somewhat advanced concept of SAS we have a mixed variable that is a variable which holds a combination of digits and alphabets like 123kgs
now we are interested to extract only number from such type of observation and manipulate it. Lets understand that with an example.
There is data about weights of few people but in very unstructured way that means some are having their weights in lbs and some in kgs and also some kgs are in capital letters some lbs are lower case letters.
Lets solve this problem with SAS code. We are going to use new functions such as find, input, compress for string manipulation.
Please look at the following code, you will be able to understand more clearly.
data A04_lbs; set a15004.study; if find(Weight,'lb','i') then WeightLB= input(compress(Weight,,'kd'),8.); else if find(Weight,'kg','i') then WeightLB= 2.2*input(compress(Weight,,'kd'),8.); WeightLB=round(WeightLB); proc print data=A04_lbs; run;
After executing the program this is how the output looks:-
Learning: So, in this section we have learned how to extract only numbers from a mixed character string and then manipulate using compress, input functions.
In this section we shall learn about displaying the dataset in the form which we like. Lets take an example and understand in a better way. There is a sales data set where data points are number of sales made by each person divided by region, quantity and sum of sales.
Now lets calculate the TotalSales and Quantity of East and West Region using SAS.
Please look at the following code for more information.
proc sort data=a15004.sales out=sales; by Region; run; title 'Sales Figures from the SALES Dataset'; proc print data= sales label noobs ; by Region; where (Region ="East" or Region = "West"); *where Region in("East","West"); Var Region Quantity TotalSales; sum Quantity TotalSales; format TotalSales dollar10.2 Quantity Comma7.; run;
So, here we used a new procedure to sort the dataset sales into regions i.e., Proc Sort.
Lets look at the output of the above code.
Learning: In this section we have learned how to sort the data based on certain variables in this case Region and to manipulate data after sorting and displaying it as desired.
This is the final section of this blog post, lets learn something interest about a new procedure called Proc Report. In this what we will learn is about column function, define function and compute function. All these functions come under Proc Report. This is the powerful procedure used for reporting.
Lets take the data about patients who has blood pressure and their DBP, SBP, Age values are recorded for our analysis.
Now we know that Females will become Hypertensive if their SBP is above 138 or DBP is above 88 else No
Similarly Males become Hypertensive if their SBP is above 140 or DBP is above 90 or esle No.
We will now try to accomplish this task using SAS. Please look at the code below for better understanding.
title "Listing of Hypertensive"; proc report data=a15004.bloodpressure nowd headline ; column Gender SBP DBP Hypertensive; define Gender/group; define SBP/order; define DBP/order; define Hypertensive /computed; compute Hypertensive; select; when (Gender = 'F' and SBP gt 138 or DBP gt 88 ) Hypertensive = 1; when (Gender = 'F' and SBP lt 138 or DBP lt 88 ) Hypertensive = 0; when (Gender = 'M' and SBP gt 140 or DBP gt 90 ) Hypertensive = 1; when (Gender = 'M' and SBP lt 140 or DBP lt 90 ) Hypertensive = 0; otherwise Hypertensive = 'NA'; end; endcomp; run; quit;
So this is how we can write the code for our above problem and now the result will be displayed in terms of groups os Males and Females Yes or No is represented with 1 and 0 respectively.
Please look at the output of the same.
Here is how our output looks after executing our SAS code.
Learning: In this section we learned a new concept i.e., about Proc Report and various functions used with it like computed, define, group, order etc., How to execute a computed block. We also learned how to define a new variable into the report of any dataset.
By this we shall conclude the learning session of SAS for now, if you have any comments or feedback about the codes pasted here please share the same in comments.
See you guys very soon.