Learning by doing – Simple SAS codes (Part II)

Welcome to the second post of the series Learning by Doing – Simple SAS codes.

Recap

Lets recap what all we have learned in the last post. The main topics which we covered in last post (Click here to open) are as follows:-

  1. if-else logic
  2. select-when-otherwise-end statement
  3. missing function
  4. do loop
  5. date functions like yrdif
  6. multiple datasets in single data step
  7. _NULL_, Constant, exactint, Put_ALL_
  8. find,input,compress and round functions
  9. proc sort, by function, where condition
  10. proc report- define, compute

Intro

In this post we are going to look at the following chapters:-

  • Summarizing data
  • Counting frequencies
  • Creating tabular reports
  • Output delivery systems
  • Generating high quality graphics

Chapter-16 – Summarizing your data

In this chapter we will learn how to summarize data with the help of procedure called proc means and its various functions like by, class, _TYPE_ etc.

Lets solve few problems in these chapters.

2nd question from Ron Cody.

We have data of college students, now we have to find number of students, how many students data is missing, minimum value of each variable, maximum, median for two variables mainly ClassRank and GPA using BY function and Class function.

We’ll be using the same directory A15004 which we used for previous post (Click here to open)

Code for the program is:-

Code

You can download code by clicking on this link

The first proc step of this code will sort the data according to gender and schoolsize. Second and third proc steps are two different ways to summarize data one is by ‘BY’ function and second is by ‘Class’ function. The respective outputs are pasted below.

Output1

Output for 2nd method of doing same is as follows.

Output2

Learning in this chapter

Learned various types of ways to summarize data using PROC Means, first we summarized using by function and second we have done using a class function.

Chapter 17: Counting frequencies

In this chapter we will learn how to use proc freq in detail.

Lets take a problem from Ron Cody and apply our understanding using SAS tool.

In this chapter we will use various procedures like Proc freq, Proc sort & Proc format. Lets create a three way table of college dataset.

This is the code which we will use for doing the same.

Code

You can download code from this link.

Now lets look at the output you can see the perfectly sorted frequency table – 3 way table of variables Gender, Scholarship, Schoolsize. I have deliberately removed percent and cumulative percentages in the respective columns.

17.2

Learning from this chapter

We have learned procedure proc freq in this chapter and its functions. learned how to create single table, two way table and three way tables which was our program in this chapter.

Chapter-18: Creating Tabular reports

In this chapter we are going to learn about Procedure called PROC TABULATE and its functions.

Lets take a task from Ron Cody to create a table and calculate the row percentages of the respective table. The dataset used in this example is college dataset.

Here is the code used to create tabular report.

Code

You can download the above code from here.

The output for above code is as follows.

18.2

Learning in this chapter

In this chapter we have learned how to use procedure proc tabulate and its functions and various options.

Chapter-19: Output Delivery Systems

In this chapter we will learn how to save our output in html format. We will use functions like ods html etc.

Lets take an example and understand this further.

As we are working with college data set. Lets continue with it. Now our task is to write our output into a html file.

Here is the code for the same

Code

You can download the code for the above program here.

The output from the above code will create a html file in the assigned library in our case it is A15004 and also produce following output.

19.2

Learning in this chapter

We have learned how to store our output as a html file in this chapter and most importantly we learned how to use various functions of odc html functions along with other procedures like proc means in this case.

Chapter-20: Generating High quality graphics

As we are using SAS university edition for demonstration of the codes and functions there shall be a slight change in the procedure used in the functions instead of proc gchart and ggplot we shall use proc sgplot which is almost similar to other procedures.

Lets take a problem from ron cody and see some of the functions to create graphs.

First lets create pie chart and use the blood dataset. Here is the code to create pie chart.

20.1 code

You can download the code from here.

And the output from the above code is as follows:-

20.1

So we have now created pie chart. And used blood dataset with Chol variable.

Now lets create barchart using bicycles dataset.

20.2

You can download the code from here.

The output of the above code is as follows

Barchart

Learning in this chapter

In this chapter we have learned how to create pie chart and bar charts and also to change colors of bars and to add data labels to them as well.

So this completes the series of posts that we are learning by doing with simple sas codes.

Cheers!

Advertisements

Learning by Doing – Simple SAS Codes!!

Let’s start our journey of learning SAS with a situation related to our favorite school days.

First of all lets create a library in which all our codes and datasets can be saved for future reference.

Here is the code for same.

libname A15004 "/folders/myfolders/Assignments";

1)

There is a school in which kids are of age group 12 and 13, they have given their quiz and got grades like ‘A’, ‘B’, ‘C’, ‘D’, ‘F’ . They also completed their Mid term and final exams. Now the problem is Mid terms and finals are in numerical values (see table below) but Quiz in grades as explained earlier.

Download the code from here to create the following dataset

School

Now we want to put all the kids of age 12 under category 6 and kids of age 13 under category 8, we shall assign numeric equivalents to grades for example for grade A we shall assign 95, for B, C, D, F we shall assign 85, 75, 70 and 65 respectively.

Using SAS we can do this very simply.Here is the code for our problem.

data A04_school;

set A15004.school;

*simple if else block of two lines to categorize kids of different age groups;

if (Age = 12) then Grade=6;
else if (Age = 13) then Grade = 8;

*Below five lines is to give numerical equivalents to grades in Quiz;

if(Quiz EQ 'A') then EqMarks = 95 ;
else if(Quiz EQ 'B') then EqMarks = 85;
else if(Quiz EQ 'C') then EqMarks = 75;
else if(Quiz EQ 'D') then EqMarks = 70;
else if(Quiz EQ 'F') then EqMarks = 65;

*Below code is to compute the course grade with weightages of 20% to quiz marks
30% to mid term and 50% to final marks;

CourseGrade = (EqMarks*0.2)+(Midterm*0.3)+(Final*0.5);

proc print data=A04_school noobs;
run;

So our coding part is done, now it is time to check output. Please look into below figure for your ease I have also highlighted the new variables with yellow. Those are the new variables which were created with our code. You can compare initial dataset with the final output dataset.

7.1Output

Please be careful while using the code downloaded from above link, you need to change your library appropriately, I used A15004 as my library in this code. This is applicable for all the codes explained here after.

Learning: We have learned how to use if-else statements using a condition and to create new variables for the benefit of further analysis.



————————————————————————————————————————-

2)

Lets look at another situation where we have data of 1000 patients who are in different age categories like young and old, we have captured their blood group, WBC count and RBC count. Now our objective is to find out their cholesterol level – High, Medium or Low.

Look into the head of our dataset with just 16 observations (patients).

7.5Data

If you are interested to create this dataset on your own, please download the code from here.

Now our aim is to check certain conditions to categorize all our patients into three cholesterol groups i.e., High, Medium and Low. We should not forget there are certain missing values where we cannot compute the exact cholesterol group, so we should display ‘Data Missing’ in such cases.

I’ve decided to use ” select, when and end clause” to do this task easily. Please look at the code screenshot below for more details.

data A04_cholgrp_blood;

set A15004.blood;

*We are defining length for our new variable so that it doesn't truncate our 'Data Missing';
length cholgroup $16;

*Here is how we are using select, when and end clause to do our task that is to 
seggregate cholesterol group with the condition like cholesterol value less than or 
equal to 110 and cholesterol value less than or equal to 140;
select;
when (missing(chol)) cholgroup = "Data Missing";
when (Chol le 110) cholgroup = "Low";
when (Chol le 140) cholgroup = "Medium";
otherwise cholgroup = "High";
end;

proc print; 
run;

Now its time for output, please check the following screenshot to have a quick view about output from our program.

7.5Output

Learning: The important takeaway from this problem is we learned how to use select, when and end clause perfectly and this example enhanced our understanding even significantly with the condition we implemented to assign the cholesterol group.


—————————————————————————————————————–

3)

Lets learn how to use missing function in this section with the help of a dataset which contains variables lets say ‘A’, ‘B’, ‘C’ in these variables lets put 7 observations in each by deliberately missing some values in the variables.

Now our task is to find how many missing values each variable contains. For this we shall create three more variables to see the count value.

If we translate our problem into SAS code it looks like this:-

data A04_missing;
   input A $ B $ C $;
   if missing(A) then MissA+1;
   if missing(B) then MissB+1;
   if missing(C) then MissC+1;

datalines;
X Y Z
X Y Y
Z Z Z
X X .
Y Z .
X . .
;
proc print noobs;
run;

you can see how missing function has been used to count how many missing values each variable (A, B & C) contains.

Let’s look at output for better picture of what we are trying to derive.

8.4Output

You can see the count values for Missing values of A is 0, Missing values of B is 1 and Missing values of C is 3.

Learning: This section taught us about using Missing function to count any missing values in the variables, which is very critical when we are dealing with more number of observations. We can take decision based on this whether to treat missing values or just ignore them.


——————————————————————————————————————

4)

Now, lets get something related to racing we have three speed reading methods, for each method I have 10 observations of speeds in an unstructured way, now I have to read 10 observations for each speed reading method.

Here I would like to use DO statement. Please look at the following code.


data A04_SRM;

do Method= 'A','B','C';

	do Subj=1 to 10;
	
	input Speed @;
	
	output;
	end;
end;

datalines;
 250 255 256 300 244 268 301 322 256 333
 267 275 256 320 250 340 345 290 280 300
 350 350 340 290 377 401 380 310 299 399
;
proc print data=A04_SRM noobs;
run;

Now the output of the above code will look as follows.

Output

Learning: In this section we have learned how to use do statement and nested do statement. This will be immensely helpful for loading data into SAS.

5)

In this section lets play with dates for a while I have three observations which have dates in different formats, as you might be already aware that dates can be written different formats like mmddyyyy, ddmmyyyy, ddmmmyyyy etc.,

We have these dates in three rows

01/03/1950 01/03/1960 03Jan1970
05/15/2000 05/15/2002 15May2003
10/10/1998 11/12/2000 25Dec2005

You can see there are different formats in the above three lines. Now our task is to normalize all the formats to mm/dd/yyyy using SAS and not only that I want to save these dates as date 1, date 2 and date 3 for three observations

I would like to see difference between date 2 and date and assign it to date12 variable, then see difference between date 3 and date 2 and assign it to date 23 variable, if my output is in rational number format I want to round it off to nearest integer value.

Here is the SAS code for same.



data A04_dates1;

input
@2 date1 mmddyy10.
@13 date2 mmddyy10.
@24 date3 date9.
;

format date1 date2 date3 mmddyy10.;
date12 = yrdif(date1,date2,'Actual');
date23 = yrdif(date2,date3,'Actual');

format date12 date23 2.1;

datalines;
01/03/1950 01/03/1960 03Jan1970
05/15/2000 05/15/2002 15May2003
10/10/1998 11/12/2000 25Dec2005
;

title "Display of dates and difference between them";

proc print;
run;


Now lets look at the output of the above code.

Output

Here you can see three different dates which are formatted into single format and two created variables to find the difference between ages of different dates date12 for difference between date2 and date1 and date23 for difference between date3 and date2.

We also utilized yrdif function to calculate difference between two dates. It takes arguments of two date variables and ‘Actual’

Learning: We have learned how to load dates of various formats and take the difference between dates.

6)

In this part of blog lets see a different situation of bicycle sales of a company which is spread across in countries like USA, France, Italy etc.,

Our task is to subset the dataset with respect to two measures one is country name and other model type

in our first subset we want to have all the observations where country is USA and Model type of bicycle is Mountain Bike and lets save it into a dataset named MountainUSA

in our second subset we want to have all the observations were country is France and Model Type of Bicycle is Road Bike.

Here is how our master dataset look like

Dataset

Lets look at the code to subset based on the discussed criteria from the above dataset.


data a04_MountainUSA a04_RoadFrance;

set a15004.bicycles;

if Country = 'USA' and Model = 'Mountain Bike' then output a04_MountainUSA;
else if Country = 'France' and Model = 'Road Bike' then output a04_RoadFrance;
run;

title "List of MountainUSA";
proc print data=a04_MountainUSA;
run;

title "List of RoadFrance";
proc print data=a04_RoadFrance;
run;

The output for the following code is shown in a screenshot below.

Output

Learning: We have learned how to subset two datasets with some conditions from a master dataset. And if you observed we have done this in single data step.

7)

Lets learn some new things in this section mainly _NULL_, CONSTANT, ‘exactint’ and Put _ALL_

our task is to generate the largest number a computer can score with respect to byte length so we will create largest number for 3 bytes, 4 bytes, 5 bytes, 6 bytes and 7 bytes.

firstly lets discuss about _NULL_ this is a reserved keyword for telling SAS engine to not create a dataset just display the variables there in the code. You will better understand when I show you the code below. we use Constant to extract largest number for the given byte length.

Please look at the code for more better understanding.


data _NULL_;

Largest3= constant('exactint',3);
Largest4= constant('exactint',4);
Largest5= constant('exactint',5);
Largest6= constant('exactint',6);
Largest7= constant('exactint',7);

put _all_;
run;

‘exactint’ is used to take largest integer value, put _all_ is to just display all the variables which are there in the _NULL_ step i.e., Largest3 to Largest7

The output looks like this not in the result window but in the log window.

result

Please concentrate on the highlighted part in the above screenshot.

Learning: In this section we have learned many new things like _NULL_, Constant, ‘exactint’, put _ALL_ etc., and we learned how to display the variables without actually creating a dataset using _NULL_

8)

In this section we are going to learn somewhat advanced concept of SAS we have a mixed variable that is a variable which holds a combination of digits and alphabets like 123kgs

now we are interested to extract only number from such type of observation and manipulate it. Lets understand that with an example.

There is data about weights of few people but in very unstructured way that means some are having their weights in lbs and some in kgs and also some kgs are in capital letters some lbs are lower case letters.

Lets solve this problem with SAS code. We are going to use new functions such as find, input, compress for string manipulation.

Please look at the following code, you will be able to understand more clearly.


data A04_lbs;
set a15004.study;

if find(Weight,'lb','i') then 
WeightLB= input(compress(Weight,,'kd'),8.);
else if find(Weight,'kg','i') then 
WeightLB= 2.2*input(compress(Weight,,'kd'),8.);

WeightLB=round(WeightLB);

proc print data=A04_lbs;
run;

After executing the program this is how the output looks:-

Output

Learning: So, in this section we have learned how to extract only numbers from a mixed character string and then manipulate using compress, input functions.

9)

In this section we shall learn about displaying the dataset in the form which we like. Lets take an example and understand in a better way. There is a sales data set where data points are number of sales made by each person divided by region, quantity and sum of sales.

Now lets calculate the TotalSales and Quantity of East and West Region using SAS.

Please look at the following code for more information.



proc sort data=a15004.sales out=sales;
by Region;
run;

title 'Sales Figures from the SALES Dataset';


proc print data= sales label noobs ;
by Region;
where (Region ="East" or Region = "West");
*where Region in("East","West");
Var Region Quantity TotalSales;
sum Quantity TotalSales;
format TotalSales dollar10.2 Quantity Comma7.;
run;

So, here we used a new procedure to sort the dataset sales into regions i.e., Proc Sort.

Lets look at the output of the above code.

Output

Learning: In this section we have learned how to sort the data based on certain variables in this case Region and to manipulate data after sorting and displaying it as desired.

10)

This is the final section of this blog post, lets learn something interest about a new procedure called Proc Report. In this what we will learn is about column function, define function and compute function. All these functions come under Proc Report. This is the powerful procedure used for reporting.

Lets take the data about patients who has blood pressure and their DBP, SBP, Age values are recorded for our analysis.

Now we know that Females will become Hypertensive if their SBP is above 138 or DBP is above 88 else No

Similarly Males become Hypertensive if their SBP is above 140 or DBP is above 90 or esle No.

We will now try to accomplish this task using SAS. Please look at the code below for better understanding.


title "Listing of Hypertensive";

proc report data=a15004.bloodpressure nowd headline ;
column Gender SBP DBP Hypertensive;
define Gender/group;
define SBP/order;
define DBP/order;
define Hypertensive /computed;

compute Hypertensive;
select;
when (Gender = 'F' and SBP gt 138 or DBP gt 88 ) Hypertensive = 1;
when (Gender = 'F' and SBP lt 138 or DBP lt 88 ) Hypertensive = 0;
when (Gender = 'M' and SBP gt 140 or DBP gt 90 ) Hypertensive = 1;
when (Gender = 'M' and SBP lt 140 or DBP lt 90 ) Hypertensive = 0;
otherwise Hypertensive = 'NA';
end;
endcomp;
run;
quit;

So this is how we can write the code for our above problem and now the result will be displayed in terms of groups os Males and Females Yes or No is represented with 1 and 0 respectively.

Please look at the output of the same.

Output

Here is how our output looks after executing our SAS code.

Learning: In this section we learned a new concept i.e., about Proc Report and various functions used with it like computed, define, group, order etc., How to execute a computed block. We also learned how to define a new variable into the report of any dataset.

By this we shall conclude the learning session of SAS  for now, if you have any comments or feedback about the codes pasted here please share the same in comments.

See you guys very soon.
Cheers!