Processing Data with Apache PIG using Hortonworks HDP

Pig scripting is another easy way to execute complex hadoop mapreduce jobs. Though underlying platform is java. Pig scripts are simple and easy to understand.

Pig Latin is the language used for writing Pig Scripts.

In this post mainly we will see how to process the data by writing pig script.

Let’s talk about our objective of executing the task. There is baseball data in a csv file of about 90000 observations with runs scored by players from the year 1871 to year 2011. Let’s try to calculate highest runs per player for each year. Also we shall extract first and last names of the players of our interest.

I suggest you to download the baseball data which we will be using from here.

Like our previous task we need to first run the hortonworks hdp single node hadoop cluster using Oracle Virtual box. Once it is booted up and ready. We should open the hue environment using the url : http://127.0.0.1:8000

It may ask for login credentials. Use the default login ID and password for same.

Login ID: hue Password: 1111

hue-login-screen1

Now its time to upload our data to hue using interactive option in file browser tab.

hue-file-upload1

Next we have to navigate to PIG icon where we can create our own PIG scripts. After navigating we have to click on New script. After giving appropriate title for our script. We need to write the following code and save it.

batting = load 'Batting.csv' using
PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;

The explanation of above code is as follows:-

  1. We load data using a comma delimiter.
  2. Then we filter the first row of data.
  3. Iteration for batting data object.
  4. We should group the runs of each player by the year field.
  5. We then join the runs data of highest scoring player to obtain player ID.

5

Then we have to execute the script and wait for PIG to start the process. Once it is started it will look like this.

6

Finally we will get a success page where we can check how much time Pig has taken to execute the job. The success page looks like this.

8

Output:

r1

r2

Conclusion & Learning:

By this we have completed our task of executing the Pig script and obtaining the result of which player has highest runs from the year 1871 to 2011.

So this is all for today. Love to hear your thoughts in comments section below.

Regards!

Advertisements

One thought on “Processing Data with Apache PIG using Hortonworks HDP

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s