HQL (Hive Query Language) Task using hue

In this blog post we will discuss about how to execute a HQL task using hue environment.

HQL was developed by facebook engineers as a hadoop support structure that allows fluent SQL developers to leverage the hadoop platform because HQL is similar to SQL. Those who are good with SQL can easily get adapted to HQL.

This blog post will be continuation to our previous task. If you remember we have executed pig script to calculate the player wise scores sorted according to years.

Like we accessed pig script window we should now access to HIVE UI (User Interface) this is located in hue page with a bee wax symbol.

 

Hive editor window should look like this.

 

creating-table-query

Lets execute our first query

create table temp_batting (col_value STRING);

if you observe this is similar to SQL query. This query will create a table called temp_batting.

Now we need to load the data from to our hql table temp_batting from a csv file which is already uploaded from window environment in our last task.

So lets move on and create a new table called batting with three variables player_id, year and No. of runs scored.

we shall use the following code for doing the same.

create table batting (player_id STRING, year INT, runs INT);

We will now transfer our data from temp_batting to our new table batting.

The code which we will use for the same is as follows:-

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) 
player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, 
'^(?:([^,]*)\,?){9}', 1) run from temp_batting;

The job status can be seen in our all applications window.

Lets group the data with Maximum runs.

Lets get to know which player scored highest runs in a given particular year. We will use the following code.

SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM 
batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;

So with this we successfully executed the HQL queries and performed the task with easily understandable codes.

If you have any queries please comment on this post.

 

Advertisements

Processing Data with Apache PIG using Hortonworks HDP

Pig scripting is another easy way to execute complex hadoop mapreduce jobs. Though underlying platform is java. Pig scripts are simple and easy to understand.

Pig Latin is the language used for writing Pig Scripts.

In this post mainly we will see how to process the data by writing pig script.

Let’s talk about our objective of executing the task. There is baseball data in a csv file of about 90000 observations with runs scored by players from the year 1871 to year 2011. Let’s try to calculate highest runs per player for each year. Also we shall extract first and last names of the players of our interest.

I suggest you to download the baseball data which we will be using from here.

Like our previous task we need to first run the hortonworks hdp single node hadoop cluster using Oracle Virtual box. Once it is booted up and ready. We should open the hue environment using the url : http://127.0.0.1:8000

It may ask for login credentials. Use the default login ID and password for same.

Login ID: hue Password: 1111

hue-login-screen1

Now its time to upload our data to hue using interactive option in file browser tab.

hue-file-upload1

Next we have to navigate to PIG icon where we can create our own PIG scripts. After navigating we have to click on New script. After giving appropriate title for our script. We need to write the following code and save it.

batting = load 'Batting.csv' using
PigStorage(',');
raw_runs = FILTER batting BY $1>0;
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;

The explanation of above code is as follows:-

  1. We load data using a comma delimiter.
  2. Then we filter the first row of data.
  3. Iteration for batting data object.
  4. We should group the runs of each player by the year field.
  5. We then join the runs data of highest scoring player to obtain player ID.

5

Then we have to execute the script and wait for PIG to start the process. Once it is started it will look like this.

6

Finally we will get a success page where we can check how much time Pig has taken to execute the job. The success page looks like this.

8

Output:

r1

r2

Conclusion & Learning:

By this we have completed our task of executing the Pig script and obtaining the result of which player has highest runs from the year 1871 to 2011.

So this is all for today. Love to hear your thoughts in comments section below.

Regards!

Word Count Exercise using Hortonworks Hadoop

Stepping stone for learning Big Data techniques is to make your hands dirty. That is what we are going to do today.

Objective: To execute java coded MapReduce (we’ll learn about this in following lines) task of three large text files and count the frequency of words appeared in those text files using Hadoop under Hortonworks Data Platform installed on Oracle virtual box.

Framework: The complete framework of achieving our objective is represented in pictorial form please do make an effort to understand the same.

Hadoop(2)

I already have Oracle virtual box in my system. Now my first task is to load hadoop using virtual box. Installation will take nearly about 10 to 15 minutes provided the system is fast enough to run this heavy application.

Everytime you want to do something using hortonworks hadoop you need to run it from virtual box. Running after installation also takes about five minutes to start the application.

Once it is ready it will show the screen like this.

Hortonworks Running

Don’t worry if your system becomes a little slower after this step. It is quite common as this application is RAM hungry it easily uses more than 3GB of RAM just to start.

Now we need to start the shell box in which we can execute linux commands.

We need to gain access in this shell box with username: root & password: hadoop.

As we have java codes ready we need to create these java files using linux vi command. After editing the document we need to give the following commands to save and exit the editor shell. :w for writing and :q to quit from editor window and come back to shell box.

Please look at the editor window opened in my shell using 127.0.0.1:4200

shell box

Below screen is where I edited my SumReducer.java, WordMapper.java and WordCount.java files.

Wordcount Program

You can download the codes of SumReducer, WordCount, WordMapper.

Once your java files are ready for execution we need to create one new folder to save our class files which we are going to compile from java codes.

After creating a folder for class files. We have to execute the following code from shell.

javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WC-classes WordMapper.java
#-----
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WC-classes SumReducer.java
#-----
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar:WCclasses -d WC-classes WordCount.java
#----

By using the code above we will be able to create class files of SumReducer, WordMapper & WordCount

What these programs essentially does is : we are having three large text files (infact three big novels) in txt format with lot of words. We are going to reduce this humongous task using reducer and mapper program will map which task is given to which node.

As we now have class files of SumReducer, WordMapper and WordCount we should create jar file using the following code.

<code> jar -cvf WordCount.jar -C WCclasses/ .</code>

Next step is to create folder in hdfs file system using the following commands.

<code>hdfs -mkdir user/ru1/wc-input</code>

After creating this folder we have to upload files using hue file browser using 127.0.0.1:8000 in our web browser.

After uploading files through file browser. It looks as follows.

Hue interface

hue file browser

 

Now its time to execute hadoop jar file. Let’s use the following code for doing the same.

hadoop jar WordCount.jar WordCount /user/ru1/wc-input /user/ru1/wc-out

hadoop exec

After it is executed without any errors we need track the status of application in the all applications page using 127.0.0.1:8088

The screen looks as follows

All Applications

In this step we should see succeeded in the respective application. After confirming the success status we should open hue file browser where we will see a new folder created called wc-out2 (which we have given in shell command prompt).

hue output

In this folder there will be two files called success and part-r-0000. The part-r-0000 is where we can check the output of the program and how many words are there and what is the frequency of each word occurred.

Outp1

Outp2

Outp3

Finally we accomplished our objective of executing java wordcount program.

This program is known as the base for google page ranking algorithm.

Hope you all enjoyed this blog.

cheers!!