Word Count Exercise using Hortonworks Hadoop

Stepping stone for learning Big Data techniques is to make your hands dirty. That is what we are going to do today.

Objective: To execute java coded MapReduce (we’ll learn about this in following lines) task of three large text files and count the frequency of words appeared in those text files using Hadoop under Hortonworks Data Platform installed on Oracle virtual box.

Framework: The complete framework of achieving our objective is represented in pictorial form please do make an effort to understand the same.

Hadoop(2)

I already have Oracle virtual box in my system. Now my first task is to load hadoop using virtual box. Installation will take nearly about 10 to 15 minutes provided the system is fast enough to run this heavy application.

Everytime you want to do something using hortonworks hadoop you need to run it from virtual box. Running after installation also takes about five minutes to start the application.

Once it is ready it will show the screen like this.

Hortonworks Running

Don’t worry if your system becomes a little slower after this step. It is quite common as this application is RAM hungry it easily uses more than 3GB of RAM just to start.

Now we need to start the shell box in which we can execute linux commands.

We need to gain access in this shell box with username: root & password: hadoop.

As we have java codes ready we need to create these java files using linux vi command. After editing the document we need to give the following commands to save and exit the editor shell. :w for writing and :q to quit from editor window and come back to shell box.

Please look at the editor window opened in my shell using 127.0.0.1:4200

shell box

Below screen is where I edited my SumReducer.java, WordMapper.java and WordCount.java files.

Wordcount Program

You can download the codes of SumReducer, WordCount, WordMapper.

Once your java files are ready for execution we need to create one new folder to save our class files which we are going to compile from java codes.

After creating a folder for class files. We have to execute the following code from shell.

javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WC-classes WordMapper.java
#-----
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WC-classes SumReducer.java
#-----
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar:WCclasses -d WC-classes WordCount.java
#----

By using the code above we will be able to create class files of SumReducer, WordMapper & WordCount

What these programs essentially does is : we are having three large text files (infact three big novels) in txt format with lot of words. We are going to reduce this humongous task using reducer and mapper program will map which task is given to which node.

As we now have class files of SumReducer, WordMapper and WordCount we should create jar file using the following code.

<code> jar -cvf WordCount.jar -C WCclasses/ .</code>

Next step is to create folder in hdfs file system using the following commands.

<code>hdfs -mkdir user/ru1/wc-input</code>

After creating this folder we have to upload files using hue file browser using 127.0.0.1:8000 in our web browser.

After uploading files through file browser. It looks as follows.

Hue interface

hue file browser

 

Now its time to execute hadoop jar file. Let’s use the following code for doing the same.

hadoop jar WordCount.jar WordCount /user/ru1/wc-input /user/ru1/wc-out

hadoop exec

After it is executed without any errors we need track the status of application in the all applications page using 127.0.0.1:8088

The screen looks as follows

All Applications

In this step we should see succeeded in the respective application. After confirming the success status we should open hue file browser where we will see a new folder created called wc-out2 (which we have given in shell command prompt).

hue output

In this folder there will be two files called success and part-r-0000. The part-r-0000 is where we can check the output of the program and how many words are there and what is the frequency of each word occurred.

Outp1

Outp2

Outp3

Finally we accomplished our objective of executing java wordcount program.

This program is known as the base for google page ranking algorithm.

Hope you all enjoyed this blog.

cheers!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s