Stepping stone for learning Big Data techniques is to make your hands dirty. That is what we are going to do today.
Objective: To execute java coded MapReduce (we’ll learn about this in following lines) task of three large text files and count the frequency of words appeared in those text files using Hadoop under Hortonworks Data Platform installed on Oracle virtual box.
Framework: The complete framework of achieving our objective is represented in pictorial form please do make an effort to understand the same.
I already have Oracle virtual box in my system. Now my first task is to load hadoop using virtual box. Installation will take nearly about 10 to 15 minutes provided the system is fast enough to run this heavy application.
Everytime you want to do something using hortonworks hadoop you need to run it from virtual box. Running after installation also takes about five minutes to start the application.
Once it is ready it will show the screen like this.
Don’t worry if your system becomes a little slower after this step. It is quite common as this application is RAM hungry it easily uses more than 3GB of RAM just to start.
Now we need to start the shell box in which we can execute linux commands.
We need to gain access in this shell box with username: root & password: hadoop.
As we have java codes ready we need to create these java files using linux vi command. After editing the document we need to give the following commands to save and exit the editor shell. :w for writing and :q to quit from editor window and come back to shell box.
Please look at the editor window opened in my shell using 127.0.0.1:4200
Below screen is where I edited my SumReducer.java, WordMapper.java and WordCount.java files.
You can download the codes of SumReducer, WordCount, WordMapper.
Once your java files are ready for execution we need to create one new folder to save our class files which we are going to compile from java codes.
After creating a folder for class files. We have to execute the following code from shell.
javac -classpath /usr/hdp/22.214.171.124-2557/hadoop/hadoop-common-126.96.36.199.3.0.0-2557.jar:/usr/hdp/188.8.131.52-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-184.108.40.206.3.0.0-2557.jar:/usr/hdp/220.127.116.11-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WC-classes WordMapper.java
javac -classpath /usr/hdp/18.104.22.168-2557/hadoop/hadoop-common-22.214.171.124.3.0.0-2557.jar:/usr/hdp/126.96.36.199-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-188.8.131.52.3.0.0-2557.jar:/usr/hdp/184.108.40.206-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WC-classes SumReducer.java
javac -classpath /usr/hdp/220.127.116.11-2557/hadoop/hadoop-common-18.104.22.168.3.0.0-2557.jar:/usr/hdp/22.214.171.124-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-126.96.36.199.3.0.0-2557.jar:/usr/hdp/188.8.131.52-2557/hadoop-mapreduce/commons-cli-1.2.jar:WCclasses -d WC-classes WordCount.java
By using the code above we will be able to create class files of SumReducer, WordMapper & WordCount
What these programs essentially does is : we are having three large text files (infact three big novels) in txt format with lot of words. We are going to reduce this humongous task using reducer and mapper program will map which task is given to which node.
As we now have class files of SumReducer, WordMapper and WordCount we should create jar file using the following code.
jar -cvf WordCount.jar -C WCclasses/ .</code>
Next step is to create folder in hdfs file system using the following commands.
<code>hdfs -mkdir user/ru1/wc-input</code>
After creating this folder we have to upload files using hue file browser using 127.0.0.1:8000 in our web browser.
After uploading files through file browser. It looks as follows.
Now its time to execute hadoop jar file. Let’s use the following code for doing the same.
hadoop jar WordCount.jar WordCount /user/ru1/wc-input /user/ru1/wc-out
After it is executed without any errors we need track the status of application in the all applications page using 127.0.0.1:8088
The screen looks as follows
In this step we should see succeeded in the respective application. After confirming the success status we should open hue file browser where we will see a new folder created called wc-out2 (which we have given in shell command prompt).
In this folder there will be two files called success and part-r-0000. The part-r-0000 is where we can check the output of the program and how many words are there and what is the frequency of each word occurred.
Finally we accomplished our objective of executing java wordcount program.
This program is known as the base for google page ranking algorithm.
Hope you all enjoyed this blog.