Getting Started with Hadoop

Getting Started with Hadoop

This is a short tutorial on using Hadoop.

Hadoop commands and formats change at times. If the hadoop fs <command> doesn't work. Try hdfs dfs <command>.

We'll go through the process of compiling, packaging, and running a simple Hadoop program.

This tutorial is adapted from the Hadoop Map/Reduce tutorial

Logging In

First, make sure you can log in to the head node with SSH, currently at zoidberg.cs.ndsu.nodak.edu. You can log in to this server with your CS Domain password (The one you use for the Windows clusters found around campus or Campus Wi-Fi), and NOT your University System password.

If you have trouble logging in:

Check to see if your password works in the Linux lab
If you can not log in to the Linux lab, contact support@cs.ndsu.edu

To request access to the Hadoop cluster, contact support@cs.ndsu.edu.

Setting Up Input Files

This program can use the Hadoop Distributed File System (HDFS) that is set up in the CS department. This file system spans all the Linux lab machines and provides distributed storage for use specifically with Hadoop.

You can work with HDFS with UNIX-like file commands. The list of file commands can be found here.

First, make a directory to store the input for the program (use your username).

helsene@zoidberg:~$ hadoop fs -mkdir /user/helsene/wordcount
helsene@zoidberg:~$ hadoop fs -mkdir /user/helsene/wordcount/input

To set up input for the WordCount program, create two files as follows:

file01:

Hello World Bye World

file02:

Hello Hadoop Goodbye Hadoop

Save these to your home folder on the head node. To move them into HDFS, use the following commands:

helsene@zoidberg:~$ hadoop fs -copyFromLocal /home/helsene/file01 /user/helsene/wordcount/input/file01
helsene@zoidberg:~$ hadoop fs -copyFromLocal /home/helsene/file02 /user/helsene/wordcount/input/file02

Again, use your username where applicable.

The syntax here is “hadoop fs -copyFromLocal <LOCAL_FILE> <HDFS_FILE>”, in this case we're going to copy file01 from the local system into HDFS under our HDFS user directory into the wordcount/input/ directory.

Running the WordCount Program

You can now run the WordCount program using the following command:

hadoop jar wc.jar WordCount /user/username/wordcount/input /user/username/wordcount/output

The command syntax is: “hadoop jar <JARFILE> <CLASS> <PARAMETERS…>”

In this case, we use the wc.jar JAR file, running the class 'WordCount' with two parameters, an input directory and an output directory. The output directory must not already exist in HDFS, it will be created by the program.

View Output

You can check the output directory with:

hadoop fs -ls /user/username/wordcount/output/

You should then see something similar to:

helsene@zoidberg:~$ hadoop fs -ls /user/helsene/wordcount/output
Found 2 items
-rw-r--r--   3 helsene nogroup         41 2011-11-08 11:23 /user/helsene/wordcount/output/part-r-00000

The 'part-r-00000' file contains the results of the word counting. You can look at the file using the 'cat' command.

helsene@zoidberg:~$ hadoop fs -cat /user/helsene/wordcount/output/part-r-00000
Bye 1
Goodbye	1
Hadoop 2
Hello 2
World 2

Notes on Hadoop 2.8.5

May need to run: export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Compile: hadoop com.sun.tools.javac.Main WordCount.java

Make Jar: jar cf wc.jar WordCount*.class

Execute jar: hadoop jar wc.jar WordCount /user/ghokanso/wordcount/input /user/ghokanso/wordcount/output