Welcome!

Blog Feed Post

Build Hadoop from Source

Instructions in this write-up were tested and run successfully on Ubuntu 10.04, 10.10, 11.04, and 11.10. The instructions should run, with minor modifications, on most flavors and variants of Linux and Unix. For example, replacing apt-get with yum should get it working on Fedora and CentOS.

If you are starting out with Hadoop, one of the best ways to get it working on your box is to build it from source. Using stable binary distributions is an option, but a rather risky one. You are likely to not stop at Hadoop common but go on to setting up Pig and Hive for analyzing data and may also give HBase a try. The Hadoop suite of tools suffer from a huge version mismatch and version confusion problem. So much so that many start out with Cloudera’s distribution, also know as CDH, simply because it solves this version confusion disorder.

Michael Noll’s well written blog post titled: Building an Hadoop 0.20.x version for HBase 0.90.2, serves as a great starting point for building the Hadoop stack from source. I would recommend you read it and follow along the steps stated in that article to build and install Hadoop common. Early on in the article you are told about a critical problem that HBase faces when run on top of a stable release version of Hadoop. HBase may loose data unless it is running on top an HDFS with durable sync. This important feature is only available in the branch-0.20-append of the Hadoop source and not in any of the release versions.

Assuming you have successfully, followed along Michael’s guidelines, you should have the hadoop jars built and available in a folder named ‘build’ within the folder that contains the Hadoop source. At this stage, its advisable to configure Hadoop and take a test drive.

Configure Hadoop: Pseudo-distributed mode

Running Hadoop in pseudo-distributed mode provides a little taste of a cluster install using a single node. The Hadoop infrastructure includes a few daemon processes, namely

  1. HDFS namenode, secondary namenode, and datanode(s)
  2. MapReduce jobtracker and tasktracker(s)

When run on a single node, you can choose to run all these daemon processes within a single Java process (also known as standalone mode) or can run each daemon in a separate Java process (pseudo-distributed mode).

If you go with the pseudo-distributed setup, you will need to provide some minimal custom configuration to your Hadoop install. In Hadoop, the general philosophy is to bundle a default configuration with the source and allow for overriding it using a separate configuration file. For example, hdfs-default.xml, which you can find in the ‘src/hdfs’ folder of your Hadoop root folder, contains the default configuration for HDFS properties. The file hdfs-default.xml gets bundled within a compiled and packaged Hadoop jar file and Hadoop uses the configuration specified in this file for setting up HDFS properties. If you need to override any of the HDFS properties that uses the default configuration from hdfs-default.xml, then you need to re-specify the configuration for that property in a file named hdfs-site.xml. This custom configuration definition file, hdfs-site.xml, resides in the ‘conf’ folder within the Hadoop root folder. Custom configuration in core-site.xml, hdfs-site.xml, and mapred-site.xml corresponds to default configuration in core-default.xml, hdfs-default.xml, and mapred-default.xml, respectively.

Contents of conf/core-site.xml after custom configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

The specified configuration makes the HDFS namenode daemon accessible on port 9000 on localhost.

Contents of conf/hdfs-site.xml after custom configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>path/to/data/dfs/name</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table(fsimage).  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
  </property>
</configuration>

The override specifies a replication factor of 1. On a single node, you can’t have a failover, can over? The custom configuration also sets the namenode directory to a path on your file system. The default is a folder within your /tmp folder, which gets purged on a restart.

Contents of conf/mapred-site.xml after custom configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
<!-- Put site-specific property overrides in this file. -->
 
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

After this configuration the MapReduce jobtracker is accessible on port 9001 on localhost.

Finally set JAVA_HOME in conf/hadoop-env.sh. On my Ubuntu 11.10, I have it set as follows:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

If you don’t have passphraseless ssh setup on your machine then you may need to execute the following commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

This creates a DSA encrypted key named id_dsa and adds it to the set of authorized_keys for the SSH server running on your localhost. If you aren’t sure if passphraseless ssh access is setup or not then simply run ‘ssh localhost’ on a terminal. If you are prompted for a password then you need to complete the steps to setup passphraseless ssh.

Now start up the Hadoop daemons using:

bin/start-all.sh

Run this command from the root of your Hadoop folder.

As an additional step set HADOOP_HOME environment variable to point to the root of your Hadoop folder. HADOOP_HOME is used by other pieces of software, like HBase, Pig, and Hive, that are built on top of Hadoop.

Running a Simple Example
If you ran the tests after the Hadoop common install and they passed, then you should be ready to use Hadoop. However, for completeness, I would suggest running a simple Hadoop example while the daemons are up and waiting. Run the simple example illustrated in the official document, available online at http://hadoop.apache.org/common/docs/r0.20.203.0/single_node_setup.html#PseudoDistributed. The example is available at the end of the sub-section on Pseudo-Distributed Operation.

Build HBase from Source

Once Hadoop is up and running, you are ready to build and run HBase. Start by getting the HBase source as follows:

git clone https://github.com/apache/hbase.git

I clone it from the Apache HBase mirror on Github. Alternatively, you can get the source from the HBase svn repository, which is where the official commits are checked-in.

HBase can make use of Snappy compression. Snappy is a fast compression/decompression library, which was built by Google and is available as an open source software library under the ‘New BSD License’. The official site defines snappy as follows:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

You can learn more about Snappy at http://code.google.com/p/snappy/.

To build HBase with snappy support we need to do the following:

  1. Build and install the snappy library
  2. Build and install hadoop-snappy, the library that bridges snappy and Hadoop
  3. Compile HBase with snappy

Build and Install snappy
Building and installing snappy is easy and quick. Get the latest stable snappy release as follows:

wget http://snappy.googlecode.com/files/snappy-1.0.4.tar.gz

The current latest release version is 1.0.4. This version number could vary as newer versions supersede this version.
Once you download the snappy zipped tarball, extract it:

tar zxvf snappy-1.0.4.tar.gz

Next, change into the snappy extracted folder and use the common configure, make, make install trio to complete the build and install process.

cd snappy-1.0.4
./configure && make && sudo make install

You may need to run ‘make install’ using the privileges of a superuser, i.e. ‘sudo make install’.

Build and Install hadoop-snappy
Hadoop-snappy is a project for Hadoop that provides access to the snappy compression/decompression library. You can learn details about hadoop-snappy at http://code.google.com/p/hadoop-snappy/. Building and installing haddop-snappy requires Maven. To get started checkout the hadoop-snappy code from its subversion repository like so:

svn checkout http://hadoop-snappy.googlecode.com/svn/trunk/ hadoop-snappy-read-only

Then, change to the ‘hadoop-snappy-read-only’ folder and make a small modification to maven/build-compilenative.xml :

# add JAVA_HOME as an env var
<exec dir="${native.staging.dir}" executable="sh" failonerror="true">
    <env key="OS_NAME" value="${os.name}"/>
    <env key="OS_ARCH" value="${os.arch}"/>
    <env key="JVM_DATA_MODEL" value="${sun.arch.data.model}"/>
    <env key="JAVA_HOME" value="/usr/lib/jvm/java-6-openjdk"/>
    <arg line="configure ${native.configure.options}"/>
</exec>

Also, install a few required zlibc related libraries:

sudo apt-get install zlibc zlib1g zlib1g-dev

Next, build Hadoop-snappy using maven like so:

sudo mvn package

Once hadoop-snappy is built, install the jar and tar distributions of hadoop-snappy to your local repository:

mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-snappy -Dversion=0.0.1-SNAPSHOT -Dpackaging=jar -Dfile=./target/hadoop-snappy-0.0.1-SNAPSHOT.jar
mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-snappy -Dversion=0.0.1-SNAPSHOT -Dclassifier=Linux-amd64-64 -Dpackaging=tar -Dfile=./target/hadoop-snappy-0.0.1-SNAPSHOT-Linux-amd64-64.tar

Compile, Configure & Run HBase
Once snappy and hadoop-snappy are compiled and installed, you are ready to compile HBase with snappy support. Change to the folder that contains the HBase repository clone and run the ‘maven compile’ command to build HBase from source.

cd hbase
mvn compile -Dsnappy

The -Dsnappy option tells maven to compile HBase with snappy support.

Earlier, I setup Hadoop to run in pseudo-distributed mode. Lets configure HBase to also run in pseudo-distributed mode. Alike Hadoop, the default configuration for HBase is available in hbase-default.xml and custom configuration can be specified to override the default configuration. Custom configuration resides in conf/hbase-site.xml. To setup, HBase in pseudo-distributed make sure the contents of conf/hbase-site.xml are as follows:

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
        <description>The directory shared by RegionServers.
        </description>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>The replication count for HLog and HFile storage.
        Should not be greater than HDFS datanode count.
        </description>
    </property>
</configuration>

You may recall we configured the HDFS namenode to be accessible on port 9000 on localhost. Therefore, the ‘hbase.rootdir’ needs to be specified with respect to the HDFS url. If you configure to run HDFS daemons on a different port, then please adjust the configuration for ‘hbase.rootdir’ in line with that. The second custom property definition sets replication factor value to 1. On a single node thats the best and only option you have!

Now you can start-up Hbase using:

bin/start-hbase.sh

Build Pig
Building and Installing Pig from source is a simple 3 command operation like so:

svn checkout http://svn.apache.org/repos/asf/pig/trunk/ pig
cd pig
ant

The first command checks out source from the Pig svn repository. It grabs the source from the ‘trunk’, which is referred to as ‘master’ in git jargon. The second command changes to the folder that contains the pig source. The third command compiles the pig source using Apache Ant. Invoking the default target, i.e. simply calling ‘ant’ without any argument, compiles Pig and packages it as a jar for distribution and consumption. Pig jar file can be found at the root of the Pig folder. Pig usually generates two jar files:

  1. pig.jar — to be run with Hadoop.
  2. pigwithouthadoop.jar — to be run locally. Pig does not need to always use Hadoop.

Build Hive
Building and Installing Hive is almost as easy as building and installing Pig. The following set of commands gets the job done:

svn co http://svn.apache.org/repos/asf/hive/trunk hive
cd hive
ant package

You should be able to understand the commands if you have come so far in this article.

There is one little catch in this Hive instruction set though. As you run the ‘ant package’ task you will see the build fail. With HADOOP_HOME pointing to a hadoop-0.20-append branch build, Hive ShimLoader does not get the Hadoop version correctly. Its the “-” in the name that causes the problem! Apply, the simple patch available at https://issues.apache.org/jira/browse/HIVE-2294 and things should work just fine. Apply the patch as follows:

patch -p0 -i HIVE-2294.3.patch

To start using Hive, you will also need to minimally carry out these additional tasks:

  • Set HIVE_HOME environment variable to point to the root of the HIVE directory.
  • Add $HIVE_HOME/bin to $PATH
  • Create /tmp in HDFS and set appropriate permissions
bin/hadoop fs -mkdir /tmp 
bin/hadoop fs -chmod g+w   /tmp
  • Create /user/hive/warehouse and set appropriate permissions
bin/hadoop fs -mkdir /user/hive/warehouse 
bin/hadoop fs -chmod g+w /user/hive/warehouse

Now you are ready with pseduo-distributed Hadoop, pseudo-distributed HBase, Pig, and Hive running on your box. This is of course just the beginning. You need to learn to leverage these tools to analyze data, but that not covered in this write-up. A following post will possibly address the topic of analyzing data using MapReduce and its abstractions.

Read the original blog entry...

More Stories By Shashank Tiwari

I am a technology entrepreneur, innovator, author and, as some say, a “thought leader”. I like to solve challenging computing problems, especially those that drive innovation. Being a polyglot programmer, I can program fluently in many languages, including Java, Python, C++, C, Ruby, ActionScript, JavaScript, Objective-C, Haskell, Scala, Clojure, PHP, Groovy, Lisp and Perl. I must admit that I like to learn programming languages and if there is a new interesting one coming, I wouldn’t be far behind getting to grips with it. Over the last many years I have built some cutting edge enterprise and consumer software applications, many of which have leveraged large data sets and the web based programming paradigms. This means I also know a lot about data bases and persistence. I am very conversant with relational databases, embedded databases, object databases, text based data and XML. Having leveraged web based programming paradigms, I have first hand experience with a lot of web development frameworks, including but not limited to Adobe Flex, Spring MVC, Rails, Grails and Django. Not to forget, I obviously have worked a lot with HTML, JS and CSS. My experience and interest are varied and diverse and range a wide spectrum of application development realms that include the server, the client and the middleware. Besides, programming, I am also deeply interested in mathematics and theoretical computer science. This motivates me to bring my knowledge of applied mathematics, statistical modeling, artificial intelligence and sometimes simply data structures, to good use, when I build applications. A couple of domains like financial mathematics and scientific computing seem to have been good fit for such expertise. I am an ardent supporter of open source software and try and contribute to open source code bases and causes. I like the plurality and variety that software development offers; the choice of programming languages, the abundant availability of tools and libraries, the existence of multiple operating systems and the possibility of varied software development methodologies. As a member of the technology community, I am an active contributor to the ever evolving software development languages, methodologies and standards. I am an expert group member on a number of JCP (Java Community Process) specifications, for example JSRs 274, 283, 299, 301 & 312, and have been recognized as an Adobe Flex Champion.I run and organize a few community events like Flex Camp Wall Street, Show Ramp and Polyglot World. I bring together all my expertise in terms of services and products via my primary venture, Treasury of Ideas LLC, in which I play the role of a Managing Partner. Treasury of Ideas LLC, through its focus on innovation and value optimization, offers many best of the breed services and products and has incubated many ideas to help translate them to reality. Our clients range from large enterprises, government agencies, not-for-profit organizations to promising new startups. I write regularly in many technical journals and magazines, present in seminars and mentor developers and architects. I have authored a few books, including Advanced Flex 3 (friends of Ed/APress, 2008) and Professional BlazeDS (Wrox/Wiley, 2009) , and am in the process of authoring a few more. You can learn all about my books and public talks by browsing through the Publish & Present page at www.shanky.org.

Latest Stories
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud plat...
"We focus on SAP workloads because they are among the most powerful but somewhat challenging workloads out there to take into public cloud," explained Swen Conrad, CEO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Outscale was founded in 2010, is based in France, is a strategic partner to Dassault Systémes and has done quite a bit of work with divisions of Dassault," explained Jackie Funk, Digital Marketing exec at Outscale, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are still a relatively small software house and we are focusing on certain industries like FinTech, med tech, energy and utilities. We help our customers with their digital transformation," noted Piotr Stawinski, Founder and CEO of EARP Integration, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We've been engaging with a lot of customers including Panasonic, we've been involved with Cisco and now we're working with the U.S. government - the Department of Homeland Security," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We're here to tell the world about our cloud-scale infrastructure that we have at Juniper combined with the world-class security that we put into the cloud," explained Lisa Guess, VP of Systems Engineering at Juniper Networks, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I'm here to leverage my secret sauce, which is using outsourced development and the company that I utilize is delaPlex Software and they've basically allowed me to win Fortune 500 companies," noted Justin Witz, CTO of FRA and PlanTools, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.