Welcome!

Blog Feed Post

Improving Hadoop Performance with Optimization, CDH3 Update 3, and CDH4

Mahout is a machine learning library that will be included in CDH4

At Cloudera Day, Cloudera software engineed Todd Lipcon Delivered a Deep Dive on the Core of Cloudera’s Distribution including Apache Hadoop (CDH), detailing tweaks and planned improvements to the Hadoop core. Just a few days later, some of these planned improvements were implemented when Cloudera released CDH3, Update 3, and more will be made for the upcoming CDH4. Many tweaks need to be configured by the user, so if you want your cluster running optimally be sure to check the documentation provided by Cloudera.

When measuring the efficiency of a cluster, we can look at three different metrics. Speed can mean per-job latency, measured with a stopwatch, or throughput, measured in slot-seconds. Latency and throughput can be at odds. Another metric, perhaps the most important to Hadoop developers, is overhead, which Lipcon defined as effort spent on jobs you don’t care about. All of these metrics can be improved with some simple adjustments. For example, by tweaking the way Linux IO and caching you can decrease latency by roughly 20% while increasing disk utilization and smoothing out CPU usage.

According to Cloudera, these and other improvements are available in CDH3u3, resulting in a 15% to 150% increase in performance depending on the workload. Additions include MapReduce TaskTracker disk failure toleration, HDFS and MapReduce read-ahead and drop-behind for improved performance, HDFS improved block report performance via improved locking and parallel disk scanning, HDFS shortcut local DataNode reads for improved performance, and  HDFS re-use of client-to-DataNode connections for improved performance. Apache HBase has also been updated with Distributed log splitting on RegionServer crash, Atomic bulk load, and HBCK offline META rebuild as well as updates to Apache Oozie and Zookeeper.

CDH4, which will be available in beta shortly, has an even wider array of improvements and additions based on customer demands and industry trends. One of the biggest changes will be the inclusion of a High Availability Namenode so that if the namenode fails you won’t lose the whole cluster or your data. With Hadoop Distributed File System-RAID, HDFS’s data replication factor will drop from 3 to 2.2 times thanks to a Distributed Raid File System. DRFS increases protection against corruption and hence reduces the amount of replication necessary to ensure availability, for significant disk, rack, and power saving when dealing with Big Data.  CDH4 will also include Apache Mahout for machine learning and improved versions of Flume, Sqoop, and Hue. There will also be new HBase diagnostic and repair tools and Kerberos for security.

Lipcon also mentioned the direction of Hadoop R&D at Cloudera. Goals for future distributions include encryption, disaster recovery, metadata storage and management, resource management, and MapReduce alternatives for problems outside the framework.

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley writes on enterprise IT. He is a founder and partner at Cognitio Corp and publsher of CTOvision.com

Latest Stories
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Vulnerability management is vital for large companies that need to secure containers across thousands of hosts, but many struggle to understand how exposed they are when they discover a new high security vulnerability. In his session at 21st Cloud Expo, John Morello, CTO of Twistlock, addressed this pressing concern by introducing the concept of the “Vulnerability Risk Tree API,” which brings all the data together in a simple REST endpoint, allowing companies to easily grasp the severity of the ...
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...
In his session at 21st Cloud Expo, James Henry, Co-CEO/CTO of Calgary Scientific Inc., introduced you to the challenges, solutions and benefits of training AI systems to solve visual problems with an emphasis on improving AIs with continuous training in the field. He explored applications in several industries and discussed technologies that allow the deployment of advanced visualization solutions to the cloud.
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...
"NetApp is known as a data management leader but we do a lot more than just data management on-prem with the data centers of our customers. We're also big in the hybrid cloud," explained Wes Talbert, Principal Architect at NetApp, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, provided a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to oper...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
"Infoblox does DNS, DHCP and IP address management for not only enterprise networks but cloud networks as well. Customers are looking for a single platform that can extend not only in their private enterprise environment but private cloud, public cloud, tracking all the IP space and everything that is going on in that environment," explained Steve Salo, Principal Systems Engineer at Infoblox, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventio...