Welcome!

Related Topics: @BigDataExpo, Apache

@BigDataExpo: Blog Post

An Example to Illustrate Hadoop Code Reuse

The developer tool to realize Hadoop code reuse

The MapReduce of Hadoop is a widely-used parallel computing framework. However, its code reuse mechanism is inconvenient, and it is quite cumbersome to pass parameters. Far different from our usual experience of calling the library function easily, I found both the coder and the caller must bear a sizable amount of precautions in mind when writing even a short pieces of program for calling by others.

However, we finally find that esProc could easily realize code reuse in hadoop. Still a simple and understandable example of grouping and summarizing, let's check out a solution with not so great reusability. Suppose we need to group the big data of order (sales.txt) on HDFS by salesman (empID), and seek the corresponding sales amount of each Salesman. esProc codes are:
Code for summary machine:



Code for node machine:



esProc classifies the distributed computing into two categories: The respective codes for summary machine and node machine. The summary machine is responsible for task scheduling, distributing the task to every task in the form of parameter, and finally integrating and summarizing the computing results from node machines. The node machines are used to get a segment of the whole data piece as specified by parameters, and then group and summarize the data of this segment.

As can be seen, esProc code is intuitive and straightforward, just like the natural and common thinking patterns. The summary machine distributes a task into several segments; distributes them to the unit machine to summarize initially; and then further summarizes the summary machine for the second time. Another thing to note is the esProc grouping and summarizing function "groups", which is used to perform the grouping action over the two-dimensional table A1 by empID and sum up the values of amount fields. The result will be renamed to the understandable totalAmount. This whole procedure of grouping and summarizing is quite concise and intuitive: A1.groups(empID;sum(amount): totalAmount)

In addition, the groups function can be applied to not only the small 2D table, but also the 2D table that is too great to be held in the memory. For example, the cursor mode is adopted for the above codes.

But there are some obvious defects in the above example: The reusability of code is not great. In the steps followed, we will rewrite the above example to a universal algorithm independent of any concrete business. It will be rewritten to control the code flow with parameters, so as to summarize whatsoever data file. In which, the task granularity can be scheduled into arbitrary number of segments, and the computing nodes can be specified at will. Then, the revised codes are shown below:

Code for summary machine. There are altogether 4 parameters defined here: fileName: Big data file to analyze; taskNumber: Number of tasks to distribute; groupField: Fields to group; sumField: Fields to summarize. In addition, the node machine is obtained via reading the profiles.



Code for node machine. In the revised codes, 4 variables are used to receive the parameter from summary machine. Besides the file starting and ending positions (start and end) from the first example, there are two newly-added fields. They are groupField: Fields to group; and sumField: Fields to summarize.



In esProc, it is much easier to pass and use parameter because users can implement the common grouping and summarizing with the least modification workload, and reuse the codes easily.

In Hadoop, the complicated business algorithm is mainly implemented by writing the MapReduce class. By comparison, it is much more inflexible to pass and use parameters in MapReduce. Though it is possible to implement a flexible algorithm independent of the concrete business, it is really cumbersome. Judging the Hadoop codes, the coupling degree of code and business is great. To pass the parameters, a global-variable-like mechanism is required, which is not only inconvenient but also hard to understand. That's why so many questions about MapReduce parameter-passing are here and there on many Web pages. Lots of people feel confused about developing universal algorithms with MapReduce.

In addition, the default separator in the above codes is the comma. It is obvious that users only need to add a variable in a similar way to customize it to any more commonly-used symbol. With it, they can also implement the common action of data filtering and then grouping and summarizing easily. Please note the usage of parameter groupField. It is used as the character parameter in the cell A6, but the macro in A8. In other words, ${gruopField} can be resolved as the formula itself, instead of any parameter in the formula alone. This is the work of dynamic language. Therefore, esProc can realize the completely flexible code, for example, using the parameter to control the summary algorithm to perform sum up or just count, seek the average value or maximum.

"Macro" is a simple special case of dynamic language. esProc supports a more flexible and complete dynamic language system.

As you may find from the above example, esProc can implement Hadoop code reuse easily, and basically achieve the goal of "Write once, run anywhere!". Needless to say, the development efficiency can be boosted dramatically.

personal blog: http://datakeyword.blogspot.com/

website: http://www.raqsoft.com/

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

Latest Stories
Companies can harness IoT and predictive analytics to sustain business continuity; predict and manage site performance during emergencies; minimize expensive reactive maintenance; and forecast equipment and maintenance budgets and expenditures. Providing cost-effective, uninterrupted service is challenging, particularly for organizations with geographically dispersed operations.
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management...
How will your company move to the cloud while ensuring a solid security posture? Organizations from small to large are increasingly adopting cloud solutions to deliver essential business services at a much lower cost. According to cyber security experts, the frequency and severity of cyber-attacks are on the rise, causing alarm to businesses and customers across a variety of industries. To defend against exploits like these, a company must adopt a comprehensive security defense strategy that is ...
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 18th International CloudExpo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Customer experience has become a competitive differentiator for companies, and it’s imperative that brands seamlessly connect the customer journey across all platforms. With the continued explosion of IoT, join us for a look at how to build a winning digital foundation in the connected era – today and in the future. In his session at @ThingsExpo, Chris Nguyen, Group Product Marketing Manager at Adobe, will discuss how to successfully leverage mobile, rapidly deploy content, capture real-time d...
SYS-CON Events announced today that BMC Software has been named "Siver Sponsor" of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2015 at the Javits Center in New York, New York. BMC is a global leader in innovative software solutions that help businesses transform into digital enterprises for the ultimate competitive advantage. BMC Digital Enterprise Management is a set of innovative IT solutions designed to make digital business fast, seamless, and optimized from mainframe to mo...
SYS-CON Events announced today that EastBanc Technologies will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. EastBanc Technologies has been working at the frontier of technology since 1999. Today, the firm provides full-lifecycle software development delivering flexible technology solutions that seamlessly integrate with existing systems – whether on premise or cloud. EastBanc Technologies partners with p...
SYS-CON Events announced today that AppNeta, the leader in performance insight for business-critical web applications, will exhibit and present at SYS-CON's @DevOpsSummit at Cloud Expo New York, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. AppNeta is the only application performance monitoring (APM) company to provide solutions for all applications – applications you develop internally, business-critical SaaS applications you use and the networks that deli...
SoftLayer operates a global cloud infrastructure platform built for Internet scale. With a global footprint of data centers and network points of presence, SoftLayer provides infrastructure as a service to leading-edge customers ranging from Web startups to global enterprises. SoftLayer's modular architecture, full-featured API, and sophisticated automation provide unparalleled performance and control. Its flexible unified platform seamlessly spans physical and virtual devices linked via a world...
SYS-CON Events announced today that ContentMX, the marketing technology and services company with a singular mission to increase engagement and drive more conversations for enterprise, channel and SMB technology marketers, has been named “Sponsor & Exhibitor Lounge Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York. “CloudExpo is a great opportunity to start a conversation with new prospects, but what happens after the...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
In the rush to compete in the digital age, a successful digital transformation is essential, but many organizations are setting themselves up for failure. There’s a common misconception that the process is just about technology, but it’s not. It’s about your business. It shouldn’t be treated as an isolated IT project; it should be driven by business needs with the committed involvement of a range of stakeholders.
SYS-CON Events announced today that Isomorphic Software will exhibit at SYS-CON's [email protected] at Cloud Expo New York, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Isomorphic Software provides the SmartClient HTML5/AJAX platform, the most advanced technology for building rich, high-productivity enterprise web applications for any device. SmartClient couples the industry’s broadest, deepest UI component set with a java server framework to deliver an end-...
As machines are increasingly connected to the internet, it’s becoming easier to discover the numerous ways Industrial IoT (IIoT) is helping to shape the business world. This is exactly why we have decided to take a closer look at this pervasive movement and to examine the desire to connect more things! Now if you need a refresher on IIoT and how it is changing the world, take a moment and listen to Greg Gorbach with ARC Advisory Group. Gorbach believes, "IIoT will significantly change the worl...
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discuss how businesses can gain an edge over competitors by empowering consumers to take control through IoT. We'll cite examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He'll also highlight how IoT can revitalize and restore outdated business models, making them profitable...