Welcome!

Related Topics: @BigDataExpo, Apache

@BigDataExpo: Blog Post

An Example to Illustrate Hadoop Code Reuse

The developer tool to realize Hadoop code reuse

The MapReduce of Hadoop is a widely-used parallel computing framework. However, its code reuse mechanism is inconvenient, and it is quite cumbersome to pass parameters. Far different from our usual experience of calling the library function easily, I found both the coder and the caller must bear a sizable amount of precautions in mind when writing even a short pieces of program for calling by others.

However, we finally find that esProc could easily realize code reuse in hadoop. Still a simple and understandable example of grouping and summarizing, let's check out a solution with not so great reusability. Suppose we need to group the big data of order (sales.txt) on HDFS by salesman (empID), and seek the corresponding sales amount of each Salesman. esProc codes are:
Code for summary machine:



Code for node machine:



esProc classifies the distributed computing into two categories: The respective codes for summary machine and node machine. The summary machine is responsible for task scheduling, distributing the task to every task in the form of parameter, and finally integrating and summarizing the computing results from node machines. The node machines are used to get a segment of the whole data piece as specified by parameters, and then group and summarize the data of this segment.

As can be seen, esProc code is intuitive and straightforward, just like the natural and common thinking patterns. The summary machine distributes a task into several segments; distributes them to the unit machine to summarize initially; and then further summarizes the summary machine for the second time. Another thing to note is the esProc grouping and summarizing function "groups", which is used to perform the grouping action over the two-dimensional table A1 by empID and sum up the values of amount fields. The result will be renamed to the understandable totalAmount. This whole procedure of grouping and summarizing is quite concise and intuitive: A1.groups(empID;sum(amount): totalAmount)

In addition, the groups function can be applied to not only the small 2D table, but also the 2D table that is too great to be held in the memory. For example, the cursor mode is adopted for the above codes.

But there are some obvious defects in the above example: The reusability of code is not great. In the steps followed, we will rewrite the above example to a universal algorithm independent of any concrete business. It will be rewritten to control the code flow with parameters, so as to summarize whatsoever data file. In which, the task granularity can be scheduled into arbitrary number of segments, and the computing nodes can be specified at will. Then, the revised codes are shown below:

Code for summary machine. There are altogether 4 parameters defined here: fileName: Big data file to analyze; taskNumber: Number of tasks to distribute; groupField: Fields to group; sumField: Fields to summarize. In addition, the node machine is obtained via reading the profiles.



Code for node machine. In the revised codes, 4 variables are used to receive the parameter from summary machine. Besides the file starting and ending positions (start and end) from the first example, there are two newly-added fields. They are groupField: Fields to group; and sumField: Fields to summarize.



In esProc, it is much easier to pass and use parameter because users can implement the common grouping and summarizing with the least modification workload, and reuse the codes easily.

In Hadoop, the complicated business algorithm is mainly implemented by writing the MapReduce class. By comparison, it is much more inflexible to pass and use parameters in MapReduce. Though it is possible to implement a flexible algorithm independent of the concrete business, it is really cumbersome. Judging the Hadoop codes, the coupling degree of code and business is great. To pass the parameters, a global-variable-like mechanism is required, which is not only inconvenient but also hard to understand. That's why so many questions about MapReduce parameter-passing are here and there on many Web pages. Lots of people feel confused about developing universal algorithms with MapReduce.

In addition, the default separator in the above codes is the comma. It is obvious that users only need to add a variable in a similar way to customize it to any more commonly-used symbol. With it, they can also implement the common action of data filtering and then grouping and summarizing easily. Please note the usage of parameter groupField. It is used as the character parameter in the cell A6, but the macro in A8. In other words, ${gruopField} can be resolved as the formula itself, instead of any parameter in the formula alone. This is the work of dynamic language. Therefore, esProc can realize the completely flexible code, for example, using the parameter to control the summary algorithm to perform sum up or just count, seek the average value or maximum.

"Macro" is a simple special case of dynamic language. esProc supports a more flexible and complete dynamic language system.

As you may find from the above example, esProc can implement Hadoop code reuse easily, and basically achieve the goal of "Write once, run anywhere!". Needless to say, the development efficiency can be boosted dramatically.

personal blog: http://datakeyword.blogspot.com/

website: http://www.raqsoft.com/

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

Latest Stories
Due of the rise of Hadoop, many enterprises are now deploying their first small clusters of 10 to 20 servers. At this small scale, the complexity of operating the cluster looks and feels like general data center servers. It is not until the clusters scale, as they inevitably do, when the pain caused by the exponential complexity becomes apparent. We've seen this problem occur time and time again. In his session at Big Data Expo, Greg Bruno, Vice President of Engineering and co-founder of StackIQ...
The security needs of IoT environments require a strong, proven approach to maintain security, trust and privacy in their ecosystem. Assurance and protection of device identity, secure data encryption and authentication are the key security challenges organizations are trying to address when integrating IoT devices. This holds true for IoT applications in a wide range of industries, for example, healthcare, consumer devices, and manufacturing. In his session at @ThingsExpo, Lancen LaChance, vic...
"Plutora provides release and testing environment capabilities to the enterprise," explained Dalibor Siroky, Director and Co-founder of Plutora, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
FinTech is the sum of financial and technology, and it’s one of the fastest growing tech industries. Total global investments in FinTech almost reached $50 billion last year, but there is still a great deal of confusion over what it is and what it means – especially as it applies to retirement. Building financial startups is not simple, but with the right team, technology and an innovative approach it can be an extremely interesting domain to disrupt. FinTech heralds a financial revolution that...
SYS-CON Media announced today that @WebRTCSummit Blog, the largest WebRTC resource in the world, has been launched. @WebRTCSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @WebRTCSummit Blog can be bookmarked ▸ Here @WebRTCSummit conference site can be bookmarked ▸ Here
In his session at DevOps Summit, Tapabrata Pal, Director of Enterprise Architecture at Capital One, will tell a story about how Capital One has embraced Agile and DevOps Security practices across the Enterprise – driven by Enterprise Architecture; bringing in Development, Operations and Information Security organizations together. Capital Ones DevOpsSec practice is based upon three "pillars" – Shift-Left, Automate Everything, Dashboard Everything. Within about three years, from 100% waterfall, C...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
SYS-CON Events announced today that Addteq will exhibit at SYS-CON's @DevOpsSummit at Cloud Expo New York, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Addteq is one of the top 10 Platinum Atlassian Experts who specialize in DevOps, custom and continuous integration, automation, plugin development, and consulting for midsize and global firms. Addteq firmly believes that automation is essential for successful software releases. Addteq centers its products an...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
You think you know what’s in your data. But do you? Most organizations are now aware of the business intelligence represented by their data. Data science stands to take this to a level you never thought of – literally. The techniques of data science, when used with the capabilities of Big Data technologies, can make connections you had not yet imagined, helping you discover new insights and ask new questions of your data. In his session at @ThingsExpo, Sarbjit Sarkaria, data science team lead ...
SYS-CON Events announced today that IoT Now has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. IoT Now explores the evolving opportunities and challenges facing CSPs, and it passes on some lessons learned from those who have taken the first steps in next-gen IoT services.
SYS-CON Events announced today that WineSOFT will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Based in Seoul and Irvine, WineSOFT is an innovative software house focusing on internet infrastructure solutions. The venture started as a bootstrap start-up in 2010 by focusing on making the internet faster and more powerful. WineSOFT’s knowledge is based on the expertise of TCP/IP, VPN, SSL, peer-to-peer, mob...
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
The Internet of Things can drive efficiency for airlines and airports. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Sudip Majumder, senior director of development at Oracle, discussed the technical details of the connected airline baggage and related social media solutions. These IoT applications will enhance travelers' journey experience and drive efficiency for the airlines and the airports.
Big Data, cloud, analytics, contextual information, wearable tech, sensors, mobility, and WebRTC: together, these advances have created a perfect storm of technologies that are disrupting and transforming classic communications models and ecosystems. In his session at @ThingsExpo, Erik Perotti, Senior Manager of New Ventures on Plantronics’ Innovation team, provided an overview of this technological shift, including associated business and consumer communications impacts, and opportunities it m...