Welcome!

Related Topics: Microservices Expo, Microsoft Cloud, Open Source Cloud

Microservices Expo: Article

Making Sense of Large and Growing Data Volumes

MapReduce won’t overtake the enterprise data warehouse industry anytime soon

Is MapReduce the Holy Grail answer to the pressing problem of processing, analyzing and making sense of large and growing data volumes? Certainly it has potential in this arena, but there is a distressing gap between the amount of hype this technology - and its spinoffs - has received and the number of professionals who actually know how to integrate and make best use of it.

Industry watchers say it's just a matter of time before MapReduce sweeps through the enterprise data warehouse (EDW) market the same way open source technologies like Linux have done. In fact, in a recent blog post, Forrester's James Kobielus proclaimed that most EDW vendors will incorporate support for MapReduce's open source cousin Hadoop into the heart of their architectures to enable open, standards-based data analytics on massive amounts of data.

So, no more databases, just MapReduce? I'm not so sure. But don't misunderstand. It's not that MapReduce isn't an effective way to analyze data in some cases. The big names in Internet business are all using it - Facebook, Google, Amazon, eBay et al - so it must be good, right? But it's worth taking a more measured view based both on the technical and the practical business merits. I believe that the two technologies are not so mutually exclusive; that they will work hand-in-hand and, in some cases, MapReduce will be integrated into the relational database (RDBMS).

Google certainly has proven that MapReduce excels at making sense out of the exabytes of unstructured data on the web, which it should, given that MapReduce was designed from the outset for manipulating very large data sets. MapReduce in this sense provides a way to put structure around unstructured data. We humans prefer structure; it's in our DNA. Without structure, we have no real way of adding value to the data. Unstructured data analytics is something of an oxymoron for a pattern-seeking hominid.

MapReduce helps us put structure around the unstructured so we can then make sense of it. It creates an environment wherein a data analyst can write two simple functions, a "mapper" and a "reducer," to perform the actual data manipulation, returning a result that is at once both an analysis of the data it has just mapped and summarized, as well as the structure for further analysis that will help provide insight into the data. Whether that further analysis is done in a MapReduce environment might be the more appropriate question.

From an infrastructure standpoint, MapReduce excels where performance and scalability are challenges. Applications written using the MapReduce framework are automatically parallelized, making it well suited to a large infrastructure of connected machines. As it scales applications across lots of servers made up of lots of nodes, the MapReduce framework also provides built-in query fault tolerance so that whatever hardware component might fail, a query would be completed by another machine. Further, MapReduce and its open source brethren can perform functions not possible in standard SQL (click-stream sessionization, nPath, graph production of potentially unbounded length in SQL).

What's not to love? At a basic level I believe the MapReduce framework is an inefficient way of analyzing data for the vast majority of businesses. The aforementioned capabilities of MapReduce are all well and good, provided you have a Google-like business replete with legions of programmers and vast amounts of server and memory capacity. Viewed from this perspective, it makes perfect sense that Google developed and used MapReduce: because it could. It had a huge and growing resource in its farms of custom-made servers, as well as armies of programmers constantly looking for new ways to take advantage of that seemingly infinite hardware (and the data collected on it), to do cool new things.

Similarly, the other high-profile adopters and advocates are also IT-savvy, IT-heavy companies and, like Google, have the means and ongoing incentive to get a MapReduce framework tailored to their particular needs and reap the benefits. Would a mid-size firm know how? It seems doubtful. While it has claimed that MapReduce is easy to use, even for programmers without experience with distributed systems, I know from field experience with customers that it does, in fact, take some pretty experienced folks to make best use of it.

Projects like Hive, Google Sawzall, Yahoo Pig and companies like Cloudera all, in essence, attempt to make the MapReduce paradigm easier for lesser experts to use and, in fact, make it behave for the end user more like a parallel database. But this raises the question: Why? It seems to be a bit of re-inventing the wheel. IT-heavy is not how most businesses operate today, especially in these economic times. The dot-com bubble is long over. Hardware budgets are limited and few companies relish the idea of hiring teams of programming experts to maintain even a valuable IT asset such as their data warehouse. They'd rather buy an off-the-shelf tool designed from the ground up to do high-speed data analytics.

Like MapReduce, commercially available massively parallel processing databases specifically built for rapid, high volume data analytics will provide immense data scale and query fault tolerance. They also have a proven track record of customer deployments and deliver equal if not better performance on Big Data problems. Perhaps as important, today's next-generation MPP analytic databases give businesses the flexibility to draw on a deep pool of IT labor skilled in established conventions such as SQL.

As mentioned earlier, unstructured data seems like a natural for MapReduce analysis. A rising tide of chatter is focused on the increasing problem - and importance - of unstructured data. There is more than a bit of truth to this. As the Internet of everything becomes more and more a reality, data is generated everywhere; but our experience to date is that businesses are most interested in data derived from the transactional systems they've wired their businesses on top of, where structure is a given.

Another difficulty faces companies even as MapReduce becomes more integrated into the overall enterprise data analysis strategy. MapReduce is a framework. As the hype and interest have grown, MapReduce solutions are being created by database vendors in entirely non-standard and incompatible ways. This will further limit the likelihood that it will become the centerpiece of an EDW. Business has demonstrated time and again that it prefers open standards and interoperability.

Finally, I believe a move toward a programmer-centric approach to data analysis is both inefficient and contrary to all other prevailing trends of technology use in the enterprise. From the mobile workforce to the rise of social enterprise computing, the momentum is away from hierarchy. I believe this trend is the only way the problem of making Big Data actionable will be effectively addressed. In his classic book on the virtues of open source programming, The Cathedral and the Bazaar, Eric S. Raymond put forth the idea that open source was an effective way to address the complexity and density of information inherent in developing good software code. His proposition, "given enough eyeballs, all bugs are shallow," could easily be restated for Big Data as, "given enough analysts, all trends are apparent." The trick is - and really always has been - to get more people looking at the data. You don't achieve that end by centering your data analytics efforts on a tool largely geared to the skills of technical wizards.

MapReduce-type solutions as they currently exist are most effective when utilized by programmer-led organizations focused on maximizing their growing IT assets. For most businesses seeking the most efficient way to quickly turn their most valuable data into revenue generating insight, MPP databases will likely continue to hold sway, even as MapReduce-based solutions find a supporting role.

More Stories By Roger Gaskell

Roger Gaskell, CTO of Kognitio, has overall responsibility for all product development. He has been instrumental in all generations of the WX and WX2 database products to date, including evolving it from a database application running on proprietary hardware, to a software-only analytical database built on industry-standard blade servers.

Prior to Kognitio, Roger was test and development manager at AB Electronics for five years. During this time his primary responsibility was for the famous BBC Micro Computer and the development and testing of the first mass production of personal computers for IBM.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Latest Stories
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
HyperConvergence came to market with the objective of being simple, flexible and to help drive down operating expenses. It reduced the footprint by bundling the compute/storage/network into one box. This brought a new set of challenges as the HyperConverged vendors are very focused on their own proprietary building blocks. If you want to scale in a certain way, let’s say you identified a need for more storage and want to add a device that is not sold by the HyperConverged vendor, forget about it...
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
Most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes a lot of work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reduction in cost ...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
FinTech is the sum of financial and technology, and it’s one of the fastest growing tech industries. Total global investments in FinTech almost reached $50 billion last year, but there is still a great deal of confusion over what it is and what it means – especially as it applies to retirement. Building financial startups is not simple, but with the right team, technology and an innovative approach it can be an extremely interesting domain to disrupt. FinTech heralds a financial revolution that...
Imagine having the ability to leverage all of your current technology and to be able to compose it into one resource pool. Now imagine, as your business grows, not having to deploy a complete new appliance to scale your infrastructure. Also imagine a true multi-cloud capability that allows live migration without any modification between cloud environments regardless of whether that cloud is your private cloud or your public AWS, Azure or Google instance. Now think of a world that is not locked i...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
MongoDB Atlas leverages VPC peering for AWS, a service that allows multiple VPC networks to interact. This includes VPCs that belong to other AWS account holders. By performing cross account VPC peering, users ensure networks that host and communicate their data are secure. In his session at 20th Cloud Expo, Jay Gordon, a Developer Advocate at MongoDB, will explain how to properly architect your VPC using existing AWS tools and then peer with your MongoDB Atlas cluster. He'll discuss the secur...
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., will discuss how these tools can be leveraged to develop a lasting competitive advanta...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
SYS-CON Events announced today that SD Times | BZ Media has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. BZ Media LLC is a high-tech media company that produces technical conferences and expositions, and publishes a magazine, newsletters and websites in the software development, SharePoint, mobile development and commercial UAV markets.
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...