|By Roger Gaskell||
|September 7, 2011 09:30 AM EDT||
Is MapReduce the Holy Grail answer to the pressing problem of processing, analyzing and making sense of large and growing data volumes? Certainly it has potential in this arena, but there is a distressing gap between the amount of hype this technology - and its spinoffs - has received and the number of professionals who actually know how to integrate and make best use of it.
Industry watchers say it's just a matter of time before MapReduce sweeps through the enterprise data warehouse (EDW) market the same way open source technologies like Linux have done. In fact, in a recent blog post, Forrester's James Kobielus proclaimed that most EDW vendors will incorporate support for MapReduce's open source cousin Hadoop into the heart of their architectures to enable open, standards-based data analytics on massive amounts of data.
So, no more databases, just MapReduce? I'm not so sure. But don't misunderstand. It's not that MapReduce isn't an effective way to analyze data in some cases. The big names in Internet business are all using it - Facebook, Google, Amazon, eBay et al - so it must be good, right? But it's worth taking a more measured view based both on the technical and the practical business merits. I believe that the two technologies are not so mutually exclusive; that they will work hand-in-hand and, in some cases, MapReduce will be integrated into the relational database (RDBMS).
Google certainly has proven that MapReduce excels at making sense out of the exabytes of unstructured data on the web, which it should, given that MapReduce was designed from the outset for manipulating very large data sets. MapReduce in this sense provides a way to put structure around unstructured data. We humans prefer structure; it's in our DNA. Without structure, we have no real way of adding value to the data. Unstructured data analytics is something of an oxymoron for a pattern-seeking hominid.
MapReduce helps us put structure around the unstructured so we can then make sense of it. It creates an environment wherein a data analyst can write two simple functions, a "mapper" and a "reducer," to perform the actual data manipulation, returning a result that is at once both an analysis of the data it has just mapped and summarized, as well as the structure for further analysis that will help provide insight into the data. Whether that further analysis is done in a MapReduce environment might be the more appropriate question.
From an infrastructure standpoint, MapReduce excels where performance and scalability are challenges. Applications written using the MapReduce framework are automatically parallelized, making it well suited to a large infrastructure of connected machines. As it scales applications across lots of servers made up of lots of nodes, the MapReduce framework also provides built-in query fault tolerance so that whatever hardware component might fail, a query would be completed by another machine. Further, MapReduce and its open source brethren can perform functions not possible in standard SQL (click-stream sessionization, nPath, graph production of potentially unbounded length in SQL).
What's not to love? At a basic level I believe the MapReduce framework is an inefficient way of analyzing data for the vast majority of businesses. The aforementioned capabilities of MapReduce are all well and good, provided you have a Google-like business replete with legions of programmers and vast amounts of server and memory capacity. Viewed from this perspective, it makes perfect sense that Google developed and used MapReduce: because it could. It had a huge and growing resource in its farms of custom-made servers, as well as armies of programmers constantly looking for new ways to take advantage of that seemingly infinite hardware (and the data collected on it), to do cool new things.
Similarly, the other high-profile adopters and advocates are also IT-savvy, IT-heavy companies and, like Google, have the means and ongoing incentive to get a MapReduce framework tailored to their particular needs and reap the benefits. Would a mid-size firm know how? It seems doubtful. While it has claimed that MapReduce is easy to use, even for programmers without experience with distributed systems, I know from field experience with customers that it does, in fact, take some pretty experienced folks to make best use of it.
Projects like Hive, Google Sawzall, Yahoo Pig and companies like Cloudera all, in essence, attempt to make the MapReduce paradigm easier for lesser experts to use and, in fact, make it behave for the end user more like a parallel database. But this raises the question: Why? It seems to be a bit of re-inventing the wheel. IT-heavy is not how most businesses operate today, especially in these economic times. The dot-com bubble is long over. Hardware budgets are limited and few companies relish the idea of hiring teams of programming experts to maintain even a valuable IT asset such as their data warehouse. They'd rather buy an off-the-shelf tool designed from the ground up to do high-speed data analytics.
Like MapReduce, commercially available massively parallel processing databases specifically built for rapid, high volume data analytics will provide immense data scale and query fault tolerance. They also have a proven track record of customer deployments and deliver equal if not better performance on Big Data problems. Perhaps as important, today's next-generation MPP analytic databases give businesses the flexibility to draw on a deep pool of IT labor skilled in established conventions such as SQL.
As mentioned earlier, unstructured data seems like a natural for MapReduce analysis. A rising tide of chatter is focused on the increasing problem - and importance - of unstructured data. There is more than a bit of truth to this. As the Internet of everything becomes more and more a reality, data is generated everywhere; but our experience to date is that businesses are most interested in data derived from the transactional systems they've wired their businesses on top of, where structure is a given.
Another difficulty faces companies even as MapReduce becomes more integrated into the overall enterprise data analysis strategy. MapReduce is a framework. As the hype and interest have grown, MapReduce solutions are being created by database vendors in entirely non-standard and incompatible ways. This will further limit the likelihood that it will become the centerpiece of an EDW. Business has demonstrated time and again that it prefers open standards and interoperability.
Finally, I believe a move toward a programmer-centric approach to data analysis is both inefficient and contrary to all other prevailing trends of technology use in the enterprise. From the mobile workforce to the rise of social enterprise computing, the momentum is away from hierarchy. I believe this trend is the only way the problem of making Big Data actionable will be effectively addressed. In his classic book on the virtues of open source programming, The Cathedral and the Bazaar, Eric S. Raymond put forth the idea that open source was an effective way to address the complexity and density of information inherent in developing good software code. His proposition, "given enough eyeballs, all bugs are shallow," could easily be restated for Big Data as, "given enough analysts, all trends are apparent." The trick is - and really always has been - to get more people looking at the data. You don't achieve that end by centering your data analytics efforts on a tool largely geared to the skills of technical wizards.
MapReduce-type solutions as they currently exist are most effective when utilized by programmer-led organizations focused on maximizing their growing IT assets. For most businesses seeking the most efficient way to quickly turn their most valuable data into revenue generating insight, MPP databases will likely continue to hold sway, even as MapReduce-based solutions find a supporting role.
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can support cloud-based workloads previously not possible for high-throughput insurance, banking, and case-based applications. In his session at 18th Cloud Expo, John Newton, CTO, Founder and Chairman of Alfresco, described how to scale cloud-based content management repositories to store, manage, and retrieve billions of documents and related information with fast and linear scalability. He addres...
Jul. 23, 2016 09:00 AM EDT Reads: 798
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, explored the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences between wh...
Jul. 23, 2016 09:00 AM EDT Reads: 1,905
Adding public cloud resources to an existing application can be a daunting process. The tools that you currently use to manage the software and hardware outside the cloud aren’t always the best tools to efficiently grow into the cloud. All of the major configuration management tools have cloud orchestration plugins that can be leveraged, but there are also cloud-native tools that can dramatically improve the efficiency of managing your application lifecycle. In his session at 18th Cloud Expo, ...
Jul. 23, 2016 08:30 AM EDT Reads: 629
It’s 2016: buildings are smart, connected and the IoT is fundamentally altering how control and operating systems work and speak to each other. Platforms across the enterprise are networked via inexpensive sensors to collect massive amounts of data for analytics, information management, and insights that can be used to continuously improve operations. In his session at @ThingsExpo, Brian Chemel, Co-Founder and CTO of Digital Lumens, will explore: The benefits sensor-networked systems bring to ...
Jul. 23, 2016 08:15 AM EDT Reads: 1,424
Much of IT terminology is often misused and misapplied. Modernization and transformation are two such terms. They are often used interchangeably even though they mean different things and have very different connotations. Indeed, it is somewhat safe to assume that in IT any transformative effort is likely to also have a modernizing effect, and thus, we can see these as levels of improvement efforts. However, many businesses are being led to believe if they don’t transform now they risk becoming ...
Jul. 23, 2016 08:00 AM EDT Reads: 1,027
SYS-CON Events announced today the Enterprise IoT Bootcamp, being held November 1-2, 2016, in conjunction with 19th Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA. Combined with real-world scenarios and use cases, the Enterprise IoT Bootcamp is not just based on presentations but with hands-on demos and detailed walkthroughs. We will introduce you to a variety of real world use cases prototyped using Arduino, Raspberry Pi, BeagleBone, Spark, and Intel Edison. Y...
Jul. 23, 2016 08:00 AM EDT Reads: 1,216
When building large, cloud-based applications that operate at a high scale, it’s important to maintain a high availability and resilience to failures. In order to do that, you must be tolerant of failures, even in light of failures in other areas of your application. “Fly two mistakes high” is an old adage in the radio control airplane hobby. It means, fly high enough so that if you make a mistake, you can continue flying with room to still make mistakes. In his session at 18th Cloud Expo, Lee...
Jul. 23, 2016 08:00 AM EDT Reads: 1,211
Large scale deployments present unique planning challenges, system commissioning hurdles between IT and OT and demand careful system hand-off orchestration. In his session at @ThingsExpo, Jeff Smith, Senior Director and a founding member of Incenergy, will discuss some of the key tactics to ensure delivery success based on his experience of the last two years deploying Industrial IoT systems across four continents.
Jul. 23, 2016 08:00 AM EDT Reads: 1,382
SYS-CON Events announced today that Venafi, the Immune System for the Internet™ and the leading provider of Next Generation Trust Protection, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Venafi is the Immune System for the Internet™ that protects the foundation of all cybersecurity – cryptographic keys and digital certificates – so they can’t be misused by bad guys in attacks...
Jul. 23, 2016 07:45 AM EDT Reads: 1,038
Identity is in everything and customers are looking to their providers to ensure the security of their identities, transactions and data. With the increased reliance on cloud-based services, service providers must build security and trust into their offerings, adding value to customers and improving the user experience. Making identity, security and privacy easy for customers provides a unique advantage over the competition.
Jul. 23, 2016 07:45 AM EDT Reads: 941
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
Jul. 23, 2016 07:30 AM EDT Reads: 2,006
"There's a growing demand from users for things to be faster. When you think about all the transactions or interactions users will have with your product and everything that is between those transactions and interactions - what drives us at Catchpoint Systems is the idea to measure that and to analyze it," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York Ci...
Jul. 23, 2016 07:15 AM EDT Reads: 1,830
"Tintri was started in 2008 with the express purpose of building a storage appliance that is ideal for virtualized environments. We support a lot of different hypervisor platforms from VMware to OpenStack to Hyper-V," explained Dan Florea, Director of Product Management at Tintri, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 23, 2016 07:15 AM EDT Reads: 1,767
"Avere Systems is a hybrid cloud solution provider. We have customers that want to use cloud storage and we have customers that want to take advantage of cloud compute," explained Rebecca Thompson, VP of Marketing at Avere Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 23, 2016 07:00 AM EDT Reads: 1,875
SaaS companies can greatly expand revenue potential by pushing beyond their own borders. The challenge is how to do this without degrading service quality. In his session at 18th Cloud Expo, Adam Rogers, Managing Director at Anexia, discussed how IaaS providers with a global presence and both virtual and dedicated infrastructure can help companies expand their service footprint with low “go-to-market” costs.
Jul. 23, 2016 07:00 AM EDT Reads: 2,237