|By Thomas Krafft||
|August 28, 2012 08:35 PM EDT||
Excellent paper released by researchers at University of California, Berkeley . They have analyzed data from Hadoop installation at Facebook (one of the largest as such in the world) looking at various metrics for Hadoop jobs running at Facebook datacenter that has over 3,000 computers dedicated to Hadoop-based processing.
They have come up with very interesting insights. I advise everyone read it firsthand but I will list some of the interesting bits.
Traditional quest for disk locality (a.k.a. affinity between the Hadoop task and the disk that contains the input data for that task) was based on two key assumptions:
- Local disk access is significantly faster than network access to remote disk
- Hadoop tasks spend significant amount of their processing time in disk IO reading input data
Through careful analysis of Hadoop system at Facebook (as their prime testbed) authors claim that both of these assumptions are rapidly loosing hold:
- With new full-bisection topologies in the modern data centers the local disk access is almost identical in performance to a network access even across the racks (with performance difference today between two is less than 10%).
- Greater parallelization and data compressions leads to lower disk IO demand on the individual tasks; in fact, Hadoop job at Facebook deal mostly with text-baed data that can be compressed dramatically.
Authors then argue that memory locality (i.e. keeping input data in memory and maintaining affinity between Hadoop task and its in-memory input data) produces much greater performance advantages because:
- RAM access is up to three orders of magnitude faster than a local disk access
- Even though memory size is significantly less than disk capacity it is large enough for most cases (see below)
Consider this fact: despite the fact that 75% of all HDFS blocks are accessed only once the 64% of Hadoop jobs at Facebook achieve the full memory locality for all their tasks (!). In case of Hadoop – full locality means that there is no outlier task that will have to access disk and delay the entire job. And this is all achieved utilizing rather primitive LFU caching policy and basic pre-fetching for input data.
With these facts authors conclude that disk locality is no longer worth while to vie for – and in-memory co-location is the way forward for high performance big data processing as it yields far greater returns.
Facebook’s case is a solid proof of this technology, and GridGain’s In-Memory Data Platform is a solid platform for the rest of us.
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at 20th Cloud Expo, Ed Featherston, director/senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
Feb. 28, 2017 04:45 AM EST Reads: 4,421
Tintri VM-aware storage is the simplest for virtualized applications and cloud. Organizations including GE, Toyota, United Healthcare, NASA and 6 of the Fortune 15 have said "No to LUNs." With Tintri they manage only virtual machines, in a fraction of the footprint and at far lower cost than conventional storage. Tintri offers the choice of all-flash or hybrid-flash platform, converged or stand-alone structure and any hypervisor. Rather than obsess with storage, leaders focus on the business app...
Feb. 28, 2017 04:15 AM EST Reads: 849
Information technology (IT) advances are transforming the way we innovate in business, thereby disrupting the old guard and their predictable status-quo. It’s creating global market turbulence. Industries are converging, and new opportunities and threats are emerging, like never before. So, how are savvy chief information officers (CIOs) leading this transition? Back in 2015, the IBM Institute for Business Value conducted a market study that included the findings from over 1,800 CIO interviews ...
Feb. 28, 2017 03:45 AM EST Reads: 2,610
"Matrix is an ambitious open standard and implementation that's set up to break down the fragmentation problems that exist in IP messaging and VoIP communication," explained John Woolf, Technical Evangelist at Matrix, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Feb. 28, 2017 02:45 AM EST Reads: 14,005
Some people worry that OpenStack is more flash then substance; however, for many customers this could not be farther from the truth. No other technology equalizes the playing field between vendors while giving your internal teams better access than ever to infrastructure when they need it. In his session at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will talk through some real-world OpenStack deployments and look into the ways this can benefit customers of all sizes....
Feb. 28, 2017 02:30 AM EST Reads: 1,873
"A lot of times people will come to us and have a very diverse set of requirements or very customized need and we'll help them to implement it in a fashion that you can't just buy off of the shelf," explained Nick Rose, CTO of Enzu, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Feb. 28, 2017 02:15 AM EST Reads: 7,143
Web Real-Time Communication APIs have quickly revolutionized what browsers are capable of. In addition to video and audio streams, we can now bi-directionally send arbitrary data over WebRTC's PeerConnection Data Channels. With the advent of Progressive Web Apps and new hardware APIs such as WebBluetooh and WebUSB, we can finally enable users to stitch together the Internet of Things directly from their browsers while communicating privately and securely in a decentralized way.
Feb. 28, 2017 02:15 AM EST Reads: 5,116
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
Feb. 28, 2017 02:15 AM EST Reads: 802
Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infrastructure through automation. Ayehu solutions have been deployed by major enterprises worldwide, and currently, support thousands of IT processes across the globe. The company has offices in New York, California, and Isr...
Feb. 28, 2017 02:15 AM EST Reads: 881
Zerto exhibited at SYS-CON's 18th International Cloud Expo®, which took place at the Javits Center in New York City, NY, in June 2016. Zerto is committed to keeping enterprise and cloud IT running 24/7 by providing innovative, simple, reliable and scalable business continuity software solutions. Through the Zerto Cloud Continuity Platform™, organizations can seamlessly move and protect virtualized workloads between public, private and hybrid clouds. The company’s flagship product, Zerto Virtual...
Feb. 28, 2017 01:45 AM EST Reads: 2,064
All organizations that did not originate this moment have a pre-existing culture as well as legacy technology and processes that can be more or less amenable to DevOps implementation. That organizational culture is influenced by the personalities and management styles of Executive Management, the wider culture in which the organization is situated, and the personalities of key team members at all levels of the organization. This culture and entrenched interests usually throw a wrench in the work...
Feb. 28, 2017 01:15 AM EST Reads: 2,853
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
Feb. 27, 2017 11:45 PM EST Reads: 1,012
You think you know what’s in your data. But do you? Most organizations are now aware of the business intelligence represented by their data. Data science stands to take this to a level you never thought of – literally. The techniques of data science, when used with the capabilities of Big Data technologies, can make connections you had not yet imagined, helping you discover new insights and ask new questions of your data. In his session at @ThingsExpo, Sarbjit Sarkaria, data science team lead ...
Feb. 27, 2017 11:00 PM EST Reads: 9,183
Addteq is one of the top 10 Platinum Atlassian Experts who specialize in DevOps, custom and continuous integration, automation, plugin development, and consulting for midsize and global firms. Addteq firmly believes that automation is essential for successful software releases. Addteq centers its products and services around this fundamentally unique approach to delivering complete software release management solutions. With a combination of Addteq's services and our extensive list of partners,...
Feb. 27, 2017 10:00 PM EST Reads: 1,338
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and sh...
Feb. 27, 2017 09:15 PM EST Reads: 4,198