Welcome!

Blog Feed Post

Explicitly-defined failure domains in the datacenter

While the bulk of the networking industry’s focus is on CapEx and automation, the two major trends driving changes in these areas will have a potentially greater impact on that which matters most: availability. In fact, despite that datacenter purchasing decisions skew towards CapEx, the number one requirement for most datacenters is uptime.

If availability is so important, why is it not the number one purchasing criteria already?

First, it’s not that availability doesn’t matter. It’s more that when everyone is building the same thing, it ceases to be a differentiating point. Most switch vendors have converged on Broadcom silicon and a basic set of features required to run datacenter architectures that have really been unchanged for a decade or more. But are those architectures going to continue unscathed?

SDN and network architecture

For those who believe in the transformative power of SDN, the answer is unequivocally no. If, after all of the SDN work is said and done, we emerge with the same architectures with an extra smidge of automation sprinkled on top, we will have grossly under-delivered on what should be the kind of change that happens once every couple of decades.

The rise of a central controller is more than just pushing provisioning to a single pane of glass. It is about central control over a network using a global perspective to make intelligent resourcing and pathing decisions. While this does not necessarily mean that legacy networking cannot (or should not) co-exist, the model is dramatically different than what exists today.

Switching is just a stop along the way for bare metal

Bare metal switching will also change IT infrastructure in a meaningful way. Again, the scope of public discourse is fairly narrow. The story goes something like this: if someone makes a commodity switch, pricing will come down. But there are two things that are really happening here.

First, what we are really seeing is a shift in monetization from hardware to software. This shift should not be surprising, as software investment on the vendor side has dwarfed hardware R&D for 15 years now. The real change here is that the companies that emerge have a tolerance for lower margins than the behemoths already entrenched in the space. Anyone can drop price; the question is at what margins is a business still attractive. What will play out over the next couple of years is a game of chicken with price.

Second, the objective of bare metal switching is less about the switching and more about the bare metal. Taken to its logical conclusion, the hope has to be that all infrastructure is eventually run on the same set of hardware. Whether something is a server, a storage device, an appliance, or a switch should ultimately be determined by the software that is being run on it. In this case, we see multi-purpose devices whose role depends on the context in which they are deployed. This would eventually allow for the fungibility of resources across what are currently very hard silos.

Domain, Domain, Domain

Both SDN and bare metal lead to very different architectures than that which exists today. But as architects consider how they will evolve their own instantiations of these technologies, they need to be clear about a couple of facts that get glossed over.

If availability really is the number one requirement for datacenters, then architectures need to explicitly consider how they impact overall resource availability. Consider that there are a number of sources for downtime:

  1. Human error - By far the leading source of downtime in most networks, human error is why there is such momentum around things like ITIL. Put differently, when is uptime the highest for most datacenters? Holidays, when everyone is away from the office.
  2. System issues – After human error, the next biggest cause of downtime is issues in the systems themselves. These are most likely software bugs, but they can include device and link failures as well.
  3. Maintenance – Another major contributor to uptime is overall infrastructure maintenance. When individual systems need to be upgraded or replaced, there is frequently some interruption to service. Of course, if maintenance is planned, then the impact to overall downtime should be low.
  4. Other – The Other category covers things like power outages and backhoes.

Of these, SDN promises to improve the first one. By expanding the management domain, it reduces the number of opportunities for pesky humans to make mistakes. Automated workflows that are executed from a single point of control and orchestrated across disparate elements of the infrastructure should help drive the number of provisioning mistakes in the datacenter down.

Additionally, a central point of control helps improve visibility (or at least it will over time). This helps operators diagnose issues more quickly, which will lower the Mean-Time-to-Repair (MTTR) for network-related issues.

But the management domain is not the only one that matters. There are at least two others that impact downtime: failure domains and maintenance domains. The impacts on these by SDN and bare metal need to be explicitly understood.

Failure domains

While there are tremendous operational benefits of collapsing domains under a single umbrella, one thing that becomes more difficult is managing the impact of failures when they do occur.

For instance, if a network is under the control of a single SDN controller, what happens if that controller is not reachable? If the controller is an active part of the data path, there is one set of outcomes. If the controller is not an active part of the data path, there is a different set of outcomes.

The point here is not to advocate for one or the other, but rather to point out that architects need to be explicit in defining the failure domain so that they can adjust operations appropriately. For instance, it might be the case that you prefer to balance the control benefits with failure scenarios, opting to create several smaller management domains, each with a correspondingly smaller failure domain. This gives you some benefit over a completely distributed management environment (where management domains are defined by the devices themselves) without putting the entire network under the same failure domain.

The same is true with bare metal. If bare metal leads to platform convergence, it allows architects to co-host compute, storage, and networking in the same device. Whether you actually group them depends an awful lot on how you view failure domains. Again, any balance is useful so long as it is explicitly chosen. Collapsing everything to a single device creates a different failure domain, which might make sense in some environments and less so in others.

Maintenance domains

The same discussion extends to maintenance domains. Collapsing everything might create a single maintenance domain (depending on architecture, of course). Keeping things separate might enable much smaller maintenance domains. There is no right size, but whatever the architecture that is chosen, the maintenance domain needs to be an explicit requirement.

The bottom line

Architectures are changing. There is little doubt that technology advances in IT generally and networking specifically are enabling us to do things we couldn’t really even consider just a few years ago. When deciding how to do those things, though, we need to be explicitly designing for availability. What has always been a requirement but has seen less and less talk as architectures matured should be dominating discussions again. Depending on your own specific requirements, this could lead to some unexpected architectural decisions.

[Today’s fun fact: Playing in a marching band is considered moderate exercise. But lest there be any confusion, this does not make it a sport.]

The post Explicitly-defined failure domains in the datacenter appeared first on Plexxi.

Read the original blog entry...

More Stories By Michael Bushong

The best marketing efforts leverage deep technology understanding with a highly-approachable means of communicating. Plexxi's Vice President of Marketing Michael Bushong has acquired these skills having spent 12 years at Juniper Networks where he led product management, product strategy and product marketing organizations for Juniper's flagship operating system, Junos. Michael spent the last several years at Juniper leading their SDN efforts across both service provider and enterprise markets. Prior to Juniper, Michael spent time at database supplier Sybase, and ASIC design tool companies Synopsis and Magma Design Automation. Michael's undergraduate work at the University of California Berkeley in advanced fluid mechanics and heat transfer lend new meaning to the marketing phrase "This isn't rocket science."

Latest Stories
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
Sometimes I write a blog just to formulate and organize a point of view, and I think it’s time that I pull together the bounty of excellent information about Machine Learning. This is a topic with which business leaders must become comfortable, especially tomorrow’s business leaders (tip for my next semester University of San Francisco business students!). Machine learning is a key capability that will help organizations drive optimization and monetization opportunities, and there have been some...
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, provided a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to oper...
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...
"Storpool does only block-level storage so we do one thing extremely well. The growth in data is what drives the move to software-defined technologies in general and software-defined storage," explained Boyan Ivanov, CEO and co-founder at StorPool, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
ChatOps is an emerging topic that has led to the wide availability of integrations between group chat and various other tools/platforms. Currently, HipChat is an extremely powerful collaboration platform due to the various ChatOps integrations that are available. However, DevOps automation can involve orchestration and complex workflows. In his session at @DevOpsSummit at 20th Cloud Expo, Himanshu Chhetri, CTO at Addteq, will cover practical examples and use cases such as self-provisioning infra...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale - when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing "software-defined" in the datacent...
Blockchain. A day doesn’t seem to go by without seeing articles and discussions about the technology. According to PwC executive Seamus Cushley, approximately $1.4B has been invested in blockchain just last year. In Gartner’s recent hype cycle for emerging technologies, blockchain is approaching the peak. It is considered by Gartner as one of the ‘Key platform-enabling technologies to track.’ While there is a lot of ‘hype vs reality’ discussions going on, there is no arguing that blockchain is b...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Is advanced scheduling in Kubernetes achievable?Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, answered these questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and co...
The use of containers by developers -- and now increasingly IT operators -- has grown from infatuation to deep and abiding love. But as with any long-term affair, the honeymoon soon leads to needing to live well together ... and maybe even getting some relationship help along the way. And so it goes with container orchestration and automation solutions, which are rapidly emerging as the means to maintain the bliss between rapid container adoption and broad container use among multiple cloud host...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.