Welcome!

Blog Feed Post

Is the Facebook DC Architecture right for you?

A few weeks ago Facebook announced their new datacenter architecture in a post on their network engineering blog. Facebook is one of the few large web scale companies that is fairly open about their network architecture and designs and it gives many others the opportunity to see how a network can be scaled, even though the scale is well beyond what most will need in the foreseeable future, if not forever.

In the post, Alexey walks through some of the thought process behind the architecture, which is ultimately the most important part of any architecture and design. Too often we simply build whatever seems to be popular or common, or mandated/pushed by a specific vendor. The network however is a product, a deliverable, and has requirements like just about anything else we produce.

Facebook’s and the other web properties’ scale is at a different order of magnitude from most everyone else, but their requirements should sound pretty familiar to many:

  • Intra DC traffic is significantly higher than inter DC or DC to Internet traffic
    • “machine to machine traffic – is several orders of magnitude larger than what goes out to the Internet”
  • Build for growth, the network is not a static entity
    • “ability to move fast and support rapid growth is at the core of our infrastructure design philosophy”
  • Simple Design, easy to operate and maintain
    • “keep our networking infrastructure simple enough that small, highly efficient teams of engineers can manage it”
    • “Our goal is to make deploying and operating our networks easier and faster over time”

Anyone with a decent sized datacenter infrastructure should find these same basic requirements back in their own network needs.

With the requirements in hand (and a few more I am sure), Facebook created clusters of racks with servers and supporting networking equipment and then built a hierarchy of network equipment on top. Each rack in a cluster contains a regular ToR switch with 4 40GbE uplinks to the first spine layer. While not explicitly stated, these ToRs likely support 48 to 56 server side 10GbE ports (this could be as high as 80 when using 96 port switches). That makes a rack somewhere between 3:1 to 5:1 oversubscribed to the fabric.

From these ToR switches, each of these 40GbE is connected to a fabric switch. With 48 ToR switches in a cluster or pod, these fabric switches support 48x40GbE towards the ToR layer. As stated, these switches have the ability to support the same amount of bandwidth up to the next spine layer (I guess Facebook differentiates them in name by calling them fabric switches vs spine switches even though the fabric switches act as the spine for the ToR switches).

This means that each of these pod spine switches needs to support up to 96x40GbE, which makes these mid sized modular switches that have an internal fabric. You cannot make a switch of that size without having some form of internal fabric to connect multiple ethernet ASICs to each other. With simplicity and ease of maintenance in mind, I am sure Facebook picked systems that have an internal CLOS fabric built out of the same ethernet ASICs used for the ToR switches. This also means there is not a very large amount of buffer memory available in the fabric and spine layers, contrary to what many believe is required (we are not among them). Similarly for latency, this is not a low latency fabric by new standards, which may be fine for Facebook’s requirements. Server to server traffic between different server pods may take up to 11 ethernet ASIC hops, some of which are not cut through switching. This may add up to close to 10 microseconds.

The spine plane that connects each of the clusters together is created using the same switch as the cluster spine. It has the ability to scale to essentially a few hundred pods. And that’s big. Bigger than 99% of the rest of the world will need.

This design very modular and can grow inside of a pod and by attaching more pods together with the fabric switches. The challenge however is that the cabling is not trivial unless you get to start fresh and layout enough fiber for the maximum configuration. Facebook has the luxury to regularly build new datacenters, most enterprises are adding to existing infrastructures, in existing buildings where recabling is not easy or cheap. Grow as you go with this design only works if the cabling is provided for the maximum configuration. So while the network is designed for easy expansion and growth, the foundational physical infrastructure has to be planned and executed at maximum size.

Ultimately the Facebook design is a 3 tier hierarchical network, but the top 2 tiers act as a fabric for the ToR switches. Facebook decided to implement the fabric as its own spine and leaf network. Our solution to a similar set of requirements would build a Plexxi fabric connecting ToR switches. ToR switches would connect to only a few Plexxi switches (for redundancy purposes), the Plexxi switches connect to each other to provide a fully programmable fabric. A Plexxi fabric extends by simply adding more switches with only local cabling.

By using switches that all use the same underlying ASIC technology, there is a very common set of limitations to worry about. It is exactly known how large each of the required tables are and those can be carefully engineered. The BGP engineering portion of the Facebook design is not insignificant. The ASICs used are limited in some of their table sizes, which means that IP address schemes need to be carefully designed, again with maximum size in mind.

The network is engineered as a full L3 network, there is no L2 connectivity outside of a rack. For Facebook this works as they own every piece of their application suite. Like it or not, there are many (legacy) enterprise applications and services that either require L2 connectivity, or work simpler in an L2 environment.

I have not touched on a key aspect of the Facebook design: “distributed control with centralized override”. This Facebook variation of SDN has extremely similar foundational thoughts to how we at Plexxi approach the programmability of the network. That will be blog post in and by itself.

I am sure many will take the Facebook design as the new way to design datacenter networks. But please apply your own scaling, extensibility and physical limitation requirements. There are some rather large luxuries a company like Facebook can afford which most others can not.

The post Is the Facebook DC Architecture right for you? appeared first on Plexxi.

Read the original blog entry...

More Stories By Michael Bushong

The best marketing efforts leverage deep technology understanding with a highly-approachable means of communicating. Plexxi's Vice President of Marketing Michael Bushong has acquired these skills having spent 12 years at Juniper Networks where he led product management, product strategy and product marketing organizations for Juniper's flagship operating system, Junos. Michael spent the last several years at Juniper leading their SDN efforts across both service provider and enterprise markets. Prior to Juniper, Michael spent time at database supplier Sybase, and ASIC design tool companies Synopsis and Magma Design Automation. Michael's undergraduate work at the University of California Berkeley in advanced fluid mechanics and heat transfer lend new meaning to the marketing phrase "This isn't rocket science."

Latest Stories
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
"Loom is applying artificial intelligence and machine learning into the entire log analysis process, from start to finish and at the end you will get a human touch,” explained Sabo Taylor Diab, Vice President, Marketing at Loom Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
After more than five years of DevOps, definitions are evolving, boundaries are expanding, ‘unicorns’ are no longer rare, enterprises are on board, and pundits are moving on. Can we now look at an evolution of DevOps? Should we? Is the foundation of DevOps ‘done’, or is there still too much left to do? What is mature, and what is still missing? What does the next 5 years of DevOps look like? In this Power Panel at DevOps Summit, moderated by DevOps Summit Conference Chair Andi Mann, panelists loo...
"Tintri focuses on the Ops side of the DevOps, which basically is pushing more and more of the accessibility of the infrastructure to the developers and trying to get behind the scenes," explained Dhiraj Sehgal of Tintri in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
@DevOpsSummit at Cloud Expo taking place Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center, Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is ...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
In the world of DevOps there are ‘known good practices’ – aka ‘patterns’ – and ‘known bad practices’ – aka ‘anti-patterns.' Many of these patterns and anti-patterns have been developed from real world experience, especially by the early adopters of DevOps theory; but many are more feasible in theory than in practice, especially for more recent entrants to the DevOps scene. In this power panel at @DevOpsSummit at 18th Cloud Expo, moderated by DevOps Conference Chair Andi Mann, panelists discussed...
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
"We focus on composable infrastructure. Composable infrastructure has been named by companies like Gartner as the evolution of the IT infrastructure where everything is now driven by software," explained Bruno Andrade, CEO and Founder of HTBase, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Hardware virtualization and cloud computing allowed us to increase resource utilization and increase our flexibility to respond to business demand. Docker Containers are the next quantum leap - Are they?! Databases always represented an additional set of challenges unique to running workloads requiring a maximum of I/O, network, CPU resources combined with data locality.
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing bes...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
In this presentation, Striim CTO and founder Steve Wilkes will discuss practical strategies for counteracting fraud and cyberattacks by leveraging real-time streaming analytics. In his session at @ThingsExpo, Steve Wilkes, Founder and Chief Technology Officer at Striim, will provide a detailed look into leveraging streaming data management to correlate events in real time, and identify potential breaches across IoT and non-IoT systems throughout the enterprise. Strategies for processing massive ...