Welcome!

Blog Feed Post

Is the Facebook DC Architecture right for you?

A few weeks ago Facebook announced their new datacenter architecture in a post on their network engineering blog. Facebook is one of the few large web scale companies that is fairly open about their network architecture and designs and it gives many others the opportunity to see how a network can be scaled, even though the scale is well beyond what most will need in the foreseeable future, if not forever.

In the post, Alexey walks through some of the thought process behind the architecture, which is ultimately the most important part of any architecture and design. Too often we simply build whatever seems to be popular or common, or mandated/pushed by a specific vendor. The network however is a product, a deliverable, and has requirements like just about anything else we produce.

Facebook’s and the other web properties’ scale is at a different order of magnitude from most everyone else, but their requirements should sound pretty familiar to many:

  • Intra DC traffic is significantly higher than inter DC or DC to Internet traffic
    • “machine to machine traffic – is several orders of magnitude larger than what goes out to the Internet”
  • Build for growth, the network is not a static entity
    • “ability to move fast and support rapid growth is at the core of our infrastructure design philosophy”
  • Simple Design, easy to operate and maintain
    • “keep our networking infrastructure simple enough that small, highly efficient teams of engineers can manage it”
    • “Our goal is to make deploying and operating our networks easier and faster over time”

Anyone with a decent sized datacenter infrastructure should find these same basic requirements back in their own network needs.

With the requirements in hand (and a few more I am sure), Facebook created clusters of racks with servers and supporting networking equipment and then built a hierarchy of network equipment on top. Each rack in a cluster contains a regular ToR switch with 4 40GbE uplinks to the first spine layer. While not explicitly stated, these ToRs likely support 48 to 56 server side 10GbE ports (this could be as high as 80 when using 96 port switches). That makes a rack somewhere between 3:1 to 5:1 oversubscribed to the fabric.

From these ToR switches, each of these 40GbE is connected to a fabric switch. With 48 ToR switches in a cluster or pod, these fabric switches support 48x40GbE towards the ToR layer. As stated, these switches have the ability to support the same amount of bandwidth up to the next spine layer (I guess Facebook differentiates them in name by calling them fabric switches vs spine switches even though the fabric switches act as the spine for the ToR switches).

This means that each of these pod spine switches needs to support up to 96x40GbE, which makes these mid sized modular switches that have an internal fabric. You cannot make a switch of that size without having some form of internal fabric to connect multiple ethernet ASICs to each other. With simplicity and ease of maintenance in mind, I am sure Facebook picked systems that have an internal CLOS fabric built out of the same ethernet ASICs used for the ToR switches. This also means there is not a very large amount of buffer memory available in the fabric and spine layers, contrary to what many believe is required (we are not among them). Similarly for latency, this is not a low latency fabric by new standards, which may be fine for Facebook’s requirements. Server to server traffic between different server pods may take up to 11 ethernet ASIC hops, some of which are not cut through switching. This may add up to close to 10 microseconds.

The spine plane that connects each of the clusters together is created using the same switch as the cluster spine. It has the ability to scale to essentially a few hundred pods. And that’s big. Bigger than 99% of the rest of the world will need.

This design very modular and can grow inside of a pod and by attaching more pods together with the fabric switches. The challenge however is that the cabling is not trivial unless you get to start fresh and layout enough fiber for the maximum configuration. Facebook has the luxury to regularly build new datacenters, most enterprises are adding to existing infrastructures, in existing buildings where recabling is not easy or cheap. Grow as you go with this design only works if the cabling is provided for the maximum configuration. So while the network is designed for easy expansion and growth, the foundational physical infrastructure has to be planned and executed at maximum size.

Ultimately the Facebook design is a 3 tier hierarchical network, but the top 2 tiers act as a fabric for the ToR switches. Facebook decided to implement the fabric as its own spine and leaf network. Our solution to a similar set of requirements would build a Plexxi fabric connecting ToR switches. ToR switches would connect to only a few Plexxi switches (for redundancy purposes), the Plexxi switches connect to each other to provide a fully programmable fabric. A Plexxi fabric extends by simply adding more switches with only local cabling.

By using switches that all use the same underlying ASIC technology, there is a very common set of limitations to worry about. It is exactly known how large each of the required tables are and those can be carefully engineered. The BGP engineering portion of the Facebook design is not insignificant. The ASICs used are limited in some of their table sizes, which means that IP address schemes need to be carefully designed, again with maximum size in mind.

The network is engineered as a full L3 network, there is no L2 connectivity outside of a rack. For Facebook this works as they own every piece of their application suite. Like it or not, there are many (legacy) enterprise applications and services that either require L2 connectivity, or work simpler in an L2 environment.

I have not touched on a key aspect of the Facebook design: “distributed control with centralized override”. This Facebook variation of SDN has extremely similar foundational thoughts to how we at Plexxi approach the programmability of the network. That will be blog post in and by itself.

I am sure many will take the Facebook design as the new way to design datacenter networks. But please apply your own scaling, extensibility and physical limitation requirements. There are some rather large luxuries a company like Facebook can afford which most others can not.

The post Is the Facebook DC Architecture right for you? appeared first on Plexxi.

Read the original blog entry...

More Stories By Michael Bushong

The best marketing efforts leverage deep technology understanding with a highly-approachable means of communicating. Plexxi's Vice President of Marketing Michael Bushong has acquired these skills having spent 12 years at Juniper Networks where he led product management, product strategy and product marketing organizations for Juniper's flagship operating system, Junos. Michael spent the last several years at Juniper leading their SDN efforts across both service provider and enterprise markets. Prior to Juniper, Michael spent time at database supplier Sybase, and ASIC design tool companies Synopsis and Magma Design Automation. Michael's undergraduate work at the University of California Berkeley in advanced fluid mechanics and heat transfer lend new meaning to the marketing phrase "This isn't rocket science."

Latest Stories
In this presentation, you will learn first hand what works and what doesn't while architecting and deploying OpenStack. Some of the topics will include:- best practices for creating repeatable deployments of OpenStack- multi-site considerations- how to customize OpenStack to integrate with your existing systems and security best practices.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Venafi has a platform that allows you to manage, centralize and automate the complete life cycle of keys and certificates within the organization," explained Gina Osmond, Sr. Field Marketing Manager at Venafi, in this SYS-CON.tv interview at DevOps at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Michael Maximilien, better known as max or Dr. Max, is a computer scientist with IBM. At IBM Research Triangle Park, he was a principal engineer for the worldwide industry point-of-sale standard: JavaPOS. At IBM Research, some highlights include pioneering research on semantic Web services, mashups, and cloud computing, and platform-as-a-service. He joined the IBM Cloud Labs in 2014 and works closely with Pivotal Inc., to help make the Cloud Found the best PaaS.
Creating replica copies to tolerate a certain number of failures is easy, but very expensive at cloud-scale. Conventional RAID has lower overhead, but it is limited in the number of failures it can tolerate. And the management is like herding cats (overseeing capacity, rebuilds, migrations, and degraded performance). In his general session at 18th Cloud Expo, Scott Cleland, Senior Director of Product Marketing for the HGST Cloud Infrastructure Business Unit, discussed how a new approach is neces...
"This week we're really focusing on scalability, asset preservation and how do you back up to the cloud and in the cloud with object storage, which is really a new way of attacking dealing with your file, your blocked data, where you put it and how you access it," stated Jeff Greenwald, Senior Director of Market Development at HGST, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
"We work around really protecting the confidentiality of information, and by doing so we've developed implementations of encryption through a patented process that is known as superencipherment," explained Richard Blech, CEO of Secure Channels Inc., in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
The Founder of NostaLab and a member of the Google Health Advisory Board, John is a unique combination of strategic thinker, marketer and entrepreneur. His career was built on the "science of advertising" combining strategy, creativity and marketing for industry-leading results. Combined with his ability to communicate complicated scientific concepts in a way that consumers and scientists alike can appreciate, John is a sought-after speaker for conferences on the forefront of healthcare science,...