Welcome!

Blog Feed Post

3x in 3 years: Scaling an Engineering Organization

Three years ago I joined an up-and-coming startup called PagerDuty as an Engineering Manager. Having received Series A funding in 2013, the company was in hyper-growth mode and hiring aggressively across all areas. The Engineering team was less than 30 people at the time and a big draw for me was (and still is) learning about the structural challenges facing a rapid-growth organization. As we approach 100 Dutonians in Engineering, it seems appropriate to look back and reflect on the changes so far.

Evolution of organizational structures is a fascinating process — full of mistakes, dead ends, reinvented wheels, and hopefully, lessons learned. There’s plenty of literature covering engineering organizations, much of it boiling down to either abstract lists of pros and cons, or to blog posts that sing praises of whatever process the company happened to adopt at the time of writing. I want to take a different approach and examine the historical reasons that pushed our engineering team to continuously experiment with and evolve our structure, with all of the accompanying missteps and learnings.

Note: Some of the events and details have been simplified for the sake of the narrative — many of these are worthy of posts of their own. Bookmark this blog to keep an eye out for new content.

The Siloed Organization

https://www.pagerduty.com/wp-content/uploads/2017/07/siloed-org-300x200.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/siloed-org-250x166.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/siloed-org-180x120.png 180w" sizes="(max-width: 400px) 100vw, 400px" />

PagerDuty circa early 2014 looked something like this:

  • The Engineering organization was split between the San Francisco and Toronto offices.
  • The Operations Team was responsible for infrastructure automation, security, and persistence (SF only).
  • The Web Application Team was responsible for the customer-facing aspects of PagerDuty (SF only) and developed primarily with Ruby on Rails and MySQL.
  • The Backend Services Group was responsible for reliable data ingestion and notification delivery (split into co-located teams in SF and Toronto) and developed primarily with Scala and Cassandra.
  • There was only a partial adherence to DevOps: Operations Team went on-call for infrastructure monitoring alerts, other teams on-call for the code they deployed.

As Engineering grew from a small handful of people to several, distributed teams in a very short order, the product development process remained rather immature. Despite superficial trappings of agility with stand-ups and sprints, we were essentially using the Waterfall model. Leadership defined the projects to work on, the individual engineers to assign to those projects, and the delivery dates to target. Predictably, the dates were rarely hit and the project tracking spreadsheets were in a perpetual state of disrepair and, eventually, disuse.

Things didn’t look much better on the ground. Product managers kicked off new projects by drafting lengthy functional design specifications, doing their best to anticipate any possible questions that might arise — a consequence of Product and Engineering not actually interacting very much during development! The spec was presented to the Web Application and Backend Services teams, who then worked on the user-facing and backend aspects independently. New infrastructure requests had to be made to the Operations Team weeks in advance.

Integrating all of these separate efforts into a coherent feature release was a nightmare. We had missing or incomplete infrastructure, hot potato bugs, functionality gaps (each team thinks the other is taking care of it), lack of ownership or empowerment for both engineers and product folks, missed dates, and organizational silos. The desire to hit dates caused us to take fewer risks, and we became more conservative in our implementations as well as averse to adjusting the product specs.

This combination of departmental structure and development processes pushed the topic of siloing to the forefront of every other discussion within Engineering that year. And boy, did we have silos:

  • Web Application vs Backend Services. There was palpable tension between the two groups. Neither really understood what the other group was working on, and both were frustrated.
  • Ruby vs Scala. Similar lines of conflict to the above, with much bikeshedding and identity building around specific languages.
  • Operations vs Development Teams. Both sides were frustrated with the Ops Team being a bottleneck for all server provisioning. Operations also experienced a lot of pressure from looming deadlines.
  • San Francisco vs Toronto. With every team geographically co-located, a distinct “them vs us” vibe developed between the engineers in the two locations. Both sides found something to grumble about.

The rigidly defined sets of responsibilities did not leave teams a whole lot of room for cross-functional collaboration. We experimented briefly with a concept of “working groups” — small, temporary, diverse squads drawing folks from the existing teams with the intent of working on scoped, timeboxed, cross-cutting projects. These ended up introducing more chaos by destabilizing the primary teams, and so the experiment was abandoned.

Still, the need to collaborate and deliver with better consistency and predictability was of utmost importance. Everyone was sick of siloing, so that had to go. We had our work cut out for us.

The Matrix Organization

https://www.pagerduty.com/wp-content/uploads/2017/07/matrix-org-300x207.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/matrix-org-250x172.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/matrix-org-180x124.png 180w" sizes="(max-width: 414px) 100vw, 414px" />By early 2015, dissatisfaction with the state of things as well as rising interest in Agile methodologies reached critical mass and resulted in a departmental shake-up. Several new teams were formed that aligned around specific product directions. They followed the Scrum process and consisted of engineers from both Web Application and Backend Services teams, as well as product owners, Scrum Masters, and UX designers.

Given the less than ideal pace of progress made on the product in the prior year, the new teams were optimized for product delivery. To ensure that they could spend 100% of their time on cranking out new features, maintenance work was delegated to the Backend Services teams. These became known as “base teams” as they adopted Kanban methodology and took on ownership for all of the services in production, including all product on-call responsibilities. Furthermore, base teams fed into the product teams — engineers migrated from one to the other as product work ramped up.

These were obviously big changes. In an effort to minimize the impact of team shuffling on individual engineers (and to postpone having to deal with remote reports), the reporting relationships were not touched. This of course complicated everyone’s life tremendously, because now we had a dual reporting structure, aka the matrix organization! Many engineers ended up on teams that were not assigned to their direct supervisor and managers were now playing the roles of “people manager” (responsible for people, some of whom were on other teams) and “functional manager” (responsible for teams, where some engineers reported elsewhere).

The good news was that the old silos were mostly smashed, never to be seen again. Web Application Engineers working in close proximity with Backend Services Engineers were able to see past each other’s differences and move towards shared goals. Most teams were now geographically distributed as well, which worked wonders in bringing the two offices together.

The bad news was that a slew of new problems was introduced:

  • It was very difficult to rally a team around technical maintenance. Base teams struggled with forming long-term roadmaps and coming to terms with operational ownership of all production services.
  • The feeder model had a very real impact on team cohesion. Turns out that changing team composition over and over is not so good for morale.
  • Dual reporting structure created a ton of inefficiencies.There was a lack of visibility into the day-to-day activities of a direct report, additional communication overhead between people managers and functional managers, and general confusion around responsibilities.
  • We adopted Agile processes, rather than agility. Scrum certainly helped with overspecification, integration, and product owner involvement. However, our approach to software delivery did not change — feature releases were still big bang affairs that delivered questionable value.
  • More teams meant more load on the Operations folks. With frequent infrastructure requests and additional on-call responsibilities for each new service, clearly, this approach did not scale.

All things considered, we were still in a much better shape than a year ago. But there was still a long way to go.

The Agile Organization

https://www.pagerduty.com/wp-content/uploads/2017/07/agile-org-300x192.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/agile-org-250x160.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/agile-org-180x115.png 180w" sizes="(max-width: 469px) 100vw, 469px" />After living with the new organizational structure for a year, we accumulated quite a lot of learnings. Agile was good (but we did not do enough of it), DevOps was good (but we did not do enough of it), matrix management was not so good (and we’d had enough of it).

In early 2016 we restructured once again. “Vertical” product Scrum teams were now responsible for particular slices of the product. “Horizontal” teams were responsible for cross-cutting product or infrastructure concerns, and were tasked with setting best practices and enabling other teams to move fast. Product delivery teams owned their roadmap, requirement definition, implementation, deployment, and maintenance of their code and infrastructure (!) in production. We had adopted a true DevOps “you code it, you own it” approach.

How did this solve our previous concerns?

  • No more maintenance teams. Each team owned a portion of the product or infrastructure and was empowered to move quickly and innovate.
  • No more feeder teams. Headcount was opened for specific teams.
  • No more dual reporting! As an organization, we got a lot more comfortable with hiring and managing remote engineers. With physical distance no longer an impediment, we were able to assign managers to teams without any dotted line relationships.
  • We’re focused on getting stuff done. The mantra of GSD (Get Sh*t Done) has been permeating our collective consciousness, challenging the teams to introspect and shed the legacy of overspecification and over-engineering in order to grow into pragmatic, productive, agile delivery machines.
  • Self-service enabled growth. The Operations Team did a lot of work to empower the product delivery teams, including providing fancy ChatOps tools to self-serve for all sorts of infrastructure needs. This was key to embracing DevOps all the way and moving infrastructure alerts out of Ops and into the relevant teams owning the actual hosts (who could resolve issues faster anyways).

The subject of GSD is worth digging into a bit more. Through practice, we got more and more comfortable with the ideas of team autonomy, innovation over invention, and business bringing problems to solve rather than solutions to implement. It’s not easy to let go of the idea that we know what’s best for the customers. A laser focus on delivering the minimum viable product, and seeking immediate feedback and incorporating it in during the development cycle, was key. It allowed us to iterate quickly, maximize value, minimize gold plating, and shorten the time to get a prototype into the customer hands from months to days or weeks. In other words, we transformed from an organization that “follows agile processes” to an actual agile organization.

As the number of product delivery teams grew, an interesting phenomenon developed — we saw an emergence of the so-called “tribes” (hat tip to Spotify’s team model), i.e. groups of teams related to each other by common functionality or common mission. These arrangements have resulted in benefits such as shared ownership, shared knowledge, shared roadmap (separate from individual team backlogs), and shared vision. Tribal organization is something we are still experimenting with — stay tuned for future updates on our learnings with tribes.

We’ve been running with this structure for 16 months now and it just works. Certainly teams and associated ownership lines will evolve as the company continues to grow at a fast pace, and some details will change as we continue to invest in improving ourselves. At the same time, it’s clear that we have done enough of the wrong things to learn what the right thing should look like.

Lessons Learned

As I think back to some of the decisions that were made early on and some of the awkward intermediate states that our organization has endured, I’m tempted to smack myself in the head and ask why it didn’t occur to us to jump to the obviously superior end state. The reality, of course, is never that simple — those decisions were a function of our state, our priorities, our people, and our challenges at the time. You may have recognized some of your own challenges as well, in which case I hope you’re able to take something away from our experience.

If I had to do it all over again, these are the learnings I would bring with me:

  • Minimize dependencies between teams. Dependencies lead to blocking, bugs, misunderstandings and bad feelings. Empower teams to deliver without waiting on others and you will see major productivity improvements.
  • Minimize changes to team composition. Business realities dictate that occasionally resources need to be shifted from one place to another. Think long and hard before moving folks around, as it can seriously impact team morale and productivity.
  • Do not overly prescribe team ownership and responsibilities. Flexibility here leads to longer-term wins. Encourage teams to solve their own problems and you will have less problems with the previous two points.
  • Do not be afraid of continuous learning and experimentation. This applies to everything — code, process, organization. One does not get better by doing the same things over and over.
  • Agile processes are nice; agile culture is better. Standups and sprint reviews don’t bring a whole lot of value on their own. Focusing on minimum viable product, rapid feedback loops, and collaboration requires a cultural change, but it maximizes the customer value delivered.
  • Operational ownership of code should live with the teams. It is the best way to balance system reliability, code quality, and organizational scalability.
  • Cross-functional, full-stack teams are best for product delivery. This is very much in line with the dependency minimization point above. Each team should be able to go from requirements gathering to deployment without needing to involve outside experts or project handovers. Specialist teams do have a place, but more in the realm of the “horizontal” teams discussed earlier.
  • Hire generalists that can work in any environment. Teammates should not attach their sense of identity to specific tools, frameworks and stacks as technologies and technical directions change. The people best positioned to thrive in a high-growth environment are those that focus on learning and using the right tool for the job. Consequently, be ready to provide those learning and growth opportunities.
  • Avoid matrix management if you can help it. Consider if there are other ways to address the problems that dual reporting may appear to solve.
  • Embrace distributed teams. Building cohesive teams with remote members is not without challenges, but the benefits are many. Martin Fowler pointed out that, “by making a team remote, you can widen your range of people you can bring to the team. A remote team may be less productive than that same team if it were co-located, but may still be more productive than the best co-located team you can form.” In other words, allowing for remote members creates a much larger hiring pool, so you can build a stronger overall team.

As organizations grow and mature, there is a tendency to slow down, calcify, and become more conservative. Through practicing continuous improvement and evolving our practices, we’ve managed to buck the trend and grow more agile, pragmatic, and productive with time — and you can do the same.

Here’s to three (and many) more years of learning!

The post 3x in 3 years: Scaling an Engineering Organization appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, will examine the regulations and provide insight on how it affects technology, challenges the established rules and will usher in new levels of diligence a...
Existing Big Data solutions are mainly focused on the discovery and analysis of data. The solutions are scalable and highly available but tedious when swapping in and swapping out occurs in disarray and thrashing takes place. The resolution for thrashing through machine learning algorithms and support nomenclature is through simple techniques. Organizations that have been collecting large customer data are increasingly seeing the need to use the data for swapping in and out and thrashing occurs ...
yperConvergence came to market with the objective of being simple, flexible and to help drive down operating expenses. It reduced the footprint by bundling the compute/storage/network into one box. This brought a new set of challenges as the HyperConverged vendors are very focused on their own proprietary building blocks. If you want to scale in a certain way, let’s say you identified a need for more storage and want to add a device that is not sold by the HyperConverged vendor, forget about it....
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that’s no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, will explore how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He wi...
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
In the enterprise today, connected IoT devices are everywhere – both inside and outside corporate environments. The need to identify, manage, control and secure a quickly growing web of connections and outside devices is making the already challenging task of security even more important, and onerous. In his session at @ThingsExpo, Rich Boyer, CISO and Chief Architect for Security at NTT i3, discussed new ways of thinking and the approaches needed to address the emerging challenges of security i...
Docker containers have brought great opportunities to shorten the deployment process through continuous integration and the delivery of applications and microservices. This applies equally to enterprise data centers as well as the cloud. In his session at 20th Cloud Expo, Jari Kolehmainen, founder and CTO of Kontena, discussed solutions and benefits of a deeply integrated deployment pipeline using technologies such as container management platforms, Docker containers, and the drone.io Cl tool. H...
Cloud adoption is often driven by a desire to increase efficiency, boost agility and save money. All too often, however, the reality involves unpredictable cost spikes and lack of oversight due to resource limitations. In his session at 20th Cloud Expo, Joe Kinsella, CTO and Founder of CloudHealth Technologies, tackled the question: “How do you build a fully optimized cloud?” He will examine: Why TCO is critical to achieving cloud success – and why attendees should be thinking holistically ab...
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, will provide a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, will discuss th...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics ...
SYS-CON Events announced today that Datera will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera offers a radically new approach to data management, where innovative software makes data infrastructure invisible, elastic and able to perform at the highest level. It eliminates hardware lock-in and gives IT organizations the choice to source x86 server nodes, with business model option...
"Cloud computing is certainly changing how people consume storage, how they use it, and what they use it for. It's also making people rethink how they architect their environment," stated Brad Winett, Senior Technologist for DDN Storage, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, Cloud Expo and @ThingsExpo are two of the most important technology events of the year. Since its launch over eight years ago, Cloud Expo and @ThingsExpo have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, I provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading the...