Blog Feed Post

3x in 3 years: Scaling an Engineering Organization

Three years ago I joined an up-and-coming startup called PagerDuty as an Engineering Manager. Having received Series A funding in 2013, the company was in hyper-growth mode and hiring aggressively across all areas. The Engineering team was less than 30 people at the time and a big draw for me was (and still is) learning about the structural challenges facing a rapid-growth organization. As we approach 100 Dutonians in Engineering, it seems appropriate to look back and reflect on the changes so far.

Evolution of organizational structures is a fascinating process — full of mistakes, dead ends, reinvented wheels, and hopefully, lessons learned. There’s plenty of literature covering engineering organizations, much of it boiling down to either abstract lists of pros and cons, or to blog posts that sing praises of whatever process the company happened to adopt at the time of writing. I want to take a different approach and examine the historical reasons that pushed our engineering team to continuously experiment with and evolve our structure, with all of the accompanying missteps and learnings.

Note: Some of the events and details have been simplified for the sake of the narrative — many of these are worthy of posts of their own. Bookmark this blog to keep an eye out for new content.

The Siloed Organization

https://www.pagerduty.com/wp-content/uploads/2017/07/siloed-org-300x200.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/siloed-org-250x166.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/siloed-org-180x120.png 180w" sizes="(max-width: 400px) 100vw, 400px" />

PagerDuty circa early 2014 looked something like this:

  • The Engineering organization was split between the San Francisco and Toronto offices.
  • The Operations Team was responsible for infrastructure automation, security, and persistence (SF only).
  • The Web Application Team was responsible for the customer-facing aspects of PagerDuty (SF only) and developed primarily with Ruby on Rails and MySQL.
  • The Backend Services Group was responsible for reliable data ingestion and notification delivery (split into co-located teams in SF and Toronto) and developed primarily with Scala and Cassandra.
  • There was only a partial adherence to DevOps: Operations Team went on-call for infrastructure monitoring alerts, other teams on-call for the code they deployed.

As Engineering grew from a small handful of people to several, distributed teams in a very short order, the product development process remained rather immature. Despite superficial trappings of agility with stand-ups and sprints, we were essentially using the Waterfall model. Leadership defined the projects to work on, the individual engineers to assign to those projects, and the delivery dates to target. Predictably, the dates were rarely hit and the project tracking spreadsheets were in a perpetual state of disrepair and, eventually, disuse.

Things didn’t look much better on the ground. Product managers kicked off new projects by drafting lengthy functional design specifications, doing their best to anticipate any possible questions that might arise — a consequence of Product and Engineering not actually interacting very much during development! The spec was presented to the Web Application and Backend Services teams, who then worked on the user-facing and backend aspects independently. New infrastructure requests had to be made to the Operations Team weeks in advance.

Integrating all of these separate efforts into a coherent feature release was a nightmare. We had missing or incomplete infrastructure, hot potato bugs, functionality gaps (each team thinks the other is taking care of it), lack of ownership or empowerment for both engineers and product folks, missed dates, and organizational silos. The desire to hit dates caused us to take fewer risks, and we became more conservative in our implementations as well as averse to adjusting the product specs.

This combination of departmental structure and development processes pushed the topic of siloing to the forefront of every other discussion within Engineering that year. And boy, did we have silos:

  • Web Application vs Backend Services. There was palpable tension between the two groups. Neither really understood what the other group was working on, and both were frustrated.
  • Ruby vs Scala. Similar lines of conflict to the above, with much bikeshedding and identity building around specific languages.
  • Operations vs Development Teams. Both sides were frustrated with the Ops Team being a bottleneck for all server provisioning. Operations also experienced a lot of pressure from looming deadlines.
  • San Francisco vs Toronto. With every team geographically co-located, a distinct “them vs us” vibe developed between the engineers in the two locations. Both sides found something to grumble about.

The rigidly defined sets of responsibilities did not leave teams a whole lot of room for cross-functional collaboration. We experimented briefly with a concept of “working groups” — small, temporary, diverse squads drawing folks from the existing teams with the intent of working on scoped, timeboxed, cross-cutting projects. These ended up introducing more chaos by destabilizing the primary teams, and so the experiment was abandoned.

Still, the need to collaborate and deliver with better consistency and predictability was of utmost importance. Everyone was sick of siloing, so that had to go. We had our work cut out for us.

The Matrix Organization

https://www.pagerduty.com/wp-content/uploads/2017/07/matrix-org-300x207.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/matrix-org-250x172.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/matrix-org-180x124.png 180w" sizes="(max-width: 414px) 100vw, 414px" />By early 2015, dissatisfaction with the state of things as well as rising interest in Agile methodologies reached critical mass and resulted in a departmental shake-up. Several new teams were formed that aligned around specific product directions. They followed the Scrum process and consisted of engineers from both Web Application and Backend Services teams, as well as product owners, Scrum Masters, and UX designers.

Given the less than ideal pace of progress made on the product in the prior year, the new teams were optimized for product delivery. To ensure that they could spend 100% of their time on cranking out new features, maintenance work was delegated to the Backend Services teams. These became known as “base teams” as they adopted Kanban methodology and took on ownership for all of the services in production, including all product on-call responsibilities. Furthermore, base teams fed into the product teams — engineers migrated from one to the other as product work ramped up.

These were obviously big changes. In an effort to minimize the impact of team shuffling on individual engineers (and to postpone having to deal with remote reports), the reporting relationships were not touched. This of course complicated everyone’s life tremendously, because now we had a dual reporting structure, aka the matrix organization! Many engineers ended up on teams that were not assigned to their direct supervisor and managers were now playing the roles of “people manager” (responsible for people, some of whom were on other teams) and “functional manager” (responsible for teams, where some engineers reported elsewhere).

The good news was that the old silos were mostly smashed, never to be seen again. Web Application Engineers working in close proximity with Backend Services Engineers were able to see past each other’s differences and move towards shared goals. Most teams were now geographically distributed as well, which worked wonders in bringing the two offices together.

The bad news was that a slew of new problems was introduced:

  • It was very difficult to rally a team around technical maintenance. Base teams struggled with forming long-term roadmaps and coming to terms with operational ownership of all production services.
  • The feeder model had a very real impact on team cohesion. Turns out that changing team composition over and over is not so good for morale.
  • Dual reporting structure created a ton of inefficiencies.There was a lack of visibility into the day-to-day activities of a direct report, additional communication overhead between people managers and functional managers, and general confusion around responsibilities.
  • We adopted Agile processes, rather than agility. Scrum certainly helped with overspecification, integration, and product owner involvement. However, our approach to software delivery did not change — feature releases were still big bang affairs that delivered questionable value.
  • More teams meant more load on the Operations folks. With frequent infrastructure requests and additional on-call responsibilities for each new service, clearly, this approach did not scale.

All things considered, we were still in a much better shape than a year ago. But there was still a long way to go.

The Agile Organization

https://www.pagerduty.com/wp-content/uploads/2017/07/agile-org-300x192.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/agile-org-250x160.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/agile-org-180x115.png 180w" sizes="(max-width: 469px) 100vw, 469px" />After living with the new organizational structure for a year, we accumulated quite a lot of learnings. Agile was good (but we did not do enough of it), DevOps was good (but we did not do enough of it), matrix management was not so good (and we’d had enough of it).

In early 2016 we restructured once again. “Vertical” product Scrum teams were now responsible for particular slices of the product. “Horizontal” teams were responsible for cross-cutting product or infrastructure concerns, and were tasked with setting best practices and enabling other teams to move fast. Product delivery teams owned their roadmap, requirement definition, implementation, deployment, and maintenance of their code and infrastructure (!) in production. We had adopted a true DevOps “you code it, you own it” approach.

How did this solve our previous concerns?

  • No more maintenance teams. Each team owned a portion of the product or infrastructure and was empowered to move quickly and innovate.
  • No more feeder teams. Headcount was opened for specific teams.
  • No more dual reporting! As an organization, we got a lot more comfortable with hiring and managing remote engineers. With physical distance no longer an impediment, we were able to assign managers to teams without any dotted line relationships.
  • We’re focused on getting stuff done. The mantra of GSD (Get Sh*t Done) has been permeating our collective consciousness, challenging the teams to introspect and shed the legacy of overspecification and over-engineering in order to grow into pragmatic, productive, agile delivery machines.
  • Self-service enabled growth. The Operations Team did a lot of work to empower the product delivery teams, including providing fancy ChatOps tools to self-serve for all sorts of infrastructure needs. This was key to embracing DevOps all the way and moving infrastructure alerts out of Ops and into the relevant teams owning the actual hosts (who could resolve issues faster anyways).

The subject of GSD is worth digging into a bit more. Through practice, we got more and more comfortable with the ideas of team autonomy, innovation over invention, and business bringing problems to solve rather than solutions to implement. It’s not easy to let go of the idea that we know what’s best for the customers. A laser focus on delivering the minimum viable product, and seeking immediate feedback and incorporating it in during the development cycle, was key. It allowed us to iterate quickly, maximize value, minimize gold plating, and shorten the time to get a prototype into the customer hands from months to days or weeks. In other words, we transformed from an organization that “follows agile processes” to an actual agile organization.

As the number of product delivery teams grew, an interesting phenomenon developed — we saw an emergence of the so-called “tribes” (hat tip to Spotify’s team model), i.e. groups of teams related to each other by common functionality or common mission. These arrangements have resulted in benefits such as shared ownership, shared knowledge, shared roadmap (separate from individual team backlogs), and shared vision. Tribal organization is something we are still experimenting with — stay tuned for future updates on our learnings with tribes.

We’ve been running with this structure for 16 months now and it just works. Certainly teams and associated ownership lines will evolve as the company continues to grow at a fast pace, and some details will change as we continue to invest in improving ourselves. At the same time, it’s clear that we have done enough of the wrong things to learn what the right thing should look like.

Lessons Learned

As I think back to some of the decisions that were made early on and some of the awkward intermediate states that our organization has endured, I’m tempted to smack myself in the head and ask why it didn’t occur to us to jump to the obviously superior end state. The reality, of course, is never that simple — those decisions were a function of our state, our priorities, our people, and our challenges at the time. You may have recognized some of your own challenges as well, in which case I hope you’re able to take something away from our experience.

If I had to do it all over again, these are the learnings I would bring with me:

  • Minimize dependencies between teams. Dependencies lead to blocking, bugs, misunderstandings and bad feelings. Empower teams to deliver without waiting on others and you will see major productivity improvements.
  • Minimize changes to team composition. Business realities dictate that occasionally resources need to be shifted from one place to another. Think long and hard before moving folks around, as it can seriously impact team morale and productivity.
  • Do not overly prescribe team ownership and responsibilities. Flexibility here leads to longer-term wins. Encourage teams to solve their own problems and you will have less problems with the previous two points.
  • Do not be afraid of continuous learning and experimentation. This applies to everything — code, process, organization. One does not get better by doing the same things over and over.
  • Agile processes are nice; agile culture is better. Standups and sprint reviews don’t bring a whole lot of value on their own. Focusing on minimum viable product, rapid feedback loops, and collaboration requires a cultural change, but it maximizes the customer value delivered.
  • Operational ownership of code should live with the teams. It is the best way to balance system reliability, code quality, and organizational scalability.
  • Cross-functional, full-stack teams are best for product delivery. This is very much in line with the dependency minimization point above. Each team should be able to go from requirements gathering to deployment without needing to involve outside experts or project handovers. Specialist teams do have a place, but more in the realm of the “horizontal” teams discussed earlier.
  • Hire generalists that can work in any environment. Teammates should not attach their sense of identity to specific tools, frameworks and stacks as technologies and technical directions change. The people best positioned to thrive in a high-growth environment are those that focus on learning and using the right tool for the job. Consequently, be ready to provide those learning and growth opportunities.
  • Avoid matrix management if you can help it. Consider if there are other ways to address the problems that dual reporting may appear to solve.
  • Embrace distributed teams. Building cohesive teams with remote members is not without challenges, but the benefits are many. Martin Fowler pointed out that, “by making a team remote, you can widen your range of people you can bring to the team. A remote team may be less productive than that same team if it were co-located, but may still be more productive than the best co-located team you can form.” In other words, allowing for remote members creates a much larger hiring pool, so you can build a stronger overall team.

As organizations grow and mature, there is a tendency to slow down, calcify, and become more conservative. Through practicing continuous improvement and evolving our practices, we’ve managed to buck the trend and grow more agile, pragmatic, and productive with time — and you can do the same.

Here’s to three (and many) more years of learning!

The post 3x in 3 years: Scaling an Engineering Organization appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale - when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing "software-defined" in the datacent...
ChatOps is an emerging topic that has led to the wide availability of integrations between group chat and various other tools/platforms. Currently, HipChat is an extremely powerful collaboration platform due to the various ChatOps integrations that are available. However, DevOps automation can involve orchestration and complex workflows. In his session at @DevOpsSummit at 20th Cloud Expo, Himanshu Chhetri, CTO at Addteq, will cover practical examples and use cases such as self-provisioning infra...
Blockchain. A day doesn’t seem to go by without seeing articles and discussions about the technology. According to PwC executive Seamus Cushley, approximately $1.4B has been invested in blockchain just last year. In Gartner’s recent hype cycle for emerging technologies, blockchain is approaching the peak. It is considered by Gartner as one of the ‘Key platform-enabling technologies to track.’ While there is a lot of ‘hype vs reality’ discussions going on, there is no arguing that blockchain is b...
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
The need for greater agility and scalability necessitated the digital transformation in the form of following equation: monolithic to microservices to serverless architecture (FaaS). To keep up with the cut-throat competition, the organisations need to update their technology stack to make software development their differentiating factor. Thus microservices architecture emerged as a potential method to provide development teams with greater flexibility and other advantages, such as the abili...
Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
The use of containers by developers -- and now increasingly IT operators -- has grown from infatuation to deep and abiding love. But as with any long-term affair, the honeymoon soon leads to needing to live well together ... and maybe even getting some relationship help along the way. And so it goes with container orchestration and automation solutions, which are rapidly emerging as the means to maintain the bliss between rapid container adoption and broad container use among multiple cloud host...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...