Welcome!

Blog Feed Post

Our recent Service Outage

Outage

We want to apologize for the Service Outage that happened on Thursday 7/31 starting at 6:30PM UTC. We caused you a lot of trouble and we are really sorry!

After digging into our logs, we reconstructed the series of events:

It started with poor database performance around 6:30PM UTC, which resulted in a growing backlog of events in our Sidekiq queues. As a result, we hit the memory limit of our Redis instance. This caused dropped jobs, since Sidekiq wasn’t able to enqueue more jobs.

After we noticed that our Redis was completely full, we started discarding some jobs to allow Sidekiq to connect some new workers. But this didn’t resolve the issue. Sidekiq grabbed some jobs but was still unable to process them. They hung during the database operations. The majority of the builds in the queue were log updates, which are INSERT statements in our log table. Browsing through the query list in postgres revealed a lot of hanging INSERT statements. We had to terminate all hanging queries to allow Postgres to accept new queries. This helped to resolve the issue.

What caused the outage?

There were multiple failures happening which caused the long outage.

  1. Our monitoring/alerting failed. We use Librato to visualize key infrastructure metrics. Librato is also responsible to observe the metrics and alert PagerDuty if some key metrics are exceeding thresholds. Due to a configuration error, we did not send all metrics to Librato and, therefore, couldn’t receive alerts from PagerDuty on our phones. This caused us to notice the incident 45 minutes after it began. About 40 minutes into the incident, NewRelic began to trigger PagerDuty as more key metrics started to exceed thresholds. After receiving PagerDuty alerts from NewRelic, we immediately started taking actions. We added more alerts to Librato, which would fire up PagerDuty if key metrics are missing. In addition, we adjusted the thresholds to include not only upper boundaries, but also lower boundaries and fire up alerts if metrics exceed or undercut these thresholds.

  2. Postgres couldn’t process INSERT/UPDATE statement. We pushed Heroku’s monitoring data into NewRelic and Librato, and from our data point of view nothing looked odd. We are currently in contact with Heroku to get more data to figure out what happened under the hood.

  3. We thought this was an issue with Sidekiq not being able to connect to Redis, but discovered the Postgres issue after we resolved the memory issue in Sidekiq. This resulted in a longer outage. This was a human error during debugging.

Again, we are sorry for causing issues on your side.

Ben from the Codeship

Events

  • Bad database performance (writes got stuck)
  • Log update queue gets filled up
  • Redis memory full
  • Free memory in Redis
  • Terminate all currently running database queries
  • Up and running again

Status Update Timeline

Resolved – Builds continue running fine. We’ll keep monitoring and write a post-mortem! Jul 31, 2014 – 9:15PM UTC

Monitoring – The builds are running fine again. We will keep monitoring. Jul 31, 2014 – 8:49PM UTC

Update – We’ve resolved our database issue. We are currently restarting our build infrastructure to resume work on the latest builds. Jul 31, 2014 – 8:40PM UTC

Identified – We’ve traced the issue to our database and are currently looking to fix the issue. Jul 31, 2014 – 8:21PM UTC

Update – We’re having memory issues with our queuing system and are working on a fix. Jul 31, 2014 – 7:46PM UTC

Investigating – We are currently seeing problems with builds not running on our test servers. We are investigating and will keep you up-to-date. Jul 31, 2014 – 7:39PM UTC

Read the original blog entry...

More Stories By Manuel Weiss

I am the cofounder of Codeship – a hosted Continuous Integration and Deployment platform for web applications. On the Codeship blog we love to write about Software Testing, Continuos Integration and Deployment. Also check out our weekly screencast series 'Testing Tuesday'!

Latest Stories
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud plat...
"We focus on SAP workloads because they are among the most powerful but somewhat challenging workloads out there to take into public cloud," explained Swen Conrad, CEO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Outscale was founded in 2010, is based in France, is a strategic partner to Dassault Systémes and has done quite a bit of work with divisions of Dassault," explained Jackie Funk, Digital Marketing exec at Outscale, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are still a relatively small software house and we are focusing on certain industries like FinTech, med tech, energy and utilities. We help our customers with their digital transformation," noted Piotr Stawinski, Founder and CEO of EARP Integration, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We've been engaging with a lot of customers including Panasonic, we've been involved with Cisco and now we're working with the U.S. government - the Department of Homeland Security," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
"We're here to tell the world about our cloud-scale infrastructure that we have at Juniper combined with the world-class security that we put into the cloud," explained Lisa Guess, VP of Systems Engineering at Juniper Networks, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.