Welcome!

Blog Feed Post

Using Historical Incident Management Data to Plan for System Upgrades

Guest post.


As a freelance developer, inheriting projects is a necessary evil. Almost every project has legacy code that the team is afraid to touch, but when you inherit a project as a freelancer, more often than not, the entire codebase is “legacy.” While dealing with an unfamiliar code base is tough, what can be even more difficult is getting that code base running in a production environment.

Guessing Games

Last October, I inherited a project that drove me to near insanity. The source code itself was in shambles for sure, but what made the project such a nightmare was the lack of documentation and communication from the previous developers. This led to me having to reverse engineer the application in order to get it running in the new production environment.

I was essentially playing a guessing game with the architecture. I had an idea of what type of resources I needed to provide, but without getting it in front of users, I really didn’t know what to expect. As I’m sure you can guess, this didn’t end well. Due to inefficient programming patterns, the site required four times the resources it should have in order to achieve some modicum of stability.

Luckily for me, however, one of the first things I did was integrate some incident management tools into the project. What this allowed me to do was identify specific pain points early and often, and fix them immediately. This led to strategic resource and project upgrades to improve the stability of the project.

So, what exactly did I see?

While I felt like I was playing whack-a-mole with half a dozen issues at any given time, there were two that cropped up infrequently enough that I would not have noticed their impact had I not integrated any incident management tools: database locking and memory issues. These are two relatively common development issues that can occur, but while common, they can be difficult to diagnose and solve.

Database Locking

After stabilizing the production site, one of the first things we noticed was that the site was crashing every hour, on the hour, for about 15 minutes each time. Thanks to the information provided by our incident management tools, I was able to narrow down the problem to an hourly cron job. What I found was that a critical cron job was locking a primary database table every time it ran, effectively taking down the site until the process was done. This led me to easily refactor that particular script, which allowed me to increase the uptime of the site and reduce user frustration.

Memory Issues

Memory leaks suck. In a complicated application, they can be incredibly difficult to track down — especially when they occur in a production environment. Unfortunately for me, this project was filled to the brim with them. Some are easy to fix, like log entries showing the Redis server running out of memory (insert more memory here), but others can be pretty elusive.

One common and seemingly random memory issue that occurred was timeouts. Occasionally, the site would start timing out for users after attempting to load for five minutes. While I knew from experience that this was likely caused by more inefficient database queries, narrowing down the exact queries was a bit of a challenge. Again, thanks to the incident management framework I’d put in place, I was able to identify a specific set of profile pages that were taking almost half an hour to retrieve data from the database. Because this process took too long, users kept reloading the page and restarting the whole process.

The first thing I was able to do was identify exactly how long users were waiting before they reloaded the page or gave up (about 1 minute). Then, I made some changes to both the web and database server configurations to kill everything after 1 minute. This gave me some breathing room, so those pages didn’t crash the rest of the site.

Then, I had to identify the exact queries that were causing the problems. Unfortunately, these particular pages were pretty query-heavy, but after referencing the logs I was able to narrow it down to one particular line that was querying over 1GB of data from the database server without caching the result. From here, the next steps were to refactor the query, cache the result for an appropriate time, and get the fix out to users as soon as possible.

While these are just a few examples of the problems I was able to solve thanks to my historical incident management data, if I hadn’t implemented the toolset early on, I would probably still be playing guess-and-check with various solutions. Don’t get me wrong, though. The same incident management tools can also be used to plan upgrades for a well-architected application. Identifying the circumstances where your servers overload or things start slowing down is crucial towards scaling your project to accommodate growth.

Learn more about how you can visualize patterns across all your systems data for improved incident management by checking out the PagerDuty Operations Command Console.

 

The post Using Historical Incident Management Data to Plan for System Upgrades appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
With the rise of DevOps, containers are at the brink of becoming a pervasive technology in Enterprise IT to accelerate application delivery for the business. When it comes to adopting containers in the enterprise, security is the highest adoption barrier. Is your organization ready to address the security risks with containers for your DevOps environment? In his session at @DevOpsSummit at 21st Cloud Expo, Chris Van Tuin, Chief Technologist, NA West at Red Hat, will discuss: The top security r...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 21st Cloud Expo, Lennart Frantzell, a Developer Advocate with IBM, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing devel...
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...
Today traditional IT approaches leverage well-architected compute/networking domains to control what applications can access what data, and how. DevOps includes rapid application development/deployment leveraging concepts like containerization, third-party sourced applications and databases. Such applications need access to production data for its test and iteration cycles. Data Security? That sounds like a roadblock to DevOps vs. protecting the crown jewels to those in IT.
SYS-CON Events announced today that Interface Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Interface Corporation is a company developing, manufacturing and marketing high quality and wide variety of industrial computers and interface modules such as PCIs and PCI express. For more information, visit http://www.i...
SYS-CON Events announced today that Keisoku Research Consultant Co. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Keisoku Research Consultant, Co. offers research and consulting in a wide range of civil engineering-related fields from information construction to preservation of cultural properties. For more information, vi...
There is huge complexity in implementing a successful digital business that requires efficient on-premise and cloud back-end infrastructure, IT and Internet of Things (IoT) data, analytics, Machine Learning, Artificial Intelligence (AI) and Digital Applications. In the data center alone, there are physical and virtual infrastructures, multiple operating systems, multiple applications and new and emerging business and technological paradigms such as cloud computing and XaaS. And then there are pe...
SYS-CON Events announced today that SIGMA Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. uLaser flow inspection device from the Japanese top share to Global Standard! Then, make the best use of data to flip to next page. For more information, visit http://www.sigma-k.co.jp/en/.
SYS-CON Events announced today that B2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. B2Cloud specializes in IoT devices for preventive and predictive maintenance in any kind of equipment retrieving data like Energy consumption, working time, temperature, humidity, pressure, etc.
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...