Blog Feed Post

From Scala Unified Logging to Full System ObservabilityPart 3 of 3: Metrics and the Bigger Picture

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 3 in a series on system visibility, the Detection and Analysis part of the Incident Management Lifecycle. If you missed them, read Part 1 and Part 2 first.

Ok, logging improvements: great for our team and codebase(!!) but there’s a bigger world out there for monitoring and instrumentation. We’re talking about observability here. It’s a world where we not only have the necessary information for troubleshooting and debugging, but also a detailed understanding of how the running system is performing–with alerts for when things go wrong.

Enter metrics. Metrics help form a well-rounded approach to the monitoring and instrumentation of your systems. They focus on quantitative measurements of success, responsiveness, demand, saturation, etc. If log statements are for debugging your running system, then metrics are for picturing how the system is running and indicating when the system needs attention. Metrics are the system’s heartbeat.

An analogy

When I think of instrumentation, I think of a person (the “system”) with advanced hypertension, undergoing angioplasty to open several clogged blood vessels. After a stressful, harrowing experience in critical care, the team performs the surgery. Fortunately it’s successful, but the patient is weak and has a few weeks’ recovery ahead.

But what if preventative health visits and monitoring (system metrics) had detected his high blood pressure earlier? And what if those earlier results were accompanied by a new diet, exercise, and medication plan? Maybe this patient wouldn’t have needed risky, reactive surgery at all.

Obviously this isn’t a direct correlation but it’s a useful analogy to thinking about the visibility that we have into our own running systems. Do we know the request latency and error rate? The demand for and saturation of given components?

The case for metrics

Now that we’re completing the big picture of monitoring and instrumentation using multiple methods, let’s examine how metrics complement a good log portfolio.

Logs are strings. Strings need to be interpreted. Interpretation typically requires humans. Humans are slow to respond.

Metrics are numbers. Numbers are immediately actionable. Actions can even be taken automatically.

Metrics have metadata. Metadata describes multiple dimensions. Dimensionality exposes additional trends, outliers, and correlations.

How does VictorOps approach metrics?

We are still iterating on our metrics infrastructure and portfolio and through that process we are identifying more KPIs (Key Performance Indicators) for our systems. For quite a while we’ve employed black-box style metrics, a.k.a show me what the customer is experiencing.

Now we’re trying to answer deeper questions, like: “Now that I know that error rate is too high, which Service Level Indicators (SLIs) would help paint a more complete picture of my problem?” Our next step is to explore the white-box SLIs that answer these types of internal system questions. Overall, we want a mixture of black and white-box metrics where the former alert off of customer experience issues and the latter point us to a subset of the system for troubleshooting and auto-remediation efforts.

How do we use metrics?

The goal is to auto-remediate issues that can be safely addressed without human involvement and alert on all remaining issues. So, when instrumentation points to a known problem, that problem is auto-remediated by some script or application and people don’t need to get involved in resolving the problem. This gives us response times well within seconds, and no human could both respond to and remediate the issue in a comparable timespan. For the remainder of scenarios, where human intervention is truly necessary, we want to alert on the problem as early as possible and, ideally, even before it becomes a problem.

Where do dashboards fit?

Alerts and auto-remediation are great, but it’s also helpful to have a visualization of your running system. Well built and intuitive dashboards are great tools for a first responder to troubleshoot and pinpoint a problem by scanning for anomalies. Specifically, we’re talking about dashboards that make anomalies easy to identify without needing a full set of tribal knowledge or a PhD. So, please, be careful what you do with all of that valuable insight into your system. Not all of your metrics must be associated with an alert (or you’ll soon end up with alert fatigue from too much noise). You should aim for intelligent alert thresholds and include them in your dashboards.

Be careful however, as dashboards require a human’s attention in order to be useful…just like logs (at least ones without alerts). It’s both a waste of your time and highly error prone to assume that each person or shift staring at that dashboard will know how to accurately interpret each panel and every time series displayed. This means that care must be taken to not only measure but clearly visualize and indicate the value of your metrics – both via dashboards and alerts.

Don’t just measure ALL the things

Ideally, we want to build out preventative health care that’s not just packed full of instrumentation, but is full of useful indicators of system health–the KPIs. If a metric isn’t used by the team through alerts or dashboards, it’s likely not useful. If a metric doesn’t make it to a dashboard or alert for visibility sake, it will likely become a metric that you have to clean up later. These metrics should be removed, and we should be cognizant (but not fearful) of new metrics falling into the same bucket.

Craving more?

Want to kick it up a few notches? Look into systems that can perform anomaly detection for your teams. This allows you to focus on more complex problems, build auto-remediation, fix newly identified issues, etc.

Want to move into uncharted territory? Give sonification a try, or at least check out this cool podcast from Science Friday: Listening in on scientific data. “When it comes to analyzing scientific data, there are the old standbys, plots and graphs. But what if instead of poring over visuals, scientists could listen to their data—and make new discoveries with their ears?”

Although we won’t dive into the topics of distributed tracing and eventing, it’s important to mention how they can add additional viewports into your system. Tracing breaks down the request path through your system by describing both the path and the timing at each stage. Eventing allows you to perform offline analysis of events that occurred in your system, with full detail. This is similar to logs but intended for computational processing and not human processing.

Wrap it up!

I’ve covered how logs and metrics play huge parts in adding observability to your systems–with tracing and eventing also fitting in to the big picture. Logs provide you with detailed diagnostic information and sometimes a clear enough indication of a problem that you can create alerts off them. Metrics provide you with quantitative insights into the running system and are also an ideal source of alerts. Remember: don’t take on the hard work of adding observability, logs, dashboards,  alerts, and anomaly detection to your platform without making these things truly useful or actionable.

Finally, don’t forget to pay attention to those annoying and often below-the-radar gaps in your observability portfolio; like the logging issue we had in our backend systems at VictorOps. You might find that alleviating these annoyances leads to happier, more enthusiastic teams with a renewed stake in creating better systems.

The post From Scala Unified Logging to Full System Observability
Part 3 of 3: Metrics and the Bigger Picture
appeared first on VictorOps.

Read the original blog entry...

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

Latest Stories
This session will provide an introduction to Cloud driven quality and transformation and highlight the key features that comprise it. A perspective on the cloud transformation lifecycle, transformation levers, and transformation framework will be shared. At Cognizant, we have developed a transformation strategy to enable the migration of business critical workloads to cloud environments. The strategy encompasses a set of transformation levers across the cloud transformation lifecycle to enhance ...
Your job is mostly boring. Many of the IT operations tasks you perform on a day-to-day basis are repetitive and dull. Utilizing automation can improve your work life, automating away the drudgery and embracing the passion for technology that got you started in the first place. In this presentation, I'll talk about what automation is, and how to approach implementing it in the context of IT Operations. Ned will discuss keys to success in the long term and include practical real-world examples. Ge...
The challenges of aggregating data from consumer-oriented devices, such as wearable technologies and smart thermostats, are fairly well-understood. However, there are a new set of challenges for IoT devices that generate megabytes or gigabytes of data per second. Certainly, the infrastructure will have to change, as those volumes of data will likely overwhelm the available bandwidth for aggregating the data into a central repository. Ochandarena discusses a whole new way to think about your next...
So the dumpster is on fire. Again. The site's down. Your boss's face is an ever-deepening purple. And you begin debating whether you should join the #incident channel or call an ambulance to deal with his impending stroke. Yes, we know this is a developer's fault. There's plenty of time for blame later. Postmortems have a macabre name because they were once intended to be Viking-like funerals for someone's job. But we're civilized now. Sort of. So we call them post-incident reviews. Fires are ne...
Whenever a new technology hits the high points of hype, everyone starts talking about it like it will solve all their business problems. Blockchain is one of those technologies. According to Gartner's latest report on the hype cycle of emerging technologies, blockchain has just passed the peak of their hype cycle curve. If you read the news articles about it, one would think it has taken over the technology world. No disruptive technology is without its challenges and potential impediments t...
Hackers took three days to identify and exploit a known vulnerability in Equifax’s web applications. I will share new data that reveals why three days (at most) is the new normal for DevSecOps teams to move new business /security requirements from design into production. This session aims to enlighten DevOps teams, security and development professionals by sharing results from the 4th annual State of the Software Supply Chain Report -- a blend of public and proprietary data with expert researc...
CloudEXPO New York 2018, colocated with DevOpsSUMMIT and DXWorldEXPO New York 2018 will be held November 12-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI and Machine Learning to one location.
CloudEXPO | DevOpsSUMMIT | DXWorldEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
DXWorldEXPO LLC announced today that Nutanix has been named "Platinum Sponsor" of CloudEXPO | DevOpsSUMMIT | DXWorldEXPO New York, which will take place November 12-13, 2018 in New York City. Nutanix makes infrastructure invisible, elevating IT to focus on the applications and services that power their business. The Nutanix Enterprise Cloud Platform blends web-scale engineering and consumer-grade design to natively converge server, storage, virtualization and networking into a resilient, softwar...
The digital transformation is real! To adapt, IT professionals need to transform their own skillset to become more multi-dimensional by gaining both depth and breadth of a wide variety of knowledge and competencies. Historically, while IT has been built on a foundation of specialty (or "I" shaped) silos, the DevOps principle of "shifting left" is opening up opportunities for developers, operational staff, security and others to grow their skills portfolio, advance their careers and become "T"-sh...
Lori MacVittie is a subject matter expert on emerging technology responsible for outbound evangelism across F5's entire product suite. MacVittie has extensive development and technical architecture experience in both high-tech and enterprise organizations, in addition to network and systems administration expertise. Prior to joining F5, MacVittie was an award-winning technology editor at Network Computing Magazine where she evaluated and tested application-focused technologies including app secu...
DXWorldEXPO LLC announced today that Big Data Federation to Exhibit at the 22nd International CloudEXPO, colocated with DevOpsSUMMIT and DXWorldEXPO, November 12-13, 2018 in New York City. Big Data Federation, Inc. develops and applies artificial intelligence to predict financial and economic events that matter. The company uncovers patterns and precise drivers of performance and outcomes with the aid of machine-learning algorithms, big data, and fundamental analysis. Their products are deployed...
ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of computational needs for many industries. Their solutions provide benefits across many environments, such as datacenter deployment, HPC, workstations, storage networks and standalone server installations. ICC has been in business for over 23 years and their phenomenal range of clients include multinational corporations, universities, and small busines...
This sixteen (16) hour course provides an introduction to DevOps, the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will result in an improved ability to design, develop, deploy and operate software and services faster.
Headquartered in Plainsboro, NJ, Synametrics Technologies has provided IT professionals and computer systems developers since 1997. Based on the success of their initial product offerings (WinSQL and DeltaCopy), the company continues to create and hone innovative products that help its customers get more from their computer applications, databases and infrastructure. To date, over one million users around the world have chosen Synametrics solutions to help power their accelerated business or per...