Blog Feed Post

The Devil’s in the Details: Understanding Event Correlation

The Devil's in the Details: Understanding Event CorrelationCorrelation and root-cause diagnosis have always been the holy grail of IT performance monitoring. Instead of managing a flood of alerts, correlation and root-cause diagnosis help IT administrators determine where the real cause of a problem lies and work to resolve this quickly, so as to minimize user impact and business loss.

However, all this is not as simple as it sounds! There has always been confusion around event correlation. Terms like event storms, false alerts, root-cause, correlation, analytics, and others are used by vendors with reckless abandon. The result for customers can be a a lot of confusion.

As a result, I’ve always liked some of the best practice guidance around monitoring and event management. Event management’s objectives – detect events, make sense of them and determine the appropriate control action – can be a good way to understand these concepts, and breaking things down along these lines can help us understand what can be a complex subject.

Data Collection and Detecting Events

This is exactly what it sounds like. Monitors need to collect all sorts of data: From devices, applications, services, users and more. This data can be collected in many ways, by many tools and/or components.

So, the first questions you’ll need to ask include “what are we trying to achieve?” and “what data do we need to collect?” Most customers ask these questions based on their individual perspective – a device, a technology silo, a supporting IT service, a customer-facing IT service or perhaps even a single user.

This perspective can result in a desire for a LOT of monitoring data, where the customer is attempting to “cover all bases” by collecting all the data they can get their hands on – “just in case.” Freeware or log monitoring can be simple, cost-effective ways of collecting large amounts of raw data.

But beware of the perception that data collection is where most of your monitoring costs are. In truth, more data is not necessarily a good thing and data collection is not where the real costs to your organization lie, in any case. To determine real causes of performance problems, you’ll have to balance your desire for fast and inexpensive data collection with your needs for making sense of events.

Making Sense of Events with Event Correlation

Root-Cause Analysis

This is where processing and analyzing the data that you’ve collected occurs, and where you ask questions like “how frequently should we collect data?” “what format does the data need to be in?” and “what analysis do we need to perform?”

For example, do you just require information about what’s happening right now, or will you require history (e.g. to identify trends, etc.)? How granular does the collected data need to be, meaning, what are you going to do with it (e.g. identify a remediation action, etc.)? Putting some thought into your monitoring objectives is an important element of determining a data collection strategy.

Most monitoring products today provide some level of formatting and reporting of monitoring data in the form of charts and graphs. This is where the use of terms such as “root-cause analysis” or “correlation” are often used. The key question here is who is doing the analysis and how. Relying on highly skilled experts to interpret and analyze monitoring data is where the real costs of monitoring come from, and this is where the confusion really begins – event correlation, or, making sense of events.

Approaches to Event Correlation

Most customers assume that when they hear “root-cause analysis” or “correlation” there is some level of automation occurring, but relying on your IT staff to interpret log files or graphs is, clearly, manual analysis or correlation. Manual correlation is time-consuming, labor-intensive, requires expertise, and is not scalable as your infrastructure grows. Herein lies the need for monitoring tools that automate this process.

There are many common approaches to correlation:

Rule-Based Correlation
A common and traditional approach to event correlation is rule-based, circuit-based or network-based. These forms of correlation involve the definition of how events themselves should be analyzed, and a rule-base is built for each combination of events. The early days of network management made use of many of these solutions. As IT infrastructures have evolved, the amount of data collected and the effort required for building rules to account for every possible event combination makes this approach very cumbersome. The challenge with this approach is that you must maintain the rule-base, and with the dynamic nature of today’s environments this is becoming increasingly difficult.

History-Based Correlation
Another approach is to learn from past events. Terms like “machine learning” and “analytics” have been used for this approach. What’s common is learning behavior from past events, and if these patterns re-occur you can quickly isolate where the issue was the last time it was observed.

These approaches are independent of the technology domain, so no domain knowledge is needed. This may limit the level of detail that can be obtained, and if the patterns have not been learned from experience, then no correlation will take place. The drawback, of course, is that when problems occur in in the software layers, many of the event patterns are new. Furthermore, the dynamicity of today’s environments makes it less likely that these problem patterns will reoccur.

Domain-Based Correlation
These approaches use terms like “embedded correlation.” This approach does not use rules per se, but organizes the measurement data using layered and topology-based dependencies in a time-based model. This enables the monitored data to be analyzed based on dependencies and timing of the events, so the accuracy of the correlation improves as new data is obtained.

The advantage of this approach is that users can get very specific, granular, actionable information from the correlated data without having to maintain rule bases or rely on history. And since virtual and cloud infrastructures are dynamic, the dependencies (e.g. between VMs and physical machines) are dynamic. So the auto-correlating tool must be able detect these dynamic dependencies to use them for actionable root-cause diagnosis.

The Devil’s Often in the Details
How far should root-cause diagnosis go? This depends on the individual seeking to use the monitoring tool. For some, knowing that the cause of slowness is the high CPU usage of a Java application may be sufficient; they can simply pass the problem on to a Java expert to investigate. On the other hand, the Java expert may want to know which thread and which line of code within the application is causing the issue. This level of diagnosis is desired in real-time, but often, the experts may not be at hand when a problem surfaces. Therefore, having the ability for the monitoring tool to go back in time and present the same level of detail for root-cause diagnosis is equally important.

The level of detail can be the difference between an actionable event and one that requires a skilled IT person to further investigate.

Determine the Appropriate Control Action (Automated IT Operations)

This is where some organizations are focusing, sometimes with limited-to-no evaluation of the monitoring environment. Many IT operations tasks can be automated with limited concern for monitoring, such as automating the provisioning process or request fulfillment.

But if your goal is to automate remediation actions when issues arise, this will involve monitoring. The event management process tends to trigger many processes such as Incident, Problem, Change, Capacity and others. But before you automate remediation tasks, you’ll need to have a high degree of certainty that you’ve correctly identified the root-cause.

The level of detailed diagnostics is relevant here, since without specific detail you may only be able to automate very simple remediation actions (re-start a server, etc.).

As you begin to populate operational remediation policies (also sometimes called rules), you will need to ensure that you can effectively maintain these policies. Therefore, rule-based correlation approaches can come with risk. Failure to maintain the correlation rules can obsolete the policy rules. Solutions that can correlate to a high degree of accuracy, as well as eliminate or simplify correlation maintenance, can be an advantage here.

Automated remediation can be a more significant driver of cost savings than simple data collection, but requires us to make sense of events before effectively achieving this goal.

The Future of Root-Cause Analysis & Event Correlation

There’s no question that with the emergence of new technologies such as containers, microservices, IoT and big data that the monitoring world will need to continue to keep pace with complexity.

Advances in artificial intelligence and analytics will surely drive continued improvements in monitoring, and we hear a lot about advancements in these areas. But remember, we’ve been down this road before. If you do not understand how the monitor will work, or if it seems too good to be true, be sure and test it in your environment.

The increasing reliance of the business on IT services indicates a likelihood that the need for correlation intelligence that can pinpoint the cause of an issue will increase in importance over time.

So, if a solution  touts the benefits of root-cause analysis and also provides you with a “war room” at the same time, or promises autonomic IT operations without explaining how it will get to an actionable diagnosis, don’t forget…

…the devil’s in the details.

Learn about automated event correlation and root-cause diagnosis in eG Enterprise »

The post The Devil’s in the Details: Understanding Event Correlation appeared first on eG Innovations.

Read the original blog entry...

More Stories By John Worthington

John Worthington is the Director of Product Marketing for eG Innovations, a global provider of unified performance monitoring and root-cause diagnosis solutions for virtual, physical and cloud IT infrastructures. He is an IT veteran with more than 30 years of executive experience in delivering positive user experiences through innovative practices in information technology such as ITSM and ITIL.

As CEO and Principal of MyServiceMonitor, LLC, he assisted clients in effectively adapting service lifecycle processes by leveraging eG Enterprise. He then went on to assignments with ThirdSky and VMware, utilizing industry certifications including ITIL Expert and DevOps.

John has more than a decade of experience helping customers transform IT operations to IT-as-a-Service operating models. He participates in industry forums, client and analyst briefings and provides thought leadership to eG Innovations, customers and partners.

Latest Stories
DX World EXPO, LLC, a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of the 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organizers to pass great deals to gre...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develop...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...