|By Michael Kopp||
|December 12, 2012 07:30 AM EST||
Anyone who ever monitored or analyzed an application uses or has used averages. They are simple to understand and calculate. We tend to ignore just how wrong the picture is that averages paint of the world. To emphasis the point let me give you a real-world example outside of the performance space that I read recently in a newspaper.
The article was explaining that the average salary in a certain region in Europe was 1900 Euro's (to be clear this would be quite good in that region!). However when looking closer they found out that the majority, namely 9 out of 10 people, only earned around 1000 Euros and one would earn 10.000 (I over simplified this of course, but you get the idea). If you do the math you will see that the average of this is indeed 1900, but we can all agree that this does not represent the "average" salary as we would use the word in day to day live. So now let's apply this thinking to application performance.
The Average Response Time
The average response time is by far the most commonly used metric in application performance management. We assume that this represents a "normal" transaction, however this would only be true if the response time is always the same (all transaction run at equal speed) or the response time distribution is roughly bell curved.
A Bell curve represents the "normal" distribution of response times in which the average and the median are the same. It rarely ever occurs in real applications
In a Bell Curve the average (mean) and median are the same. In other words observed performance would represent the majority (half or more than half) of the transactions.
In reality most applications have few very heavy outliers; a statistician would say that the curve has a long tail. A long tail does not imply many slow transactions, but few that are magnitudes slower than the norm.
This is a typical Response Time Distribution with few but heavy outliers - it has a long tail. The average here is dragged to the right by the long tail.
We recognize that the average no longer represents the bulk of the transactions but can be a lot higher than the median.
You can now argue that this is not a problem as long as the average doesn't look better than the median. I would disagree, but let's look at another real-world scenario experienced by many of our customers:
This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median
In this case a considerable percentage of transactions are very, very fast (10-20 percent), while the bulk of transactions are several times slower. The median would still tell us the true story, but the average all of a sudden looks a lot faster than most of our transactions actually are. This is very typical in search engines or when caches are involved - some transactions are very fast, but the bulk are normal. Another reason for this scenario are failed transactions, more specifically transactions that failed fast. Many real-world applications have a failure rate of 1-10 percent (due to user errors or validation errors). These failed transactions are often magnitudes faster than the real ones and consequently distorted an average.
Of course performance analysts are not stupid and regularly try to compensate with higher frequency charts (compensating by looking at smaller aggregates visually) and by taking in minimum and maximum observed response times. However we can often only do this if we know the application very well, those unfamiliar with the application might easily misinterpret the charts. Because of the depth and type of knowledge required for this, it's difficult to communicate your analysis to other people - think how many arguments between IT teams have been caused by this. And that's before we even begin to think about communicating with business stakeholders!
A better metric by far are percentiles, because they allow us to understand the distribution. But before we look at percentiles, let's take a look a key feature in every production monitoring solution: Automatic Baselining and Alerting.
Automatic Baselining and Alerting
In real-world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects? We cannot alert on every slow transaction, since there are always some. In addition, most operations teams have to maintain a large number of applications and are not familiar with all of them, so manually setting thresholds can be inaccurate, quite painful and time-consuming.
The industry has come up with a solution called Automatic Baselining. Baselining calculates out the "normal" performance and only alerts us when an application slows down or produces more errors than usual. Most approaches rely on averages and standard deviations.
Without going into statistical details, this approach again assumes that the response times are distributed over a bell curve:
The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier. However most real world scenarios are not bell curved...
Typically, transactions that are outside two times standard deviation are treated as slow and captured for analysis. An alert is raised if the average moves significantly. In a bell curve this would account for the slowest 16.5 percent (and you can of course adjust that); however; if the response time distribution does not represent a bell curve, it becomes inaccurate. We either end up with a lot of false positives (transactions that are a lot slower than the average but when looking at the curve lie within the norm) or we miss a lot of problems (false negatives). In addition if the curve is not a bell curve, then the average can differ a lot from the median; applying a standard deviation to such an average can lead to quite a different result than you would expect. To work around this problem these algorithms have many tunable variables and a lot of "hacks" for specific use cases.
Why I Love Percentiles
A percentile tells me which part of the curve I am looking at and how many transactions are represented by that metric. To visualize this look at the following chart:
This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions
The green line represents the average. As you can see it is very volatile. The other two lines represent the 50th and 90th percentile. As we can see the 50th percentile (or median) is rather stable but has a couple of jumps. These jumps represent real performance degradation for the majority (50%) of the transactions. The 90th percentile (this is the start of the "tail") is a lot more volatile, which means that the outliers slowness depends on data or user behavior. What's important here is that the average is heavily influenced (dragged) by the 90th percentile, the tail, rather than the bulk of the transactions.
If the 50th percentile (median) of a response time is 500ms that means that 50% of my transactions are either as fast or faster than 500ms. If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower. The average in this case could either be lower than 500ms (on a heavy front curve), a lot higher (long tail) or somewhere in between. A percentile gives me a much better sense of my real world performance, because it shows me a slice of my response time curve.
For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.
In many cases we see that the 75th or 90th percentile does not change at all in such a scenario. This means the slow transactions didn't get any slower, only the normal ones did. Depending on how long your tail is the average might not have moved at all in such a scenario.!
In other cases we see the 98th percentile degrading from 1s to 1.5 seconds while the 95th is stable at 900ms. This means that your application as a whole is stable, but a few outliers got worse, nothing to worry about immediately. Percentile-based alerts do not suffer from false positives, are a lot less volatile and don't miss any important performance degradations! Consequently a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.
The screenshot below shows the Median (50th Percentile) for a particular transaction jumping from about 50ms to about 500ms and triggering an alert as it is significantly above the calculated baseline (green line). The chart labeled "Slow Response Time" on the other hand shows the 90th percentile for the same transaction. These "outliers" also show an increase in response time but not significant enough to trigger an alert.
Here we see an automatic baselining dashboard with a violation at the 50th percentile. The violation is quite clear, at the same time the 90th percentile (right upper chart) does not violate. Because the outliers are so much slower than the bulk of the transaction an average would have been influenced by them and would not have have reacted quite as dramatically as the 50th percentile. We might have missed this clear violation!
How Can We Use Percentiles for Tuning?
Percentiles are also great for tuning, and giving your optimizations a particular goal. Let's say that something within my application is too slow in general and I need to make it faster. In this case I want to focus on bringing down the 90th percentile. This would ensure sure that the overall response time of the application goes down. In other cases I have unacceptably long outliers I want to focus on bringing down response time for transactions beyond the 98th or 99th percentile (only outliers). We see a lot of applications that have perfectly acceptable performance for the 90th percentile, with the 98th percentile being magnitudes worse.
In throughput oriented applications on the other hand I would want to make the majority of my transactions very fast, while accepting that an optimization makes a few outliers slower. I might therefore make sure that the 75th percentile goes down while trying to keep the 90th percentile stable or not getting a lot worse.
I could not make the same kind of observations with averages, minimum and maximum, but with percentiles they are very easy indeed.
Averages are ineffective because they are too simplistic and one-dimensional. Percentiles are a really great and easy way of understanding the real performance characteristics of your application. They also provide a great basis for automatic baselining, behavioral learning and optimizing your application with a proper focus. In short, percentiles are great!
|rtalexander 11/21/12 12:58:00 AM EST|
Hey, could you post a reference or two that covers the theory and/or practicalities of the approach you describe?
Effectively SMBs and government programs must address compounded regulatory compliance requirements. The most recent are Controlled Unclassified Information and the EU’s GDPR have Board Level implications. Managing sensitive data protection will likely result in acquisition criteria, demonstration requests and new requirements. Developers, as part of the pre-planning process and the associated supply chain, could benefit from updating their code libraries and design by incorporating changes.
Oct. 26, 2016 11:30 AM EDT Reads: 1,754
Whether they’re located in a public, private, or hybrid cloud environment, cloud technologies are constantly evolving. While the innovation is exciting, the end mission of delivering business value and rapidly producing incremental product features is paramount. In his session at @DevOpsSummit at 19th Cloud Expo, Kiran Chitturi, CTO Architect at Sungard AS, will discuss DevOps culture, its evolution of frameworks and technologies, and how it is achieving maturity. He will also cover various st...
Oct. 26, 2016 11:30 AM EDT Reads: 2,517
Enterprises have been using both Big Data and virtualization for years. Until recently, however, most enterprises have not combined the two. Big Data's demands for higher levels of performance, the ability to control quality-of-service (QoS), and the ability to adhere to SLAs have kept it on bare metal, apart from the modern data center cloud. With recent technology innovations, we've seen the advantages of bare metal erode to such a degree that the enhanced flexibility and reduced costs that ...
Oct. 26, 2016 11:29 AM EDT
Qosmos, the market leader for IP traffic classification and network intelligence technology, has announced that it will launch the Launch L7 Viewer at CloudExpo | @ThingsExpo Silicon Valley, being held November 1 – 3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. The L7 Viewer is a traffic analysis tool that provides complete visibility of all network traffic that crosses a virtualized infrastructure, up to Layer 7. It facilitates and accelerates common IT tasks such as VM migra...
Oct. 26, 2016 11:00 AM EDT Reads: 196
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
Oct. 26, 2016 11:00 AM EDT Reads: 3,170
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smar...
Oct. 26, 2016 10:30 AM EDT Reads: 1,120
Two weeks ago (November 3-5), I attended the Cloud Expo Silicon Valley as a speaker, where I presented on the security and privacy due diligence requirements for cloud solutions. Cloud security is a topical issue for every CIO, CISO, and technology buyer. Decision-makers are always looking for insights on how to mitigate the security risks of implementing and using cloud solutions. Based on the presentation topics covered at the conference, as well as the general discussions heard between sessi...
Oct. 26, 2016 10:15 AM EDT Reads: 5,166
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
Oct. 26, 2016 10:15 AM EDT Reads: 4,954
operations aren’t merging to become one discipline. Nor is operations simply going away. Rather, DevOps is leading software development and operations – together with other practices such as security – to collaborate and coexist with less overhead and conflict than in the past. In his session at @DevOpsSummit at 19th Cloud Expo, Gordon Haff, Red Hat Technology Evangelist, will discuss what modern operational practices look like in a world in which applications are more loosely coupled, are deve...
Oct. 26, 2016 10:15 AM EDT Reads: 1,886
Regulatory requirements exist to promote the controlled sharing of information, while protecting the privacy and/or security of the information. Regulations for each type of information have their own set of rules, policies, and guidelines. Cloud Service Providers (CSP) are faced with increasing demand for services at decreasing prices. Demonstrating and maintaining compliance with regulations is a nontrivial task and doing so against numerous sets of regulatory requirements can be daunting task...
Oct. 26, 2016 10:00 AM EDT Reads: 1,807
Join Impiger for their featured webinar: ‘Cloud Computing: A Roadmap to Modern Software Delivery’ on November 10, 2016, at 12:00 pm CST. Very few companies have not experienced some impact to their IT delivery due to the evolution of cloud computing. This webinar is not about deciding whether you should entertain moving some or all of your IT to the cloud, but rather, a detailed look under the hood to help IT professionals understand how cloud adoption has evolved and what trends will impact th...
Oct. 26, 2016 10:00 AM EDT Reads: 478
The Internet of Things can drive efficiency for airlines and airports. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Sudip Majumder, senior director of development at Oracle, will discuss the technical details of the connected airline baggage and related social media solutions. These IoT applications will enhance travelers' journey experience and drive efficiency for the airlines and the airports. The session will include a working demo and a technical d...
Oct. 26, 2016 10:00 AM EDT Reads: 2,619
Security, data privacy, reliability, and regulatory compliance are critical factors when evaluating whether to move business applications from in-house, client-hosted environments to a cloud platform. Quality assurance plays a vital role in ensuring that the appropriate level of risk assessment, verification, and validation takes place to ensure business continuity during the migration to a new cloud platform.
Oct. 26, 2016 10:00 AM EDT Reads: 2,020
In his session at 19th Cloud Expo, Nick Son, Vice President of Cyber Risk & Public Sector at Coalfire, will discuss the latest information on the FedRAMP Program. Topics will cover: FedRAMP Readiness Assessment Report (RAR). This new process is designed to streamline and accelerate the FedRAMP process from the traditional timeline by initially focusing on technical capability instead of documentation preparedness. FedRAMP for High-impact level systems. Early in 2016 FedRAMP officially publishe...
Oct. 26, 2016 10:00 AM EDT Reads: 459
We are always online. We access our data, our finances, work, and various services on the Internet. But we live in a congested world of information in which the roads were built two decades ago. The quest for better, faster Internet routing has been around for a decade, but nobody solved this problem. We’ve seen band aid approaches like CDNs that attack a niche's slice of static content part of the Internet, but that’s it. It does not address the dynamic services-based Internet of today. It doe...
Oct. 26, 2016 09:45 AM EDT Reads: 1,837