Blog Feed Post

A Bird’s Eye View Via Boxplot

The impact of website/app performance on the bottom line of an Internet firm is an undisputed fact (refer to our earlier blog for further discussion on the subject). Over the years, the industry has come to terms with no longer considering performance as an afterthought and making it a top priority. Now, performance analysis is easier said than done; for instance, let’s carry out a comparative performance analysis – measured via, say, webpage response time – of some of the leading airlines. The plot below shows a week-long snapshot where the aforementioned metric was sampled every 5 minutes (the data was extracted via the Catchpoint portal).

http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 846px) 100vw, 846px" />

With increasing maturity of tooling, data collection has become a commodity today. However, any meaningful analysis, even visual analysis, of the plot is not practical. One may wonder what would happen if one were to lax the sampling rate to contain the “too much data” problem?

screen-shot-2017-02-15-at-1-13-27-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 838px) 100vw, 838px" />

The plot above corresponds to the same time period, but with a sampling period of 15 minutes. The overlap between the time series is still too heavy, thereby making it very hard to derive any material insights. How about laxing the sampling further?

screen-shot-2017-02-15-at-1-14-11-pmhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 374w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 840px) 100vw, 840px" />

The plot above corresponds to the same time period, but with a sampling period of 30 minutes. From the plot above we note that, on average, Alaska Airlines has the best performance and Virgin has the worst performance. Having said that, from the above it is difficult to assess how often each airline experiences a performance hiccup. Concretely speaking, diving deeper to figure out how often one’s website experiences a webpage response time of, say, >3 seconds might lead to a useful discovery regarding user churn. To this end, a common method used is to analyze the probability density distribution of the metric of interest, as exemplified by the plot below (note that the plot below corresponds to data set sampled every 5 minutes).

screen-shot-2017-02-15-at-1-14-53-pmhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 767px) 100vw, 767px" />

A lot of valuable insight can be extracted from the plot above based on the following:

  • Relative location of the peaks of each distribution
  • The spread (an indicator of variance) of the distribution
  • The fatness of the tails – this sheds light on the extent that the user base is being impacted, in the current context, outlier webpage response times

Still, the probability density distribution is not conducive to compare the key statistics such as, but not limited to, median, the first and third quartiles, and the density of outliers. Boxplot, proposed for over four decades (see [1] and [7]), is tailor-made for this. An example illustration of a boxplot is shown below.

screen-shot-2017-02-15-at-1-43-25-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 654px) 100vw, 654px" />

A boxplot is made up of five components that are carefully chosen to give a robust summary of the distribution of a dataset:

  1. The median
  2. The upper and lower fourth quartiles, commonly referred to as “hinges”
  3. The data values adjacent to the upper and lower fences, which lie 1.5 times the IQR (inter-quartile) range from the hinges
  4. Two whiskers that connect the hinges to the fences
  5. Anomalies, which are data points further away from the fences

Boxplot for the data set sampled every 5 minutes is shown below:

screen-shot-2017-02-15-at-1-16-56-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 863px) 100vw, 863px" />

From the plot above, it is straightforward to compare the various descriptive statistics of webpage response time across different airlines. For instance, although both Southwest and United have a lower median than Delta, the latter has a lower spread (= IQR = height of the box) than the former two. In a similar vein, we note that not only does Virgin has the highest median webpage response time, it also has the highest IQR. This clearly speaks well of the experience of Virgin’s (potential) customers.

One of the common use cases of boxplots is to detect anomalies. Although robust anomaly detection is subject to a multitude of factors, boxplots serve as a first-cut means to filter out potential anomalies. In the case of a standard normal distribution, 0.35% of the data points along each tail are deemed anomalous (see below).

screen-shot-2017-02-15-at-1-17-45-pmhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 851px) 100vw, 851px" />

The limitations of boxplot are that it is primarily suited to:

  • Almost symmetric data
  • Approximately mesokurtic distribution, i.e., distributions with zero excess kurtosis

The above two assumptions do not hold in general for real world data. This is exemplified by the plot of the probability density distribution above.

One way to address the former, i.e., asymmetry, is to use medcouple – a robust metric to measure skewness of a univariate distribution. Using medcouple (MC), the whiskers of the boxplot are redefined as follows:

screen-shot-2017-02-15-at-1-19-34-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 823px) 100vw, 823px" />

A number of techniques have been proposed, see [6, 8, 9], to adapt boxplot to different characteristics of the underlying distribution. Likewise, several variations of boxplots have been proposed, see [2]. In a similar vein, the addition of other graphical elements to display distributional features like kurtosis [3], skewness and multimodality [4], and mean and standard error [5] have been proposed. For instance, varying the width of the box based on the sample size. A user of Catchpoint can plugin a boxplot plotting library of their choice in a straightforward fashion (refer to our earlier blog for this).

By: Arun Kejariwal, Ryan Pellette, and Mehdi Daoudi



[1] “Exploratory Data Analysis”, by J. W. Tukey, Addison–Wesley, 1977.

[2] “Variation of Boxplots”, by R. McGill, J. W. Tukey and W. A. Larsen, 1978.

[3] “Shape-finder box plots”, by M. Aslam and A. Khurshid, 1991.

[4] “Can the box plot be improved?”, by C. Choonpradub and D. McNeil, 2005.

[5] “The shifting boxplot”, by F. Marmolejo-Ramos and T. Tian, 2010.

[6] “An adjusted boxplot for skewed distributions“, by M. Hubert and E. Vendervieren, 2008.

[7] “40 years of Boxplots”, by H. Wickham and L. Stryjewski, 2011. http://vita.had.co.nz/papers/boxplots.pdf

[8] “A generalized boxplot for skewed and heavy-tailed distributions”, by C. Bruffaerts, V. Verardi and C. Vermandele, 2014.

[9] “A Generalized Boxplot for Skewed and Heavy-tailed Distributions implemented in Stata”, by V. Verardi. http://www.stata.com/meeting/uk14/abstracts/materials/uk14_verardi.pdf

The post A Bird’s Eye View Via Boxplot appeared first on Catchpoint's Blog.

Read the original blog entry...

More Stories By Mehdi Daoudi

Catchpoint radically transforms the way businesses manage, monitor, and test the performance of online applications. Truly understand and improve user experience with clear visibility into complex, distributed online systems.

Founded in 2008 by four DoubleClick / Google executives with a passion for speed, reliability and overall better online experiences, Catchpoint has now become the most innovative provider of web performance testing and monitoring solutions. We are a team with expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies and impacting the experience of millions of users. Catchpoint is funded by top-tier venture capital firm, Battery Ventures, which has invested in category leaders such as Akamai, Omniture (Adobe Systems), Optimizely, Tealium, BazaarVoice, Marketo and many more.

Latest Stories
SYS-CON Events announced today that Outscale, a global pure play Infrastructure as a Service provider and strategic partner of Dassault Systèmes, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2010, Outscale simplifies infrastructure complexities and boosts the business agility of its customers. Outscale delivers a secure, reliable and industrial strength solution for its customers, which in...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that EARP Integration will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. EARP Integration is a passionate software house. Since its inception in 2009 the company successfully delivers smart solutions for cities and factories that start their digital transformation. EARP provides bespoke solutions like, for example, advanced enterprise portals, business intelligence systems an...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 20th Cloud Expo, Tom Eck, Industry Platforms CTO at IBM Cloud, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing developers ...
Existing Big Data solutions are mainly focused on the discovery and analysis of data. The solutions are scalable and highly available but tedious when swapping in and swapping out occurs in disarray and thrashing takes place. The resolution for thrashing through machine learning algorithms and support nomenclature is through simple techniques. Organizations that have been collecting large customer data are increasingly seeing the need to use the data for swapping in and out and thrashing occurs ...
In his session at 20th Cloud Expo, Brad Winett, Senior Technologist for DDN Storage, will present several current, end-user environments that are using object storage at scale for cloud deployments including private cloud and cloud providers. Details on the top considerations of features and functions for selecting object storage will be included. Brad will also touch on recent developments in tiering technologies that deliver single solution and an end-user view of data across files and objects...
For financial firms, the cloud is going to increasingly become a crucial part of dealing with customers over the next five years and beyond, particularly with the growing use and acceptance of virtual currencies. There are new data storage paradigms on the horizon that will deliver secure solutions for storing and moving sensitive financial data around the world without touching terrestrial networks. In his session at 20th Cloud Expo, Cliff Beek, President of Cloud Constellation Corporation, w...
Most DevOps journeys involve several phases of maturity. Research shows that the inflection point where organizations begin to see maximum value is when they implement tight integration deploying their code to their infrastructure. Success at this level is the last barrier to at-will deployment. Storage, for instance, is more capable than where we read and write data. In his session at @DevOpsSummit at 20th Cloud Expo, Josh Atwell, a Developer Advocate for NetApp, will discuss the role and value...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assis...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs oft...
The 21st International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo Silicon Valley Call for Papers is now open.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
As cloud adoption continues to transform business, today's global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
Interested in leveling up on your Cloud Foundry skills? Join IBM for Cloud Foundry Days on June 7 at Cloud Expo New York at the Javits Center in New York City. Cloud Foundry Days is a free half day educational conference and networking event. Come find out why Cloud Foundry is the industry's fastest-growing and most adopted cloud application platform.