Blog Feed Post

A Bird’s Eye View Via Boxplot

The impact of website/app performance on the bottom line of an Internet firm is an undisputed fact (refer to our earlier blog for further discussion on the subject). Over the years, the industry has come to terms with no longer considering performance as an afterthought and making it a top priority. Now, performance analysis is easier said than done; for instance, let’s carry out a comparative performance analysis – measured via, say, webpage response time – of some of the leading airlines. The plot below shows a week-long snapshot where the aforementioned metric was sampled every 5 minutes (the data was extracted via the Catchpoint portal).

http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 846px) 100vw, 846px" />

With increasing maturity of tooling, data collection has become a commodity today. However, any meaningful analysis, even visual analysis, of the plot is not practical. One may wonder what would happen if one were to lax the sampling rate to contain the “too much data” problem?

screen-shot-2017-02-15-at-1-13-27-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 838px) 100vw, 838px" />

The plot above corresponds to the same time period, but with a sampling period of 15 minutes. The overlap between the time series is still too heavy, thereby making it very hard to derive any material insights. How about laxing the sampling further?

screen-shot-2017-02-15-at-1-14-11-pmhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 374w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 840px) 100vw, 840px" />

The plot above corresponds to the same time period, but with a sampling period of 30 minutes. From the plot above we note that, on average, Alaska Airlines has the best performance and Virgin has the worst performance. Having said that, from the above it is difficult to assess how often each airline experiences a performance hiccup. Concretely speaking, diving deeper to figure out how often one’s website experiences a webpage response time of, say, >3 seconds might lead to a useful discovery regarding user churn. To this end, a common method used is to analyze the probability density distribution of the metric of interest, as exemplified by the plot below (note that the plot below corresponds to data set sampled every 5 minutes).

screen-shot-2017-02-15-at-1-14-53-pmhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 767px) 100vw, 767px" />

A lot of valuable insight can be extracted from the plot above based on the following:

  • Relative location of the peaks of each distribution
  • The spread (an indicator of variance) of the distribution
  • The fatness of the tails – this sheds light on the extent that the user base is being impacted, in the current context, outlier webpage response times

Still, the probability density distribution is not conducive to compare the key statistics such as, but not limited to, median, the first and third quartiles, and the density of outliers. Boxplot, proposed for over four decades (see [1] and [7]), is tailor-made for this. An example illustration of a boxplot is shown below.

screen-shot-2017-02-15-at-1-43-25-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 654px) 100vw, 654px" />

A boxplot is made up of five components that are carefully chosen to give a robust summary of the distribution of a dataset:

  1. The median
  2. The upper and lower fourth quartiles, commonly referred to as “hinges”
  3. The data values adjacent to the upper and lower fences, which lie 1.5 times the IQR (inter-quartile) range from the hinges
  4. Two whiskers that connect the hinges to the fences
  5. Anomalies, which are data points further away from the fences

Boxplot for the data set sampled every 5 minutes is shown below:

screen-shot-2017-02-15-at-1-16-56-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 863px) 100vw, 863px" />

From the plot above, it is straightforward to compare the various descriptive statistics of webpage response time across different airlines. For instance, although both Southwest and United have a lower median than Delta, the latter has a lower spread (= IQR = height of the box) than the former two. In a similar vein, we note that not only does Virgin has the highest median webpage response time, it also has the highest IQR. This clearly speaks well of the experience of Virgin’s (potential) customers.

One of the common use cases of boxplots is to detect anomalies. Although robust anomaly detection is subject to a multitude of factors, boxplots serve as a first-cut means to filter out potential anomalies. In the case of a standard normal distribution, 0.35% of the data points along each tail are deemed anomalous (see below).

screen-shot-2017-02-15-at-1-17-45-pmhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 851px) 100vw, 851px" />

The limitations of boxplot are that it is primarily suited to:

  • Almost symmetric data
  • Approximately mesokurtic distribution, i.e., distributions with zero excess kurtosis

The above two assumptions do not hold in general for real world data. This is exemplified by the plot of the probability density distribution above.

One way to address the former, i.e., asymmetry, is to use medcouple – a robust metric to measure skewness of a univariate distribution. Using medcouple (MC), the whiskers of the boxplot are redefined as follows:

screen-shot-2017-02-15-at-1-19-34-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/02/Screen-S... 624w" sizes="(max-width: 823px) 100vw, 823px" />

A number of techniques have been proposed, see [6, 8, 9], to adapt boxplot to different characteristics of the underlying distribution. Likewise, several variations of boxplots have been proposed, see [2]. In a similar vein, the addition of other graphical elements to display distributional features like kurtosis [3], skewness and multimodality [4], and mean and standard error [5] have been proposed. For instance, varying the width of the box based on the sample size. A user of Catchpoint can plugin a boxplot plotting library of their choice in a straightforward fashion (refer to our earlier blog for this).

By: Arun Kejariwal, Ryan Pellette, and Mehdi Daoudi



[1] “Exploratory Data Analysis”, by J. W. Tukey, Addison–Wesley, 1977.

[2] “Variation of Boxplots”, by R. McGill, J. W. Tukey and W. A. Larsen, 1978.

[3] “Shape-finder box plots”, by M. Aslam and A. Khurshid, 1991.

[4] “Can the box plot be improved?”, by C. Choonpradub and D. McNeil, 2005.

[5] “The shifting boxplot”, by F. Marmolejo-Ramos and T. Tian, 2010.

[6] “An adjusted boxplot for skewed distributions“, by M. Hubert and E. Vendervieren, 2008.

[7] “40 years of Boxplots”, by H. Wickham and L. Stryjewski, 2011. http://vita.had.co.nz/papers/boxplots.pdf

[8] “A generalized boxplot for skewed and heavy-tailed distributions”, by C. Bruffaerts, V. Verardi and C. Vermandele, 2014.

[9] “A Generalized Boxplot for Skewed and Heavy-tailed Distributions implemented in Stata”, by V. Verardi. http://www.stata.com/meeting/uk14/abstracts/materials/uk14_verardi.pdf

The post A Bird’s Eye View Via Boxplot appeared first on Catchpoint's Blog.

Read the original blog entry...

More Stories By Mehdi Daoudi

Catchpoint radically transforms the way businesses manage, monitor, and test the performance of online applications. Truly understand and improve user experience with clear visibility into complex, distributed online systems.

Founded in 2008 by four DoubleClick / Google executives with a passion for speed, reliability and overall better online experiences, Catchpoint has now become the most innovative provider of web performance testing and monitoring solutions. We are a team with expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies and impacting the experience of millions of users. Catchpoint is funded by top-tier venture capital firm, Battery Ventures, which has invested in category leaders such as Akamai, Omniture (Adobe Systems), Optimizely, Tealium, BazaarVoice, Marketo and many more.

Latest Stories
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, will discuss some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he’ll go over some of the best practices for structured team migrat...
SYS-CON Events announced today that Datera will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera offers a radically new approach to data management, where innovative software makes data infrastructure invisible, elastic and able to perform at the highest level. It eliminates hardware lock-in and gives IT organizations the choice to source x86 server nodes, with business model option...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, will discuss how from store operations...
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, will discuss how they bu...
Digital transformation is changing the face of business. The IDC predicts that enterprises will commit to a massive new scale of digital transformation, to stake out leadership positions in the "digital transformation economy." Accordingly, attendees at the upcoming Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA, Oct 31-Nov 2, will find fresh new content in a new track called Enterprise Cloud & Digital Transformation.
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo Silicon Valley which will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. "DevOps is at the intersection of technology and business-optimizing tools, organizations and processes to bring measurable improvements in productivity and profitability," said Aruna Ravichandran, vice president, DevOps product and solutions marketing...
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
As people view cloud as a preferred option to build IT systems, the size of the cloud-based system is getting bigger and more complex. As the system gets bigger, more people need to collaborate from design to management. As more people collaborate to create a bigger system, the need for a systematic approach to automate the process is required. Just as in software, cloud now needs DevOps. In this session, the audience can see how people can solve this issue with a visual model. Visual models ha...
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...
SYS-CON Events announced today that Avere Systems, a leading provider of hybrid cloud enablement solutions, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Avere Systems was created by file systems experts determined to reinvent storage by changing the way enterprises thought about and bought storage resources. With decades of experience behind the company’s founders, Avere got its ...