Welcome!

Blog Feed Post

Management Challenge: P-hacking or “The Danger of Alternative Facts”

Mark Twain famously said “There are three kinds of lies: lies, damned lies, and statistics.” Maybe if Mark Twain were alive today, he’d add “alternative facts” to that list. BusinessWeek in the article “Lies, Damn Lies, and Financial Statistics” reminds us of the management challenges with statisticians or “data scientists” who manipulate the data to create “pseudo-facts” that can lead to sub-optimal or even dangerously wrong decisions.

My University of San Francisco class recently did a hackathon with a local data science company. It was insightful for all involved (and I learned of a new machine learning tool – BigML – which I will discuss in a future blog). One team was trying to prove that the housing prices in Sacramento where on the same pricing trajectory (accelerating upward) as the Brooklyn and Oakland housing markets, cities next to major technology hubs. She had just bought a house in Sacramento and was eager to prove that her investment was a sound one. Unfortunately, her tainted objective caused her to ignore some critical metrics that indicated that Sacramento was not on the same pricing trajectory. She got the model results that she was seeking, but she did not necessarily get the truth.

Statisticians know that by selectively munging, binning, constraining, cleansing and sub-segmenting one’s data set, they can get the data to tell almost any story or validate almost any “fact.” Plus with the advent of seductive data visualization tools, one can easily distract the reader from the real facts and paint a visually compelling but totally erroneous story from the data (see Figure 1).

Figure 1:  Distorting The “Truth” with Data Visualization

Figure 1:  Distorting The “Truth” with Data Visualization

The Weapon of Alternative Facts: p-Hacking

This flawed statistical behavior of staging and presenting the data in a way that supports one’s already pre-conceived answer has a name: p-hacking. As described in the BusinessWeek article, p-hacking is a reference to the p-value, which is a measure of statistical significance. To quote Andrew Lo, director of MIT’s Laboratory of Financial Engineering: “The more you search over the past, the more likely it is you are going to find exotic patterns that you happen to like or focus on. Those patterns are least likely to repeat.”

What is the ‘P-Value’[1]?  The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. A small p-value (less than .05) means that there is stronger evidence in favor of the alternative hypothesis. The hacking of a p-value can sometimes inadvertently happen through a statistical practice known as overfitting.

Overfitting Challenges

Understanding the concept of overfitting is important to make sure that one is not taking inappropriate liberties with the data. Ensuring that one isn’t inadvertently p-hacking the data requires 1) an understanding of overfitting and 2) a touch of common sense. From the BusinessWeek article:

“An abundance of computing power makes it possible to test thousands, even millions, of trading strategies. [For example], the standard method is to see how the [trading] strategy would have done if it had been used during the ups and downs of the market over, say, the past 20 years. This is called backtesting. As a quality check, the technique is then tested on a separate set of “out-of-sample” data—i.e., market history that wasn’t used to create the technique. In the wrong hands, though, backtesting can go horribly wrong. It once found that the best predictor of the S&P 500, out of all the series in a batch of United Nations data, was butter production in Bangladesh.”

So what is evil thing called overfitting? Overfitting occurs when the analytic model is excessively complex, such as having too many parameters or variables in the model relative to the number of observation points.

I love the overfitting example shown in Figure 2. In Figure 2, one is trying to fit the different shapes on the left side of the figure into one of the containers that minimizes the unused space in the container. The shapes fit into the right-most container with the lowest amount of unused space[2].

Figure 2:  Fitting Shapes Into Containers

Figure 2:  Fitting Shapes Into Containers

However when a new shape is added in Figure 3, a shape significantly different than the shapes used to minimize the container space, the originally selected container does not work because the container was over-fitted for just those shapes that were originally available.

Figure 3:  Overfitting

Figure 3:  Overfitting

Once new data gets added, especially data that might be different from the test data in some significant way (e.g., different time periods, different geographies, different products, different customers), the risk that the model that was created with the original set of data just doesn’t work with the next set of data that is not exactly like the original set.

P-hacking Summary

Okay, this is the most nerdy of jokes, but the web comic Randall Munroe captures the over-fitting management challenge creatively in the comic of a woman claiming jelly beans cause acne (see Figure 4).

Figure 4: The Most Nerdy Overfitting Example

When a statistical test shows no evidence of an effect, the woman revises her claim that acne must depend on the flavor of the jelly bean. So the statistician tests 20 flavors. Nineteen show nothing. But by chance there’s a high correlation between jelly bean consumption and acne breakouts for green jelly beans. The final panel of the cartoon is the front page of a newspaper: “Green Jelly Beans Linked to Acne! 95% Confidence. Only 5% Chance of Coincidence!”

Yea, it’s nerdy but it also makes this final critical point: use your common sense to contemplate whether the correlation really exists or not. There are plenty of examples of distorting the data and applying statistics to show the most absurd correlations (see Figure 5).

I mean, look at the correlation for that relationship (0.993)!  The relationship between United States spending on the space program and suicides must be true!

Additional reading:

 

[1] http://www.investopedia.com/terms/p/p-value.asp

[2] Special thanks to John Cardente, from Dell EMC’s Office of the CTO for the overfitting slides

The post Management Challenge: P-hacking or “The Danger of Alternative Facts” appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business”, is responsible for setting the strategy and defining the Big Data service line offerings and capabilities for the EMC Global Services organization. As part of Bill’s CTO charter, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He’s written several white papers, avid blogger and is a frequent speaker on the use of Big Data and advanced analytics to power organization’s key business initiatives. He also teaches the “Big Data MBA” at the University of San Francisco School of Management.

Bill has nearly three decades of experience in data warehousing, BI and analytics. Bill authored EMC’s Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements, and co-authored with Ralph Kimball a series of articles on analytic applications. Bill has served on The Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the Vice President of Advertiser Analytics at Yahoo and the Vice President of Analytic Applications at Business Objects.

Latest Stories
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
You know you need the cloud, but you're hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You're looking at private cloud solutions based on hyperconverged infrastructure, but you're concerned with the limits inherent in those technologies. What do you do?
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
Recently, WebRTC has a lot of eyes from market. The use cases of WebRTC are expanding - video chat, online education, online health care etc. Not only for human-to-human communication, but also IoT use cases such as machine to human use cases can be seen recently. One of the typical use-case is remote camera monitoring. With WebRTC, people can have interoperability and flexibility for deploying monitoring service. However, the benefit of WebRTC for IoT is not only its convenience and interopera...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams. In his session at 22nd Cloud Expo | DXWorld Expo, Daniel Jones, CTO of EngineerBetter, will answer: How can we improve willpower and decrease technical debt? Is the present bias real? How can we turn it to our advantage? Can you increase a team’s effective IQ? How do DevOps & Product Teams increase empathy, and what impact does empath...
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...