Blog Feed Post

Management Challenge: P-hacking or “The Danger of Alternative Facts”

Mark Twain famously said “There are three kinds of lies: lies, damned lies, and statistics.” Maybe if Mark Twain were alive today, he’d add “alternative facts” to that list. BusinessWeek in the article “Lies, Damn Lies, and Financial Statistics” reminds us of the management challenges with statisticians or “data scientists” who manipulate the data to create “pseudo-facts” that can lead to sub-optimal or even dangerously wrong decisions.

My University of San Francisco class recently did a hackathon with a local data science company. It was insightful for all involved (and I learned of a new machine learning tool – BigML – which I will discuss in a future blog). One team was trying to prove that the housing prices in Sacramento where on the same pricing trajectory (accelerating upward) as the Brooklyn and Oakland housing markets, cities next to major technology hubs. She had just bought a house in Sacramento and was eager to prove that her investment was a sound one. Unfortunately, her tainted objective caused her to ignore some critical metrics that indicated that Sacramento was not on the same pricing trajectory. She got the model results that she was seeking, but she did not necessarily get the truth.

Statisticians know that by selectively munging, binning, constraining, cleansing and sub-segmenting one’s data set, they can get the data to tell almost any story or validate almost any “fact.” Plus with the advent of seductive data visualization tools, one can easily distract the reader from the real facts and paint a visually compelling but totally erroneous story from the data (see Figure 1).

Figure 1:  Distorting The “Truth” with Data Visualization

Figure 1:  Distorting The “Truth” with Data Visualization

The Weapon of Alternative Facts: p-Hacking

This flawed statistical behavior of staging and presenting the data in a way that supports one’s already pre-conceived answer has a name: p-hacking. As described in the BusinessWeek article, p-hacking is a reference to the p-value, which is a measure of statistical significance. To quote Andrew Lo, director of MIT’s Laboratory of Financial Engineering: “The more you search over the past, the more likely it is you are going to find exotic patterns that you happen to like or focus on. Those patterns are least likely to repeat.”

What is the ‘P-Value’[1]?  The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. A small p-value (less than .05) means that there is stronger evidence in favor of the alternative hypothesis. The hacking of a p-value can sometimes inadvertently happen through a statistical practice known as overfitting.

Overfitting Challenges

Understanding the concept of overfitting is important to make sure that one is not taking inappropriate liberties with the data. Ensuring that one isn’t inadvertently p-hacking the data requires 1) an understanding of overfitting and 2) a touch of common sense. From the BusinessWeek article:

“An abundance of computing power makes it possible to test thousands, even millions, of trading strategies. [For example], the standard method is to see how the [trading] strategy would have done if it had been used during the ups and downs of the market over, say, the past 20 years. This is called backtesting. As a quality check, the technique is then tested on a separate set of “out-of-sample” data—i.e., market history that wasn’t used to create the technique. In the wrong hands, though, backtesting can go horribly wrong. It once found that the best predictor of the S&P 500, out of all the series in a batch of United Nations data, was butter production in Bangladesh.”

So what is evil thing called overfitting? Overfitting occurs when the analytic model is excessively complex, such as having too many parameters or variables in the model relative to the number of observation points.

I love the overfitting example shown in Figure 2. In Figure 2, one is trying to fit the different shapes on the left side of the figure into one of the containers that minimizes the unused space in the container. The shapes fit into the right-most container with the lowest amount of unused space[2].

Figure 2:  Fitting Shapes Into Containers

Figure 2:  Fitting Shapes Into Containers

However when a new shape is added in Figure 3, a shape significantly different than the shapes used to minimize the container space, the originally selected container does not work because the container was over-fitted for just those shapes that were originally available.

Figure 3:  Overfitting

Figure 3:  Overfitting

Once new data gets added, especially data that might be different from the test data in some significant way (e.g., different time periods, different geographies, different products, different customers), the risk that the model that was created with the original set of data just doesn’t work with the next set of data that is not exactly like the original set.

P-hacking Summary

Okay, this is the most nerdy of jokes, but the web comic Randall Munroe captures the over-fitting management challenge creatively in the comic of a woman claiming jelly beans cause acne (see Figure 4).

Figure 4: The Most Nerdy Overfitting Example

When a statistical test shows no evidence of an effect, the woman revises her claim that acne must depend on the flavor of the jelly bean. So the statistician tests 20 flavors. Nineteen show nothing. But by chance there’s a high correlation between jelly bean consumption and acne breakouts for green jelly beans. The final panel of the cartoon is the front page of a newspaper: “Green Jelly Beans Linked to Acne! 95% Confidence. Only 5% Chance of Coincidence!”

Yea, it’s nerdy but it also makes this final critical point: use your common sense to contemplate whether the correlation really exists or not. There are plenty of examples of distorting the data and applying statistics to show the most absurd correlations (see Figure 5).

I mean, look at the correlation for that relationship (0.993)!  The relationship between United States spending on the space program and suicides must be true!

Additional reading:


[1] http://www.investopedia.com/terms/p/p-value.asp

[2] Special thanks to John Cardente, from Dell EMC’s Office of the CTO for the overfitting slides

The post Management Challenge: P-hacking or “The Danger of Alternative Facts” appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business”, is responsible for setting the strategy and defining the Big Data service line offerings and capabilities for the EMC Global Services organization. As part of Bill’s CTO charter, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He’s written several white papers, avid blogger and is a frequent speaker on the use of Big Data and advanced analytics to power organization’s key business initiatives. He also teaches the “Big Data MBA” at the University of San Francisco School of Management.

Bill has nearly three decades of experience in data warehousing, BI and analytics. Bill authored EMC’s Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements, and co-authored with Ralph Kimball a series of articles on analytic applications. Bill has served on The Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the Vice President of Advertiser Analytics at Yahoo and the Vice President of Analytic Applications at Business Objects.

Latest Stories
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
Automation is enabling enterprises to design, deploy, and manage more complex, hybrid cloud environments. Yet the people who manage these environments must be trained in and understanding these environments better than ever before. A new era of analytics and cognitive computing is adding intelligence, but also more complexity, to these cloud environments. How smart is your cloud? How smart should it be? In this power panel at 20th Cloud Expo, moderated by Conference Chair Roger Strukhoff, paneli...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
SYS-CON Events announced today that TMC has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo and Big Data at Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Global buyers rely on TMC’s content-driven marketplaces to make purchase decisions and navigate markets. Learn how we can help you reach your marketing goals.
Managing mission-critical SAP systems and landscapes has never been easy. Add public cloud with its myriad of powerful cloud native services and this may not change any time soon. Public cloud offers exciting new possibilities for enterprise workloads. But to make use of these possibilities and capabilities, IT teams need to re-think everything they have done before. Otherwise, they will just end up using public cloud as a hosting platform for their workloads, aka known as “lift and shift.”
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing bes...
Cloud applications are seeing a deluge of requests to support the exploding advanced analytics market. “Open analytics” is the emerging strategy to deliver that data through an open data access layer, in the cloud, to be directly consumed by external analytics tools and popular programming languages. An increasing number of data engineers and data scientists use a variety of platforms and advanced analytics languages such as SAS, R, Python and Java, as well as frameworks such as Hadoop and Spark...
SYS-CON Events announced today that TechTarget has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget storage websites are the best online information resource for news, tips and expert advice for the storage, backup and disaster recovery markets.
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...
SYS-CON Events announced today that Silicon India has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Published in Silicon Valley, Silicon India magazine is the premiere platform for CIOs to discuss their innovative enterprise solutions and allows IT vendors to learn about new solutions that can help grow their business.
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organi...