Blog Feed Post

Lessons in Becoming an Effective Data Scientist

I was recently a guest lecturer at the University of California Berkeley Extension in San Francisco. On a lovely Saturday afternoon, the classroom was crowded with students of all ages learning the tools of the modern economy. The craftspeople of the “Analytics Revolution” were busy learning new skills and tools that will prepare them for this Brave New World of analytics. I was blown away by their dedication!

As we teach the next generation, it’s important that we focus more on capabilities and less so on skills. What I mean is “learning TensorFlow” isn’t nearly as important as “learning how to learn TensorFlow.”

We need to make sure that we teach concepts and methodologies along with the tools. We should teach the “What” and “Why” as well as the “How” so we don’t put our students in a situation where they “can’t see the forest for the trees.”

This brings me to a recent article “What IBM Looks for in a Data Scientist,” which outlines what IBM looks for in a Data Scientist. The list of skills is very useful, especially for someone pursuing such a career:

  1. Training as a scientist with an MS or PhD.
  2. Expertise in machine learning and statistics with an emphasis on decision optimization.
  3. Expertise in R, Python or Scala.
  4. Ability to transform and manage large data sets.
  5. Proven ability to apply the skills above to real-world business problems.
  6. Ability to evaluate model performance and tune it accordingly.

Unfortunately, this is a tactical list, not a strategic list. In fact, some of the points are too granular and too focused on “how” versus “why.”  For example, on point #3, it’s more important to know how to program than it is to know a specific language. It’s more important to learn the concepts and approach to effectively program than it is to learn the tools themselves. The minute you think you’re expert at R or Python or Scala, along comes Julia. It’s important to develop transferable skills rather having to re-educate yourself each time a new tool arrives.

In a world driven by the rapid introduction and adoption of open source tools and frameworks (like TensorFlow for machine learning), expertise in a tool is fleeting.  However, mastery of the concepts and approaches for which those tools are used is critical because being a data scientist is more than just a bag of skills. The best data scientists are about outcomes and results.

Data Science DEPP Engagement Process

Our data science team at Dell EMC uses a methodology called DEPP that guides the collaboration with the business stakeholders through the following stages:

  • Descriptive Analytics to clearly understand what happened and how the business is measuring success.
  • Exploratory Analytics to understand the financial, business and operational drivers behind what happened.
  • Predictive Analytics to transition the business stakeholder mindset to focus on predicting what is likely to happen.
  • Prescriptive Analytics to identify actions or recommendations based upon the measures of business success and the Predictive Analytics.

The DEPP Methodology is an agile and iterative process that continues to evolve in scope and complexity as our clients mature in their advanced analytics capabilities (see Figure 1).

Figure 1: Dell EMC DEPP Data Science Collaborative Methodology

Importance of Humility

The first skill that I look for when engaging with or hiring a data scientist is humility. I look for the ability to listen and engage with others who may not seem as smart as them. And as you can see from our DEPP methodology, humility is the key to driving collaboration between the business stakeholders (who will never understand data science to the level that a data scientist do) and the data scientist (who will never understand the business to the level that the business stakeholders do).

Humility is critical to our DEPP methodology because you can’t learn what’s important for the business if you aren’t willing to acknowledge that you might not know everything.

Humility is one of the secrets to effective collaboration. Nowhere does the importance of the business/data science collaboration play a more important role than in hypothesis development.

A hypothesis is a formal statement that presents the expected relationship between an independent and dependent variable. (Creswell,1994)

If you get the hypothesis and the metrics against which you are going to measure success wrong, everything the data scientist does to support that hypothesis doesn’t matter. In fact, if you get the hypothesis and the metrics against which you are going to measure wrong, not only are you likely to achieve suboptimal results, but you could actually achieve the wrong results altogether.

For example, in the healthcare industry, we are seeing the disastrous effects of the wrong metrics (see the blog “Unintended Consequences of the Wrong Measures” for more details). Instead of using “Patient Satisfaction” as the metric against which to measure the doctor and hospital effectiveness (which is leading to unintended consequences), the healthcare industry may benefit from a more holistic metric against which to measure success. One example is a “Quality and Effectiveness of Care” combined with a “Readmissions” score and “Hospital Acquired Infections” score.

Being off in your hypothesis by just one degree can be disastrous. For example, if you are flying San Francisco to Washington, D.C. and were off by a mere one degree upon takeoff, you’d end up on the other side of Baltimore, 42.6 miles away (“Impact of A Mere One-Degree Difference”).

Figure 2: Ramifications of being off 1 degree


Get the hypothesis wrong, even by a one degree, and the results could be wrong or even disastrous (if you have tickets to watch the Washington Redskins play football and not the Baltimore Ravens).

Type I / Type II Errors

Being humble also means to concede when you may be wrong, particularly with analytic models that may not always deliver the right predictions or outcomes. In that case, a solid understanding of the business or organizational costs of Type I (False Positive) and Type II (False Negative) errors is important. To understand the business and organizational ramifications of such errors requires close collaboration with the business stakeholders (see Figure 3).

Figure 3: Understanding Type I Errors and Type II Errors

See the blog “Understanding Type I and Type II Errors” for more details.


In my classes, I focus on the “What” and “Why” versus spending too much time on the “How”. I want my students to have a framework that enables them to understand how the different technologies, techniques and tools can be more effectively used.

I’m not teaching my students data science, I’m teaching them how to learn data science. It is an important distinction that can be humbling, but results in a more detailed-oriented student that wishes not only to become a data scientist, but become an effective data scientist. As teachers, it is important that we know the difference.

The post Lessons in Becoming an Effective Data Scientist appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice.

As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Latest Stories
Sometimes I write a blog just to formulate and organize a point of view, and I think it’s time that I pull together the bounty of excellent information about Machine Learning. This is a topic with which business leaders must become comfortable, especially tomorrow’s business leaders (tip for my next semester University of San Francisco business students!). Machine learning is a key capability that will help organizations drive optimization and monetization opportunities, and there have been some...
"Storpool does only block-level storage so we do one thing extremely well. The growth in data is what drives the move to software-defined technologies in general and software-defined storage," explained Boyan Ivanov, CEO and co-founder at StorPool, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, provided a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to oper...
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...
ChatOps is an emerging topic that has led to the wide availability of integrations between group chat and various other tools/platforms. Currently, HipChat is an extremely powerful collaboration platform due to the various ChatOps integrations that are available. However, DevOps automation can involve orchestration and complex workflows. In his session at @DevOpsSummit at 20th Cloud Expo, Himanshu Chhetri, CTO at Addteq, will cover practical examples and use cases such as self-provisioning infra...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale - when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing "software-defined" in the datacent...
Blockchain. A day doesn’t seem to go by without seeing articles and discussions about the technology. According to PwC executive Seamus Cushley, approximately $1.4B has been invested in blockchain just last year. In Gartner’s recent hype cycle for emerging technologies, blockchain is approaching the peak. It is considered by Gartner as one of the ‘Key platform-enabling technologies to track.’ While there is a lot of ‘hype vs reality’ discussions going on, there is no arguing that blockchain is b...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Is advanced scheduling in Kubernetes achievable?Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, answered these questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and co...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
The use of containers by developers -- and now increasingly IT operators -- has grown from infatuation to deep and abiding love. But as with any long-term affair, the honeymoon soon leads to needing to live well together ... and maybe even getting some relationship help along the way. And so it goes with container orchestration and automation solutions, which are rapidly emerging as the means to maintain the bliss between rapid container adoption and broad container use among multiple cloud host...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.