Blog Feed Post

Lessons in Becoming an Effective Data Scientist

I was recently a guest lecturer at the University of California Berkeley Extension in San Francisco. On a lovely Saturday afternoon, the classroom was crowded with students of all ages learning the tools of the modern economy. The craftspeople of the “Analytics Revolution” were busy learning new skills and tools that will prepare them for this Brave New World of analytics. I was blown away by their dedication!

As we teach the next generation, it’s important that we focus more on capabilities and less so on skills. What I mean is “learning TensorFlow” isn’t nearly as important as “learning how to learn TensorFlow.”

We need to make sure that we teach concepts and methodologies along with the tools. We should teach the “What” and “Why” as well as the “How” so we don’t put our students in a situation where they “can’t see the forest for the trees.”

This brings me to a recent article “What IBM Looks for in a Data Scientist,” which outlines what IBM looks for in a Data Scientist. The list of skills is very useful, especially for someone pursuing such a career:

  1. Training as a scientist with an MS or PhD.
  2. Expertise in machine learning and statistics with an emphasis on decision optimization.
  3. Expertise in R, Python or Scala.
  4. Ability to transform and manage large data sets.
  5. Proven ability to apply the skills above to real-world business problems.
  6. Ability to evaluate model performance and tune it accordingly.

Unfortunately, this is a tactical list, not a strategic list. In fact, some of the points are too granular and too focused on “how” versus “why.”  For example, on point #3, it’s more important to know how to program than it is to know a specific language. It’s more important to learn the concepts and approach to effectively program than it is to learn the tools themselves. The minute you think you’re expert at R or Python or Scala, along comes Julia. It’s important to develop transferable skills rather having to re-educate yourself each time a new tool arrives.

In a world driven by the rapid introduction and adoption of open source tools and frameworks (like TensorFlow for machine learning), expertise in a tool is fleeting.  However, mastery of the concepts and approaches for which those tools are used is critical because being a data scientist is more than just a bag of skills. The best data scientists are about outcomes and results.

Data Science DEPP Engagement Process

Our data science team at Dell EMC uses a methodology called DEPP that guides the collaboration with the business stakeholders through the following stages:

  • Descriptive Analytics to clearly understand what happened and how the business is measuring success.
  • Exploratory Analytics to understand the financial, business and operational drivers behind what happened.
  • Predictive Analytics to transition the business stakeholder mindset to focus on predicting what is likely to happen.
  • Prescriptive Analytics to identify actions or recommendations based upon the measures of business success and the Predictive Analytics.

The DEPP Methodology is an agile and iterative process that continues to evolve in scope and complexity as our clients mature in their advanced analytics capabilities (see Figure 1).

Figure 1: Dell EMC DEPP Data Science Collaborative Methodology

Importance of Humility

The first skill that I look for when engaging with or hiring a data scientist is humility. I look for the ability to listen and engage with others who may not seem as smart as them. And as you can see from our DEPP methodology, humility is the key to driving collaboration between the business stakeholders (who will never understand data science to the level that a data scientist do) and the data scientist (who will never understand the business to the level that the business stakeholders do).

Humility is critical to our DEPP methodology because you can’t learn what’s important for the business if you aren’t willing to acknowledge that you might not know everything.

Humility is one of the secrets to effective collaboration. Nowhere does the importance of the business/data science collaboration play a more important role than in hypothesis development.

A hypothesis is a formal statement that presents the expected relationship between an independent and dependent variable. (Creswell,1994)

If you get the hypothesis and the metrics against which you are going to measure success wrong, everything the data scientist does to support that hypothesis doesn’t matter. In fact, if you get the hypothesis and the metrics against which you are going to measure wrong, not only are you likely to achieve suboptimal results, but you could actually achieve the wrong results altogether.

For example, in the healthcare industry, we are seeing the disastrous effects of the wrong metrics (see the blog “Unintended Consequences of the Wrong Measures” for more details). Instead of using “Patient Satisfaction” as the metric against which to measure the doctor and hospital effectiveness (which is leading to unintended consequences), the healthcare industry may benefit from a more holistic metric against which to measure success. One example is a “Quality and Effectiveness of Care” combined with a “Readmissions” score and “Hospital Acquired Infections” score.

Being off in your hypothesis by just one degree can be disastrous. For example, if you are flying San Francisco to Washington, D.C. and were off by a mere one degree upon takeoff, you’d end up on the other side of Baltimore, 42.6 miles away (“Impact of A Mere One-Degree Difference”).

Figure 2: Ramifications of being off 1 degree


Get the hypothesis wrong, even by a one degree, and the results could be wrong or even disastrous (if you have tickets to watch the Washington Redskins play football and not the Baltimore Ravens).

Type I / Type II Errors

Being humble also means to concede when you may be wrong, particularly with analytic models that may not always deliver the right predictions or outcomes. In that case, a solid understanding of the business or organizational costs of Type I (False Positive) and Type II (False Negative) errors is important. To understand the business and organizational ramifications of such errors requires close collaboration with the business stakeholders (see Figure 3).

Figure 3: Understanding Type I Errors and Type II Errors

See the blog “Understanding Type I and Type II Errors” for more details.


In my classes, I focus on the “What” and “Why” versus spending too much time on the “How”. I want my students to have a framework that enables them to understand how the different technologies, techniques and tools can be more effectively used.

I’m not teaching my students data science, I’m teaching them how to learn data science. It is an important distinction that can be humbling, but results in a more detailed-oriented student that wishes not only to become a data scientist, but become an effective data scientist. As teachers, it is important that we know the difference.

The post Lessons in Becoming an Effective Data Scientist appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Hitachi Vantara as CTO, IoT and Analytics.

Previously, as a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Latest Stories
CloudEXPO | DevOpsSUMMIT | DXWorldEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of computational needs for many industries. Their solutions provide benefits across many environments, such as datacenter deployment, HPC, workstations, storage networks and standalone server installations. ICC has been in business for over 23 years and their phenomenal range of clients include multinational corporations, universities, and small busines...
This sixteen (16) hour course provides an introduction to DevOps, the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will result in an improved ability to design, develop, deploy and operate software and services faster.
Headquartered in Plainsboro, NJ, Synametrics Technologies has provided IT professionals and computer systems developers since 1997. Based on the success of their initial product offerings (WinSQL and DeltaCopy), the company continues to create and hone innovative products that help its customers get more from their computer applications, databases and infrastructure. To date, over one million users around the world have chosen Synametrics solutions to help power their accelerated business or per...
All in Mobile is a place where we continually maximize their impact by fostering understanding, empathy, insights, creativity and joy. They believe that a truly useful and desirable mobile app doesn't need the brightest idea or the most advanced technology. A great product begins with understanding people. It's easy to think that customers will love your app, but can you justify it? They make sure your final app is something that users truly want and need. The only way to do this is by ...
Authorization of web applications developed in the cloud is a fundamental problem for security, yet companies often build solutions from scratch, which is error prone and impedes time to market. This talk shows developers how they can (instead) build on-top of community-owned projects and frameworks for better security.Whether you build software for enterprises, mobile, or internal microservices, security is important. Standards like SAML, OIDC, and SPIFFE help you solve identity and authenticat...
The digital transformation is real! To adapt, IT professionals need to transform their own skillset to become more multi-dimensional by gaining both depth and breadth of a wide variety of knowledge and competencies. Historically, while IT has been built on a foundation of specialty (or "I" shaped) silos, the DevOps principle of "shifting left" is opening up opportunities for developers, operational staff, security and others to grow their skills portfolio, advance their careers and become "T"-sh...
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
Whenever a new technology hits the high points of hype, everyone starts talking about it like it will solve all their business problems. Blockchain is one of those technologies. According to Gartner's latest report on the hype cycle of emerging technologies, blockchain has just passed the peak of their hype cycle curve. If you read the news articles about it, one would think it has taken over the technology world. No disruptive technology is without its challenges and potential impediments t...
Hackers took three days to identify and exploit a known vulnerability in Equifax’s web applications. I will share new data that reveals why three days (at most) is the new normal for DevSecOps teams to move new business /security requirements from design into production. This session aims to enlighten DevOps teams, security and development professionals by sharing results from the 4th annual State of the Software Supply Chain Report -- a blend of public and proprietary data with expert researc...
Lori MacVittie is a subject matter expert on emerging technology responsible for outbound evangelism across F5's entire product suite. MacVittie has extensive development and technical architecture experience in both high-tech and enterprise organizations, in addition to network and systems administration expertise. Prior to joining F5, MacVittie was an award-winning technology editor at Network Computing Magazine where she evaluated and tested application-focused technologies including app secu...
Dynatrace is an application performance management software company with products for the information technology departments and digital business owners of medium and large businesses. Building the Future of Monitoring with Artificial Intelligence. Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more busine...
Having been in the web hosting industry since 2002, dhosting has gained a great deal of experience while working on a wide range of projects. This experience has enabled the company to develop our amazing new product, which they are now excited to present! Among dHosting's greatest achievements, they can include the development of their own hosting panel, the building of their fully redundant server system, and the creation of dhHosting's unique product, Dynamic Edge.
This session will provide an introduction to Cloud driven quality and transformation and highlight the key features that comprise it. A perspective on the cloud transformation lifecycle, transformation levers, and transformation framework will be shared. At Cognizant, we have developed a transformation strategy to enable the migration of business critical workloads to cloud environments. The strategy encompasses a set of transformation levers across the cloud transformation lifecycle to enhance ...
Your job is mostly boring. Many of the IT operations tasks you perform on a day-to-day basis are repetitive and dull. Utilizing automation can improve your work life, automating away the drudgery and embracing the passion for technology that got you started in the first place. In this presentation, I'll talk about what automation is, and how to approach implementing it in the context of IT Operations. Ned will discuss keys to success in the long term and include practical real-world examples. Ge...