Going Pro
in Data Science
What It Takes to Succeed as
a Professional Data Scientist
Jerry Overton
Going Pro in Data Science
What It Takes to Succeed as a
Professional Data Scientist
Jerry Overton
Beijing
Boston Farnham
Sebastopol
Tokyo
Going Pro in Data Science
by Jerry Overton
Copyright © 2016 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (). For
more information, contact our corporate/institutional sales department:
800-998-9938 or
Editor: Shannon Cutt
Production Editor: Kristen Brown
Proofreader:
O’Reilly
Production
Services
March 2016:
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
First Edition
Revision History for the First Edition
2016-03-03: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Going Pro in Data
Science, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-95608-3
[LSI]
Table of Contents
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Finding Signals in the Noise
Data Science that Works
1
2
2. How to Get a Competitive Advantage Using Data Science. . . . . . . . . . 5
The Standard Story Line for Getting Value from Data
Science
An Alternative Story Line for Getting Value from Data
Science
The Importance of the Scientific Method
5
6
8
3. What to Look for in a Data Scientist. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
A Realistic Skill Set
Realistic Expectations
11
12
4. How to Think Like a Data Scientist. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Practical Induction
The Logic of Data Science
Treating Data as Evidence
15
16
20
5. How to Write Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
The Professional Data Science Programmer
Think Like a Pro
Design Like a Pro
Build Like a Pro
Learn Like a Pro
22
22
23
26
28
v
6. How to Be Agile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
An Example Using the StackOverflow Data Explorer
Putting the Results into Action
Lessons Learned from a Minimum Viable Experiment
Don’t Worry, Be Crappy
32
36
37
38
7. How to Survive in Your Organization. . . . . . . . . . . . . . . . . . . . . . . . . . 41
You Need a Network
You Need A Patron
You Need Partners
It’s a Jungle Out There
41
43
44
45
8. The Road Ahead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Data Science Today
Data Science Tomorrow
47
48
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vi |
Table of Contents
CHAPTER 1
Introduction
Finding Signals in the Noise
Popular data science publications tend to creep me out. I’ll read case
studies where I’m led by deduction from the data collected to a very
cool insight. Each step is fully justified, the interpretation is clear—
and yet the whole thing feels weird. My problem with these stories is
that everything you need to know is known, or at least present in
some form. The challenge is finding the analytical approach that will
get you safely to a prediction. This works when all transactions hap‐
pen digitally, like ecommerce, or when the world is simple enough
to fully quantify, like some sports. But the world I know is a lot dif‐
ferent. In my world, I spend a lot of time dealing with real people
and the problems they are trying to solve. Missing information is
common. The things I really want to know are outside my observa‐
ble universe and, many times, the best I can hope for are weak sig‐
nals.
CSC (Computer Sciences Corporation) is a global IT leader and
every day we’re faced with the challenge of using IT to solve our cus‐
tomer’s business problems. I’m asked questions like: what are our
client’s biggest problems, what solutions should we build, and what
skills do we need? These questions are complicated and messy, but
often there are answers. Getting to answers requires a strategy and,
so far, I’ve done quite well with basic, simple heuristics. It’s natural
to think that complex environments require complex strategies, but
often they don’t. Simple heuristics tend to be most resilient when
trying to generate plausible scenarios about something as uncertain
1
as the real world. And simple scales. As the volume and variety of
data increases, the number of possible correlations grows a lot faster
than the number of meaningful or useful ones. As data gets bigger,
noise grows faster than signal (Figure 1-1).
Figure 1-1. As data gets bigger, noise grows faster than signal
Finding signals buried in the noise is tough, and not every data sci‐
ence technique is useful for finding the types of insights I need to
discover. But there is a subset of practices that I’ve found fantasti‐
cally useful. I call them “data science that works.” It’s the set of data
science practices that I’ve found to be consistently useful in extract‐
ing simple heuristics for making good decisions in a messy and
complicated world. Getting to a data science that works is a difficult
process of trial and error.
But essentially it comes down to two factors:
• First, it’s important to value the right set of data science skills.
• Second, it’s critical to find practical methods of induction where
I can infer general principles from observations and then reason
about the credibility of those principles.
Data Science that Works
The common ask from a data scientist is the combination of subject
matter expertise, mathematics, and computer science. However I’ve
found that the skill set that tends to be most effective in practice are
agile experimentation, hypothesis testing, and professional data sci‐
2
|
Chapter 1: Introduction
ence programming. This more pragmatic view of data science skills
shifts the focus from searching for a unicorn to relying on real fleshand-blood humans. After you have data science skills that work,
what remains to consistently finding actionable insights is a practi‐
cal method of induction.
Induction is the go-to method of reasoning when you don’t have all
the information. It takes you from observations to hypotheses to the
credibility of each hypothesis. You start with a question and collect
data you think can give answers. Take a guess at a hypothesis and
use it to build a model that explains the data. Evaluate the credibility
of the hypothesis based on how well the model explains the data
observed so far. Ultimately the goal is to arrive at insights we can
rely on to make high-quality decisions in the real world. The biggest
challenge in judging a hypothesis is figuring out what available evi‐
dence is useful for the task. In practice, finding useful evidence and
interpreting its significance is the key skill of the practicing data sci‐
entist—even more so than mastering the details of a machine learn‐
ing algorithm.
The goal of this book is to communicate what I’ve learned, so far,
about data science that works:
1.
2.
3.
4.
5.
Start with a question.
Guess at a pattern.
Gather observations and use them to generate a hypothesis.
Use real-world evidence to judge the hypothesis.
Collaborate early and often with customers and subject matter
experts along the way.
At any point in time, a hypothesis and our confidence in it is simply
the best that we can know so far. Real-world data science results are
abstractions—simple heuristic representations of the reality they
come from. Going pro in data science is a matter of making a small
upgrade to basic human judgment and common sense. This book is
built from the kinds of thinking we’ve always relied on to make
smart decisions in a complicated world.
Data Science that Works
|
3
CHAPTER 2
How to Get a Competitive
Advantage Using Data Science
The Standard Story Line for Getting Value
from Data Science
Data science already plays a significant role in specialized areas.
Being able to predict machine failure is a big deal in transportation
and manufacturing. Predicting user engagement is huge in advertis‐
ing. And properly classifying potential voters can mean the differ‐
ence between winning and losing an election.
But the thing that excites me most is the promise that, in general,
data science can give a competitive advantage to almost any business
that is able to secure the right data and the right talent. I believe that
data science can live up to this promise, but only if we can fix some
common misconceptions about its value.
For instance, here’s the standard story line when it comes to data sci‐
ence: data-driven companies outperform their peers—just look at
Google, Netflix, and Amazon. You need high-quality data with the
right velocity, variety, and volume, the story goes, as well as skilled
data scientists who can find hidden patterns and tell compelling sto‐
ries about what those patterns really mean. The resulting insights
will drive businesses to optimal performance and greater competi‐
tive advantage. Right?
Well…not quite.
5
The standard story line sounds really good. But a few problems
occur when you try to put it into practice.
The first problem, I think, is that the story makes the wrong
assumption about what to look for in a data scientist. If you do a
web search on the skills required to be a data scientist (seriously, try
it), you’ll find a heavy focus on algorithms. It seems that we tend to
assume that data science is mostly about creating and running
advanced analytics algorithms.
I think the second problem is that the story ignores the subtle, yet
very persistent tendency of human beings to reject things we don’t
like. Often we assume that getting someone to accept an insight
from a pattern found in the data is a matter of telling a good story.
It’s the “last mile” assumption. Many times what happens instead is
that the requester questions the assumptions, the data, the methods,
or the interpretation. You end up chasing follow-up research tasks
until you either tell your requesters what they already believed or
just give up and find a new project.
An Alternative Story Line for Getting Value
from Data Science
The first step in building a competitive advantage through data sci‐
ence is having a good definition of what a data scientist really is. I
believe that data scientists are, foremost, scientists. They use the sci‐
entific method. They guess at hypotheses. They gather evidence.
They draw conclusions. Like all other scientists, their job is to create
and test hypotheses. Instead of specializing in a particular domain of
the world, such as living organisms or volcanoes, data scientists spe‐
cialize in the study of data. This means that, ultimately, data scien‐
tists must have a falsifiable hypothesis to do their job. Which puts
them on a much different trajectory than what is described in the
standard story line.
If you want to build a competitive advantage through data science,
you need a falsifiable hypothesis about what will create that advan‐
tage. Guess at the hypothesis, then turn the data scientist loose on
trying to confirm or refute it. There are countless specific hypothe‐
ses you can explore, but they will all have the same general form:
It’s more effective to do X than to do Y
For example:
6
|
Chapter 2: How to Get a Competitive Advantage Using Data Science
• Our company will sell more widgets if we increase delivery
capabilities in Asia Pacific.
• The sales force will increase their overall sales if we introduce
mandatory training.
• We will increase customer satisfaction if we hire more userexperience designers.
You have to describe what you mean by effective. That is, you need
some kind of key performance indicator, like sales or customer satis‐
faction, that defines your desired outcome. You have to specify some
action that you believe connects to the outcome you care about. You
need a potential leading indicator that you’ve tracked over time.
Assembling this data is a very difficult step, and one of the main rea‐
sons you hire a data scientist. The specifics will vary, but the data
you need will have the same general form shown in Figure 2-1.
Figure 2-1. The data you need to build a competitive advantage using
data science
Let’s take, for example, our hypothesis that hiring more userexperience designers will increase customer satisfaction. We already
control whom we hire. We want greater control over customer satis‐
faction—the key performance indicator. We assume that the number
of user experience designers is a leading indicator of customer satis‐
faction. User experience design is a skill of our employees, employ‐
ees work on client projects, and their performance influences
customer satisfaction.
Once you’ve assembled the data you need (Figure 2-2), let your data
scientists go nuts. Run algorithms, collect evidence, and decide on
the credibility of the hypothesis. The end result will be something
along the lines of “yes, hiring more user experience designers should
An Alternative Story Line for Getting Value from Data Science
|
7
increase customer satisfaction by 10% on average” or “the number of
user experience designers has no detectable influence on customer
satisfaction.”
Figure 2-2. An example of the data you need to explore the hypothesis
that hiring more user experience designers will improve customer sat‐
isfaction
The Importance of the Scientific Method
Notice, now, that we’ve pushed well past the “last mile.” At this point,
progress is not a matter of telling a compelling story and convincing
someone of a particular worldview. Progress is a matter of choosing
whether or not the evidence is strong enough to justify taking
action. The whole process is simply a business adaptation of the sci‐
entific method (Figure 2-3).
This brand of data science may not be as exciting as the idea of tak‐
ing unexplored data and discovering unexpected connections that
change everything. But it works. The progress you make is steady
and depends entirely on the hypotheses you choose to investigate.
8
|
Chapter 2: How to Get a Competitive Advantage Using Data Science
Figure 2-3. The process of accumulating competitive advantages using
data science; it’s a simple adaptation of the scientific method
Which brings us to the main point: there are many factors that con‐
tribute to the success of a data science team. But achieving a com‐
petitive advantage from the work of your data scientists depends on
the quality and format of the questions you ask.
How to Partner with the C-Suite
If you are an executive, people are constantly trying to impress you.
No one wants to be the tattletale with lots of problems, they want to
be the hero with lots of solutions. For us mere mortals, finding peo‐
ple who will list the ways we’re screwing up is no problem. For an
executive, that source of information is a rare and valuable thing.
Most executives follow a straightforward process for making deci‐
sions: define success, gather options, make a call. For most, spend‐
ing a few hours on the Web researching options or meeting with
subject-matter experts is no problem. But for an executive, spend‐
ing that kind of time is an extravagance they can’t afford.
All of this is good news for the data scientist. It means that the bar
for being valuable to the C-Suite isn’t as high as you might think.
Groundbreaking discoveries are great, but being a credible source
of looming problems and viable solutions is probably enough to
reserve you a seat at the table.
The Importance of the Scientific Method
|
9
CHAPTER 3
What to Look for in a
Data Scientist
A Realistic Skill Set
What’s commonly expected from a data scientist is a combination of
subject matter expertise, mathematics, and computer science. This is
a tall order and it makes sense that there would be a shortage of peo‐
ple who fit the description. The more knowledge you have, the bet‐
ter. However, I’ve found that the skill set you need to be effective, in
practice, tends to be more specific and much more attainable
(Figure 3-1). This approach changes both what you look
for from data science and what you look for in a data scientist.
A background in computer science helps with understanding soft‐
ware engineering, but writing working data products requires spe‐
cific techniques for writing solid data science code. Subject matter
expertise is needed to pose interesting questions and interpret
results, but this is often done in collaboration between the data sci‐
entist and subject matter experts (SMEs). In practice, it is much
more important for data scientists to be skilled at engaging SMEs in
agile experimentation. A background in mathematics and statistics
is necessary to understand the details of most machine learning
algorithms, but to be effective at applying those algorithms requires
a more specific understanding of how to evaluate hypotheses.
11
Figure 3-1. A more pragmatic view of the required data science skills
Realistic Expectations
In practice, data scientists usually start with a question, and then
collect data they think could provide insight. A data scientist has to
be able to take a guess at a hypothesis and use it to explain the data.
For example, I collaborated with HR in an effort to find the factors
that contributed best to employee satisfaction at our company (I
describe this in more detail in Chapter 4). After a few short sessions
with the SMEs, it was clear that you could probably spot an unhappy
employee with just a handful of simple warning signs—which made
decision trees (or association rules) a natural choice. We selected a
decision-tree algorithm and used it to produce a tree and error esti‐
mates based on employee survey responses.
Once we have a hypothesis, we need to figure out if it’s something
we can trust. The challenge in judging a hypothesis is figuring out
what available evidence would be useful for that task.
12
|
Chapter 3: What to Look for in a Data Scientist
The Most Important Quality of a Data Scientist
I believe that the most important quality to look
for in a data scientist is the ability to find useful
evidence and interpret its significance.
In data science today, we spend way too much time celebrating the
details of machine learning algorithms. A machine learning algo‐
rithm is to a data scientist what a compound microscope is to a biol‐
ogist. The microscope is a source of evidence. The biologist should
understand that evidence and how it was produced, but we should
expect our biologists to make contributions well beyond custom
grinding lenses or calculating refraction indices.
A data scientist needs to be able to understand an algorithm. But
confusion about what that means causes would-be great data scien‐
tists to shy away from the field, and practicing data scientists to
focus on the wrong thing. Interestingly, in this matter we can bor‐
row a lesson from the Turing Test. The Turing Test gives us a way to
recognize when a machine is intelligent—talk to the machine. If you
can’t tell if it’s a machine or a person, then the machine is intelligent.
We can do the same thing in data science. If you can converse intel‐
ligently about the results of an algorithm, then you probably under‐
stand it. In general, here’s what it looks like:
Q: Why are the results of the algorithm X and not Y?
A: The algorithm operates on principle A. Because the circumstan‐
ces are B, the algorithm produces X. We would have to change
things to C to get result Y.
Here’s a more specific example:
Q: Why does your adjacency matrix show a relationship of 1
(instead of 3) between the term “cat” and the term “hat”?
A: The algorithm defines distance as the number of characters
needed to turn one term into another. Since the only difference
between “cat” and “hat” is the first letter, the distance between them
is 1. If we changed “cat” to, say, “dog”, we would get a distance of 3.
The point is to focus on engaging a machine learning algorithm as a
scientific apparatus. Get familiar with its interface and its output.
Form mental models that will allow you to anticipate the relation‐
ship between the two. Thoroughly test that mental model. If you can
understand the algorithm, you can understand the hypotheses it
Realistic Expectations
|
13
produces and you can begin the search for evidence that will con‐
firm or refute the hypothesis.
We tend to judge data scientists by how much they’ve stored in their
heads. We look for detailed knowledge of machine learning algo‐
rithms, a history of experiences in a particular domain, and an allaround understanding of computers. I believe it’s better, however, to
judge the skill of a data scientist based on their track record of shep‐
herding ideas through funnels of evidence and arriving at insights
that are useful in the real world.
14
|
Chapter 3: What to Look for in a Data Scientist
CHAPTER 4
How to Think Like a Data Scientist
Practical Induction
Data science is about finding signals buried in the noise. It’s tough to
do, but there is a certain way of thinking about it that I’ve found use‐
ful. Essentially, it comes down to finding practical methods of
induction, where I can infer general principles from observations,
and then reason about the credibility of those principles.
Induction is the go-to method of reasoning when you don’t have all
of the information. It takes you from observations to hypotheses to
the credibility of each hypothesis. In practice, you start with a
hypothesis and collect data you think can give you answers. Then,
you generate a model and use it to explain the data. Next, you evalu‐
ate the credibility of the model based on how well it explains the
data observed so far. This method works ridiculously well.
To illustrate this concept with an example, let’s consider a recent
project, wherein I worked to uncover factors that contribute most to
employee satisfaction at our company. Our team guessed that pat‐
terns of employee satisfaction could be expressed as a decision tree.
We selected a decision-tree algorithm and used it to produce a
model (an actual tree), and error estimates based on observations of
employee survey responses (Figure 4-1).
15
Figure 4-1. A decision-tree model that predicts employee happiness
Each employee responded to questions on a scale from 0 to 5, with 0
being negative and 5 being positive. The leaf nodes of the tree pro‐
vide a prediction of how many employees were likely to be happy
under different circumstances. We arrived at a model that predicted
—as long as employees felt they were paid even moderately well, had
management that cared, and options to advance—they were very
likely to be happy.
The Logic of Data Science
The logic that takes us from employee responses to a conclusion we
can trust involves a combination of observation, model, error and
significance. These concepts are often presented in isolation—how‐
ever, we can illustrate them as a single, coherent framework using
concepts borrowed from David J. Saville and Graham R. Wood’s
statistical triangle. Figure 4-2 shows the observation space: a sche‐
matic representation that makes it easier to see how the logic of data
science works.
16
|
Chapter 4: How to Think Like a Data Scientist
Figure 4-2. The observation space: using the statistical triangle to illus‐
trate the logic of data science
Each axis represents a set of observations. For example a set of
employee satisfaction responses. In a two-dimensional space, a point
in the space represents a collection of two independent sets of obser‐
vations. We call the vector from the origin to a point, an observation
vector (the blue arrow). In the case of our employee surveys, an
observation vector represents two independent sets of employee sat‐
isfaction responses, perhaps taken at different times. We can gener‐
alize to an arbitrary number of independent observations, but we’ll
stick with two because a two-dimensional space is easier to draw.
The dotted line shows the places in the space where the independent
observations are consistent—we observe the same patterns in both
sets of observations. For example, observation vectors near the dot‐
ted line is where we find that two independent sets of employees
answered satisfaction questions in similar ways. The dotted line rep‐
resents the assumption that our observations are ruled by some
underlying principle.
The Logic of Data Science
|
17
The decision tree of employee happiness is an example of a model.
The model summarizes observations made of individual employee
survey responses. When you think like a data scientist, you want a
model that you can apply consistently across all observations (ones
that lie along the dotted line in observation space). In the employee
satisfaction analysis, the decision-tree model can accurately classify
a great majority of the employee responses we observed.
The green line is the model that fits the criteria of Ockham’s Razor
(Figure 4-3): among the models that fit the observations, it has the
smallest error and, therefore, is most likely to accurately predict
future observations. If the model were any more or less complicated,
it would increase error and decrease in predictive power.
Figure 4-3. The thinking behind finding the best model
Ultimately, the goal is to arrive at insights we can rely on to make
high-quality decisions in the real world. We can tell if we have a
model we can trust by following a simple rule of Bayesian reasoning:
look for a level of fit between model and observation that is unlikely
to occur just by chance. For example, the low P values for our
employee satisfaction model tells us that the patterns in the decision
18
|
Chapter 4: How to Think Like a Data Scientist
tree are unlikely to occur by chance and, therefore, are significant.
In observation space, this corresponds to small angles (which are
less likely than larger ones) between the observation vector and the
model. See Figure 4-4.
Figure 4-4. A small angle indicates a significant model because it’s
unlikely to happen by chance
When you think like a data scientist, you start by collecting observa‐
tions. You assume that there is some kind of underlying order to
what you are observing and you search for a model that can repre‐
sent that order. Errors are the differences between the model you
build and the actual observations. The best models are the ones that
describe the observations with a minimum of error. It’s unlikely that
random observations will have a model that fits with a relatively
small error. Models like these are significant to someone who thinks
like a data scientist. It means that we’ve likely found the underlying
order we were looking for. We’ve found the signal buried in the
noise.
The Logic of Data Science
|
19