Notes from Book: Data Science for Business
My reading notes from the book.
Table of Contents
chapter 01. “Introduction: Data-Analytic Thinking”
introduction
past 15 years
vast amount of data
increasing interest for extracting useful information
widest applications
marketing: targeted marketing, online advertising, recommendations
crm: analyze customer behavior
maximize customer value
finance
credit scoring
trading
fraud detection
workforce management
goals
view business problems from data perspective
understand principles of extracting useful knowledge
fundamental structure to data-analytic thinking
basic principles
data perspective provides
structure and principles
framework to systemetically analyze problems
terms: data science and data mining
data science
a set of fundamental principles
in extracting knowledge
data mining
extraction of knowledge
via technologies
more broad: data science
why to understand data science
to spot unrealistic assumptions, missing pieces for data mining projects
book describes
fundamental data science principles
show each with one data mining technique
two case studies
example: hurricane frances
wal mart: forecast based on what happened previous hurricane
why is prediction useful?
people would buy more bottled water
local stores properly stocked
how to discover patterns that are not obvious?
identify unusual local demand for products
what happened?
strawberry pop-tarts increase in sales seven times
top selling item: beer
example: predicting customer churn
MegaTelco: telco firm
20% of customers leave when contracts expire
difficult to acquire new customers
churn
customers switching from one company to another
since attracting new customers is expensive
a lot marketing allocated to prevent churn
customer retention
major use of data mining
Data Science, Engineering, and Data-Driven Decision Making
data science
what is
principles, processes, techniques
to understand events
via analysis of data
ultimate goal
improve decision making
data-driven decision-making (ddd)
basing decisions on analysis of data
ex
selecting advertisements based on
experience
analysis of data of how consumers react to different ads
proof for benefits
erik brynjolfsson from mit
more data-driven a firm is
more productive it is
one standard deviation higher on ddd scale
4-6 % increase in productivity
relationship is causal
2 decision types
type 1 and 2
where discoveries need to be made within data
decisions that repeat at massive scale
ex: Walmart and MegaTelco
Walmart: type 1
discover knowledge to prepare hurricane
Target market: type 1 (ref: Duhigg, 2012)
consumers
inertia in their habits
new baby -> change in shopping habits
"when they buy diapers, they buy everything else too"
birth records public =>
retailers send special offers to new parents
how to predict that people expect a baby?
analyzed historical data
customers who later revealed to have been pregnant
ex
pregnant mothers change their
diets, wardrobes, vitamin regimens
predictive models in general: type 1
focus on a particular indicator that correlates with a variable
who will churn
who will purchase
who is pregnant
not testing a simple hypothesis
data explored
to discover something useful
churn example of MegaTelco: type 2
improve our ability to estimate
large benefits by applying it to millions of customers
applications
fields
direct marketing
online advertising
credit scoring
financial trading
help-desk management
fraud detection
search ranking
product recommendation
90s
banking and consumer credit industries
data-driven fraud control
telecom
retail systems
merchandising decisions
Harrah's casinos' reward programs
recommendations of Amazon and Netflix
now
advertising
Data Processing and "Big Data"
data processing
relation to data science
not a subset of data science
support data science
more general than data science
does not involve
extracting knowledge
data-driven decision-making
big data technologies
such as
hadoop, hbase, mongodb
means
datasets too large for traditional data processing systems
study by Prasanna Tambe (Tambe 2012)
big data technologies correlated with productivity growth
one standard deviation of higher utilization -> 1-3 % higher productivity
From Big Data 1.0 to Big Data 2.0
web 1.0
goal
establish a web presence
build ecommerce capability
improve efficienty in operations
general
build capability to process large data
to improve efficiency
after web 1.0
rise of voice of individual consumer
big data 2.0
what can i do now better?
Data and Data Science Capability as a Strategic Asset
key strategic assets
data
capability to extract useful knowledge
for most companies
data analytics
value from existing data
without regard to appropriate analytical talent
viewing as assets
one should invest in them
we don't have
right data
right talent
not trivial
case: Signet Bank 90s
in 80: transformation in consumer credit
modeling the probability of default
credit cards had uniform pricing
around 90
do predictive modeling
offer different terms
pricing
credit limits
low rate transfers
cash back
loyalty points
problem
no appropriate data to model profitability
solution
acquire necessary data at a cost
learning cost
different terms offered at random
charge-off rate went from 2.9% to 6% (losses)
next
customer retention
customer calls for a better offer
data driven models predict potential profitability of different offers
Capital One
2000: 45000 scientific test were carried
study: Martens and Provost 2011
does data of bank's consumers improve models for deciding product offers?
detailed data on customers' transactions improve performance
more data better performance =>
banks with bigger data assets =>
increased adoption of bank's products
decreased cost of customer acquisition
Amazon
value in rankings and recommendations
Facebook
data about individuals and their likes
structure of social network => (Hill, Provost, Volinsky 2006)
who will buy certain products
Data-Analytic Thinking
digital 100 companies (Business Insider 2012): high valuations
due to primarily data assets
need for business guys
managers: oversee analytics teams
marketers: organize data-driven campaigns
venture capitalists: invest wisely in businesses with data assets
strategists: devise plans that exploit data
ex
assess wthere a data mining project makes sense
competitor announces a new data partnership
when does it put you at a strategic disadvantage
mckinsey estimates
talent with data-analytic skills
2018:
shortage of 140-190 K people with deep analytical skills
1.5 M managers+analysts with data skills (Manyika, 2011)
Data Mining and Data Science, Revisited
ex: churn-prediction example
take data on prior churn
extract patterns of behavior that are useful
to predict customers that are more likely to leave
to design better services
fundemantal concept: a process with well defined stages
CRISP-DM: Cross Industry Standard Process for Data Mining
judgment:
following a process systematically
to solve business problems
by extracting useful knowledge from data
fundamental concept: finding informative attributes of entities
judgment:
finding informative attributes of entities
by using information technology
from a large mass of data
ex: churn
customer: entity of interest
described by a number of attributes
usage, customer service history, other factors
which one gives information on likelihood of leaving?
notion: finding variables that correlate with churn
fundamental concept: overfitting a dataset
judgment
you can find something
but it might not generalize beyond your data
fundamental concept: context is part of data mining
judgment
thinking about the context
where the results will be used
is part of data mining
## Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist
discussions of data science mention
analytical skills and techniques
random forests, support vector machines
application areas
recommendation, ad placement optimization
tools used
hadoop, spark
young discipline
good experts are
good technicians
Summary
Chapter 2. Business Problems and Data Science Solutions
fundamental concepts
set of canonical data mining tasks
data mining process
supervised versus unsupervised data mining
data mining is a process
with well-undestood stages
involve
it
automated discovery
evaluation of patterns
creativity
business knowledge
From Business Problems to Data Mining Tasks
decompose a problem into pieces
each piece matches a known task
algorithms and types of tasks
large number of data mining algorithms
small number of types of tasks algorithms address
term: individual
entity about which we have data
ex
customer
business
project type:
finding correlation
between
variable describing individual
other variables
ex
leaving customers
which other variables correlate with it
example of classification and regression tasks
tasks of data mining
classification task
goal
estimate the set of classes
an individual belongs to
ex
which customers will respond to a given offer
will respond
will not respond
related task
scoring or class probability estimation
probability that individiual belongs to a class
regression (value estimation)
goal
estimate the numerical value
of some variable for an individual
ex
how much will a customer use the service
predicted: service usage
comparison with classification
classification: whether something will happen
regression: how much something will happen
similarity matching
goal
identify similar individuals
ex
find companies similar to the best customers
based on "firmographic" data
applications
product recommendation
clustering
goal
group individuals by their similarity
not driven by any specific purpose
ex
do customers form natural segments?
applications
preliminary domain exploration
input to decision making questions
what products should we offer?
how should customer care teams be structured?
co-occurrence grouping
names
association rule discovery
frequent itemset mining
market-basket analysis
goal
find associations between entities
based on transactions
ex
what items are commony purchased together?
comparison
clustering: similarity based on objects' attributes
co-occurrence: similarity based on their appearing together in transactions
ex
supermarket
ground meat is purchased togther with hot sauce
recommendation systems
pairs of books purchased by same people
profiling
names
behavior description
goal
characterize typical behavior of an individual
ex
what is typical cell phone usage in this segment?
application
anomaly detection
fraud detection
monitoring intrusion to computer systems
ex
determine whether a new card transaction fits that profile
suspician score -> issue an alarm
link prediction
goal
predict connection between data items
a link should exist
strength of link
ex
you and karen share 10 friends
would you like to be karen's friend?
recommending movies
graph between customers and movies they rated
predict links that should exist and be strong
data reduction
goal
compress data
input: large data
output: small data that contains much of the important information
ex
massive dataset on consumer movie preferences
reduced to small dataset
to reveal consumer tastes
causal modeling
goal
what events influence others
ex
targeting advertisements
observation: targeted consumers purchase more
question:
is this because of advertisement?
or predictive model identified the right customers?
how
randomized controlled experiments
called: A/B tests
counterfactual analysis
what would be the difference between situations
where the treatment event
were to happen
and were not to happen
involves assumptions
ex: placebo effect
Supervised Versus Unsupervised Methods
ex: supervised vs. unsupervised classes
customer population
do customers fall into different groups?
no specific target
=> unsupervised
find groups with high likelihood of canceling the service
specific target
=> supervised
condition for supervised
specific target
there is data on target
value for target: label
often: before data mining
actively labelling data is required
methods: supervised or not
supervised methods
classification
regression
causal modeling
either
similarity matching
link prediction
data reduction
unsupervised
clustering
co-occurrence
profiling
type of target in classification and regression
regression: numerical
classification: categorical (often binary)
ex
will customer purchase s1 if given incentive I?
classification with binary target
which service will customer purchase if given incentive I?
classification with multi-valued target
how much will customer use the service?
regression
subtleties
for business applications: numerical prediction better
ex: churn
probability that the customer will continue
still considered as classification
or: class probability estimation
in early stages:
i) decide supervised or unsupervised
ii) if supervised, define target variable
process
model building
historical data
x y z class
14 T R rejected
...
data mining -> model
model using
new data
x y z class
30 T R ?
apply model
result
class: accepted
probability: 0.88
Data Mining and its Results
difference
between
mining data
using results
results should influnce data mining process
Data Mining Process
craft
CRISP-DM
business understanding -> data understanding -> data preparation -> modeling -> evaluation -> deployment
business understanding
data understanding
strengths and limits of data
costs of data
ex
fraud detection problems
credit card
transactions have reliable labels
supervised method
Medicare
fraud perpetrators are
legitimate users and service providers
subset of legitimate users
data has no reliable target variable
unsupervised methods
both: fraud, but very distinct problems
data preparation
separate book: Pyle 1999
beware of leaks
Kaufman et al. 2012
what is leak
information appears in historical data
but is not available at decision time
ex
predicting if a web visitor end session
variable: total number of webpages visited
predicting if a customer will be a big spender
known in history:
categories of items purchased
amount of tax paid
but not known at decision time
modeling
evaluation
common flaw with detection solutions
such as
fraud, spam, intrusion monitoring
too many false alarms
testing in lab and in business may be different
in vivo evaluation
randomly apply model to some customers
keep a control group
deployment
Implications for Managing the Data Science Team
mistake:
viewing data mining process as software development cycle
software development
milestones are clear
success is clear
data mining
exploratory
closer to research
crisp cycle iterates on
approaches and strategy
not on software designs
outcomes less certain
results can change understanding of the problem
analytics skills vs. software skills
software
writing effcient code from requirements
analytics
formulating problems well
prototyping solutions quickly
making reasonable assumptions in ill-structured problems
designing experiments
analyzing results
Other Analytics Techniques and Technologies
main difference
data mining: focus on automated search for
knowledge, patterns, regularities
important: what analytic technique is appropriate for a particular problem
statistics
numeric values
summary statistics
wrt. distribution of data
field of study
contrast
dm: hypothesis generation
database querying
sql
query by example
OLAP
done in realtime
unlike ad hoc querying with SQL
dimensions must be pre-programmed
data warehousing
collect data from enterprise
multiple systems
integrates records from sales, billing, hr etc.
regression analysis
dm: not interested in generalization to population
machine learning and data mining
methods for extracting (predictive) models
developed in several fields
machine learning
subfield of artificial intelligence
concerned: improving knowledge of an agent in response to his experience
applied statistics
pattern recognition
data mining (KDD: knowledge discovery and data mining)
started from machine learning
both: try to find useful patterns
techniques are shared
kdd a subfield of ml
more concerned with entire process:
data preparation, evaluation
Answering Business Questions with These Techniques
Who are the most profitable customers?
if profitable is in existing data
just a database query
Is there really a difference between the profitable customers and the average customer?
about a conjecture or hypothesis
there is a difference
method: statistical hypothesis testing
But who really are these customers? Can I characterize them?
common features of them
from database: using database querying
deeper analysis
what features differentiate profitable customers from others
Will some particular new customer be profitable? How much revenue should I expect this customer to generate?
examine historical data
produce predictive model of profitability
Chapter 3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Fundamental concepts:
Identifying informative attributes;
Segmenting data by progressive attribute selection.
Exemplary techniques:
Finding correlations;
Attribute/variable selection;
Tree induction.
predictive modeling
as supervised segmentation
how segment population wrt sth that we predict
target in predictions
something we want to avoid
ex
which customers are likely to leave
which accounts have been defrauded
which customers are likely not to pay off
which web pages contain objectionable content
positive target
ex
which consumers are likley to respond to an ad or offer
which web pages are appropriate for a search query
fundamental idea of dm
finding informative variables or attributes of entities described by data
meaning of "informative"
information: quantity that reduces uncertainty about something
supervised dm:
specific target exists
the target quantity is unknown
customer will churn?
accounts has been defrauded?
finding informative attributes
is there other variables that reduces uncertainty about value of the target?
find knowable attributes
that correlate with target of interest
basis for tree induction
terminology
ex
name,balance,age,employed,write-off
Ali,115,40,no,no
feature vector: <Ali,115,40,no>
class label (value of target attribute): no
attributes: name,balance,age,employed,write-off
target attribute: write-off
Models, Induction, and Prediction
model
simplified representation of reality to serve a purpose
simplified
on assumptions
what is important
predictive model
formula to estimate unknown value of interest: target
formula can be
mathematical
logical rule
terminology: prediction
data science: to estimate an unknown value
contrast to descriptive modeling
purpose: gain insight into process
ex: churn
what do customers typically look like
criterion: intelligibility
less accurate model better if easier to understand
pm: predictive performance
terminology:
supervised learning
model creation
model describes a relationship between
set of selected variables (attributes or features)
predefined variable called target
model estimates value of target as a function of features
possibly a probabilistic function
instance or example
a fact or a data point
ex: a historical customer given credit
usually a row in database
described by a set of attributes
fields, columns, variables, features
also called: feature vector
fixed length ordered collection of feature values
many names for same things
principles studied in different fields
dataset
also:
table of database
worksheet of spreadsheet
contains
a set of examples or instances
instance
also:
row of database table
case in statistics
features
also:
table columns
independent variables (stats)
predictors: input attributes (stats)
explanatory variable (operations research)
target variable
also
dependent variable (stats)
selec
model induction
creation of models from data
term: from philosophy
contrast: deduction
starts with general rules and specific facts
creates other specific facts
input data for induction algorithm
used for inducing model
called: training data
also: labeled data
because value of target is known
ex: churn problem
build a supervised segmentation model
that divides sample into segments
Supervised Segmentation
human understandable set of segmentation patterns
ex:
middle aged professionals who reside in NYC have a churn rate of 5%
predicted target value: 5%
fundamental concept:
how to judge whether a variable contains important information about target?
how much?
Selecting informative attributes
ex: stick people
attributes:
head shape: square, circular
body shape: rectangular, oval
body color: gray, white
target variable:
write-of: yes, no
goal:
resulting groups to be as pure as possible
homogeneous wrt target variable
every member of group has same value for target
complications
solution
formula based on purity measure
splitting criteria
information gain
most common
based on a purity measure: entropy
invented by Claude Shannon 1948
entropy
measure of disorder
consider
a set of properties of members of the set
each member has one property
in supervised segmentation:
member properties = values of target variable
disorder = how mixed (impure) the segment is wrt properties
_ref: dscp20150626.1
entropy = - p_1 log (p_1) - p_2 log (p_2) - ...
p_i: probability of property i within set
p_i = 1: all members have property i
entropy function of two class set
_fig: 3.3
if pure => 0
if randomly mixed => 1
how informative is an attribute wrt target
how much gain in information it gives us about value of target
an attributes
segments a set of instances
into several subsets
contrast: entropy
how impure one individual subset is
define: information gain (IG)
using entropy
to measure
how much an attribute improves entropy
measures
change in entropy
due to new information
IG(parent, children) = entropy(parent) -
(p(c_1) x entropy(c_1) + ...)
entropy(c_i)
weighted by proportion of instances belonging to that child
ex
attribute has k different values
original set: parent set
result of splitting on k values: children sets
_fig: 3.4
what if attribute is numeric
discretize by choosing split points
regression problems
information gain is not right measure
because ig is based on properties in segments
measure of impurity: variance
set pure when variance is zero
all values in set are same
Example: Attribute Selection with Information Gain
goal
which attribute is most informative wrt estimating value of target
rank a set of attributes by their informativeness
problem
which attribute is most useful for distinguishing edible mushrooms from poisonous ones?
_fig: 3.7
Supervised Segmentation with Tree-Structured Models
goal
select multiple attributes
how to put them together?
multivariate (multiple attribute)
classification tree
nodes
interior
contains a test of an attribute
terminal or leaf
= segment
attributes and values along the path = characteristics of the segment
branches
distinct value of attribute
how to build it?
divide-and-conquer approach
start with whole dataset
apply variable selection
choose the split with most information gain
probability estimation tree
to predict the probability of membership in the class
ex: probability of churn or write-off
not the class itself
Visualizing Segmentations
decision lines and hyperplanes
decision lines
also:
decision surfaces
decision boundaries
lines separating the regions
each node:
an (n-1) dimensional hyperplane decision boundary on instance space
_fig 3.15
Trees as Sets of Rules
rule set
IF (Balance < 50K) AND (Age < 50) THEN Class=Write-off
...
Probability Estimation
ex: churn prediction
rank prospects by probability of leaving
high budget to instances with high expected loss
ex: credit default
most instances will "not write-off"
most leafs in tree: not write-off
_fig 3.15
how
frequency based estimate of class membership probability
we have frequencies of each property in each segment
use them as class probability estimate
overfitting in small samples
if a leaf has a single instance => 100%
smoothed version
known as: Laplace correction
Example: Addressing the Churn Problem with Tree Induction
how good are each variable indivdually?
this is different from multivariate classification tree
depends on previous nodes