My reading notes from the book

Table of Contents

ch01 Defining Analytics

Challenge of Definition
  dictionaries
    logical analysis method
    logic with analysis
    turning data into insights into action
  definition 1: business value perspective
    three variations of value
      present
      past
      future
    data created
      in operational system
  definition 2: techincal implementation perspective
    four data analysis methods
      forecasting
      descriptive (clustering, association rules ...)
      predictive analytics (classification, regression, text mining)
      decision optimization
Analytics Techniques
  4 techniques
    forecasting
    descriptive analytics (primarily clustering)
    predictive analytics (primarily classification, also regression and text mining)
    decision optimization
  algorithm vs. analytics model
    each technique implemented
      by an algorithm
    algorithm: like an instrument
      tune you produce: analytics model
  forecasting
    related terms
      time series analysis
      sequence analysis
  descriptive analytics
    clustering
      grouping data points
  predictive analytics
    based on
      how similar observations were classified in past
    also known
      directed analytics
        training data shows what to predict
    prediction
      what class a new observation belongs to
    ex
      probability that a coupon offer will be redeemed leading to a sale?
      probability a credit card will default
      probability that a product has a defect leading to a warranty claim
    how it works
      training data:
        with known outcomes
      algorithm learns
        from pattern of known outcomes
        what variables influence the likelihood of a certain outcome
          discriminatory power of a variable
    prediction vs. forecasting
      forecasting
        time series
        new value may not have occurred in the past
        no likelihood value
      regression
        prediction
        historical value musth ave occurred
      ex: weather
        forecasting:
          what is high temperature tomorrow?
        prediction
          likelihood that it will be 72?
    prediction methods
      regression
        classic
      data mining
        decision trees
        neural networks
        naive bayes
      decision trees
        simplest
        ex: fig 1.6 training data set
    text mining
      predict categories of text
  decision optimization

ch02 Information Continuum

ch03 Using Analytics

Healthcare
  Emergency Room Visit
    probability that a patient will be back in ER in next 3 months?
      analytics solution
        predictive model
        predictive variable:
          ER_Return
        historical ER data
          3-5 years
          paitent visit
          age
          gender
          profession
          marital
          diagnosis
          procedure 1
          procedure 2
          date
          vitals 1-5
          previous diagnosis
          previous precedure
          current medication
          last visit to ER
          insurance coverage
  Patients with the Same Disease
    identify common patterns among patients with same disease
    analytics solution
      goal: clustering
      variables
        age, gender, economic status, zip code, children?, pets?, medical conditions, vitals?
CRM
  customer segmentation
    build customer segments based on the similarity of their profile
    analytics solution
      one record per customer
      variables
        age, gender, children, income, luxury car, zip code, number of products purchased, avg products, distance from store, online shopper
  propensity to buy
    track how many coupons are utilized? overall benefit of the campaign?
    problem: probability that a customer buys the product upon receiving coupon
    analytics solution
      predictive problem
      predictive variable:
        Uses_Coupon
      age, demographics, purchases, departments and products purchased, coupon date, type, coupon delivery method
Human Resources
  not as widespread
  ex
    predict employees potentially leaving
  employee attrition
    probability a new employee will leave in first 3 months
    analytics solution
      predictive variable
        Left_Company
      one record per employee
      variables
        employee personal profile, education, interview process, referring firm, last job details, reasons for leaving last job, interest level, hiring manager, hiring department, new compensation, old compensation
      tricky:
        interest leven in new position
  resume matching
    resume run through text mining
      prediction as to the class of resume
    problem: probability that this resume is from a good java programmer?
    analytics solution
      existing known resumes
        should be labeled
        business analyst, database developer etc.
consumer risk
  mature adoption in banking industry
  products:
    auto loans, credit cards, personal loans, mortgages
  borrower default
    probability that customer default on loan in next 12 months
    analytics solution
      predictive variable
        Defaulted
      one record per loan account
      variables
        customer personal profile, demographic, type of account, loan disbursed amount, term of loan, missed payments, installment amount, credit history from credit bureau
Insurance
  customer fills an application, model calculates rate
  Probability of a claim
    probability that policiy will incur a claim within 3 years
    analytics solution
      preditive variable
        Incurs_Claim
      one record per insurance policy
      variables
        policiyholder personal profile, policiy type, open date, maturity date, premium amount, payout amount, sales representative, commission percentage
Telecommunication
  Call Usage Patterns
    build clusters of customers
    analytics solution
      one record per customer
      variables
        calls, sms messages, data utilization, bill payment, demographics, legnth of relationship
Higher Education
  problem: will the student offered admission, will accept and join the school?
  admission and acceptance
    probability that a student will accept admission
    analytics solution
      predictive variable
        Accepts
      one record per student application
      variables
        sat, essay score, interview score, economic background, ethnicity, personal profile, high school, school distance, siblings, financial aid requested, faculty applied
Manuafacturing
  to manage supply chain
  driving forces
    3d printing => new designs
    global supply chains => large choice
    volatiliy in pricing => procurement and storage very complex
    customization => adjustments on short notice
  warranty claims
  Predicting Warranty Claims
    probability that next product will lead to warranty claim
    analytics solution
      predictive variable
        Warranty_Claim_Paid
      data grain: product level
        related to
          product specs
          production schedule
          employees
          customer details
  Analyzing Warranty Claims
    clustering similar claims to break down the defect problem into chunks
    analytics solution
      one record per claim
      variables
        product details
        product specs
        customer details
        production schedules
        line workers' details
        sales channel details
        defect details
Energy and Utilities
  after deregulation: power generation is separate from distribution
    customer care and billing done by energy services
  new challenge
    analytics solution
      weather forecasting
      correlate weather to load
        predictive model
      generation company: probability of firing up power generation plants
      services company: predict additional customers to come online at peak load hits
        clustering models to group their customers
      decision optimization and pricing algorithms to max services companies
Fraud Detection
  classification and clustering
  benefits fraud
    goverment benefits
    frauds:
      payment transactions 
        has two children but receives for four
      citizien is not eligible
        misrepresenting his income
    probability that an application is fraudelent
    analytics solution
      predictive variable
        Is_fraudelent
      one record per application
      variables
        similar to loan application
  Credit Card Fraud
    transaction that violates expected behavior
    analytics solution
      clustering problem
      clusters of customers based on their behavior
      grain of data
        customer level
        all purchases on card aggregated and related to customer record
      variables
        average spending on gas, shopping, dining out, travel
        full payment, min payment
        percentage of limit utilized
        geography of spending
      clusters
        their values are thresholds
        breaking threshold => fraud alert
Patterns of Problems
  prediction problem
    break down problem into 1 or 0
    event occurred or not

ch04 Performance Variables and Model Development

Performance Variables
  Terminology
    columns or data fields
    variables
    performance variables
      aggregate or summary variables
    input variables
      input to analytics algorithm
    characteristics
      input variables with their relative weights and probabilities in model
  What are performance variables?
    Reasons for creating performance variables
      ex:
        ABC sells computers online
        data fields
          entity | data fields
          customer | name, address, credit score ...
          product | product id, type, class, ...
          sales channel | channel id, category, partner ...
          sales transaction | transaction id, code, timestamp, channel ...
        issues
          most data fields empty
          if filled out, with quality issues
          lots of fields have same values
        assume
          # fields: 70
          mostly with null: 23
          good data: 24
          input variables: 24
        second test
          are these fields filled only for customers who bought?
            if so, not useful
        how much data is enough?
          40% of all transactions should be available
            90% for training
            10% for testing
    what if no pattern is available?
      use performance variables
        aggregate variables built using historical data
        same level of grain as input variables
      ex: performance variables
        purchased_in_last_12_mo
        no_of_visits_online
        no_of_customer_service_contacts
        1st_ever_purchase_date
        last_purchase_date
      define new performance variables
    benefit of using performance variables
      as business evolves
        remove some input variables
        add newer performance variables
    creating performance variables
      grain
        is the level of detail at which analytics algorithm will work
      range
        continuous variables are not good
      spread
        how the population is broken out
        ex
          predict likelihood of an insurance policy to redeem a claim
          performance variable
            customer_profit_lifetodate_amount
            value range: -1200-7300
        frequency might be 1-2 for most values
          might be skewed
  Designing Performance Variables
    discrete vs. continuous
      how to convert continuou into discrete?
        ex
          how many buckets to build?
          their ranges?
        best:
          uniform frequencies in each bucket
    nominal vs. ordinal
    atomic vs. aggregate
  Working Example
    performance variables:
      popular_products_purchased
      multiple per customer record
        not at same grain level
      thus divide it:
        popular_products_purchased_1
        popular_products_purchased_2
        popular_products_purchased_3
Model Development
  What is a Model
    difference: model vs. algorithm
      algorithm
        takes input data
        to produce model
      model
        takes input data
        produces an output (prediction, forecast etc.)
  in Predictive Modeling
    ex: loan application. produce risk score
      ex: characteristic: age
        input variable: age_range_code
        max. weight: 200
        ex:
          value | weight
          < 17  | rejected
          18-24 | 100
          25-45 | 150
      ex: characteristic: previous loan status
        input variable: 12_month_previous_loan_history_status
        ex:
          value | weight
          one or more loand and uptodate payments | 200
          no loans before | -50
          one loan, two payments late | -100
      total risk max: 1000
      two variables: 450 / 1000. 45%
    input variables converted into characteristics to form the model
      relative weights/importance of variables emerge
    ex: Risk Scorecard
      characteristics | % weight | max score
      age | 20% | 200
      previous loans | 25% | 250
      income | 15% | 150
  in Descriptive Modeling
    model built. you have:
      number of clusters
      cluster affinity (closeness)
      cluster characteristics:
        cluster name/id
        input variables and their range of values
        probabilities and correlations of variables in each cluster
  Model Validation and Tuning
    essential bc of blackbox nature of ml algorithms
    predictive model validation
      low false-positives
        error rate of model's predictions
      if probabilistic output
        ex
          shipping company
          predict which shipments will get delayed
          historic data
            pred variable: delayed or not
          confusion matrix
              actual shipments on time | actual shipments delayed
            predicted on time | 7670 | 170
            predicted delayed | 940 | 1220

Challenger: A Culture of Constant Innovation

ch05 Automated Decisions and Business Innovation

Automated Decisions
  school of thought
    empower data scientists
      give them all models
      let them find out 
    democratization of analytics
      not limited to problems where data scientists are available
      required skills
        data and analytics
        very rare
      automated decision approach
        converting subjectivity 
          into objective transparent set of rules
Decision Strategy
  decision strategy
    most important part of the book
    nothing matters
      if decision strategy is not properly designed
    analytics has gray area
      not all scenarios 
        can be accomodated
        have enough training data
      scenarios can have conflicting outputs
      change by conditions
        assumes business as usual
    Nassim Taleb
      Fooled By Randomness
      The Black Swan
      weakness of models 
        when surprising event occurs
      surprise events
        not addressed by historical data
      trick
        find balance in decision strategies
        limiting damage from automated decisions
    morgage crisis of 2008
      risk models dubbed risky customers as subprime
      decision strategy was agressive
        morgtgage backed securities risk models
          called them AAA
          didn't factor performance variables
    understanding models
      far more difficult for managers
        vs. managing the decision strategy
  Business Rules in Business Operations
    business rules
      have decision variables (DVs)
        to complete a transaction
        ex
          offer a discount to a good customer
            what is good?
          don't lend to a customer with poor credit
            what is poor?
          reroute a package using a premium service if can delay
            what is "can"?
          stop suspicious money laundering transfer
            what is suspicious?
          hire most suitable candidate
            what is suitable?
      two categories of rules
        expert
        quantitative
    Expert Business Rules
      devised by expert people
      they are
        policiy designers, mentors, to get advice employees, know how to handle things
      business as usual rules (80% of normal business) are 
        either baked into operational systems
        or employees are well trained to address them
    Quantitative Business Rules
      based on hard numbers and absolute values
      ex
        good customer
          has been a customer >1 year
          purchased >300 $
        suspicious money transfer
          > 7500 $
          if customer has account < 200 $
  Decision Automation and Business Rules
    analytics models -> new business rules
    document the process
      describe model
        its output
      context of decision
  Joint Business and Analytics Sessions
    business input
      reasons
        1. they understand their business
        2. their buy-in is needed
  Examples of Decision Strategy
    Retail Bank
      car loan product
      probability that loan will not be paid
        bank does:
          1. historical data of load data. identify loans paid and defaulted
          2. predictive load default model
          3. identify patterns that differentiate
          4. test model
          5. tune model
        decision variables
          probability of deafult
          loan amount 
          down payment
        decision strategy
          IF any of the Policy Variables are TRUE
          THEN Reject
          ELSE
            IF the Probability of Default > 0.62 (62%)
            THEN Reject
            ELSE
              IF Probability of Default is between 0.28 and 0.62
              THEN IF Loan Amount Requested <= 10,000
                THEN Approve with 25% down payment 
                ELSE (i.e. loan amount requested is > 10,000) Request for a Co-Signer
              ELSE Approve the loan (i.e. probability of default is < 28%)
        performance of the model
          measured using results of decision strategies
        decision variables and cutoffs
          DVs
            probability of default
              from risk predictive model
            loan amount
              part of business transaction
    Insurance Claims
      decision strategy
        fig 5.2
    Decision Strategy in Descriptive Models
Decision Automation and Intelligent Systems
  Learning versus Applying
    purpose of analytics systems
      analyze data
      learn from analysis
      use that knowledge
      insight to optimize business operations
    optimize business operations
      means
        operational system changes
    two dimensions of decision automation
      learning
        data warehous, analysis, analytics models
      application of that learning on business operatinos
        scope: decision strategies 
    school of thought: active data warehousing
      do decision making in data warehouse
        since all relevant data is available
      requires
        dw connected to operational system
  Strategy Integration Methods
    problem
      how to add decision strategy into operational system?
    ETL to the Rescue
      encompasses
        all aspects of data in motion
          single record, large data set, messaging, files, real time, batch
      ETL: glue that holds entire analytics solution
      integration
        ETL layer
          receives real time event from operational system
          prepares input record around event
          invokes analytics model
          get output from analytics model
          pass it to decision strategy
          reach a decision
          return decision back into operational system
Strategy Evaluation
  evaluation of decision strategies
    validating automated decision making
    testing alternate approach
  Retrospective Processing
    use historical data
      they have known outcome
    compare them with strategy
  Reprocessing
    run event through existing decision process
    randomly select 10% of transactions
      run through new strategy
      save results
    after some time
      when benefit of decision is evident and measurable
      compare them

Challenger Strategies

ch06: Governance Monitoring and Tuning of Analytics Solutions

Analytics and Automated Decisions
  The Risk of Automated Decisions 
    incorrect decisions carried out automatically
    inaction leading to lost opportunity
    wasted resources
  Monitoring Layer
Audit and Control Framework
  automated decisions
    2 parts
      analytics
      strategy
    information
      analytics model name / id, type, version
      decision strategy name / id, version
      input values
      output
      decision path
      input date/time, source system, id of input, output date
    real data
      still in operational system
  Organization and Process
    audit and control organization
    data moved into audit datamart
  Audit Datamart
    fig 6.1
    tables
      decision summary fact table
        records decisions
        grain: decision level
      decision code
        decision that strategy eventually  takes
        3 typical decision codes
          accept
          reject
          review
        predictive model
          assigns probability of default
        strategy
          looks at overall situation of applicant, loan offer, product specifics, liquidity, lender's overall strategy
      threshold table
        wrt. automated decisions
        threshold on certain decision code
        allow certain values of decisions
        breaches get recorded in alerts table
      data warehousing best practices
        use of ETL 
          to load data
        regularly scheduled data feeds
        data quality controls
        etc.
    Unique Features of Audit Datamart
      schedule shared
      data discrepancy go to investigation
      changes to datamart signed off by auditors
      separate logs for access and running queries
      tight security controls
      application for threshold table
      alerts table runs nightly
  Control Definition
    Best Practice Controls
    Expert Controls
    Analytical Controls
  Reporting and Action

ch07: Analytics Adoption Roadmap

Learning from Success of DW
  main
    how to convert into an analytics driven enterprise
      penetrate all aspects of business
      top-down
        from top executives
      bottom up
        our proposal
  Simplification
    original definition of DW (Bill Inmon 1992)
      reporting strains operational systems
      there is no single version of truth integrated
      solution
        take all reportable data
        move into dw
  Quick Results
    adoption:
      first top-down approach
      large projects
    problem: analysis paralysis
      years to integrate with no value
    then: datamart based approach
      functional use facing databases
        integrating all relevant data
        for that function's reporting needs
      dw was built datamart after datamart
  Evangelize
    no need to convince management this is important
    the more they see, the more they ask for
  Efficient Data Acquisition
    capability in handling 
      various forms of data
      at different schedules
      called: ETL
    vast field: ETL
      involves all aspects
      small industry: data integration
      integration 
        with all operational systems
        mechanism for receiving and sending data
  Holistic View
    business requirements
      come from users of datamarts
    dw teams has business analysts
      that has complete functional and data perspective
    business users trust dw business specialists
  Data Management
    in addition to infrastructure
      data dictionary
      data model
      data lineage
        how did data reach reporting display
      data quality controls
      data dependencies
      transformation business rules
      other metadata
  dw is prerequisite for analytics
  information continuum
    first levels: dw
    analytics on top of dw
The Pilot
  Business Problem
    constraint of analytics projects
      business does not understand
        what they want
      IT doesn't know
        what to build
  Management Attention and Champion
    don't pick a problem
      for which relevant data is not in dw
  The Project
    deliverables:
      problem statement
      candidate variables as input to model
      mapping of input variables to dw data
      design of specialized analytics datamart
      design of ETL: dw > transform > analytics datamart
      integrate analytics datamart with analytics software
      identify large data sample for pilot
      load analytics datamart
      separate 90% for training, 10% for testing
      run build training models
      validate results, add more variables, repeat to improve
      feed new data to model, return output to business for actions
    tendency to get expensive software
      resist by R and SQL
  Results, Roadshow, and Case for Wider Adoption

Analytics Organization and Architecture

Organizational structure
  bicc: business intelligence competency center
    responsible for
      implementation (incl dw)
      analytics
      decision strategies
      across Information Continuum
  BICC Organization Chart
    ETL
      design and architecture
      scheduling and maintenance
      ETL development
    Data Architecture
      data modeling
      data integration
      metadata, tool, repository
    Business analysis
      data analysis
      analytics analysis
      requirements analysis
    Analytics
      analitics modeling
      analytics implementation
      decision strategies
    Information delivery
      reporting development
      analytical application development
      operational integration development
    Additional support teams
      Database administration
      technology infrastructure
      project management
      quality assurance and data governance
      information security
      compliance, audit, control
  Roles and Responsibilities
    ETL
    Data Architecture
    Business Analysts
    Analytics 
    Information Delivery
  Skills Summary
    Analytics Analyst
    Analytics Architect
    Analytics Specialist
Technical Components in Analytics Solutions 
  Analytics Datamart
    Base Analytics Data
    Performance Variables
    Model and Characteristics
    Model Execution, Audit, and Control

Referred Books

Kimball
Innon
Han