无代写-ALYTICS 2
时间:2022-09-11
FOUNDATION OF ANALYTICS 2
ANGELIQUE ZERINGUE & SHANTI GREENE
DATABASE SYSTEMS AND SQL
RELATIONAL
DATABASES
Data is related
Table structure
Logical and physical structure
are separate
DATA MODEL / ER DIAGRAM
JOINS
SQL
Structured
query language
Send commands
to a relational
database
Some keywords:
SELECT, FROM,
JOIN, WHERE
EXAMPLE SQL – OVERDUE CUSTOMERS
SELECT c.email
FROM customer c
JOIN rental r ON c.customer_id = r.customer_id
JOIN inventory I ON r.inventory_id = i.inventory_id
JOIN film f ON i.film_id = f.film_id
WHERE r.return_date IS NULL
AND EXTRACT (DAY FROM CURRENT_DATE -
r.rental_date) > f.rental_duration
NOSQL DATA STORES
 Not Only SQL
 Non-relational
 Schemaless
 Elastic
 Key Value Stores (often in-memory)
 Document
 Columnar (wide column)
 Graph
CAP THEOREM
A distributed data store cannot simultaneously offer
more than two of three established guarantees:
Consistency:The data within the database
remains consistent, even after an operation has been
executed. For instance, after updating a system, all
clients will see the same data.
Availability: The system is constantly on (always
available), with no downtime.
Partition Tolerance: Even if communication
among the servers is no longer reliable, the system
will continue to function. This is because the servers
can be partitioned off, into multiple groups which
can’t communicate with each other.
STRUCTURED DATA FORMATS
 JSON
 Delimited
 CSV
 Tab
 XLSX
OTHER DATA
FORMATS AND
SOURCES
APIs
Serialization
Streams
Lakes
DATA MANIPULATION IN PYTHON WITH PANDAS
DATAFRAME
 Data organized into rows and columns
 Similar to a table
 Lives in memory
 1D array is a series
 Easy to read in data from files (read_csv)
 Can access elements directly with indices
 Defaults to a soft copy
 Supports grouping, pivoting
DATA TYPES
 Categorical (Qualitative)
 Nominal
 Ordinal
 Numerical (Quantitative)
 Discrete
 Continuous
 Interval
 Ratio
DATA TYPES
15
TRANSACTIONAL
INFORMATION
Data that is collected from one or more exchanges
• When did it happen
Recency
• How many times has it happened
Frequency
• What was it worth
Value
• What category(ies) were involved
Variety
DESCRIBING WITH DATA
Demographics
• Examples: Age, income, race, sex, number of
children
• Can describe a person, group, or area
(geodemographics)
• People with similar demographics purchase
more similarly than people with dissimilar
demographics
Psychographics
• Examples: golfer, audiophile, listens to classic
rock, watches Luther, donates to AMA
• Describes a persons hobbies, attitudes, or
values
• People with similar psychographics purchase
more similarly than people with dissimilar
psychographics
DESCRIBING WITH DATA (PART 2)
Exographics
• Examples: avg. commute time in MSA,
no. golf courses in County, avg. annual
snowfall
• Describes factors outside the level of
the person
• External factors may be a strong
influence in purchase decisions
Fluxographics
• Examples: recently moved, recently
had a baby, recently graduated college
• Describes life changes
• Life changes often precipitate
purchases
DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS
Range (min,
max)
Averages
(mean, median,
mode)
Variances
Counts and
Sums
Histograms
DESCRIPTIVE EXAMPLE 1: TRANSACTION AMOUNT
metric value
Min transaction amount $1
Max transaction amount $820
Sum transaction amount $2,731,061
Count of transactons 100,000
Median transaction amount $20
Mean transaction amount $27.31
Mode transaction amount $9
Variance transaction amount $716.26
VISUALIZING DATA IN PYTHON
GRAPHICAL
EXCELLENCE
Is the well-designed presentation of interesting data - a
matter of substance, of statistics, and of design.
Consists of complex ideas communicated with clarity,
precision and efficiency.
Gives to the viewer the greatest number of ideas in the
shortest time with the least ink in the smallest space.
Is nearly always multivariate.
Requires telling the truth about the data.
GRAPHICAL INTEGRITY PRINCIPLES
Representations of
numbers should match
their true proportions.
Labeling should be
clear, detailed and
thorough.
Designs should show
only data variations,
not design variation.
Standardized units are
best when representing
money.
The number of
dimensions visualized
should equal the
number of dimensions
in the data.
Representations should
not imply a subjective
or false context.
SEABORN EXAMPLE
sns.histplot(data=filmCategoryDf, x='rental_duration', stat='proportion', binwidth=1, kde=True,
color='Blue')
THE GOAL DICTATES THE APPROACH
THE PROCESS OF MODEL BUILDING
26
STEPS IN MODEL BUILDING
27
1
Identify &
clarify the
problem you
are solving
2
Learn
background
information
about the
problem
3
Select
variables
4
Acquire data
5
Choose a
statistical
modeling
approach
6
Conduct
exploratory
data analysis /
assumptions
testing
7
Fit the model
8
Conduct
model
diagnostics
9
Address any
deficiencies in
the model
10
Interpret and
communicate
how the
results
address the
problem
ITERATIVE PROCESS
28
Inputs
• Subject matter theories
• Model
• Data
• Statistical techniques
• Auxiliary assumptions
Calculate / Compute
Outputs
• Parameter estimates
• Confidence regions
• Test statistics
• Graphical displays
Diagnose, validate, &
criticize
STEP 1.
IDENTIFY THE
PROBLEM
 Most people are terrible at specifying the
problem. Before you build a model, ask
them the important questions
 What do you want to know?
 What is the goal? What do you want to
do with that information?
 What background information is needed
to understand the problem?
 Is the problem one you can realistically
address?
29
WHAT IS THE GOAL?
30
Description
• Describe key features of data
• Centrality, spread,
distribution, etc
• Describe relationships in data
01
Inference
• Test hypotheses / causal
theories
• Explain relationships in the
data
02
Prediction
• Make informed guesses about
future data
03
IS THE GOAL
REALISTIC?
31
Is what the client requested consistent with their stated
goals? Does it address the problem they described?
What are the available resources to work on this
problem?
Can the problem be solved in the time frame allotted?
Does the data already exist or will it have be collected?
THIS REALLY HAPPENS..
• The client: “I need some
operational analytics.”
• Data scientist: “What do mean?”
• The client: “I just need some
metrics.”
• Data scientist: “What kind of
metrics?”
• The client: “You’re the data
scientist. You’re supposed to tell
me.”
A real
conversation:
What to do:
 Ask lots of questions
 Explain why you are asking those questions
 Listen carefully
 Take lots of notes
32
STEP 2. GET THE
BACKGROUND
INFO
33
• You probably don’t
The client knows a lot about the problem
Ask them for resources to learn more
• Understand the main clinical issues
• Identify known relationships between data elements
• Identify variables you need to include in the model that the client
forgot to tell you about
• Learn about the quirks and limitations of using certain measures
• Find out the standard approaches people use with that type of data
Do your own background research to..
This can save you a lot of pain
STEP 3. SELECT
VARIABLES
34
If you have several different variables that measure approximately the same thing,
which do you choose?
Do you keep the variables as-is or will you have to modify them?
As they explain the problem & the background information, listen for clues about
What is the outcome?
What is the distribution of the
outcome?
What is the main relationship
they are interested in
examining?
What other things will impact
the outcome and have to be
taken into account?
EXAMPLE
35
“We .. um .. started a program to better detect sepsis. We want to
know if the program worked and we have reduced instances of septic
shock. We think we think patients with sepsis will also have reduced
length of stay in the hospital, if we catch this earlier. Sepsis is an
infection in the blood. Patients with compromised immunity are at
greater risk, as are the elderly, IV drug users, and those on dialysis. As
part of the program, we track people who come in with sepsis.”
From this, can we identify the problem, some clinical background info,
outcomes, other variables, & potential data sources?
 “We .. um .. started a program to better detect sepsis. We want to know if
the program worked and we have reduced instances of septic shock. We think
we think patients with sepsis will also have reduced length of stay in the
hospital, if we catch this earlier. Sepsis is an infection in the blood. Patients
with compromised immunity are at greater risk, as are the elderly, IV drug
users, and those on dialysis. As part of the program, we track people who
come in with sepsis.”
 Two Outcomes: One continuous & one binary
 Main predictor: People in the program vs people not in the program (or
before vs after)
 Other risk factors
EXAMPLE
36
 Are there existing sources of the data or will it have to be collected?
 If it has to be collected, work with people who have research background to do this
properly
 If it exits already, is the data in a single source or do data need to come from
multiple sources?
STEP 4. ACQUIRE THE DATA
80-90% of time spend assembling and wrangling data 10-20%
Doing analyses
37
NEXT STEPS
38
Model fitting is iterative, so expect to jump back
and forth between different steps
Step 5, choosing a model; Step 7, fitting the model;
and Step 10, interpreting results, will covered
throughout the course
Step 6, exploratory data analysis, and Step 8, model
diagnostics will be covered in a future module
Step 9, addressing model deficiencies will be in a
future module
• As part of this, we will loop back to step 3 and talk more about
variable selection


essay、essay代写