U1 -无代写|学霸联盟

U1 -无代写

时间：2025-09-16

U1 Data analysis and data sampling
Data integrity includes the accuracy, completeness, consistency , and validity
(缺失,重复,违法)
replicated, transferred, manipulated导致数据完整性出问题要求: Data type,
Data range, Mandatory强制性不能缺失,
Unique, Regular expression (regex) patterns, Cross-field
validation百分率相加等于1, Accuracy, Completeness,
Consistency
没有数据或数据过少: 1.perform the analysis using proxy
data 2.If only no data for certain group of units, Adjust your
analysis to align with the data you already have 3.Gather the
data on a small scale to perform a preliminary analysis and
then request additional time to complete the analysis after you have collected
more data.
数据缺失: 1.Remove missing objects 2.Create a new, related
attribute 3.Replace with estimates 4.Ignore missing values
数据重复:remove the redundant data if any
不一致/噪声数据: 一个一个改或者按数据缺失来
极端值: outliers can be legitimate data
如何发现极端值?1.排序 2.可视化 3.Statistical tests (z score)3 西格玛原则 4.
Interquartile range (IQR)上限Q3+1.5IQR, 下限Q1-1.5IQR
数据预处理
数据转换:数据scale很重要
1.Standardization:以均值为中心,
标准差是1,适用于正态分布,对极端值敏感
2.Scaling Normalization (Min-max rescaling):把数据限制到0-
1之间,对未知分布或非正态分布适用,对极端值敏感.
可以两个都做,对比结果优劣 3.Log Transformation(偏态适
用)
3.Measure of dispersion (variation)
Strategic decisions:involve higher-level i su s concerned
with the overall direction of the orga ization
Tactical decisions:how the organizati n should achieve
the goals and objectives set by its strategy
Operational decisions: ffect how the firm is ru from day
对数据分析技术急剧增加的关注
`availability of massive amounts of data
`improvements in analytic methodologies
`substantial increases in computing power
to day 为什么决策分析很难
enormous number of alternatives, uncertainty
怎么做决策:tradition, intuition, rules of thumb, data analysis
Data sampling数据采样
census: time-consuming, expensive, misleading, unnecessary,
Impractical if the observations are destructive
sample: subs t of population|population: 总集
unit/object/instance: 采集的目标数据
抽样的margin of error：
成功sample: representative, large enough(与diversity有关),
appropriate method
样本量不小于30, 有outliers不小于50
large sample:low margin of error\
GPT:Generative`Pre-trained`Transformer
high confidence level
Correlation Vs Causation有关和导致
correlation: 有关系但不一定导致，causation: 一定导致
只有randomized experiment才能获得causation,控制变量,
observational study只能获得
数据获取
Sample survey: no intervention or manipulation, simply
asking some questions, bias,Confidentiality and anonymity
Randomized experiments: 控制变量法, indepe dent/
explanatory variable自变, dependent/outcome variable因变
Observational study: 只观察，不介入
correlation
Longitudinal data: multiple entities, some extended time period
Trend study:每次perio 随机random抽取样本采集数据, 同一样本不要求
多次参与
Cohort study:每次period都用固定的参与者,同一样本不一定每次period
都参与,但要从这些固定的cohort里random
Data Sampling Method抽样方法
Non-probability (Biased) sample: Convenience sampling方便抽
样, Voluntary response sampling/self selected sample, Purposive
sampling/judgment sample, Snowball sampling人拉人
Probability (Unbiased) sample:
Simple random sampling:所有人编号,随机数生成
Stratified sampling:分层抽样,每层性质类似,stratum复数strata,
在每层进行简单随机抽样(美国大选按州计票),large natural
variability and within each of the strata gives more consistent
values/geographically separated/use different interviewers
Weighted stratified sampling:每个地层样本数不一样多,加权
Cluster sampling:整体抽出其中几个聚类,每个聚类内性质不同
,有整体代表性
Systematic sampling:编号，每隔数取一个，但要no
hidden pattern影响结果
Multi-stage sampling:不同阶段用不同抽样方法
census无限大:要求Each element selected comes from the same
population, is selected independently
选一些参与
U2 Descriptive Analytics
Data analysis:数据分析分类
Descriptive analytics: describes wh t has happened in the past
Diagnostic analytics: determi e the causes of trends and
correlations between variables, hypothesis testing, diagnostic
regression analysis, correlation/causation有些认为和前面是一类
Predictive analyt cs: predict the future or ascertain the impact of
one variable on another, Linear regression, time series analysis,
some data-mining techniques,and simulation, often referred to as
risk analysis
Prescriptive analytics: course of ction to take,give decision
Data are facts and figures collected, analyzed, and summarized
for presentation and interpretation, including numbers, texts,
images, audios, videos, and so on.
Data types数据类型(时间序列等)
Cross-sectional: multiple entities, same point in time or within
same time interval.
Time series: ingle entity, over multiplepoints in time or a time
period
Panel data (or time-series cross-section): multiple entities,
multiple points in time ,和下一个不同的是每次都是同样的人

Nomi al scale: 互斥互补,分类之间不能排序(性别,颜色,地区,邮编)
Ordinal scale: 可以排序,但排序间的确切差异不能确定(优良中差,
Types of measurement scales
Qualitative variables:定性distinct categories
大

数据统计
frequency distribution: Peaks and outliers,nominal or ordinal scale也可以
生成频率直方图, Cumulative frequency table(样本的)
整体的: probability mass function (p.m.f.) for the discrete attribute;
probability density function (p.d.f.) for the continuous attribute.
正态分布右偏是均值在左边,左边更高,右边更长,Right (Positively)
Skewed
中小,第一名第二名)
Quantitative variables: 定量,可排序
Interval (relative) scale: no meaningful zero(year,temperature,IQ)
Ratio (absolute) scale: exists a true/absolute zero, true ratios exist(height
,weight,age,speed)
summary statistic统计数据
1.Measures of central tendency
均值mean,众数mode,中位数
方图(定量)
median,midrange
2.Measures of location
1st quartile(25%), 2nd quartile (Median), 3rd quartile (75%), Decile (10%
, 20%, 30%), Percentile (5th percentile, 95th percentile, 99th percentile)
箱型图:max, 3rd quartile, median, 1st quartial, min,中值偏上就是左偏
Arithmetic mean、Weighted mean、Geometric mean (GM)增长率的平
均值,连乘的n次方根、
Harmonic mean调和平均值,用于分母
,速度电阻的平均值
Plots图表选择
Line chart线形图:track changes over short and long periods of time.相
较于条形图,可以展示更微小的变化,可以有多条线
Bar chart条形图:contrast and compare two or more values, using
height or lengths,也可以表现时间变化
Heatmap热图:use color to compare categories in a data set
Pie chart:divided into segments representing proportions corresponding
to the quantity it represents可以表示关系
Scatter Plot散点图:show relationships
Chebyshev's theorem切比雪
夫定理
1-1/k^2比例的数据会落在
均值的k步标准差之内(k>1
但k不一定是整数)
between different variables(two)
U3 Art of Visualization and Storytelling
data visualization数据可视化 is to put information together into graphic
representation to make it easier for people to understand. Engage your
audience. Help the audience have a conversation with the data. A good
visualization has to accurately convey the information of data. A good
data viz should be aesthetically pleasing.
Headlines, subtitles, labels, and annotations(注释
Pre-attentive attributes: people recognize automatically without conscious
Marks: points, lines, and shapes--position, size, shape, color
Channels: Accuracy, Popout弹出显著性?, Grouping(proximity, similarity,
enclosure, connectedness, and continuity of the channel)
The Elements of Arts:
Line: curved or straight, thick or thin, vertical/horizontal or diagonal,
solid or dashed
Shape: 2-D, symmetrical or asymmetrical
Color: Hue 色调(red, yellow, blue, etc.), Intensity 饱和度(brightness or
dullness,Colors lose intensity when mixed with their complement),Value
明度 (describes how light or dark a color is, white-tint, black-shade)
Space:The area between, around, and in the objects.
Movement
)
Data Storytelling
3 steps:1. Engage your audience2. Create compelling visuals3. Tell
the story in an interesting narrative
An effective narrative considers: Cha acters受故事影响的人,
Setting现状, Plot矛盾冲突, Big reveal解决方案
plot:This could be a challenge from a competitor, an inefficient
饼图(定性)、条形图(定性,值很少的时候可以定量)、线性图(定量)、直 process that needs to be fixed, or a new opportunity that the
company just can't pass up. This complication of the current
situation should reveal the problem your analysis is solving and
compel the characters to act
9 Principles of Figure Design:balance, emphasis, movement,
pattern, repetition, proportion, rhythm, variety, unity
U4 Statistical Inference and Data Resampling
point estimation点估计
Statistical inference:用样本数据估计总体数据
:The sample mean x is a point estimator
of the population mean u. The sample SD a is a point estimator
of population SD o- .The proportion p is a point estimator of
population proportion p.
Sampling distribution:
Expected value:
Standard deviation:
Rule of thumb: 左边的finite population correction factor is
used when the sample size is more than 5% of population
Central limit theorem中心极限理论:如果总体是正态分布,样本
的均值是正态分布; 如果整体不是正态分布, 样本量足够大时
样本的均值是正态分布.(样本量大于等于30)
-
-
Hypothesis test假设检验
null hypothesis H0, alternative hypothesis Ha, 目标是拒绝H0,接
受Ha
Test statistic
The smaller the t-statistic value, the higher the chance to reject H0
P value: assuming that H0 is true, 获得当前结果的概率有多小
(有多极端)p越小越能拒绝
a:level of significance,通常取0.
05或0.01
lower tail:小于算出
来t的概率累计
upper tail:大于算出
来t的概率累计
two tail:2倍的上两
个里更小的值
Confusion Matrix混淆矩阵
阿尔法(level of significance) = probability of making (Type I
error) when the null hypothesis is true.
贝塔= probability of making (Type II error) when the
alternative hypothesis is true.
Data resampling数据重采样 refers to methods for economically using a collected
dataset to improve the estimate of the population parameter and help to quantify
the uncertainty of the estimate.
Training set & Testing set (validation set/hold-out set)
不同交叉验证方法的区别主要就是如何划分训练集验证集
训练集不同
,
Cross Validation
1.The validation set approach: Simply partition the data set
into two parts, with size K and (n-K).随机选取k个
Merits: It is simple to understand and easy to implement. It is
less computationally expensive.
Drawbacks: The validation error computed therefore will vary
with respect to the different validation sets. A significant
number of observations are held out for validation.误差比实
际高,很多数据没用来训练模型,只用做了验证.
2.Leave-one-out cross-validation (LOOCV)留一验证,训练n次
Merits: LOOCV tends not to overestimate the testing error as
much as the validation set approach. LOOCV always returns
the same result no matter who applies it and how many times
it is repeated.
Drawbacks: potentially expensive in computation
3.K-fold cross-validation分为k个包,每次一个包用于验证
Merits: More computationally affordable than LOOCV. Less
biased in overestimating the actual testing error than the
validation set approach.
Drawbacks: k-fold cross validation have higher variation in
the testing error than the validation set approach.
(nevertheless, the variation is less than LOOCV) 变异性大
ÿ
Bootstrapping
The process of creating bootstrapped data sets, calculate and
record down some desired statistic
对数据集进行有放回抽样, 规定比如说数据集只有八个样本
,就一个实验抽8次,获得样本估计值(比如均值),重复n次实验,
获得均值的频率直方图.
此方法可以用较小的数据集获取数据的规律,而不用大量地
采集数据浪费钱.
U5 Data Integrity, Data Wrangling, and Data Pre-
processing
评估模型准确性Assessing Model Accuracy
拟合质量
loss function:
戴帽的是预测值
mean squared error (MSE):
均方根误差
Feature Engineering特征工程
creation of new features from existing ones so that we can Improve performance
andObtain additional insight into relationships between features.比如把文字编
码成二进制
Machine learning 机器学习is the concept that a machine can 'learn' knowledge
from data without human intervention.
x: features, input variables, independent variables, predictor variables.
y: target, output variables, dependent variables, response variables
Train/Fit a machine learning model using a sample of observations (data points
)
residual/random error/irreducible error
如果求y,是prediction,如果求f(x),是inference
Statistical Learning监督学习
Supervised Learning:监督学习,数据成对出现,有标签
定量:regression problem 定性:classification problem
Unsupervised Learning:无监督学习,无标签
Clustering, Anomaly detection异常检测, Dimension reduction地图三维降二维
Semi-supervised Learning, Self-supervised Learning(gpt), Reinforcement
Learning
U6 Statistical Learning, Model Selection and Regularization
(alphago)
Training MSE一般小与Testing MSE
选两者接近的最好,随着flexibility的增加,training MSE减小,testing MSE呈
U形分布
如果training MSE特别小而testing MSE很大,则overfitting
Bias-Variance Tradeoff
Expected test mean squared error (MSE)
Bias: refers to the inability of a machine learning model to capture a true
relationship.
The linear model is inflexible, larger bias
Variance: difference in fits on different datasets, or amount by which 预测值
would change on new datasets
The linear model still fits the testing set moderately well, lower cariance, High
generalization, High interpretability
一般复杂模型low bias, high variance
Simple Linear Regression
Least Square Estimation (LSE)
Multiple Linear Regression const常数
Linear Model Selection(或称为variable selection, feature selection, or model
selection)
1.Best Subset Selection: 对比选择所有的包含0个,1个⋯⋯p个变量的模型,共计2
的p次方个.
2. Backward Selection:先选p个,从p里减一个对比选最好的,再从p-1里减一个对
比选最好的,直到剩0个,共计:
3. Forward Selection: 从0到p,共计同上
看起来像forward也有可能是best,要对比结果一不一样
Regularization (Shrinkage Methods)
Ridge regression:训练集的结果可能与测试集差异较大,采取牺牲训练集的MSE
的bias方式来提高其variance
第二项被称为a regularised or shrinkage penalty.
When lamda gets larger, the slope of the line will get smaller, as the optimal
slope value gets closer to 0.
When is large enough, the slope (beta coefficient) will get asymptotically closer
to zero, but not exactly zero.
With smaller slope, it makes the fitted model, response variable , less sensitive to
the training data .
We can try a bunch of lamda values and use cross validation, to determine
which one results in the best prediction performance.
LASSO(least absolute shrinkage and selection operator)
In LASSO, the coefficients are fitted by minimizing the following quantity
This kind of shrinkage penalty can force
some coefficient to be exactly zero when the
tuning parameter is large enough.
Prediction accuracy. LSE will have a competitive performance in prediction accuracy when
the data set is in a low dimension. That is to say,p<smaller than the amount of data observations available. Nevertheless, when is not much
larger than , there might be an issue of overfitting and there will be a high variance of the
least square estimation. Worse still, if p>n, there is no unique solution to the least square
estimation and therefore it cannot be used at all. In contrast, ridge regression and LASSO
can overcome this issue by reducing the variance without too much compromise in the
Why shrinkage methods?
Variable selection and interpretability. it is extremely unlikely to output an
exactly zero value for a coefficient using LSE. We need to repeatedly fit
different subset models that leave out a subset of predictors every time and
test if the combination of coefficients generates a significant model.
LASSO can constrain and set a series of coefficients to be exactly zero
while fitting the model only once.
increase of bias. This will lead to an overall improvement in prediction accuracy.
Ridge or LASSO?
LASSO tends to do well if there are a small number of significant predictors, and the others
are close to zero (ergo: when only a few useful predictors which significantly influence the
response)
Ridge works well if there are many large predictors of about the same value (ergo: when
most of the predictors are useful in predicting the response).
However, in practice, we don't know the true relationship, so the previous two points are
somewhat theoretical. A safer way is to try both and run cross-validation to select the more
suited model for a specific case.
U7 Descriptive Data Mining
Clustering
K-means clustering三个颜色的点分类
K distinct and nonoverlapping clusters
Step 1: choose an integer value for K.拟定一个k
Step 2: initialize a random cluster assignment把所有的数据点随机分到k 类里
Step 3: identify the cluster centroid for each of the three clusters.寻找每个类的质心
Step 4: compute the distances from each observation to the three cluster
centroids respectively. Re-assign each observation in the data set to the
corresponding cluster with the shortest distance.计算每个点到三个颜色质心的距离，距离
哪个颜色最近就把这个点重新分配到这个颜色
cluster里
Step 5: compute the new cluster centroids for each of the three clusters计算新的质心,重复45
直到质心不再改变
Clustering is considered good when the dissimilarity within the same
cluster is small.
The dissimilarity is quantified by the within-cluster-variation, which
measures how much the observation differs from each other within the
same cluster.
within-cluster-variation = the sum of squared distances among all pairs of observations
within a cluster.任选两对
Hierarchical Clustering树状图分类可以切切切那个
construct a dendrogram
1.each of the data points is considered as a cluster of its own每个点单独一类
计算两个cluster之间每对点之间的距离,比选方法:
Complete linkage:最大值作为两个cluster之间的差异
Single linkage:最小值
Average linkage:均值
2.计算所有cluster之间的差异,最小的两个可以合并
3.重复直到融为一个cluster
K-means和Hierarchical Clustering的限制:
1.如果有outlier,不会自动处理,会被加入到某个cluster中,模糊cluster边界2.The hierarchical
clustering approach is only appropriate for the case
that the clusters have some embedded hierarchical relationship.分三类必须嵌套在两类中.
比如人可以分成男女,老中少,这就不行
Frequent Pattern Mining寻找关联,内在联系
有I个元素的集合有2的I次方-1个非空子集
support of an itemset
有时候也用绝对频率
absolute frequency而不用比率
Frequent itemset mining:寻找满足大于min_sup的子集,小于的话infrequent
Naive algorithm: 把所有的全数一遍,找出来1.Exponential explosion,2.Every step, we need
to scan the whole transaction data to count the support of each item set.
to count the support of each item set
min-sup:
A hyper-parameter set by users to manage their expectations of results.
The minimum threshold value for the support min_sup is to avoid inexplicable rules
capturing random noise in the data.
Setting it too low will lead to a large number of item sets being qualified as frequent. Those
rare item sets may apply in too few cases to be useful.
Setting it too high will give a small number of item sets. The results would be too generic to
be useful and do not provide new knowledge for users.
A general rule of thumb is 20%.
However, if an item set is particularly valuable and represents a lucrative opportunity,
the minimum support threshold can be lowered.
1.解决指数爆炸问题:Monotonicity theorems
Theorem 1: If an itemset is frequent, then each of its subsets are frequent too. all the itemset
connected to it are frequent too.
Theorem 2: If an itemset is infrequent, then none of its supersets will be frequent. all the
itemset connected from it are infrequent too.
Apriori Algorithm:通过单调性规则摒弃一些集合,Because only the combination of 2
frequent itemsets may lead to a frequent itemset
2.解决每次都要数数问题:
FP-Growth algorithm is short for “frequent pattern growth”
步骤:1.寻找所有频繁的基础集(只有一个元素的集)
2.把每个事物数据里的元素按频繁程度从大到小顺序排列
3.画分支
Maximal frequent itemset:a frequent itemset is maximal if none of its supersets is
frequent.
Closed frequent itemset: a frequent itemset is closed if none of its supersets has
the same support.
If an itemset i is maximal, then it must be also closed
An association rule is an implication of the form Α -B, where A and B are itemset
that do not share common items
A: antecedent of the rule
B: consequent of the rule
One possible measure is the support.
another is confidence
support in Τ is at least min_sup and confidence in Τ is at least min_conf.
8 Classification Methods
K-nearest Neighborhood (KNN)
指定k值,在点附近最近的k个点,计算属于各分类的分数,分数最大的分到这类k
用奇数,偶数容易有想等的情况,k的值对结果影响很大
Logistic Regression
对数回归相较于线性回归,可以把范围固定在[0,1]之间
simple:
multiple
Tree-based Methods
Regression tree回归树
root node, leaf node, interval node,root node在最上面
the predicted response for an observation is given by the mean response of the
training observations that belong to the same terminal node.
找RSS最小的分支方法
树层多:overfitting, lower bias, poor generation
tree-pruning
Classification tree定义树
the predicted response for an observation is calculated by the most commonly
occurring class of training observations in the region to which it belongs.
Classification error rate is simply the fraction of the tr ining observations in that
region that do not belong to the most common class:
Gini Index基尼系数: pmk接近0或1时,基尼系数越小
As such, Gini index is also referred to as a measure of node purity because a small
value indicates that a node contains predominately observations from a single
class
Cross-entropy交叉熵
U9 Introduction to GenAI and LLMs
Generative AI can learn from existing artifacts to generate new, realistic
artifacts (at scale) that reflect the characteristics of the training data but don’t
repeat it. It can produce a variety of novel content, such as images, video, music,
speech, text, software code and product designs
Rather than AI replacing humans, the two should collaborate - humans
identifying problems, AI suggesting solutions, humans selecting
Risk: the potential for hallucination, the black-box logic systems, opportunities
for cyberattacks, data breaches, copyright concerns, and on and on.
LLMs (Large Language Models) are super smart AI programs that understand
and generate text, a lot like how humans do,Examples are GPT
The use a brain-like system called 'Transformer' to understand the
relationship between words, making them really good at grasping the
meaning of a whole text
Natural Language Generation (NLG)
Natural Languag Understanding (NLU)
Tokenization = how the model "sees" the prompt.
Base LLMs = the foundation model "processes" a prompt.
Instruction-Tuned LLMs: An Instruction Tuned LLM starts with the
foundation model and fine-tunes it with examples or input/output pairs更专
业化
what LLMs cannot do well: 1.Knowledge cutoffs 2.Hallucinations 3.Input
(and output) length is limited 4.Bias and toxicity.
问题:Transparency、Accountability、Information hazards、
Misinformation spread、Malicious use、Accuracy、Intellectual property (IP
) and copyright、Environmental impact、AI Regulations
Prompt Engineering: Prompts can be a single or a series of instructions
Prompt engineering is the process of designing and optimizing text inputs
(prompts) to deliver consistent and quality responses (completions) for a
given application objective and model.
instructions、context、inputdata、output indicator
question format. statement format. instruction format.
Prompt instruction refers to specific guidelines or directives given to an AI
model to shape the nature, style, or format of the response
U10 Monte Carlo simulation
Monte Carlo simulation consisting of independent trials in which the results
for one trial do not affect what happens in subsequent trials.
Discrete-event simulation, which involves trials that represent how a system
evolves over time. One common application of discrete-event simulation is
the analysis of waiting lines. (Arena®, ProModel®, and Simio®).
Agent-based modeling, which studies the emergent behaviors (i.e. the
collective behavior of the system) due to the interactions and behaviors of
individual agents.
System dynamic modelling, chaotic systems. It relies on discrete event
simulation and numeric methods to determine the behavior of components
within that system.
uniform probability distribution.
Advantages of Simulation
It is conceptually easy to understand and that the methods can be used to
model and learn about the behavior of complex systems that would be
difficult, if not impossible, to deal with analytically.
Simulation models are flexible. They can be used to describe systems
without requiring the assumptions that are often required by other
mathematical models.
A simulation model provides a convenient experimental laboratory for the
real system.
Simulation models frequently warn against poor decision strategies by
projecting disastrous outcomes such as system failures, large financial losses,
and so on
Limitations of Simulation
For complex systems, the process of developing, verifying, and validating a
simulation model can be time consuming and expensive.
Like all mathematical models, the analyst must be conscious of the
assumptions of the model in order to understand its limitations.
In addition, each simulation run provides only a sample of output data. As
such, the summary of the simulation data provides only estimates or
approximations about the real system. Nonetheless, the danger of obtaining
poor solutions is greatly mitigated if the analyst exercises good judgment in
developing the simulation model and follows proper verification and
validation steps. Furthermore, if a sufficiently large enough set of
simulation trials is run under a wide variety of conditions, the analyst will
likely have sufficient data to predict how the real system will operate

学霸联盟