U1 Data analysis and data sampling Data integrity includes the accuracy, completeness, consistency , and validity (缺失,重复,违法) replicated, transferred, manipulated导致数据完整性出问题要求: Data type, Data range, Mandatory强制性不能缺失, Unique, Regular expression (regex) patterns, Cross-field validation百分率相加等于1, Accuracy, Completeness, Consistency 没有数据或数据过少: 1.perform the analysis using proxy data 2.If only no data for certain group of units, Adjust your analysis to align with the data you already have 3.Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. 数据缺失: 1.Remove missing objects 2.Create a new, related attribute 3.Replace with estimates 4.Ignore missing values 数据重复:remove the redundant data if any 不一致/噪声数据: 一个一个改或者按数据缺失来 极端值: outliers can be legitimate data 如何发现极端值?1.排序 2.可视化 3.Statistical tests (z score)3 西格玛原则 4. Interquartile range (IQR)上限Q3+1.5IQR, 下限Q1-1.5IQR 数据预处理 数据转换:数据scale很重要 1.Standardization:以均值为中心, 标准差是1,适用于正态分布,对极端值敏感 2.Scaling Normalization (Min-max rescaling):把数据限制到0- 1之间,对未知分布或非正态分布适用,对极端值敏感. 可以两个都做,对比结果优劣 3.Log Transformation(偏态适 用) 3.Measure of dispersion (variation) Strategic decisions:involve higher-level i su s concerned with the overall direction of the orga ization Tactical decisions:how the organizati n should achieve the goals and objectives set by its strategy Operational decisions: ffect how the firm is ru from day 对数据分析技术急剧增加的关注 `availability of massive amounts of data `improvements in analytic methodologies `substantial increases in computing power to day 为什么决策分析很难 enormous number of alternatives, uncertainty 怎么做决策:tradition, intuition, rules of thumb, data analysis Data sampling数据采样 census: time-consuming, expensive, misleading, unnecessary, Impractical if the observations are destructive sample: subs t of population|population: 总集 unit/object/instance: 采集的目标数据 抽样的margin of error: 成功sample: representative, large enough(与diversity有关), appropriate method 样本量不小于30, 有outliers不小于50 large sample:low margin of error\ GPT:Generative`Pre-trained`Transformer high confidence level Correlation Vs Causation有关和导致 correlation: 有关系但不一定导致,causation: 一定导致 只有randomized experiment才能获得causation,控制变量, observational study只能获得 数据获取 Sample survey: no intervention or manipulation, simply asking some questions, bias,Confidentiality and anonymity Randomized experiments: 控制变量法, indepe dent/ explanatory variable自变, dependent/outcome variable因变 Observational study: 只观察,不介入 correlation Longitudinal data: multiple entities, some extended time period Trend study:每次perio 随机random抽取样本采集数据, 同一样本不要求 多次参与 Cohort study:每次period都用固定的参与者,同一样本不一定每次period 都参与,但要从这些固定的cohort里random Data Sampling Method抽样方法 Non-probability (Biased) sample: Convenience sampling方便抽 样, Voluntary response sampling/self selected sample, Purposive sampling/judgment sample, Snowball sampling人拉人 Probability (Unbiased) sample: Simple random sampling:所有人编号,随机数生成 Stratified sampling:分层抽样,每层性质类似,stratum复数strata, 在每层进行简单随机抽样(美国大选按州计票),large natural variability and within each of the strata gives more consistent values/geographically separated/use different interviewers Weighted stratified sampling:每个地层样本数不一样多,加权 Cluster sampling:整体抽出其中几个聚类,每个聚类内性质不同 ,有整体代表性 Systematic sampling:编号,每隔数取一个,但要no hidden pattern影响结果 Multi-stage sampling:不同阶段用不同抽样方法 census无限大:要求Each element selected comes from the same population, is selected independently 选一些参与 U2 Descriptive Analytics Data analysis:数据分析分类 Descriptive analytics: describes wh t has happened in the past Diagnostic analytics: determi e the causes of trends and correlations between variables, hypothesis testing, diagnostic regression analysis, correlation/causation有些认为和前面是一类 Predictive analyt cs: predict the future or ascertain the impact of one variable on another, Linear regression, time series analysis, some data-mining techniques,and simulation, often referred to as risk analysis Prescriptive analytics: course of ction to take,give decision Data are facts and figures collected, analyzed, and summarized for presentation and interpretation, including numbers, texts, images, audios, videos, and so on. Data types数据类型(时间序列等) Cross-sectional: multiple entities, same point in time or within same time interval. Time series: ingle entity, over multiplepoints in time or a time period Panel data (or time-series cross-section): multiple entities, multiple points in time ,和下一个不同的是每次都是同样的人 Nomi al scale: 互斥互补,分类之间不能排序(性别,颜色,地区,邮编) Ordinal scale: 可以排序,但排序间的确切差异不能确定(优良中差, Types of measurement scales Qualitative variables:定性distinct categories 大 数据统计 frequency distribution: Peaks and outliers,nominal or ordinal scale也可以 生成频率直方图, Cumulative frequency table(样本的) 整体的: probability mass function (p.m.f.) for the discrete attribute; probability density function (p.d.f.) for the continuous attribute. 正态分布右偏是均值在左边,左边更高,右边更长,Right (Positively) Skewed 中小,第一名第二名) Quantitative variables: 定量,可排序 Interval (relative) scale: no meaningful zero(year,temperature,IQ) Ratio (absolute) scale: exists a true/absolute zero, true ratios exist(height ,weight,age,speed) summary statistic统计数据 1.Measures of central tendency 均值mean,众数mode,中位数 方图(定量) median,midrange 2.Measures of location 1st quartile(25%), 2nd quartile (Median), 3rd quartile (75%), Decile (10% , 20%, 30%), Percentile (5th percentile, 95th percentile, 99th percentile) 箱型图:max, 3rd quartile, median, 1st quartial, min,中值偏上就是左偏 Arithmetic mean、Weighted mean、Geometric mean (GM)增长率的平 均值,连乘的n次方根、 Harmonic mean调和平均值,用于分母 ,速度电阻的平均值 Plots图表选择 Line chart线形图:track changes over short and long periods of time.相 较于条形图,可以展示更微小的变化,可以有多条线 Bar chart条形图:contrast and compare two or more values, using height or lengths,也可以表现时间变化 Heatmap热图:use color to compare categories in a data set Pie chart:divided into segments representing proportions corresponding to the quantity it represents可以表示关系 Scatter Plot散点图:show relationships Chebyshev's theorem切比雪 夫定理 1-1/k^2比例的数据会落在 均值的k步标准差之内(k>1 但k不一定是整数) between different variables(two) U3 Art of Visualization and Storytelling data visualization数据可视化 is to put information together into graphic representation to make it easier for people to understand. Engage your audience. Help the audience have a conversation with the data. A good visualization has to accurately convey the information of data. A good data viz should be aesthetically pleasing. Headlines, subtitles, labels, and annotations(注释 Pre-attentive attributes: people recognize automatically without conscious Marks: points, lines, and shapes--position, size, shape, color Channels: Accuracy, Popout弹出显著性?, Grouping(proximity, similarity, enclosure, connectedness, and continuity of the channel) The Elements of Arts: Line: curved or straight, thick or thin, vertical/horizontal or diagonal, solid or dashed Shape: 2-D, symmetrical or asymmetrical Color: Hue 色调(red, yellow, blue, etc.), Intensity 饱和度(brightness or dullness,Colors lose intensity when mixed with their complement),Value 明度 (describes how light or dark a color is, white-tint, black-shade) Space:The area between, around, and in the objects. Movement ) Data Storytelling 3 steps:1. Engage your audience2. Create compelling visuals3. Tell the story in an interesting narrative An effective narrative considers: Cha acters受故事影响的人, Setting现状, Plot矛盾冲突, Big reveal解决方案 plot:This could be a challenge from a competitor, an inefficient 饼图(定性)、条形图(定性,值很少的时候可以定量)、线性图(定量)、直 process that needs to be fixed, or a new opportunity that the company just can't pass up. This complication of the current situation should reveal the problem your analysis is solving and compel the characters to act 9 Principles of Figure Design:balance, emphasis, movement, pattern, repetition, proportion, rhythm, variety, unity U4 Statistical Inference and Data Resampling point estimation点估计 Statistical inference:用样本数据估计总体数据 :The sample mean x is a point estimator of the population mean u. The sample SD a is a point estimator of population SD o- .The proportion p is a point estimator of population proportion p. Sampling distribution: Expected value: Standard deviation: Rule of thumb: 左边的finite population correction factor is used when the sample size is more than 5% of population Central limit theorem中心极限理论:如果总体是正态分布,样本 的均值是正态分布; 如果整体不是正态分布, 样本量足够大时 样本的均值是正态分布.(样本量大于等于30) - - Hypothesis test假设检验 null hypothesis H0, alternative hypothesis Ha, 目标是拒绝H0,接 受Ha Test statistic The smaller the t-statistic value, the higher the chance to reject H0 P value: assuming that H0 is true, 获得当前结果的概率有多小 (有多极端)p越小越能拒绝 a:level of significance,通常取0. 05或0.01 lower tail:小于算出 来t的概率累计 upper tail:大于算出 来t的概率累计 two tail:2倍的上两 个里更小的值 Confusion Matrix混淆矩阵 阿尔法(level of significance) = probability of making (Type I error) when the null hypothesis is true. 贝塔= probability of making (Type II error) when the alternative hypothesis is true. Data resampling数据重采样 refers to methods for economically using a collected dataset to improve the estimate of the population parameter and help to quantify the uncertainty of the estimate. Training set & Testing set (validation set/hold-out set) 不同交叉验证方法的区别主要就是如何划分训练集 验证集 训练集不同 , Cross Validation 1.The validation set approach: Simply partition the data set into two parts, with size K and (n-K).随机选取k个 Merits: It is simple to understand and easy to implement. It is less computationally expensive. Drawbacks: The validation error computed therefore will vary with respect to the different validation sets. A significant number of observations are held out for validation.误差比实 际高,很多数据没用来训练模型,只用做了验证. 2.Leave-one-out cross-validation (LOOCV)留一验证,训练n次 Merits: LOOCV tends not to overestimate the testing error as much as the validation set approach. LOOCV always returns the same result no matter who applies it and how many times it is repeated. Drawbacks: potentially expensive in computation 3.K-fold cross-validation分为k个包,每次一个包用于验证 Merits: More computationally affordable than LOOCV. Less biased in overestimating the actual testing error than the validation set approach. Drawbacks: k-fold cross validation have higher variation in the testing error than the validation set approach. (nevertheless, the variation is less than LOOCV) 变异性大 ÿ Bootstrapping The process of creating bootstrapped data sets, calculate and record down some desired statistic 对数据集进行有放回抽样, 规定比如说数据集只有八个样本 ,就一个实验抽8次,获得样本估计值(比如均值),重复n次实验, 获得均值的频率直方图. 此方法可以用较小的数据集获取数据的规律,而不用大量地 采集数据浪费钱. U5 Data Integrity, Data Wrangling, and Data Pre- processing 评估模型准确性Assessing Model Accuracy 拟合质量 loss function: 戴帽的是预测值 mean squared error (MSE): 均方根误差 Feature Engineering特征工程 creation of new features from existing ones so that we can Improve performance andObtain additional insight into relationships between features.比如把文字编 码成二进制 Machine learning 机器学习is the concept that a machine can 'learn' knowledge from data without human intervention. x: features, input variables, independent variables, predictor variables. y: target, output variables, dependent variables, response variables Train/Fit a machine learning model using a sample of observations (data points ) residual/random error/irreducible error 如果求y,是prediction,如果求f(x),是inference Statistical Learning监督学习 Supervised Learning:监督学习,数据成对出现,有标签 定量:regression problem 定性:classification problem Unsupervised Learning:无监督学习,无标签 Clustering, Anomaly detection异常检测, Dimension reduction地图三维降二维 Semi-supervised Learning, Self-supervised Learning(gpt), Reinforcement Learning U6 Statistical Learning, Model Selection and Regularization (alphago) Training MSE一般小与Testing MSE 选两者接近的最好,随着flexibility的增加,training MSE减小,testing MSE呈 U形分布 如果training MSE特别小而testing MSE很大,则overfitting Bias-Variance Tradeoff Expected test mean squared error (MSE) Bias: refers to the inability of a machine learning model to capture a true relationship. The linear model is inflexible, larger bias Variance: difference in fits on different datasets, or amount by which 预测值 would change on new datasets The linear model still fits the testing set moderately well, lower cariance, High generalization, High interpretability 一般复杂模型low bias, high variance Simple Linear Regression Least Square Estimation (LSE) Multiple Linear Regression const常数 Linear Model Selection(或称为variable selection, feature selection, or model selection) 1.Best Subset Selection: 对比选择所有的包含0个,1个⋯⋯p个变量的模型,共计2 的p次方个. 2. Backward Selection:先选p个,从p里减一个对比选最好的,再从p-1里减一个对 比选最好的,直到剩0个,共计: 3. Forward Selection: 从0到p,共计同上 看起来像forward也有可能是best,要对比结果一不一样 Regularization (Shrinkage Methods) Ridge regression:训练集的结果可能与测试集差异较大,采取牺牲训练集的MSE 的bias方式来提高其variance 第二项被称为a regularised or shrinkage penalty. When lamda gets larger, the slope of the line will get smaller, as the optimal slope value gets closer to 0. When is large enough, the slope (beta coefficient) will get asymptotically closer to zero, but not exactly zero. With smaller slope, it makes the fitted model, response variable , less sensitive to the training data . We can try a bunch of lamda values and use cross validation, to determine which one results in the best prediction performance. LASSO(least absolute shrinkage and selection operator) In LASSO, the coefficients are fitted by minimizing the following quantity This kind of shrinkage penalty can force some coefficient to be exactly zero when the tuning parameter is large enough. Prediction accuracy. LSE will have a competitive performance in prediction accuracy when the data set is in a low dimension. That is to say,p<
smaller than the amount of data observations available. Nevertheless, when is not much larger than , there might be an issue of overfitting and there will be a high variance of the least square estimation. Worse still, if p>n, there is no unique solution to the least square estimation and therefore it cannot be used at all. In contrast, ridge regression and LASSO can overcome this issue by reducing the variance without too much compromise in the Why shrinkage methods? Variable selection and interpretability. it is extremely unlikely to output an exactly zero value for a coefficient using LSE. We need to repeatedly fit different subset models that leave out a subset of predictors every time and test if the combination of coefficients generates a significant model. LASSO can constrain and set a series of coefficients to be exactly zero while fitting the model only once. increase of bias. This will lead to an overall improvement in prediction accuracy. Ridge or LASSO? LASSO tends to do well if there are a small number of significant predictors, and the others are close to zero (ergo: when only a few useful predictors which significantly influence the response) Ridge works well if there are many large predictors of about the same value (ergo: when most of the predictors are useful in predicting the response). However, in practice, we don't know the true relationship, so the previous two points are somewhat theoretical. A safer way is to try both and run cross-validation to select the more suited model for a specific case. U7 Descriptive Data Mining Clustering K-means clustering三个颜色的点分类 K distinct and nonoverlapping clusters Step 1: choose an integer value for K.拟定一个k Step 2: initialize a random cluster assignment把所有的数据点随机分到k 类里 Step 3: identify the cluster centroid for each of the three clusters.寻找每个类的质心 Step 4: compute the distances from each observation to the three cluster centroids respectively. Re-assign each observation in the data set to the corresponding cluster with the shortest distance.计算每个点到三个颜色质心的距离,距离 哪个颜色最近就把这个点重新分配到这个颜色 cluster里 Step 5: compute the new cluster centroids for each of the three clusters计算新的质心,重复45 直到质心不再改变 Clustering is considered good when the dissimilarity within the same cluster is small. The dissimilarity is quantified by the within-cluster-variation, which measures how much the observation differs from each other within the same cluster. within-cluster-variation = the sum of squared distances among all pairs of observations within a cluster.任选两对 Hierarchical Clustering树状图分类可以切切切那个 construct a dendrogram 1.each of the data points is considered as a cluster of its own每个点单独一类 计算两个cluster之间每对点之间的距离,比选方法: Complete linkage:最大值作为两个cluster之间的差异 Single linkage:最小值 Average linkage:均值 2.计算所有cluster之间的差异,最小的两个可以合并 3.重复直到融为一个cluster K-means和Hierarchical Clustering的限制: 1.如果有outlier,不会自动处理,会被加入到某个cluster中,模糊cluster边界2.The hierarchical clustering approach is only appropriate for the case that the clusters have some embedded hierarchical relationship.分三类必须嵌套在两类中. 比如人可以分成男女,老中少,这就不行 Frequent Pattern Mining寻找关联,内在联系 有I个元素的集合有2的I次方-1个非空子集 support of an itemset 有时候也用绝对频率 absolute frequency而不用比率 Frequent itemset mining:寻找满足大于min_sup的子集,小于的话infrequent Naive algorithm: 把所有的全数一遍,找出来1.Exponential explosion,2.Every step, we need to scan the whole transaction data to count the support of each item set. to count the support of each item set min-sup: A hyper-parameter set by users to manage their expectations of results. The minimum threshold value for the support min_sup is to avoid inexplicable rules capturing random noise in the data. Setting it too low will lead to a large number of item sets being qualified as frequent. Those rare item sets may apply in too few cases to be useful. Setting it too high will give a small number of item sets. The results would be too generic to be useful and do not provide new knowledge for users. A general rule of thumb is 20%. However, if an item set is particularly valuable and represents a lucrative opportunity, the minimum support threshold can be lowered. 1.解决指数爆炸问题:Monotonicity theorems Theorem 1: If an itemset is frequent, then each of its subsets are frequent too. all the itemset connected to it are frequent too. Theorem 2: If an itemset is infrequent, then none of its supersets will be frequent. all the itemset connected from it are infrequent too. Apriori Algorithm:通过单调性规则摒弃一些集合,Because only the combination of 2 frequent itemsets may lead to a frequent itemset 2.解决每次都要数数问题: FP-Growth algorithm is short for “frequent pattern growth” 步骤:1.寻找所有频繁的基础集(只有一个元素的集) 2.把每个事物数据里的元素按频繁程度从大到小顺序排列 3.画分支 Maximal frequent itemset:a frequent itemset is maximal if none of its supersets is frequent. Closed frequent itemset: a frequent itemset is closed if none of its supersets has the same support. If an itemset i is maximal, then it must be also closed An association rule is an implication of the form Α -B, where A and B are itemset that do not share common items A: antecedent of the rule B: consequent of the rule One possible measure is the support. another is confidence support in Τ is at least min_sup and confidence in Τ is at least min_conf. 8 Classification Methods K-nearest Neighborhood (KNN) 指定k值,在点附近最近的k个点,计算属于各分类的分数,分数最大的分到这类k 用奇数,偶数容易有想等的情况,k的值对结果影响很大 Logistic Regression 对数回归相较于线性回归,可以把范围固定在[0,1]之间 simple: multiple Tree-based Methods Regression tree回归树 root node, leaf node, interval node,root node在最上面 the predicted response for an observation is given by the mean response of the training observations that belong to the same terminal node. 找RSS最小的分支方法 树层多:overfitting, lower bias, poor generation tree-pruning Classification tree定义树 the predicted response for an observation is calculated by the most commonly occurring class of training observations in the region to which it belongs. Classification error rate is simply the fraction of the tr ining observations in that region that do not belong to the most common class: Gini Index基尼系数: pmk接近0或1时,基尼系数越小 As such, Gini index is also referred to as a measure of node purity because a small value indicates that a node contains predominately observations from a single class Cross-entropy交叉熵 U9 Introduction to GenAI and LLMs Generative AI can learn from existing artifacts to generate new, realistic artifacts (at scale) that reflect the characteristics of the training data but don’t repeat it. It can produce a variety of novel content, such as images, video, music, speech, text, software code and product designs Rather than AI replacing humans, the two should collaborate - humans identifying problems, AI suggesting solutions, humans selecting Risk: the potential for hallucination, the black-box logic systems, opportunities for cyberattacks, data breaches, copyright concerns, and on and on. LLMs (Large Language Models) are super smart AI programs that understand and generate text, a lot like how humans do,Examples are GPT The use a brain-like system called 'Transformer' to understand the relationship between words, making them really good at grasping the meaning of a whole text Natural Language Generation (NLG) Natural Languag Understanding (NLU) Tokenization = how the model "sees" the prompt. Base LLMs = the foundation model "processes" a prompt. Instruction-Tuned LLMs: An Instruction Tuned LLM starts with the foundation model and fine-tunes it with examples or input/output pairs更专 业化 what LLMs cannot do well: 1.Knowledge cutoffs 2.Hallucinations 3.Input (and output) length is limited 4.Bias and toxicity. 问题:Transparency、Accountability、Information hazards、 Misinformation spread、Malicious use、Accuracy、Intellectual property (IP ) and copyright、Environmental impact、AI Regulations Prompt Engineering: Prompts can be a single or a series of instructions Prompt engineering is the process of designing and optimizing text inputs (prompts) to deliver consistent and quality responses (completions) for a given application objective and model. instructions、context、inputdata、output indicator question format. statement format. instruction format. Prompt instruction refers to specific guidelines or directives given to an AI model to shape the nature, style, or format of the response U10 Monte Carlo simulation Monte Carlo simulation consisting of independent trials in which the results for one trial do not affect what happens in subsequent trials. Discrete-event simulation, which involves trials that represent how a system evolves over time. One common application of discrete-event simulation is the analysis of waiting lines. (Arena®, ProModel®, and Simio®). Agent-based modeling, which studies the emergent behaviors (i.e. the collective behavior of the system) due to the interactions and behaviors of individual agents. System dynamic modelling, chaotic systems. It relies on discrete event simulation and numeric methods to determine the behavior of components within that system. uniform probability distribution. Advantages of Simulation It is conceptually easy to understand and that the methods can be used to model and learn about the behavior of complex systems that would be difficult, if not impossible, to deal with analytically. Simulation models are flexible. They can be used to describe systems without requiring the assumptions that are often required by other mathematical models. A simulation model provides a convenient experimental laboratory for the real system. Simulation models frequently warn against poor decision strategies by projecting disastrous outcomes such as system failures, large financial losses, and so on Limitations of Simulation For complex systems, the process of developing, verifying, and validating a simulation model can be time consuming and expensive. Like all mathematical models, the analyst must be conscious of the assumptions of the model in order to understand its limitations. In addition, each simulation run provides only a sample of output data. As such, the summary of the simulation data provides only estimates or approximations about the real system. Nonetheless, the danger of obtaining poor solutions is greatly mitigated if the analyst exercises good judgment in developing the simulation model and follows proper verification and validation steps. Furthermore, if a sufficiently large enough set of simulation trials is run under a wide variety of conditions, the analyst will likely have sufficient data to predict how the real system will operate 学霸联盟