xuebaunion@vip.163.com
3551 Trousdale Rkwy, University Park, Los Angeles, CA
留学生论文指导和课程辅导
无忧GPA:https://www.essaygpa.com
工作时间:全年无休-早上8点到凌晨3点

微信客服:xiaoxionga100

微信客服:ITCS521
DSCI550: Data Science at Scale Seon Ho Kim, Ph.D. seonkim@usc.edu Classification and Prediction Outline • Classification vs. prediction • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Prediction • Summary • Reference Classification vs. Prediction • Classification – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training dataset and uses it in classifying new data – e.g., face recognition, medical diagnosis, etc. • Prediction – models continuous-valued functions, i.e., predicts unknown or missing values – e.g., housing price prediction, temperature prediction Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae Classification—A Two-Step Process • Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise overfitting will occur – If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Classification Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Classification Process (2): Use the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured? Issues Regarding Classification and Prediction (1): Data Preparation • Data cleaning – Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) – Remove the irrelevant or redundant attributes • Data transformation – Generalize and/or normalize data Issues regarding classification and prediction (2): Evaluating classification methods • Accuracy: classifier and predictor accuracy • Speed – time to construct the model (training time) – time to use the model (classification/prediction time) • Robustness – handling noise and missing values • Scalability: efficiency as data size grows • Interpretability – understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules Decision Tree Decision Trees • A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences. • Sort instances (data) according to feature values (i.e., age, income, etc.): a hierarchy of tests – data are classified/sorted according to specific feature values, which become increasingly specific. • Nodes: features – Root node: feature that best divides data • Algorithms exist for determining the best root node • Branches: values the node can assume Classification Tree Y Y Y Rules for classification? Which one is more efficient? Training Dataset (Example of “Buying Computer”) age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no Output: A Decision Tree for “buying_computer” age? overcast student? credit rating? no yes fairexcellent <=30 >40 no noyes yes yes 30..40 Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in top-down recursive divide-and- conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous, discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Algorithm for Decision Tree Induction • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left Attribute Selection Using Information • Quantifying information is the foundation of the field of information theory. • The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability). – Low Probability Event: High Information (surprising). – High Probability Event: Low Information (unsurprising). • The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. Attribute Selection Using Information • We can quantify how much information there is in a random variable. • For example, if we wanted to calculate the information for a random variable X with probability distribution p, this might be written as a function H(); for example: H(X) • In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable. • Calculating the information for a random variable is called “information entropy,” “Shannon entropy,” or simply “entropy“. Attribute Selection Using Information • Information Gain measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable. • A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise. • It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification. Attribute Selection Measure: Information Gain ◼ Select the attribute with the highest information gain ◼ S contains si tuples of class Ci for i = {1, …, m} ◼ information measures info required to classify any arbitrary tuple ◼ entropy of attribute A with values {a1,a2,…,av} ◼ information gained by branching on attribute A 2 1 I( ) log m i i 1 2 m i s s s ,s ,...,s s s= = − 1 1 1 ... E(A) ( ,..., ) j mj j mj v j s s I s s s= + + = 1 2 m Gain(A) I(s ,s ,...,s ) E(A)= − Attribute Selection by Information Gain Computation • Class P: buys_computer = “yes” • Class N: buys_computer = “no” • I(p, n) = I(9, 5) = (-9/14) log2(9/14) + (-5/14) log2(5/14) = 0.4097 + 0.5307 = 0.940 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no Attribute Selection by Information Gain Computation • Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence, age pi ni I(pi, ni) <=30 2 3 0.971 30…40 4 0 0 >40 3 2 0.971 694.0)2,3( 14 5 )0,4( 14 4 )3,2( 14 5 )( =+ += I IIageE 048.0)_( 151.0)( 029.0)( = = = ratingcreditGain studentGain incomeGain 246.0)(),()( =−= ageEnpIageGain age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no )3,2( 14 5 I Similarly, Selecting Attributes age Which? Which? <=30 >40 yes 30..40 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no Computing Information-Gain for Continuous- Value Attributes • Let attribute A be a continuous-valued attribute • Must determine the best split point for A – Sort the value A in increasing order – Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 – The point with the minimum expected information requirement for A is selected as the split-point for A • Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “exc” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” Avoid Overfitting in Classification • Overfitting: An induced tree may overfit the training data – Too many branches, some may reflect anomalies (noise or outliers) – Poor accuracy for unseen samples • Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” Approaches to Determine the Final Tree Size • Separate training (2/3) and testing (1/3) sets • Use cross validation • Use all the data for training – but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Decision Trees: Assessment • Advantages: • Classification of data based on limiting • features is intuitive • Handles discrete/categorical features best • Limitations: • Danger of “overfitting” the data • Not the best choice for accuracy Bayesian Classification Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Bayesian Theorem: Basics • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H): probability of observing the sample X, given that the hypothesis holds Bayesian Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posteriori = likelihood x prior / evidence • Practical difficulty: require initial knowledge of many probabilities, significant computational cost )( )()|()|( XP HPHXPXHP = Naive Bayes Classifier • A simplified assumption: attributes are conditionally independent: • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2], C) = P(y1, C) * P(y2, C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci) * P(Ci) = = n k CixkPCiXP 1 )|()|( Steps 1. Convert the data set into a frequency table 2. Create Likelihood table by finding the probabilities 3. use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction. Question: Players will play if weather is sunny. Is this statement is correct? • We can solve it using above discussed method of posterior probability. • P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny) • Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 • Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability. Training dataset age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 30…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) Naive Bayesian Classifier: An Example • Compute P(X|Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes” The Bayesian Doctor Example • A person doesn’t feel well and goes to the doctor. • Assume two states of nature: – w1: The person has a common flu. – w2: The person is really sick (a vicious bacterial infection). • The doctor’s prior is: • This doctor has two possible actions: ``prescribe’’ hot tea or antibiotics. Doctor can use prior and predict optimally: always flu. Therefore doctor will always prescribe hot tea. 1.0)( 2 =wp9.0)( 1 =wp • But there is very high risk: Although this doctor can diagnose with very high rate of success using the prior, (s)he can lose a patient once in a while. • Denote the two possible actions: a1 = prescribe hot tea a2 = prescribe antibiotics • Now assume the following cost (loss) matrix: 01 100 2 1 21 , a aji ww = The Bayesian Doctor - Cont. • Choosing a1 results in expected risk of • Choosing a2 results in expected risk of • So, considering the costs it’s much better (and optimal!) to always give antibiotics. 1101.00 )()()( 2,121,111 =+= += ww ppaR R(a2 ) = p(w1) × l2,1 + p(w2 ) × l2,2 = 0.9 ×1+0 = 0.9 The Bayesian Doctor - Cont. What? • But doctors can do more. For example, they can take some observations. • A reasonable observation is to perform a blood test. • Suppose the possible results of the blood test are: x1 = negative (no bacterial infection) x2 = positive (infection) • But blood tests can also often fail. Suppose (Called class conditional probabilities.) 7.0)|(3.0)|( 2221 == ww xpxp 8.0)|(2.0)|( 1112 == ww xpxp The Bayesian Doctor - Cont. • Define the conditional risk given the observation • We would like to compute the conditional risk for each action and observation so that the doctor can choose an optimal action that minimizes risk. • How can we compute ? • We use the class conditional probabilities and Bayes inversion rule. jiji xpxaR j ,)|()|( w w = )|( xp jw The Bayesian Doctor - Cont. • Let’s calculate first p(x1) and p(x2) • p(x2) is complementary to p(x1) , so 75.0 1.03.09.08.0 )()|()()|()( 2211111 = += += wwww pxppxpxp 25.0)( 2 =xp The Bayesian Doctor - Cont. 1 1 1 1 1,1 2 1 1,2 2 1 1 2 2 1 R( | ) ( | ) ( | ) 0 ( | ) 10 ( | ) ( ) 10 ( ) 0.3 0.1 10 0.4 0.75 a x p x p x p x p x p p x w w w w w = + = + = = = 96.0 75.0 9.08.0 )( )()|( 0)|(1)|( )|()|()|( 1 111 1211 2,2121,21112 = = = += += xp pxp xpxp xpxpxaR ww ww ww The Bayesian Doctor - Cont. 8.2 25.0 1.07.0 10 )( )()|( 10 10)|(0 )|()|()|( 2 222 22 2,1221,12121 = = = += += xp pxp xp xpxpxaR ww w ww 72.0 25.0 9.02.0 )( )()|( 0)|(1)|( )|()|()|( 2 112 2221 2,2221,22122 = = = += += xp pxp xpxp xpxpxaR ww ww ww The Bayesian Doctor - Cont. • To summarize: • Whenever we encounter an observation x, we can minimize the expected loss by minimizing the conditional risk. • Makes sense: Doctor chooses hot tea if blood test is negative, and antibiotics otherwise. 4.0)|( 11 =xaR 96.0)|( 12 =xaR 8.2)|( 21 =xaR R(a2 | x2 )= 0.72 The Bayesian Doctor - Cont. Optimal Bayes Decision Strategies • A strategy or decision function a(x) is a mapping from observations to actions. • The total risk of a decision function is given by • A decision function is optimal if it minimizes the total risk. This optimal total risk is called Bayes risk. • In the Bayesian doctor example: – Total risk if doctor always gives antibiotics: 0.9 – Bayes risk: 0.48 = x xp xxRxpxxRE )|)(()()]|)(([)( aa Prediction and Linear Regression What Is Prediction? • (Numerical) prediction is similar to classification – construct a model – use model to predict continuous or ordered value for a given input • Prediction is different from classification – Classification refers to predict categorical class label – Prediction models continuous-valued functions • Major method for prediction: regression – model the relationship between one or more independent or predictor variables and a dependent or response variable • Regression analysis – Linear and multiple regression – Non-linear regression – Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees Linear Regression • In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted x. The case of one explanatory variable is called simple linear regression. 020 40 60 0 20 40 60 X Y How would you draw a line through the points? How do you determine which line ‘fits best’? 0 20 40 60 0 20 40 60 X Y Advertising Sales Advertising Sales Advertising Sales Advertising Sales Which Is More Logical? Types of Regression Models Regression Models Linear Non- Linear 2+ Explanatory Variables Simple Non- Linear Multiple Linear 1 Explanatory Variable Least Squares • ‘Best Fit’Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. But Positive Differences Off-Set Negative. So square errors! • LS (Least Squares) Minimizes the Sum of the Squared Differences (errors) (SSE) 55 ( ) == =− n i i n i ii YY 1 2 1 2 ˆˆ 56 Least Squares Graphically 2 Y X 1 3 4 ^^ ^ ^ Y X2 0 1 2 2= + + Y Xi i= + 0 1 LS minimizes i i n 2 1 1 2 2 2 3 2 4 2= + + + = Coefficient Equations • Prediction equation • Sample slope • Sample Y - intercept ii xy 10 ˆˆˆ += bˆ1 = xi - x( ) yi - y( )å xi - x( ) 2 å xy 10 ˆˆ −= Linear Regression Example • Last year, five randomly selected students took a math aptitude test before they began their statistics course. The Statistics Department has three questions. – What linear regression equation best predicts statistics performance, based on math aptitude scores? – If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics? – How well does the regression equation fit the data? Linear Regression Example (Cont…) • In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the regression analysis. Linear Regression Example (Cont…) • The regression equation is a linear equation of the form: ŷ = b0 + b1x . To conduct a regression analysis, we need to solve for b0 and b1. Computations are shown below (giving the minimum sum of squared residuals). • b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x) 2] = 470/730 = 0.644 • b0 = y - b1 * x = 77 - (0.644)(78) = 26.768 • Therefore, the regression equation is: ŷ = 26.768 + 0.644x Linear Regression Example (Cont…) • Using the regression equation: – Choose a value for the independent variable (x), perform the computation, and you have an estimated value (ŷ) for the dependent variable. – In our example, the independent variable is the student's score on the aptitude test. The dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade would be: ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288 Linear Regression Example (Cont…) • Warning: When you use a regression equation, do not use values for the independent variable that are outside the range of values used to create the equation. That is called extrapolation, and it can produce unreasonable estimates. • In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95. Therefore, only use values inside that range to estimate statistics grades. Using values outside that range (less than 60 or greater than 95) is problematic. Linear Regression Example (Cont…) • Whenever you use a regression equation, you should ask how well the equation fits the data. One way to assess fit is to check the coefficient of determination, which can be computed from the following formula. R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) } 2 • where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard deviation of x, and σy is the standard deviation of y. Linear Regression Example (Cont…) • Computations for the sample problem of this lesson are shown below. • A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics grades (the dependent variable) can be explained by the relationship to math aptitude scores (the independent variable). • R2 = 1 indicates that the regression line perfectly fits the data. • Some nonlinear models can be modeled by a polynomial function • A polynomial regression model can be transformed into linear regression model. For example, y = w0 + w1 x + w2 x 2 + w3 x 3 convertible to linear with new variables: x2 = x 2, x3= x 3 y = w0 + w1 x + w2 x2 + w3 x3 • Other functions, such as power function, can also be transformed to linear model • Some models are intractable nonlinear (e.g., sum of exponential terms) – possible to obtain least square estimates through extensive calculation on more complex formulae Nonlinear Regression • Generalized linear model: – Foundation on which linear regression can be applied to modeling categorical response variables – Variance of y is a function of the mean value of y, not a constant – Logistic regression: models the prob. of some event occurring as a linear function of a set of predictor variables – Poisson regression: models the data that exhibit a Poisson distribution • Log-linear models: (for categorical data) – Approximate discrete multidimensional prob. distributions – Also useful for data compression and smoothing • Regression trees and model trees – Trees to predict continuous values rather than class labels Other Regression-Based Models Summary • Classification and prediction can be used to extract models describing important data classes or to predict future data trends. • Effective and scalable methods have been developed for decision trees induction, Naive Bayesian classification, Bayesian belief network, rule- based classifier, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers, and case-based reasoning, and other classification methods such as genetic algorithms, rough set and fuzzy set approaches. • Linear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for prediction. Enhancements to Basic Decision Tree Induction • Allow for continuous-valued attributes – Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals • Handle missing attribute values – Assign the most common value of the attribute – Assign probability to each of the possible values • Attribute construction – Create new attributes based on existing ones that are sparsely represented – This reduces fragmentation, repetition, and replication Classification in Large Databases • Classification—a classical problem extensively studied by statisticians and machine learning researchers • Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed • Use distributed training – Common technique used is Map Reduce methods that train data on different machines