BANA560-rapidminer studi代写
时间:2023-10-23
BANA 560 — Homework 2:
Classification and Prediction
Assignment
This assignment contains two tasks. Detailed instructions and the related datasets are
available in the Canvas course site.
Task 1: Boston Housing
This dataset contains information collected by the US Census Bureau concerning
housing prices in the area of Boston, Massachusetts. The dataset has 506 cases. Each
case contains summary data for a housing tract (neighborhood). It was obtained from
the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). There are 14 attributes
in each case of the dataset. They are:
CRIM Per capita crime rate by town
ZN Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS Proportion of non-retail business acres per town.
CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX Nitric oxides concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes ($000)
The median value of owner-occupied homes (MEDV) is believed to be dependent
upon all or some of the other thirteen variables.
a. Why should the data be partitioned into training and validation sets? For what will
the training set be used?
b. Fit a multiple linear regression model to the median house price (MEDV) as a
function of the other thirteen variables. Please write down the equation for predicting
the median house price from the thirteen variable.
c. Fit a multiple linear regression model to the median house price (MEDV) as a
function of CRIM, CHAS, and RM. Please write down the equation for predicting the
median house price from these three variables.
d. Using the model created in c to answer following question. What median house
price is predicted for a tract in the Boston area that does not bound the Charles River,
has a crime rate of 0.1, and where the average number of rooms per house is 6?
The deliverables of this task include a WORD document with answers to above
questions
Task 2: Charles Book Club (CBC)
A new title, "The Art History of Florence," is ready for release. CBC has sent a test
mailing to a random sample of 4,000 customers from its customer base. The customer
responses have been collated with past purchase data. The data could be randomly
partitioned into 3 parts- Training Data : initial data to be used to fit response
models, Validation Data : hold-out data used to compare the performance of different
response models, and Test Data : data only to be used after a final model has been
selected to estimate the likely accuracy of the model when it is deployed. The Sample
Data are in file - CharlesBookClub.xls . Each row (or case) in the spreadsheet (other
than the header) corresponds to one market test customer. Each column is a variable
with the header row giving the name of the variable. The variable names and
descriptions are given in Table 1, below.
Table 1: List of Variables in CharlesI.xls
Variable Name
Description
Seq# Sequence number in the partition
ID# Identification number in the full
(unpartitioned) market test data set
Gender O=Male 1=Female
M Monetary- Total money spent on books
R Recency - Months since last purchase
F Frequency - Total number of purchases
FirstPurch Months since first purchase
ChildBks Number of purchases from the category:
Child books
Youth Number of purchases from the category:
Youth books
CookBks Number of purchases from the category:
Cookbooks
DoItYBks Number of purchases from the category
Do It Yourself books I
RefBks Number of purchases from the category:
Reference books (Atlases, Encyclopedias,
Dictionaries)
ArtBks Number of purchases from the category
Art books
GeoBks Number of purchases from the category:
Geography books
ItalCook Number of purchases of book title:
"Secrets of Italian Cooking"
ItalAtlas Number of purchases of book title:
"Historical Atlas of Italy"
ItalArt Number of purchases of book title: "Italian Art"
Florence =1 'The Art History of Florence" was bought, = 0 if not
Identify the target segment of customers to whom you will send the mailers, using
Logistic Regression technique of Data Mining; Conduct four trials by selecting
variables as follow. For each trial, report the performance information.
1. The full set of 15 predictors as independent variables and "Florence" as the
dependent variable,
2. Best subset selection of six variables. Choose six variables that you think are
important as independent variables and "Florence" as the dependent variable
3. Change the training data set such that it contains the same number of rows with
Florence =1 as those with Florence = 0. Run the full set of 15 predictors as
independent variables and "Florence" as the dependent variable
4. For the newly generated training data set, choose six variables that you think
are important as independent variables and "Florence" as the dependent
variable
5. Select one model from above generated models that you believe to be the best.
Apply the model to the test data for prediction, identifying customers who will
respond to your mail.
The deliverables of this task include the classification matrix for each of the four
models, the worksheet with the test data and the prediction from the model, and
answers to following questions:
1. When we cut the size of the training data set such that it contains the same
number of rows with Florence =1 as those with Florence =0 , what is the
impact of this change on the model accuracy. Please explain why.
2. Please explain your choice of the best model for the identification of why you
selected the model you selected for the task of identifying customers who will
respond.
3. After using the model for prediction, please sort the "test" worksheet in
descending order based on a person's probability of responding. Please report
the number of responding customers the model recognizes if we set the cutoff
probability at 95%, 75%, and 50%.
essay、essay代写