sas代写-MDA 503

时间：2021-12-17

Page 1 of 5

Final Exam / Fall Session 2, 2021
MDA 503 Data Mining and Predictive Analytics
▪ Please email your answer sheet when you complete the exam to jhoffman3@laverne.edu
▪ Due Wednesday, December 22, 2021 ▪ Four 25-point questions = 100 possible points
▪ Submit 1 file (Word; PDF) with screencaps of ALL model diagrams + answers to exam questions
1. Filtering Data The Census2000 dataset is a postal code-level summary of the 2000 US Census.
ID postal code of the region
LOCX region longitude
LOCY region latitude
MEANHHSZ average household size in the region
MEDHHINC median household income in the region
REGDENS region population density percentile (1=low density, 100=high density)
REGPOP number of people in the region
a. Open the ‘Census2000’ dataset in SAS Enterprise Miner with the ‘default’ settings.
b. Update the histogram for ‘Average Household Size’ to show 100 bins and explain your
findings. Right-click the CENSUS2000 data source and select Edit Variables. Select all listed inputs
and Click Explore. Maximize the MeanHHSz histogram. Right-click in the histogram window and
select Graph Properties. Enter 100 in the Number of X Bins field and click OK.
What issue do you observe with the histogram/data?
What should you do as a data scientist to fix this problem?
c. Remove any unwanted records from the Census2000 dataset. Drag the CENSUS2000 node &
a Filter node to the diagram workspace and connect the nodes. In Filter ‘properties’ change the
Default Filtering Method property to User-Specified Limits. Select the Interval Variables ellipsis.
Enter 0.1 as the Filter Lower Limit value for the input variable MeanHHSz. Press Enter & Click OK.
What exactly did you filter out of this data / analysis?
d. Run the Filter node and view the Results.
How many cases did the Filter node remove (with a household size of zero)?
Page 2 of 5

2. Clustering Parts a-b build upon the Census2000 dataset and the diagram from Q1. For parts
c-d, you will use the DUNGAREE dataset and setup a new diagram.
a. Ignore irrelevant inputs, standardize inputs, and use the Cluster Tool. Drag a Cluster tool into
the diagram workspace. Connect the Filter node to the Cluster node. Select the Variables
property for the Cluster node. Select Use to ‘No’ for LocX, LocY, and RegPop. For
‘standardization’ the Cluster node will do this automatically. Notice the default setting for
Internal Standardization: Internal Standardization → Standardization. No change is required
because standardization is performed on input variables. Run the Cluster node & select Results.
What inputs/variables are included in the cluster analysis?
Why do these inputs need to be standardized?
How many clusters did the Cluster node find in the CENSUS2000 data?
b. Manually specify the number of clusters. In the Properties panel for the Cluster node, select
Specification Method → User Specify. Leave the Maximum Number of Clusters property at 10
(the default). Run the Cluster node and select Results.
Do you have exactly 10 clusters now?
Which cluster is the smallest? Which cluster is the largest?
c. Open the DUNGAREE dataset. The DUNGAREE dataset gives the number of pairs of 4 different
types of dungarees that were sold at stores over a specific time period. Each row represents an
individual store. One column is the store identification number, and the remaining columns
contain the number of pairs of each type of jeans that were sold. Select Advanced to use the
Advanced Advisor, and then Next. The variable STOREID should have the ID model role & the
variable SALESTOT should have the Rejected model role. Click Next several times & then Finish.
d. Run default Cluster on the DUNGAREE dataset & then manually change the # of clusters. Drag
the DUNGAREE data source and a Cluster node to the diagram workspace and connect them.
Run the Cluster node and view the results. Then select Specification Method → User Specify.
Select Maximum Number of Clusters → 6. Run the Cluster node and view the results again.
How many clusters were originally generated by running the Cluster tool?
With 6 clusters, there is one really small cluster. Which cluster # is relatively small?
Page 3 of 5

3. Support Vector Machines The task is to apply some support vector machines with different
settings to the data set SIGNAL. This data set contains two different signals, which unfortunately
have a high degree of noise. The aim of the constructed model is to separate the two signals.
a. Open the SIGNAL dataset in SAS Enterprise Miner. Navigate to the DMNN41 library and select
the data set SIGNAL. Click Next. The data set consists of three variables and 600 observations.
Continue to the Metadata Advisor Options step and select Advanced to use the Advanced
Metadata Advisor. Click Next. In the Column Metadata window, set the role of the variable
OUTPUT_SIGNAL to Target. Make sure that the target variable has the level Binary. The variables
INPUT_SIGNAL1 and INPUT_SIGNAL2 are used as input variables. Click Next. There is no decision
processing. Click Next until you come to Step 9 of the Data Source Wizard. Click Finish.
b. Partition the data. Drag the SIGNAL data source and a Data Partition node onto the diagram
workspace and connect them. Modify the Data Set Allocations properties as follows:
• Set the Training property to 70.0 • Set the Test property to 0.0
• Let the Validation property still be 30.0
c. Setup HP SVM Linear, Polynomial, and RBF nodes Drag three HP SVM nodes and connect all
three to the Data Partition node. Click the HP SVM node and make sure that this node is using
the Interior Point optimization method with a linear kernel. Rename this node HP SVM / Linear.
Click the HP SVM (2) node and change the optimization method to Active Set. Verify that a
polynomial kernel with degree 2 is being used. Rename this node HP SVM / Polynomial. Click
the HP SVM (3) node and change the optimization method to Active Set. Change the kernel to
Radial Basis Function. Rename this node HP SVM / Radial Basis Fx.
d. Compare the results of the models Drag a Model Comparison Node onto the diagram space
and connect all three HP SVM nodes to it. Click the Model Comparison node in the diagram to
select it. In the Model Selection section, change Selection Statistic to ROC and Selection Table to
Validation. Run the Model Comparison node and view the results.
List the Average Squared Error (Training) for each HP SVM Model.
List the Average Squared Error (Validation) for each HP SVM Model.
Describe the concept of Support Vector Machines in a few sentences.
Page 4 of 5

4. Transforming Variables and SAS Code This demonstration uses the HOUSING dataset.
age proportion of owner-occupied units built prior to 1940
crim per capita crime rate by town
dis weighted distances to five Boston employment centers
indus proportion of non-retail business acres per town
lstat percent of lower socio-economic status of the population
nox nitric oxide concentration (parts per 10 million)
ptratio pupil-teacher ratio by town
rad index of accessibility to highways
rm average number of rooms per dwelling
tax full-value property tax-rate per $10,000
zn proportion of residential land zoned for lots more than 25,000 square feet
*The target is medv, which is the median value of owner-occupied homes in $10,000.
a. Open the HOUSING dataset in SAS Enterprise Miner. Select Advanced in the Metadata Advisor
Options window. Change the level of the variable RAD to Interval and the role of the variable
MEDV to Target. Change the roles of RM and PTRATIO to Rejected. Then click Next…Finish.
b. Decision Tree: MEDV Drag the HOUSING data node and a Decision Tree onto your diagram
space and connect the nodes. Rename this decision tree Decision Tree: MEDV. Select this
decision tree and set the Observation Based Importance property to Yes. Run this node.
c. SAS Code Drag an SAS Code node to the diagram and connect it to the Decision Tree: MEDV
node. Use the SAS Code node to explore residual patterns and test for homoscedasticity. Select
the SAS Code node and then the Code Editor ellipsis under the Train properties. Select the top
icon to activate the training Code Editor. When you are inside the training Code Editor, right-
click, select Open, and select the diagnose.sas file. Run the SAS Code node and view the results.
Maximize the scatter plot. Right-click the plot, select Graph Properties, clear the Autosize
Markers check box, and reduce the marker size to 5.
Do the residuals appear to have a constant variance?
If not, what do you observe with the residuals, and what statistical term do we call this?
Page 5 of 5

d. Perform a LOG transformation Drag a Transform Variables node to the diagram and connect
it to the HOUSING node. Drag a Decision Tree Node to the diagram and connect it to the
Transform Variables node. Rename this Decision Tree Decision Tree: LOGMEDV. Select the
Transform Variables node. Under Default Methods in the Properties panel, change the Interval
Targets property to Log. Select the Decision Tree: LOGMEDV node and set the Observation Based
Importance property to Yes. Run this Decision Tree node. Copy and paste the SAS Code node to
the same diagram. Connect the new node to the Decision Tree: LOGMEDV node. Run this SAS
Code node and view the results.
Do the residuals appear to have a constant variance now (after the log transformation)?

What statistical term do we use to describe this (desirable) constant variance?