xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

Python代写-ZHVZ9BVRP

时间：2021-04-28

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 1/4

Question 1 (40 points)

In this question, you will model trac counts in Pittsburgh using Gaussian process (GP) regression.

The included dataset, "PittsburghTracCounts.csv", represents the average daily trac counts

computed by trac sensors at over 1,100 locations in Allegheny County, PA. The data was collected

from years 2012-2014 and compiled by Carnegie Mellon University’s Trac21 Institute; we have the

longitude, latitude, and average daily count for each sensor.

Given this dataset, your goal is to learn a model of trac count as a function of spatial location. To

do so, t a Gaussian Process regression model to the observed data. While you can decide on the

precise kernel specication, you should try to achieve a good model t, as quantied by a log

marginal likelihood value greater than (i.e., less negative than) -1400. Here are some hints for

getting a good model t:

We recommend that you take the logarithm of the trac counts, and then subtract the mean

of this vector, before tting the model.

Since the data is noisy, don't forget to include a noise term (WhiteKernel) in your model.

When tting a GP with RBF kernel on multidimensional data, you can learn a separate length

scale for each dimension, e.g., length_scale=(length_scale_x, length_scale_y).

Your Python code should provide the following ve outputs:

1) The kernel after parameter optimization and tting to the observed data. (10 pts)

2) The log marginal likelihood of the training data. (5 pts)

3) Show a 2-D plot of the model's predictions over a mesh grid of longitude/latitude (with color

corresponding to the model's predictions) and overlay a 2-D scatter plot of sensor locations (with

color corresponding to the observed values). (10 pts)

4) What percentage of sensors have average trac counts more than two standard deviations

higher or lower than the model predicts given their spatial location? (5 pts)

5) Show a 2-D scatter plot of the sensor locations, with three colors corresponding to observed

values a) more than two standard deviations higher than predicted, b) more then two standard

deviations lower than predicted, and c) within two standard deviations of the predicted values. (10

pts)

MLC HW 4

import pandas as pd

import numpy as np

from google.colab import drive

drive.mount('/content/gdrive')

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 2/4

Data1=pd.read_csv('gdrive/My Drive/PittsburghTrafficCounts.csv')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/con

Data1['log']=np.log(Data1['AvgDailyTrafficCount'])

Data1['log'].mean()

8.408342585887237

Longitude Latitude AvgDailyTrafficCount log

0 -80.278366 40.468606 84.0 4.430817

1 -80.162117 40.384598 95.0 4.553877

2 -80.221205 40.366778 97.0 4.574711

3 -80.142455 40.622084 111.0 4.709530

4 -80.131975 40.544915 125.0 4.828314

... ... ... ... ...

1110 -79.843684 40.498619 13428.0 9.505097

1111 -79.926842 40.425383 13713.0 9.526100

1112 -80.065730 40.397582 13822.0 9.534017

1113 -79.863848 40.429878 14172.0 9.559023

1114 -79.848609 40.479233 14891.0 9.608512

1115 rows × 4 columns

Data1

Given an unlabeled dataset with two real-valued attributes, we perform cluster-based anomaly

detection by running k-means, choosing the number of clusters k automatically using the Schwarz

criterion. Four clusters are formed:

A: 100 points, center (0, 0), standard deviation 0.1

B: 150 points, center (35, 5), standard deviation 5

C: 2 points, center (15, 20), standard deviation 1

Question 2: Cluster-based anomaly detection (10 points)

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 3/4

D: 200 points, center (10, 10), standard deviation 1

Given the four points below, which of these points are, and are not, likely to be anomalies? Choose

“Anomaly” or “Not Anomaly”, and provide a brief explanation, for each point. (Hint: your answers

should take into account the size and standard deviation of each cluster as well as the distances to

cluster centers.)

(1, 0) Anomaly / Not Anomaly

(35, 2) Anomaly / Not Anomaly

(15, 19) Anomaly / Not Anomaly

(10, 11) Anomaly / Not Anomaly

Your answer here

For this question, use the "County Health Indicators" dataset provided to identify the most

anomalous counties. Please list the top 5 most anomalous counties computed using each of the

following models. (We recommend that, as a pre-processing step, you drop na values, and make

sure all numeric values are treated as oats not strings.)

Part 1: Learn a Bayesian network structure using only the six features ["'% Smokers'","'%

Obese'","'Violent Crime Rate'","'80/20 Income Ratio'","'% Children in Poverty'","'Average Daily PM2.5'"].

Use pd.cut() to discretize each feature into 5 categories: 0,1,2,3,4.

(a) Use HillClimbSearch and BicScore to learn the Bayesian network structure (5 pts)

(b) Which 5 counties have the lowest (most negative) log-likelihood values? Please show a ranked

list of the top counties' names and log-likelihood values. (10 pts)

Part 2: Cluster based anomaly detection. Use all numeric features for this part, and do not

discretize.

(a) Clustering with k-means. Please use k=3 clusters. Compute each record's distance to the

nearest cluster center and report the ve counties which have the longest distances. (10 pts)

(b) Cluster with Gaussian Mixture. Please repeat (2)a but use log-likelihood for each record (rather

than distance) as the measure of anomalousness. (10 pts)

Part 3: Choose one more anomaly detection model you prefer and report the top 5 most anomalous

counties by the model you chose. (10 pts)

Question 3: Anomaly detection (50 points)

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 4/4

0 秒 完成时间：下午3:49

Part 4: Compare and contrast the results from the different models. Were there some counties that

were found to be anomalous in some models and not in others? Please provide some intuitions on

why each county was found to be anomalous. (5 pts)

Data2=pd.read_csv("2016CountyHealthIndicators.csv")

Data2.head()

# your code here

学霸联盟

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 1/4

Question 1 (40 points)

In this question, you will model trac counts in Pittsburgh using Gaussian process (GP) regression.

The included dataset, "PittsburghTracCounts.csv", represents the average daily trac counts

computed by trac sensors at over 1,100 locations in Allegheny County, PA. The data was collected

from years 2012-2014 and compiled by Carnegie Mellon University’s Trac21 Institute; we have the

longitude, latitude, and average daily count for each sensor.

Given this dataset, your goal is to learn a model of trac count as a function of spatial location. To

do so, t a Gaussian Process regression model to the observed data. While you can decide on the

precise kernel specication, you should try to achieve a good model t, as quantied by a log

marginal likelihood value greater than (i.e., less negative than) -1400. Here are some hints for

getting a good model t:

We recommend that you take the logarithm of the trac counts, and then subtract the mean

of this vector, before tting the model.

Since the data is noisy, don't forget to include a noise term (WhiteKernel) in your model.

When tting a GP with RBF kernel on multidimensional data, you can learn a separate length

scale for each dimension, e.g., length_scale=(length_scale_x, length_scale_y).

Your Python code should provide the following ve outputs:

1) The kernel after parameter optimization and tting to the observed data. (10 pts)

2) The log marginal likelihood of the training data. (5 pts)

3) Show a 2-D plot of the model's predictions over a mesh grid of longitude/latitude (with color

corresponding to the model's predictions) and overlay a 2-D scatter plot of sensor locations (with

color corresponding to the observed values). (10 pts)

4) What percentage of sensors have average trac counts more than two standard deviations

higher or lower than the model predicts given their spatial location? (5 pts)

5) Show a 2-D scatter plot of the sensor locations, with three colors corresponding to observed

values a) more than two standard deviations higher than predicted, b) more then two standard

deviations lower than predicted, and c) within two standard deviations of the predicted values. (10

pts)

MLC HW 4

import pandas as pd

import numpy as np

from google.colab import drive

drive.mount('/content/gdrive')

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 2/4

Data1=pd.read_csv('gdrive/My Drive/PittsburghTrafficCounts.csv')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/con

Data1['log']=np.log(Data1['AvgDailyTrafficCount'])

Data1['log'].mean()

8.408342585887237

Longitude Latitude AvgDailyTrafficCount log

0 -80.278366 40.468606 84.0 4.430817

1 -80.162117 40.384598 95.0 4.553877

2 -80.221205 40.366778 97.0 4.574711

3 -80.142455 40.622084 111.0 4.709530

4 -80.131975 40.544915 125.0 4.828314

... ... ... ... ...

1110 -79.843684 40.498619 13428.0 9.505097

1111 -79.926842 40.425383 13713.0 9.526100

1112 -80.065730 40.397582 13822.0 9.534017

1113 -79.863848 40.429878 14172.0 9.559023

1114 -79.848609 40.479233 14891.0 9.608512

1115 rows × 4 columns

Data1

Given an unlabeled dataset with two real-valued attributes, we perform cluster-based anomaly

detection by running k-means, choosing the number of clusters k automatically using the Schwarz

criterion. Four clusters are formed:

A: 100 points, center (0, 0), standard deviation 0.1

B: 150 points, center (35, 5), standard deviation 5

C: 2 points, center (15, 20), standard deviation 1

Question 2: Cluster-based anomaly detection (10 points)

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 3/4

D: 200 points, center (10, 10), standard deviation 1

Given the four points below, which of these points are, and are not, likely to be anomalies? Choose

“Anomaly” or “Not Anomaly”, and provide a brief explanation, for each point. (Hint: your answers

should take into account the size and standard deviation of each cluster as well as the distances to

cluster centers.)

(1, 0) Anomaly / Not Anomaly

(35, 2) Anomaly / Not Anomaly

(15, 19) Anomaly / Not Anomaly

(10, 11) Anomaly / Not Anomaly

Your answer here

For this question, use the "County Health Indicators" dataset provided to identify the most

anomalous counties. Please list the top 5 most anomalous counties computed using each of the

following models. (We recommend that, as a pre-processing step, you drop na values, and make

sure all numeric values are treated as oats not strings.)

Part 1: Learn a Bayesian network structure using only the six features ["'% Smokers'","'%

Obese'","'Violent Crime Rate'","'80/20 Income Ratio'","'% Children in Poverty'","'Average Daily PM2.5'"].

Use pd.cut() to discretize each feature into 5 categories: 0,1,2,3,4.

(a) Use HillClimbSearch and BicScore to learn the Bayesian network structure (5 pts)

(b) Which 5 counties have the lowest (most negative) log-likelihood values? Please show a ranked

list of the top counties' names and log-likelihood values. (10 pts)

Part 2: Cluster based anomaly detection. Use all numeric features for this part, and do not

discretize.

(a) Clustering with k-means. Please use k=3 clusters. Compute each record's distance to the

nearest cluster center and report the ve counties which have the longest distances. (10 pts)

(b) Cluster with Gaussian Mixture. Please repeat (2)a but use log-likelihood for each record (rather

than distance) as the measure of anomalousness. (10 pts)

Part 3: Choose one more anomaly detection model you prefer and report the top 5 most anomalous

counties by the model you chose. (10 pts)

Question 3: Anomaly detection (50 points)

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb - Colaboratory

https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 4/4

0 秒 完成时间：下午3:49

Part 4: Compare and contrast the results from the different models. Were there some counties that

were found to be anomalous in some models and not in others? Please provide some intuitions on

why each county was found to be anomalous. (5 pts)

Data2=pd.read_csv("2016CountyHealthIndicators.csv")

Data2.head()

# your code here

学霸联盟