程序代写案例-COMP7710|学霸联盟

程序代写案例-COMP7710

时间：2022-03-31

COMP7710 Homework 3 – Classification
Weighting: 10%
Due date: 4th April 2022, 4 pm
Overview
In this problem set we will be working with Bayesian classification.
Marks
This assignment is worth 10% of your total grade.
Submission Instructions
• Your solutions to the questions should be submitted via BlackBoard.
• You should only submit your completed Jupyter notebook in .ipynb format, including written answers in
markdown and results from executed code cells.
• Name your file as sxxxxxxx-homework#.ipynb where sxxxxxxx is your student number and # is the
homework number.
• No marks will be awarded for non-compiling submissions.
Late Submissions and Extensions
Late Penalties: Where an assessment item is submitted after the deadline, without an approved extension, a late
penalty will apply. The late penalty shall be 10% of the maximum possible mark for the assessment item will be
deducted per calendar day (or part thereof), up to a maximum of seven (7) days. After seven days, no marks will
be awarded for the item. A day is considered to be a 24 hour block from the assessment item due time. Negative
marks will not be awarded.
Academic Misconduct
This assignment is an individual assignment. Posting questions or copying answers from the internet is consid-
ered cheating, as is sharing your answers with classmates. All your work (including code) will be analysed by
sophisticated plagiarism detection software. Students are reminded of the University’s policy on student mis-
conduct, including plagiarism. See the course profile and the School web page: http://www.itee.uq.edu.au/
itee-student-misconduct-including-plagiarism.
1
Questions
In this homework, you will analyse ’phishing.csv’ dataset and use Logistic regression and 2 Bayesian classifiers
to distinguish fishing and benign websites. In addition to the Gaussian Naive Bayes, you should research and
familiarise yourself with Categorical Naive Bayes (see scikit-learn link for the full list of Naive Bayes methods).
The description of the dataset features is provided below.
• Domain: The URL itself.
• Ranking: Page Ranking.
• isIp: Is there an IP address in the weblink.
• valid: This data is fetched from google’s ”whois” API that tells us more about the current status of the URL’s
registration.
• activeDuration: Also from ”whois” API. Gives the duration of the time since the registration up until now.
• urlLen: The length of the URL.
• is@: If the link has a ’@’ character.
• isredirect: If the link has double dashes, there is a chance that it is a redirect.
• haveDash: If there are any dashes in the domain name.
• domainLen: The length of just the domain name.
• noOfSubdomain: The number of subdomains preset in the URL.
• Labels: 0: Legitimate website; 1: Phishing Link/ Spam Link.
Data pre-processing
(2 mark)
1. Preprocess the data. Check if there are any missing data, and remove all ’-’ entries. Drop ”Domain” feature.
Transform textual data into numerical data where necessary.
2. Split the dataset into train and test set in 70:30 ratio. Consider the target to be column ’label’ which describes
whether the domain was phishing or not.
Model Training and Evaluation
(2 marks)
1. Train 3 models: Logistic Regression, Gaussian Naive Bayes and Categorical Naive Bayes.
2. Calculate test accuracy, test precision and test recall for each model.
Discussion
(6 marks)
1. Describe the difference between Gaussian Naive Bayes and Categorical Naive Bayes.
2. Compare your test performance of Gaussian Naive Bayes and Categorical Naive Bayes on phishing.csv dataset.
Use the dataset feature description, provided above, to explain why one method performs better than the
other.
3. Describe the similarity and the difference between Logistic Regression and Naive Bayes.
2