S1-Python代写
时间:2024-05-22
Machine Learning for Analysis – 2024 S1 – Assignment
ML Solutions for
Misinformation Detection in
Social Media
Machine Learning for Analysis – 2024 S1 – Assignment
2
Introduction
Social media, particularly X (formally known as Twitter), has revolutionized the way information
spreads, but it's also an incubator for fake news and misinformation. Misinformation on platform X can
evolve from diverse forms and may stem from various sources, whether intentional or not, taking
advantage of the platform's viral nature to widen its dissemination. As we approach major events like
elections, the urgency to address this challenge becomes increasingly apparent. As there is no specific
form that misinformation is presented in, there is an increasing need to develop more innovative and
novel approaches to addressing it.
Machine learning and natural language processing (NLP) offer promising solutions to identify trends
and detect misinformation. However, free-text data is challenging to incorporate into classification
models due to its lack of structure. To overcome this challenge, latent variable models such as topic
models or feature generation can be used to infer intermediary representations that can be used as
structured data for classification tasks.
In this project, you will showcase the significance of integrating data sourced from X alongside newly
engineered features to classify the authenticity of news-related tweets. A dataset obtained from X has
been web-scraped, and the various sections of this assignment will establish one kind of exploratory
strategy for addressing a classification challenge.
Machine Learning for Analysis – 2024 S1 – Assignment
Dataset
The Assignment dataset consists of an assortment of news headlines, along with associated X posts
relating to the headline. The dataset consists of 134,198 rows and 15 columns. There are 3 types of
feature variables and only 1 target variable:
Feature Variables
➢ Textual Data:
1. news_author (str) author of a news headline.
2. news_headline (str) – headline of a news article.
3. related_tweet (str) – X post relating to the news headline posted by a user.
➢ Post Metadata
4. post_replies (int) - number of replies on the post.
5. post_retweets (int) - number of retweets on the post.
6. post_favourites (int) - number of favourites on the post.
7. post_quotes (int) - number of times the post has been quote tweeted.
➢ User Metadata
8. user_followers (int) - number of followers.
9. user_following (int) - number of following users.
10. user_friends (int) - number of friends (mutual following).
11. user_tweet_count (int) – total number of tweets the user has made.
12. user_favourites_count (int) – total number of favourites user has across all tweets.
13. user_mentions (int) – total number of of users mentioned (@) in related_tweet
14. user_tweet_count_lists (int) – total number of tweets the user has in their lists.
Target Variable
➢ Misinformation (bool) – a T/F value representing if a tweet is false.
Machine Learning for Analysis – 2024 S1 – Assignment
4
Specification
Summary
• Type: Project report, individual assignment
• Deliverable: Report in the format of Python script only (.ipynb)
The aim of this assignment is to provide you with experience in the steps involved in text preparation,
feature generation, and creating, evaluating, and improving classification models. You will need to
research NLP, and python functionalities if you aim to achieve excellent marks and discover innovative
techniques/methods.
Exploration, Preparation & Feature Generation
This section requires you to explore various aspects of your dataset and prepare the data for future
sections. It is important you take time to carefully explore your data and make decisions on preparation
or generation that make sense.
Preprocessing steps are essential to clean and standardize data before feature generation and enhance
the quality of extracted features. Classification models that harness generated features may enable
models to better understand and analyze data or to better learn patterns and relationships, compared to
regular models.
Further, X or Twitter recently open sourced their algorithms and many articles provide insights into
what features of a tweet are important. Knowing this may help to better understand how to classify a
tweet as misinformation.
Your task is to
➢ Explore and prepare your data.
o In this task, you could perform the necessary cleaning and pre-processing tasks, explore
or try to understand and profile your data through various techniques (i.e. clustering,
topic modelling, etc.).
➢ Generate new features from your data.
o You should have a good understanding of your data from above and can now
experiment with feature generation. In this task you should consider what can be
generated to improve your classification model.
Machine Learning for Analysis – 2024 S1 – Assignment
5
Classification (Model Building and Evaluation)
It is important to try multiple variations of features/parameters in model building to achieve the best
performance. Additionally, you should elaborate on the performance metrics you have used to evaluate
your model and explain why they suit the available data.
Your task
➢ Experiment developing and evaluating classification models to find a model that has the best
overall performance.
o Once you find the best performing model, you should only show how you built and
evaluated that specific one.
➢ Elaborate on the major tasks you have undertaken to improve the best-performing model and
explain why the performance metrics suit the available data.
Machine Learning for Analysis – 2024 S1 – Assignment
Submission
Your report should be delivered in an .ipynb file. A notebook template is provided to show how to
structure your work. You need to use the template (Assignment_Template.ipynb) and strictly follow
its format which is designed based on the provided Assignment rubric.
It can be useful that add some in-line comments (using #) next to your codes to explain it briefly.
You will get a better mark if your approach is innovative. This means no other student has applied it,
or a few others have applied a similar approach with some differences. Therefore, it is highly advised
that you do not share your creative work with anyone else. You can still discuss preliminary ideas
and help each other, just remember your submission must be your own work.
To be done through Blackboard Assignment Submission, as indicated in Learn.UQ. The only acceptable
submission format is .ipynb file. The file should be named in the format of YourStudentID.ipynb
You will only need to submit one .ipynb file and should use the provided Python template file.
Before submission:
➢ Ensure that your code can run without errors. If your code returns an error at any point, your
assignment will only be marked up until the error, and the remainder of your code won't earn
any marks. Example errors may include: Syntax issues or Name Errors.
➢ Make sure that all the important outputs are shown in your notebook. However, avoid showing
trivial outputs. For example, you should remove codes randomly displaying the whole
DataFrame, etc.
➢ Your marker will first look at your generated output as a reference without running your
notebook (unless deemed necessary). Therefore, your significant outputs need to be generated,
and the elaboration should be provided in the notebook, as shown in the template.
essay、essay代写