POLI3148-Data science and policy study代写|学霸联盟

POLI3148-Data science and policy study代写

时间：2023-05-11

POLI3148 Homework 2+3: Text Mining and Spatial Analysis
Weight: 30% of Total Grade. Due: 23:59, May 19, 2023
2023-03-23
Motivation
Press conferences offer valuable insights into a leader’s style, priorities, and strategies. They also reveal how
politicians use the media to influence and persuade. Despite being highly choreographed and staged affairs,
press conferences should not be dismissed as they provide an important platform for leaders to connect with
audiences while helping us comprehend the dynamics of politics.
In this assignment, you will analyze corpus of the Chinese Ministry of Foreign Affairs Press Conferences. You
will apply text mining and network analysis method to pairs of questions and answers from press conferences
from 2020 to 2022.
Logistics
Coding: You should use RMarkdown to do this assignment. You can do it on either Posit Cloud or your
local environment. To do it on Posit Cloud, please open the Assignment Project named Homework 2+3 on
Posit Cloud. If you prefer to work in your local environment, download the file and relevant packages. In
case you encounter any issues while compiling your RMD, refer to relevant questions on Campuswire. If the
existing questions cannot solve your issues, feel free to post a new question and we will assist your case.
Submission: You will submit both RMarkdown and PDF to Moodle. All summary statistics and data
visualization you include in the report should be replicable with your code.
Friendly Reminders
• This homework can be challenging. Although all techniques you will apply have been covered in class,
you will need to use your creativity in “putting things together” and interpreting the results. Reserve a
few long working sessions for it and start early.
• Most questions are open-ended. That means, there may not be a single “correct answer” for many parts
of your work – you and your classmate may use different approaches or reach different conclusions, but
both be considered great answers.
• If you encounter difficulties, feel free to ask questions. You may work with your classmates (but please
acknowledge who you work with), post on Campuswire, or send Dr. Chen or TA emails. Campuswire
will be the best place to get a quick response from us.
Data
The dataset contains an original corpus of the Chinese Ministry of Foreign Affairs Press Conferences which
maps out China’s diplomatic discourse and priorities in its foreign policy. . The dataset is organized around
a question/response structure extracted from the official transcripts of press conferences. To begin, please
execute the following codes:
library(tidyverse)
url_main <- "https://raw.githubusercontent.com/sscihz/DaSPPA/main/"
1
url_file <- "MoFA/Data/CMoFA.csv"
MoFA <- read_csv(paste0(url_main, url_file))
Question 1 Describe Raw Documents and Meta Data (2%)
Describe of your pre-tokenized corpus, including both the questions and answers.Provide appropriate summary
statistics and data visualization and briefly discuss meta data of the documents and patterns you find from
the text.
Question 2 Tokenization and Wrangling (2%)
Tokenize and clean the text data, separately for both the questions and answers. Provide appropriate summary
statistics and data visualization and briefly discuss the patterns you find. In the end of this step, save the
tokenized and cleaned documents and dictionary.
Question 3 Exploratory Data Analysis (2%)
Using the text data tokenized in the previous step, draw word clouds for both questions and answers.
Question 4 Sentiment Analysis (6%)
Conduct sentiment analysis on both questions and answers. Then, answer the following questions:
• In case of questions: Are there any emotional differences between Chinese and non-Chinese journalists
when asking questions?
• In case of answers: Are there any emotional differences between different spokespersons? Are spokesper-
sons more friendly (a positive sentiment score) when answering questions from Chinese journalists?
Provide appropriate summary statistics and data visualization to support your answers.
Question 5 Topic Modeling (6%)
Conduct topic modeling on both questions and answers. For both questions and answers, complete following
tasks:
• First, visualize all topics.
• Second, pick topics that you think are most important and draw word clouds. Describe how the pattern
of these word clouds are different from word clouds you draw in Q3.
• Third, show and interpret the variations of topics among different spokespersons/news outlets (at
most three news outlets of your choice) and across time. Provide appropriate visualization and discuss
patterns you find.
Question 6 Co-occurrence Network (6%)
Construct networks and conduct relative social network analysis. For questions, the nodes are news outlets
and locations mentioned in questions. If a news outlet mentioned a location, there will be a link between such
two nodes; For answers, the nodes are spokespersons and locations mentioned in answers. If a spokesperson
mentioned a location, there will be a link between such two nodes. Provide appropriate summary statistics
and data visualization and briefly discuss the patterns you find.
Question 7 Map Based on Sentiment Scores (6%)
Create two world maps and color countries that are mentioned in questions or answers. The colors indicate
the level of average sentiment scores countries receive from questions or answers. Briefly discuss your findings.
2
Please note the named entities for locations from questions or answers are not always countries. You should
specify which named entities are countries. To assist with this, we have provided an additional dataset
containing information on 173 countries around the world and their corresponding geographical data. A
description of this dataset can be found in the last page. Use the following code to load the dataset:
# You can also find full information here:
#https://raw.githubusercontent.com/chenkuangkuang/
#world-countries-geojson/main/allCountriesGeojson.json
library(tidyverse)
url_main <- "https://raw.githubusercontent.com/sscihz/DaSPPA/main/"
url_file <- "MoFA/Data/geo_info.csv"
geo_info <- read_csv(paste0(url_main, url_file))
Hint:
• To calculate the average sentiment score I refer to, assign the sentiment score from a question or answer
to each country mentioned in such a question or answer. Then, add up all of the sentiment scores for
each location and divide that sum by the number of times those countries are mentioned in questions
or answers.
Question 8 (Bonus) (+6% max)
Construct another type of co-occurrence networks. For questions, the nodes are countries mentioned in
questions. If two countries are mentioned in the same question, there will be a link between such two countries
(nodes); For answers, the nodes are countries mentioned in answers If two countries are mentioned in the
same answer, there will be a link between such two countries (nodes); Plot the networks onto the maps you
drew. Use latitude and longitude of a country in the dataset to plot the position of a nod. Briefly discuss
your findings.
Hint:
• The key challenge might be how to construct the network. In your original data, the dataset looks like:
Doc_id Question Countries
001 news outlet_1 country_1, country_2, country_3
and in Q7, you want to transform your data into something like:
Doc_id Question Country
001 news outlet_1 country_1
001 news outlet_1 country_2
001 news outlet_1 country_3
then in Q8, you should transform your data into something like:
Country_i Country_j
country_1 country_2
country_1 country_3
country_2 country_3
• To transform the data structure from Q7 to Q8, you can use left_join method to join the the same
data(Q7_data %>% join(Q7_data,by = figure it out, relationship = figure it out)). For
3
more information, please check this website: https://dplyr.tidyverse.org/reference/mutate-joins.html
Varibale Description Table: MoFA
Variable Description
id_orig Original ID assigned to collected question-response
dyad; ordered by date.
id ID assigned to question-answer dyad after the
multi-topical split; ordered by date and multi-topical
split (in a sequence).
link Link originally used for accessing the transcript of a
press conference (some of them do not work
anymore).
spokesperson Spokesperson holding the press conference.
day Day of the press conference.
month Month of the press conference.
year Year of the press conference.
date Full date of the press conference.
question Question(s) asked by a member of the press corps.
answer Answer(s) provided by the spokesperson.
who_asked Author of the question (news outlet)
who_asked_loc Country of the news outlet
q_loc Named entities for locations extracted from a
transcribed question.
q_per Named entities for persons extracted from a
transcribed question.
q_org Named entities for organizations extracted from a
transcribed question.
q_misc Named entities for miscellaneous references extracted
from a transcribed question.
a_loc Named entities for locations extracted from a
transcribed answer.
a_per Named entities for persons extracted from a
transcribed answer.
a_org Named entities for organizations extracted from a
transcribed answer.
a_misc Named entities for miscellaneous references extracted
from a transcribed answer.
Varibale Description Table: Geographical Information
Variable Description
latitude The north-south position of a location on the Earth’s
surface, measured from the equator.
longitude The east-west position of a location on the Earth’s
surface, measured from the Prime Meridian.
region A categorical variable that identifies the region or
continent to which a country belongs.
subregion A categorical variable that identifies the subregion or
subcontinent to which a country belongs.
4
Variable Description
alpha_3 A categorical variable that represents the ISO 3166-1
alpha-3 code for a country, a three-letter country
code assigned by the International Organization for
Standardization (ISO).
name A categorical variable that represents the common or
colloquial name of a country.
common_name A categorical variable that represents other common
or colloquial names of a country.
official_name A categorical variable that represents the official
name of a country