程序代写案例-CSE 482
时间:2021-10-31
CSE 482: Big Data Analysis (Fall 2021) Homework 2
Due date: October 31, 2021
Please make sure you submit the homework via the D2L system. Submit the
written part of your solution in PDF format and the ipython notebooks of your
source code. Try to zip up your solutions into a single file.
1. Download the bestbuy.csv file from the class website. The data set con-
tains a sample of search queries submitted to bestbuy.com web site. The
first line in the file contains the names of each field (column). Write an
ipython notebook (named q1.ipynb) that performs the following:
(a) Read the data and store it into the CSE MySQL database server.
Use the following schema (you do not need to specify a primary key):
user: varchar(50), sku: int, category: varchar(50), query: varchar(100),
click time: varchar(50), query time: varchar(50).
You can also use MySQL datetime format to store click time and
query time if you like.
(b) Execute a query that counts the number of records in the database.
(c) Execute a query that finds the top-5 product sku (i.e., field #2 in
each line) that was most frequently searched by the users.
Submit the ipython notebook, which must contain the answers to your
last queries.
2. Repeat question 1 using mongoDB instead of MySQL as your database
file. To do this, you must convert each line into a JSON file (see Exercise
3). Write an ipython notebook (named q2.ipynb) that performs the three
tasks stated in question 1 above. Submit the ipython notebook, which
must contains the answers to your queries.
3. Consider the following schema for a movie database (the primary keys of
the tables are underlined). Foreign key constraints are also listed under
the schema of each table.
Person (ID, FirstName, LastName, Birthdate)
Movie (ID, Title, ReleaseYear, Director)
Movie(Director) references Person(ID)
Acts (ActorID, MovieID)
Acts(ActorID) references Person(ID)
Acts(MovieID) references Movie(ID)
Express each of the following query in SQL.
(a) Find the year in which the movie entitled “Jurassic Park” was re-
leased.
(b) Find the name of the actors and actresses who have acted in the
movie “Jurassic Park”.
1
(c) Find the titles of movies directed by “Steven Spielberg”.
(d) For each actor or actress, count the number of movies they have
acted.
4. For this question, you will use the same database schema as in the previous
question. Assume that the Director attribute in the Movie table is not
null (i.e., every movie must have a director). Assume the number of rows
in each table are as follows: Person (200 rows), Movie (150 rows), and
Acts (1000 rows). For each of the SQL query given below:
• Explain (in plain English) what the query tries to return.
• How many columns are there in the query result?
• What is the minimum and maximum number of possible rows that
may appear in the query result?
(a) SELECT Title
FROM Movie
WHERE ReleaseYear = 2017;
(b) SELECT P.FirstName, P.LastName
FROM Person P, Movie M;
WHERE P.ID = M.Director AND M.Title = ‘Star Wars’;
(c) SELECT P.LastName, P.FirstName
FROM Person P, Movie M, Acts A
WHERE P.ID = A.ActorID and M.ID = A.MovieID
and M.Director = P.ID;
(d) SELECT DISTINCT P.FirstName, P.LastName
FROM Person P, Acts A1, Acts A2
WHERE P.ID = A1.ActorID and P.ID = A2.ActorID
and A1.MovieID <> A2.MovieID;
5. Consider a continuous attribute X of a data set that takes the values
{x1, x2, · · · , x9} (sorted in increasing order of magnitude). We apply two
methods (equal interval width and equal frequency) to discretize the at-
tribute into 3 bins. The bins obtained are shown below:
Equal Width: {x1, x2, x3}, {x4, x5, x6, x7, x8}, {x9}
Equal Frequency: {x1, x2, x3}, {x4, x5, x6}, {x7, x8, x9}
Explain what will be the effect of applying the following transformations
on each discretization method, i.e., whether the elements assigned to the
bins may change if you discretize the attribute after applying the trans-
formation functions given below. Note that X¯ denotes the mean (average)
value and σx denotes standard deviation of attribute X.
2
(a) X → X − X¯ (i.e., if the attribute values are centered).
(b) X → X−X¯σx (i.e., if the attribute values are standardized).
(c) X → exp
[
X−X¯
σx
]
(i.e., if the values are standardized and exponenti-
ated).
6. Consider the following dataset containing the gender and weight of pa-
tients who visited a clinic.
Patient ID Gender Weight
1 Male 150
2 Male 200
3 Male 280
4 Male 130
5 Male 240
6 Male ?
7 Female 110
8 Female 130
9 Female 120
10 Female 180
11 Female 160
12 Female ?
The data set contains missing values for the weight attribute (denoted as
? in the table). Compare the following three approaches for imputing
missing values in the data:
Approach 1: Discard the missing values.
Approach 2: Replace the missing values by the average weight for all
patients (without missing values) in the table.
Approach 3: Replace the missing values by the average weight for all
patients (without missing values) with the same gender. For example,
you will replace the missing weight value for males by the average
weight of all the males.
(a) What are the weights for patients 6 and 12 using approaches 2 and
3?
(b) Suppose we need to calculate the average weight for all patients.
Which approach, 2 or 3, will have the same average weight as ap-
proach 1?
(c) Which of the three approaches is most reasonable for dealing with
the data set shown above. State your reasons clearly.
(d) Give a scenario in which approach 1 is an acceptable way for dealing
with missing values in data.
3

学霸联盟


























































































































学霸联盟


essay、essay代写