xuebaunion@vip.163.com
3551 Trousdale Rkwy, University Park, Los Angeles, CA
留学生论文指导和课程辅导
无忧GPA:https://www.essaygpa.com
工作时间:全年无休-早上8点到凌晨3点
微信客服:xiaoxionga100
微信客服:ITCS521
DATA MANAGEMENT AND VISUALIZATION ASSIGNMENT 4
UNIVERSITY OF VICTORIA
Due: Thursday, March 30th, 2017 in class. Late assignments will not be accepted.
Submit your answers to the questions below on paper by the deadline
above. Addition- ally, you must upload a copy of all code you use (SQL,
Python, R, Excel, LibreOffice, etc.) to conneX by 11:59pm on March 30th.
All code submissions must be your own work. However, you are permitted
to use the code discussed in lectures (or code posted on conneX) as the
basis for your submissions if proper attribution is given.
Question 1: BC Ferries Data Visualizations [7 marks]
The questions below use the a3_28days.sql dataset from assignment 3. For
all of the visualizations in this section, you may use any software you
want (Python, R, Excel, LibreOffice, etc.), but you must hand in all of
the source files and intermediate data you use. Additionally, you will
lose marks if your visualizations are not well-formatted, with axis
labels, titles, a legend, good choices for axis ranges and histogram
bins (where applicable).
(a) The query below generates the duration of every sailing on route 1.
select (strftime(‘%s’,arrival)-strftime(‘%s’,scheduled_departure))/60 as duration_minutes from sailings where route_number = 1;
Create a histogram of the sailing times for route 1. Choose a set of bins which makes the distribution clear.
(b) Create a histogram of the sailing times for route 30 in the same format as the histogram above.
(c) The query below computes the average sailing time per day in February for each route.
select
route_number,
strftime(‘%d’,scheduled_departure) as day,
avg((strftime(‘%s’,arrival)-strftime(‘%s’,scheduled_departure))/60) as
avg_duration_minutes from sailings
where strftime(‘%m’,scheduled_departure) = ’02’
group by route_number, day
order by route_number, day;
Construct a line plot with the average sailing time per day for each of routes 1, 2 and 30 (on
the same plot). You will receive a zero if more than one plot is used. To separate the data by
route, you will need to either modify the query above (and make three
variations) to produce separate tables for the three routes, or use
np.where to filter the result of the query above.
Question 2: Regular Expressions [6 marks]
Design regular expressions to match the classes of text described below. You will lose marks if
your expressions matches anything not within the described class or if it does not match anything
1
which is. You can use regular parentheses () to delineate groups if
necessary (there is no need to indicate non-capturing groups).
(a) Street addresses, which consist of a number, possibly followed by
exactly one letter (as in ‘1234a Main St.’), followed by a street name,
which must be one word long and capitalized. The street name is then
followed by a capitalized extension, which must end in a period (e.g.
‘St.’, ‘Rd.’, ‘Ave.’, etc.). Any number of spaces (but at least one) may
appear between the number and the street name or the street name and
the extension.
(b) Sequences of lowercase letters (not necessarily forming an English word) which contain the letter ‘A’ at most 3 times.
(c) Sequences of lowercase letters (not necessarily forming an English
word) with an even number of characters (0, 2, 4, 6, etc.) and where
every second character is a vowel, starting with the first character.
The other characters may any letter (including vowels). For example,
sequences like ‘at’, ‘away’, ‘economic’, ‘avocados’ and ‘onomatopoeia’
should match, but sequences like ‘canoe’, ‘bore’, ‘omega’ and ‘avocado’
should not.
Question 3: Technological Literary Analysis and Deconstruction [11 marks]
Project Gutenberg (https://www.gutenberg.org) maintains a large database
of public domain eBooks. Most of the books in their collection are in
the public domain due to their age (since the copyright on literary
works expires after a certain number of years since the death of the
author). These books are useful for analysing the patterns found in
English text.
A sign-up sheet has been posted to conneX, containing a selection of
works from the Project Gutenberg repository. For this question, you must
choose one of the books in the list and reserve it on conneX. Only one
student is allowed to choose each book, and you will receive zero marks
on this question if you analyse a book other than the one you have
reserved on the online system. Since the analysis does not require
actually reading the book1, the length or genre of the book should not
affect the difficulty of this question.
Once you have chosen a book, follow the link from the signup sheet and
download the UTF-8 text version. This will be your input data. You
should not modify this file at all (leave all of the Project Gutenberg
preamble in place).
The purpose of this question is to analyse the length and quantity of
English words used in your chosen book. An “english word” is defined to
be any sequence of characters that meets either of the following
criteria.
• A sequence of one or more letters, which may be uppercase or lowercase in any combination.
• A sequence of one or more letters followed by exactly one hyphen (the
‘-’ character), followed by a sequence of one or more letters. The
letters may be uppercase or lowercase in any
combination.
You can use regular expressions to extract all of the words from the
downloaded text file. If you print all of the extracted words to a text
file (one per line), you can treat the resulting file as a CSV
spreadsheet (with one column) and import it into SQLite (as a table with
one column). You can then perform analysis on the lengths or
frequencies of words and export the resulting data for visualization.
1. Although some of the books are certainly worth reading…
2
In your submission, answer the questions below for your chosen book (and be sure to clearly indicate what book you chose).
(a) Generate a list of the 25 words that appear most frequently, along
with their counts, ranked in descending order. This can be done with a
simple aggregation-based SQL query. The result should be case-sensitive
(so ‘The’ should be considered to be a different word than ‘the’). For
Jane Austen’s Pride and Prejudice (which is not available as a choice
for this question), the 25 most frequently occurring words are as
follows.
word occurrences
———- ———–
the 4218
to 4163
of 3707
and 3493
her 2138
I 2070
a 1955
was 1844
in 1836
that 1540
not 1419
she 1384
it 1301
be 1255
you 1201
his 1190
had 1152
as 1134
he 1104
with 1066
for 1060
is 861
have 835
Mr 786
him 764
(b) Find the longest word in the book. In SQLite, you can use the
length() function to get the length of a text field (for example, select
word, length(word) from word_list; will list each word and its length).
The length function can also be used for aggregation or WHERE clauses.
For Jane Austen’s Pride and Prejudice, the longest word is
‘respectable-looking’
(c) Generate a table consisting of the number of words of each length. For Pride and Prejudice, the table is
word_length words_with_that_length
———– ———————-
1 4809
2 23900
3
3 28961
4 22163
5 12013
6 9284
7 8670
8 5126
9 5163
10 2490
11 1428
12 917
13 432
14 123
15 41
16 17
17 8 19 1
(d) Plot a histogram of the number of words of each length (that is, a
bar chart of the data from part (c). You may use any visualization
software, and your result should look superficially similar to the chart
below.
30000
25000
20000
15000
10000
5000 0
Words in Jane Austen’s “Pride and Prejudice” Length vs. Frequency
1 2 3 4 5 6 7 8 9 10111213141516171819 Word Length (# characters)
Submission Instructions
Your primary submission should be a hard copy document containing the
answers to all of the above questions. In addition to the hard copy, you
must submit all code (and intermediate data such as CSV files or
spreadsheets) to conneX.
The usual rules for electronic submissions apply. You are permitted to
delete and resubmit your assignment as many times as you want before the
due date, but no submissions or resubmissions will be accepted after
the due date has passed.
Ensure that each code file contains a comment with your name and student
number. If you do not submit the code files needed for a part of the
assignment, it will not be possible to mark that part
4
Number of Occurrences
of your submission and you will receive a mark of zero.
After submitting your assignment, conneX will automatically send you a
confirmation email. If you do not receive such an email, your submission
was not received. If you have problems with the submission process,
send an email to the instructor before the due date.