R代写|Sql代写 - IE 332 - Homework
时间:2020-10-24
Read Carefully. Important!
As outlined in the course syllabus this homework is worth 6% of your final grade. The maximum attainable
mark on this homework is 75. As was also outlined in the syllabus, there is a zero tolerance policy for
any form of academic misconduct. The assignment can be done individually or in pairs.
By electronically uploading this assignment to Gradescope/Brightspace you acknowledge these statements
and accept any repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must
upload your homework on time for it to be graded. No late assignments will be accepted. Only the last
uploaded version of your assignment will be graded.
NOTE: You should aim to submit no later than 30 minutes before the deadline, as there could be last
minute network traffic that would cause your assignment to be late, resulting in a grade of zero.
You must use the provided LATEXtemplate on Brightspace to submit your assignment.
Page i of i
IE 332 Homework #2 Due: Oct 26 2020
1. (0 points) These are style points meant to enforce the skill of communicating technical information in
a precise, concise and easily interpretable way. You are penalized for (a) using poor grammar/spelling,
(b) disorganized presentation of solutions (including organizing your code into functions, as appropriate), (c) not commenting well your source code, (d) not using meaningful variable names in your
code. At the discretion of the TA (who should be grading this “hard”). The presumption is that you
do not do any of these things, and so doing them will cost addition points (up to -10). Your goal is
to get 0/0 on this question.
The assignment should have margins between 0.5 and 0.75 inches wide, with font size no larger than
12pt (11pt for code) and no smaller than 10pt - sub/superscripts, figure and plot captions excluded,
but should be clearly legible. Clearly label each question.
If a question requires more than 40% of the page to answer, then that is the only answer on that
page. If multiple pages are required, this rule applies to the last of those pages.
2. The given jobs.csv file contains thousands of historical job postings, where each row (job posting) has
multiple features (columns). The overall goal of this question is to use this historical data to build a
probabilistic model that can in turn be used to create a dataframe of 10,000 realistic but synthesized
job postings, each having the features Level, Salary.Range.From, Salary.Range.To, where:
• Level refers to the seniority of the job
• Salary.Range.From refers to the lower bound salary of this job
• Salary.Range.To refers to the upper bound salary of this job
Each of the questions below must be solved using as few lines of R code as possible.
(a) (5 points) Data Cleaning. Read the csv file into R. Notice that the salary is not reported
consistently across each job posting. Conduct data-cleaning so that the units of time and salary
are standardized to a yearly salary as a full time employee for all postings. There may be
more than one way to handle this data-cleaning problem; justify your approach as to which is
best considering the overall goal of this question.Hint: Check columns Salary.Frequency and
Full.Time.Part.Time.indicator.
(b) (6 points) To gain some insight into the relationship between job seniority level and the upper
limit salary, use at least one statistical method to interpret/analyze and an associated plot of the
relationship. Justify your approach and interpretation of the utility of the results with respect
to the overall goal of the question.
(c) (10 points) Assume both Salary.Range.To and Salary.Range.From follow a normal distribution
conditional on a certain job level. Then, the following code prepares us to simulate job postings
considering the distribution over Level, Salary.Range.From and Salary.Range.To.
##Level $4A,4B,M6$ only have a couple of data points which are not enough to fit a distribution
##combine level 4A, 4B into 4
jobs2$Level[which(jobs2$Level %in% c("4A","4B"))] = "4"
##combine level M6 into M7
jobs2$Level[which(jobs2$Level == "M6")] = "M7"
##calculate marginal distribution on column Level
v.nu.level2 = unclass(as.factor(data.matrix(jobs2$Level)))
level.tbl2 = as.matrix(table(v.nu.level2))
m.level = as.numeric(level.tbl2/sum(level.tbl2))
##Calculate the group mean and sd of lower bound salary on different job levels
agg.from.mean = aggregate(Salary.Range.From ~ Level,jobs2,mean)
agg.from.sd = aggregate(Salary.Range.From ~ Level,jobs2,sd)
##Calculate the group mean and sd of upper bound salary on different job levels
agg.to.mean = aggregate(Salary.Range.To ~ Level,jobs2,mean)
agg.to.sd = aggregate(Salary.Range.To ~ Level,jobs2,sd)
Assuming the above code already exists, write a new function called my.sim that first simulates
a numerical vector of job levels based on its marginal distribution m.level. Then, by iterating
over each of the simulated job levels, simulate the corresponding salary lower and upper bound
using the calculated group mean and standard deviation (MAXIMUM 12 LINES OF R CODE).
(d) (8 points) Use your function from above to simulate 10,000 job postings with features of Level,
Salary.Range.From and Salary.Range.To. To compare how well it can create realistic job postings you will evaluate the result obtained by your simulation function by the following criteria:
• Is the relationship between the simulated job seniority and upper limit salary similar to
what you obtain in part b?
• Do the order between salary lower bound and salary upper bound and the range of each of
them make sense?
Discuss potential concerns you may have with either criteria, and propose good remedy to
alleviate them. Record the number of times the second criteria was violated (before corrections)
as the proportion of rejected samples from the total number of records generated.
(e) (6 points) Notice that the previous function for synthesizing data did not consider the joint
distribution between columns Salary.Range.From and Salary.Range.To. We will now correct this
using the method of inverse sampling, which is one way to sample from the joint distribution.
Inverse sampling uses the fact that if we apply the cumulative distribution function FX of X on
X itself, then the result follows a uniform distribution FX(X) ∼ U(0, 1). Then, if we map the
randomly sampled uniform values U to the inverse of the cumulative distribution function FX1,
we will obtain the random variables which follow FX: FX1(U) ∼ F.
The following code uses inverse sampling to sample from empirical joint distribution of columns
Salary.Range.From and Salary.Range.To. Using this code, simulate 10,000 job records with
features of Level, Salary.Range.From and Salary.Range.To.(Maximum 10 lines of R code).
Hint: You may need to use the marginal distribution on Level: m.level.
Compare this approach to your algorithm in part c by:
1. the time needed to execute them.(Maximum 5 lines of R code) Note, the R package
stats has a function system.time that can be used to record the amount of time a function
takes to execute.
2. the proportion of rejected samples(Maximum 4 lines of R code).
v.char.level = agg.from.mean[,1] #get a unique character set of the job seniority level
##empirical CDF function where we count number of instances fall below or on the upper bound
my_cdf = function(i,df,ubound.grid){
##i: row index of ubound.grid ##df: empirical data
##ubound.grid: grid of upper bounds
count = length(which(df[,1]<=as.numeric(ubound.grid[i,1])
& df[,2]<=as.numeric(ubound.grid[i,2])))
result.cdf = count/nrow(df)
}
IE 332 Homework #2 Page 2 of 5
##function to find the first level of probability which reaches a uniform random variable
bi_find_inverse<-function(u,my.ecdf,ubound.grid){
#u: vector of uniform random variables #my.ecdf: corresponding cdf of simulating grids
#ubound.grid:simulating grids
inds = which(my.ecdf>=u)
if(length(inds)>=1){
ind = inds[1]
result = ubound.grid[ind,]
}else{result = NA}
return(result)
}
##function to conduct inverse sampling
my_inverse_sim <-function(nu.level,n,input.df,step.size = 1000){
#nu.level is the numerical job seniority level #n is the number of simulated records in total
#input.df: the empirical dataframe after data cleaning
#step.size is the increment between two grid points
###first, construct grid of upper bounds
level = v.char.level[nu.level] #get level in character
n1 = round(m.level[nu.level]*n)#get number of records to be simulated based on level
salary.to = input.df$Salary.Range.To[input.df$Level==level]
salary.from = input.df$Salary.Range.From[input.df$Level==level]
from.min = min(salary.from)
from.max = max(salary.from)
from.seq = seq(from.min-step.size,from.max + step.size,by = step.size)
to.min = min(salary.to)
to.max = max(salary.to)
to.seq= seq(to.min-step.size,to.max + step.size,by = step.size)
ubound.grid = expand.grid(from.seq,to.seq)
n.ubound = nrow(ubound.grid)
###then, calculate the empirical cdf based on upper bounds
my.df <- data.frame(x = salary.from, y = salary.to)
my.ecdf = sapply(1:n.ubound,my_cdf,my.df,ubound.grid)
##simulate uniform random variables and find inverse of them based on empirical cdfs
u1 = runif(round(n1))
bi.sim = lapply(u1,bi_find_inverse,my.ecdf,ubound.grid)
##remove any NA values for caution, shouldn’t be any
bi.sim=Filter(Negate(anyNA), bi.sim)
df.bi.sim = do.call(’rbind’, bi.sim)
return(df.bi.sim)
}
IE 332 Homework #2 Page 3 of 5
3. (10 points) The ER diagram below describes the relationships between doctors who prescribe drugs
to their patients and the production of such drugs by pharmaceutical companies and their commercialization by pharmacies. Provide all the CREATE TABLE statements needed to create the database.
IE 332 Homework #2 Page 4 of 5
4. Using the database in the Question 3, provide SQL code to answer for the following questions:
(a) (5 points) What are the drugs prescribed by doctors with more than 15 years of experience?
List the drug’s trade name and the name of the pharmaceutical company that produces it.
(b) (5 points) What are the drugs that have the highest price in the market? List both the trade
name of the drug and the company that produces it and its price. Hint: Take into account that
there might be more than one drug with highest price. You may want to use embedded queries
to answer this item.
(c) (6 points) What are the drugs prescribed to patients (strictly) over 60 years old in quantities
greater or equal than 100mg that have the suffix “in” in their trade name? For each drug, display
its trade name, the company that produces it and the number of patients with the characteristics
above that were prescribe such medication. Order the list by the latter in descending order.
Display only the first 20 results. Note: quantities for all prescriptions are measured in miligrams.
(d) (5 points) What is the name and address of pharmacies that signed at least two contracts with
pharmaceutical companies that produce any drugs except “atorvastatin”? For each pharmacy,
list also the number of pharmaceutical companies it has contracts with the condition above. Sort
the results by the pharmacies with more contracts to the ones with less, breaking ties by the
pharmacy’s name in alphabetical order.
5. Using again the database in Question 3, for each English statement below indicate whether the
associated query is correct. If it is incorrect provide a fix to the query. If it is correct, simplify the
query to achieve the same result but with less complicated SQL code.
(a) (3 points) For any specialty, except for “cardiology”, with more than 10 doctors, list the specialty
name and the total number of prescriptions issued by doctors in such specialty between March
1
st, 2020 and July 31st, 2020.
Hint: You can use BETWEEN to obtain dates between a certain period. Dates in SQL are
displayed in the format ”YYYY-MM-DD”.
SELECT specialty, COUNT(date)
FROM doctor DOC RIGHT JOIN prescribes PB ON DOC.SSN = PB.SSN
WHERE DOC.specialty NOT IN (SELECT * FROM doctor WHERE specialty = "cardiology")
AND DOC.specialty >= 10
GROUP BY specialty
(b) (6 points) Display the name of each pharmaceutical company with 50 contracts or more together
with the number of those contracts that will expire in 2021.
Hint: YEAR() can be used to obtain the year of a date. Ex: YEAR(“2017-06-15”) returns 2017.
SELECT C.company_name, C.COUNT(end_date)
FROM (SELECT company_name, end_date
FROM contracts_with
GROUP BY company_name
HAVING COUNT(pharmacy_name) >= 50
) AS company_date
GROUP BY C.company_name
WHERE YEAR(C.end_date) = 2021