BUSS5221-无代写
时间:2023-10-17
【BUSS5221 - Quiz - Amy】
2
1.What data is;What data cycle is
Introduction to Data
1.1 What is Data
Data is information or facts in a form that can be stored and used.
Data can take many forms (categorical, numerical, ordinal, continuous, interval, ratio) and used in a variety of
settings including in Business, Scientific Research, Production, Governance, Government, among others (more
on these later).
1.2 What is data life cycle
8 steps: (考点)
1st stage: Generation
We generate data every time we click on a link, we comment, we like, we read, we message, we review,
we answer surveys for example.
2nd stage: Collection
Not all data generated is collected. There may also be a cost versus benefits component when it comes
to collecting the data generated.
3rd stage: Processing (考点)
• Data cleaning: "fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset"
• Data wrangling 整理: the process of transforming and mapping data from one raw data into a
different form so that the data is more suited for the various task 整理为标准行列结构.
• Data formatting: depends on what you are formatting the data for.
• Data compression: the process of "modifying, encoding or converting the bits structure of data in
such a way that it consumes less space on disk"
【BUSS5221 - Quiz - Amy】
3
• Data encryption 加密: converts data from a readable, plaintext format into an unreadable
encoded format. Data encryption is often undertaken to protect the information that is being
transmitted.
4th stage:Storage
Common data storage devices include hard disks, USB drives and magnetic tapes. These are now
evolving to data storages using optional and DNA storage.
5th stage: Management
We would like to store the data in a manner that allows us to access and retrieve 检索 the data easily.
We may also need to create and use different kinds of metadata (i.e. data about data) to enable
subsequent analysis of these data.
6th stage: Analysis
involves modeling data so that we can scrutinise 仔细查看 what we assert (e.g. hypotheses). These will
allow us to discover useful information, inform conclusions and support decision-making. Data analysis
includes a range of computational and statistical techniques for analysing quantitative and qualitative
data.
Artificial intelligence (AI), data mining and machine learning are some techniques that are used to
analyse and understand data in hand. Softwares such as Excel, SPSS, MATLAB, SAS as well as
programming languages (R, Python etc) may also be used in this analysis phase.
7th stage: Visualisation
Visualisations help present the information in such a way that your target audience can easily understand.
Pie charts, scatter plots, line graphs and timelines are a few examples of graphs that can be used to
visualise data.
8th stage: Interpretation
Bring all aspects of data together and explain to the target audience what we have analysed and
visualised. This is however not the end of the data life cycle. The data life cycle is a continuous loop.
【BUSS5221 - Quiz - Amy】
4
2.1 Sources of data: primary data & secondary data (考点)
Primary data
Primary data is new information collected specifically for the task at hand.
Primary data is often collected directly from people of significance to your project through interviews, focus
groups.
The methods of primary data collection will depend on the goals of your research as well as the type and
depth of information required.
Secondary data
Secondary data is often available publicly and has been collected and possibly organised by others.
Secondary data is typically free or inexpensive to obtain.
Secondary data can be a useful source of data provided that you are aware of its reliability and relevance
to your task at hand.
2.Where Data comes from
【BUSS5221 - Quiz - Amy】
5
3.1 Types of data
Qualitative data Quantitative data
Qualitative data is data that is difficult to reduce to
just numbers.
Qualitative data tends to answer questions about
the 'what', 'how' and 'why' of a phenomenon'.
非数字型 data
Quantitative data is any information that can be
reduced to a set of numbers.
Quantitative data is the information from which can
be created averages, differences.
数字型 data
• Nominal - data that is simply for labelling variables. The name of a category. They have no
meaningful order and no hierarchy. For instance gender or occupation are nominal level values
• Ordinal - data that is placed into some kind of order by their position on a scale. This refers to a
sequence or data that has some kind of order. For instance you came first second or third in a unit of
study, your economic status is low medium or high
• Discrete - data that is a count (that involves integers, the values cannot be subdivided). For example
the number of students at the university is discrete data. We can only count whole individuals and
cannot divide you into parts.
• Continuous - data is considered the opposite of discrete data. It is data that can be meaningfully
divided into smaller parts and can be measured on a scale or continuum. For example it might be
your height which we can measure very precisely - metre, centimeter, millimeter. Common sense
does prevail and a good rule of thumb is to ensure that the continuous data still makes sense.
o Interval - classifies and orders the measurements including the distance between the values
on the scale for example temperature or time. They are meaningful and equal.
3. The difference between qualitative and quantitative types of datasets
【BUSS5221 - Quiz - Amy】
6
▪ There is no true zero point or fixed beginning (can below zero). You cannot calculate
Ratios.
▪ Time of each day in the meaning of a 12-hour clock.
▪ Temperature, in degrees Fahrenheit or Celsius (but not Kelvin 有绝对零度).
▪ IQ test (intelligence scale).
▪ Test scores such as the SAT and ACT test scores.
▪ Dates (1015, 1442, 1726, etc.)
o Ratio - shows us the order and exact value between the units. Ratios have an absolute zero
that enables us to do some advanced statistical analysis. For example height and weight.
▪ Ratio ones have an absolute zero (can not below zero) that allows us to multiply
and to calculate ratios.
▪ Age, Money, Weight
▪ The Kelvin scale: 50 K is twice as hot as 25 K.
▪ Income earned in a month.
▪ A number of children
3.2 Application of data to different scenarios
有可能的考试方式:给出一个 scenario,判断 Data Type
Scenario 1 How many cars do each household have?
Data type Quantitative data.
We can have 1, 2, 3, 4 etc cars in a household but not 1.6, 2.8.
So, the best data type is Discrete.
Scenario 2 a survey of a hospital to understand patient care in that hospital.
Each question will have a series of questions and responses on a scale of 1 to 5.
1 = Unsatisfied; 2 = Quite Usatisfied; 3 = Neither Satisfied nor Dissatisfied;
4= Quite Satisfied; 5 = Satisifed.
Data type Qualitative data.
There are meanings attached to the numbers with 1 being unsatisfied and 5 being Satisfied.
So, the best data type is Ordinal.
Scenario 3 the demographics of a group of respondents.
1) What is your gender? 1=Male 2=Female 3=Prefer not to say
2) How do you identify yourself? 1=Asian 2 = Caucasian 3=Indian 4=African 5=Other.
Data type Qualitative data.
There are no meanings attached to the numbers with 1 or 2 be used to represent a Male
respondent. That is, 1=Male 2=Female 3=Prefer not to say OR 1=Female 2=Male 3=Prefer
not to say, are all acceptable as survey questions.
Therefore the best data type to represent is Nominal.
Scenario 4 How the average temperature in Australia has changed over the years 2000 to 2020.
Data type Quantitative data.
【BUSS5221 - Quiz - Amy】
7
As we will need to manipulate the data by taking differences between say, the average
temperature in 2000 and 2001 to see the change. We also know that the interval in duration
between 2000 and 2001 is the same as the interval in duration between 2019 and 2020.
There can also be a negative change between one year and the next.
So, the best data type is Interval.
Scenario 5 Income of people in different suburbs in Sydney and the socio-economic make of Sydney.
for example, the average income in Sydney City is $90,000 whereas the average income
in Mosman is $180,000.
Data type Quantitative data.
as we need to manipulate the data by taking differences between, the average salary
between someone in Mosman versus someone in Sydney City. We also know the average
salary in Mosman is twice that of someone in Sydney city. That is we can use manipulation
and division in additional to manipulating the data using addition and substraction.
So, the best data type is Ratio.
Practice:
The possible responses to the question “How long have you been living at your current unit/house?” are values
from a continuous variable.
The possible responses to the question “How many times in the past three months have you visited a gym?”
are values from a discrete variable.
The amount of coffee/tea consumed by an individual in a day is an example of a discrete numerical variable.F
What level of measurement allows for the ranking of data, a precise difference between units of measure, and
also includes a true zero? Ratio
If "1"= code for Male, "2"= code for Female, what type of data are 1 and 2? Nominal
The most appropriate data type to represent the result of this question would be: If John is 30 years old and
Paul is twice as old as John. How old will Paul be in the next 10 years. Ratio
【BUSS5221 - Quiz - Amy】
8
4.1 Quality of Data
Quality measurement for statistical outputs is concerned with providing the user with sufficient information
to judge whether or not the data are of sufficient quality for their intended use(s)”, including 6 dimension:
Accuracy、relevance、accessibility and clarity、timelines and punctuality、coherence、comparability
• Relevance: Statistics are relevant when they meet users' needs. Relevance requires the
identification of users and their expectations; ·
• Accuracy: is defined as the closeness between the estimated value and the (unknown) true value; ·
• Timeliness and punctuality: Statistics are only useful when the figures are up-to-date and published
on time at pre-established dates; ·
• Accessibility and clarity: Data have most value when they are easily accessible by users, are
available in the form users desire, and are adequately documented ("metadata" according to the
type of user). Assistance in using and interpreting figures should be part of the providers' tasks; ·
• Comparability: Data are most useful when they enable reliable comparisons across space like
countries or regions and over time. ·
• Coherence: When originating from a single source, statistics are coherent in as much as elementary
concepts can be combined reliably in more complex ways. When originating from different sources,
e.g. from different surveys with differing frequencies, statistics are coherent insofar as they are
based on common definitions, classifications and methodological standards. ·
4.2 A population and a sample
Population A population includes all members from a group that is of interest in a study. The
characteristics and size of a population will depend on the scope of the study.
For example, if you wish to study if a particular cancer treatment drug has side effects,
then you will administer the drug to cancer patients. The cancer patients will be your
population.
4. Identification of data sources and their quality
【BUSS5221 - Quiz - Amy】
9
5. Limitations of data
However, it may not be possible to test the cancer treatment drug on all patients having
cancer treatments as you may be constrained by geographical distance, financial
resources, and restrictions posed about storing the cancer treatment drug.
Therefore you take a representative sample of patients undergoing cancer treatments.
Sample A sample is thus a subset of the population. The sample is a group of participants who
were actually included in the study.
• The data could be incomplete
• Data may not always be accurate
• Data may be in different formats and in varying quality
【BUSS5221 - Quiz - Amy】
10
6. Data interoperability
7. Data cleaning, coupling, fusion, mapping, and information extraction
How does data contribute to analysis & Uncertainty.
6.1 Uncertainty
Data uncertainty is the degree to which data is inaccurate, imprecise, untrusted and unknown.
Uncertainty may result due to the following reasons:
• Source
• Data lineage 数据志,此处指某一过程中被改变了
• Noise
• Abnormalities 数据异常
• Inherent uncertainty
• Precision
• Ambiguity 意义含糊
6.2 Data interoperability
Data interoperability is a way of managing data uncertainty.
Data interoperability is characteristic of a product or system, whose interfaces are completely understood, to
work with other products or systems, present or future, in either implementation or access, without any
restrictions
Data interoperability thus enables systems and services that create, exchange and consume data to have clear,
shared expectations for the contents, context and meaning of that data. Interoperability thus facilitates the
sharing of data in varying formats easily.
Data interoperability includes many data management activities: data cleaning, data coupling, data fusion 数
据融合, data mapping, and information extraction.
7.1 Data cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
【BUSS5221 - Quiz - Amy】
11
incomplete data within a dataset.
In the event of combining multiple data sources, there are many instances for data to be duplicated or
mislabelled.
If data is incorrect, results and algorithms are inaccurate, even though they may look correct.
➢ A process of cleaning data:
Stage 1: Remove duplicate or inappropriate instances
Stage 2: Fix structural errors(Structural errors are when you measure or transfer data and notice strange naming
conventions,
typos, or incorrect capitalization. These inconsistencies can cause
mislabeled categories or classes. For example,
you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.)
Stage 3: Filter unwanted outliers
Stage 4: Handle missing data
Stage 5: Validate and Quality Assurance.( Does the data make sense? Does the data follow the appropriate rules for
its field? Does it prove or disprove your working theory, or bring any insight to light?)
7.2 Data coupling
Data coupling is when two modules interact with each other through passing data and data structure. It is best
to keep two modules loosely coupled so that the fault in one module does not affect another.
7.3 Data fusion 数据整合
Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful
information than that provided by any individual data source.
Data fusion refers to gathering different kinds of information together into a procedure yielding a single model.
7.4 Data mapping
Data mapping is an essential part of many data management processes. If not properly mapped, data may
become corrupted as it moves to its destination. Data mapping is the process of matching fields (e.g. First name,
Last name, Gender etc) from one database to another. It’s the first step to facilitate data migration, data
integration, and other data management tasks.
For example, the state field in a source system may show Queensland as “Queensland,” but the destination
may store it as “QLD.” Data mapping bridges the differences between two systems, or data models, so that
when data is moved from a source, it is accurate and usable at the destination.
7.5 Information extraction
Data information extraction (IE) is the automated retrieval of the specific information related to a selected topic
from a body or bodies of text.
Information extraction tools make it possible to pull information from text documents, databases, websites, or
multiple sources.
【BUSS5221 - Quiz - Amy】
12
8. Accuracy of data, scalability of data, type of data, configuring data and
integrating data
Practice:
When you combine multiple datasets is it true or false that a common human error is that data can be
duplicated?
In the following list of processes, eliminate the one that is not useful with cleaning data? Rewriting a research
question
Which of the following definitions matches data fusion? Producing a dataset that is more consolidated,
consistent and accurate than having multiple datasets
8.1 Accuracy of data
Data accuracy is one of the components of data quality.
It refers to whether the data values stored for an object are the correct values.
To be correct, data values must be the right value and must be represented in a consistent and unambiguous
form.
(考点)
Practice:
Data accuracy is error free records that can be used as a reliable source of information. True
From the following statements, identify which one is false in causing data inaccuracy. Good data entry
practices
8.2 Scalability of data
By better managing the scalability of data, data uncertainty can be managed by ensuring that a database tool
(a tool that stores and manages data such as MySQL) can handle large datasets without creating data
uncertainties.
【BUSS5221 - Quiz - Amy】
13
A scalable data platform allows for rapid changes in the growth of data, either in traffic or volume. A scalable
data platform prepares a company for the potential of growth in its data needs.
8.4 Coding data
Data uncertainty can also be managed through coding data correctly based on its type. Code is a set of
instructions to extract meaningful information from data.
8.5 Integrating data
Data integration can be described as the process of combining data from different sources into a single
cohesive view. We begin with the intake process and this can include steps such as data cleaning, mapping
and transformation.
There tends not to be a universal approach to data integration but typically it may involve common factors
that include a network of data sources, a master server and staff/clients accessing data from the master
server. What may occur is a request is sent to the master server for data, the master server collates the
data from internal and external sources, the data is obtained from the sources and then consolidated into
a single comprehensive data set which can then be acted upon.
This can be achieved through a variety of data integration techniques, including:
• Extract, Transform and Load: copies of datasets from several sources are gathered together,
harmonized, and loaded into a data warehouse or database
• Extract, Load and Transform: data is loaded as is into a big data system and transformed at a later
time depending on user needs
【BUSS5221 - Quiz - Amy】
14
• Change Data Capture: identifies data changes in databases in real-time and applies them to
repositories 贮藏 of data.
• Data Replication: data in one database is replicated to other databases to keep the information the
information synchronized to operational uses and for backup
• Data Virtualization: data from different systems are virtually combined to create a unified view rather
than loading data into a new repository
• Streaming Data Integration: a real-time data integration method in which different streams of data
are continuously integrated and fed into analytics systems and data stores
9.1 Probability Distributions
A probability distribution is a statistical function that describes all the possible values and likelihoods
that a (random) variable can take within a given range. This range will be usually between the minimum
and maximum possible values the variable can take (Hayes, 2020).
Two types of probability distributions: discrete distributions and continuous distributions.
9.1.1 Discrete distributions
With a discrete probability distribution, each possible value of the discrete random variable can be
associated with a non-zero probability. Thus, a discrete probability distribution is often presented in
tabular form.
9.1.2 Continuous distributions
A continuous probability distribution describes the probabilities of the possible values of a continuous random
variable. A continuous random variable is a random variable with a set of possible values (known as the range)
that is infinite and uncountable
9.The use of probability distributions to manage uncertainty
【BUSS5221 - Quiz - Amy】
15
There are many kinds of continuous distributions. Of these, the normal distribution is a continuous probability
distribution that is symmetrical on both sides of the mean, so the right side of the centre is a mirror image of the
left side. The normal distribution is often called the bell curve because the graph of its probability density looks
like a bell. Many things are normally distributed, or very close to it. For example, height and intelligence are
approximately normally distributed. If you have a normal distribution, that means that you have likely captured
representative data that is not skewed.
The normal distribution is the most used statistical distribution. Normality arises naturally in many physical,
biological, and social measurement situations. A normal distribution shows that your data is not biased. For
example, if you obtain the weights of a male and discover it is normally distributed, it means that you don't have
men who are too underweight or men who are too overweight.
9.2 Managing uncertainty through probability distributions
Some examples:
• A HR manager who uses probabilities might examine the data about where a company’s best
people come from and how they perform throughout their career to identify new sources of talent
that may have not been approached before
• A sales professional who uses probabilities might not just look at increasing his/her sales but
rather see where his/her customers are coming from and who is providing these leads.
• A banker lending money to people may look at the data deeper to highlight low-risk customers
and offer then flexible credit policies rather than simply providing inflexible credit policies to
higher-risk customers.
【BUSS5221 - Quiz - Amy】
16
10.1 What is a decision tree(是什么)
Decision trees are an efficient method to make decisions or solve problems under uncertainty in order to
evaluate each choice based on the outcome.
The values of each choice are determined based on the probability value assigned to each expected outcome.
We use decision trees to help explain and find an answer to a complex problem.
10.2 The basics of a decision tree(为什么)
A decision tree is a graphical depiction of a decision and every potential outcome or result of making that
decision. We use decision trees in a variety of circumstances, from something simple and personal ("Should I
go out to see a movie?") to more complex industrial, scientific, or microeconomic actions.
Decision trees give us an effective and easy way to imagine and understand the potential possibilities of a
decision and its range of possible outcomes by utilising a series of steps. The decision tree also helps people
identify every potential option and consider each course of action against the risks and rewards each option
can produce.
Decision trees can be used as a decision support system by organisations. The prepared model allows the
member of the chart to see how and why one choice may lead to the next, with the use of the branches indicating
mutually exclusive options. The structure allows users to take a problem with multiple possible solutions and to
display those solutions in a simple, easy-to-understand format that also shows the relationship between different
events or decisions.
Each end result of the decision tree has a designated risk and reward weight or number. If we use a decision
tree to decide, we look at each final outcome and assess the benefits and shortcomings. The tree itself can
extend over as long or as short as required in order to come to an appropriate conclusion.
10.3 How to construct a decision tree(怎么做)
To construct a decision tree, you must start with a specific decision that needs to be made. You can draw a
small rectangle at the far left of the subsequent tree to represent the initial decision. You can then draw lines
outward from the rectangle; each line moves from left to right, and each represents a potential option.
Alternatively, you can start with a rectangle at the top of a page or screen and draw the lines downward. There
are also free online sites that you can use to prepare decision trees.
At the end of each decision point, you evaluate the results. If the result of an option is a new decision, draw a
rectangle at the end of that line, and then draw new lines out of that decision, representing the new options, and
labelling them accordingly.
If the result of an option is unclear, draw a circle at the end of the line, which denotes potential risk. If an option
results in a decision, leave that line blank. You continue to expand until every line reaches an end point, meaning
you've covered every choice or outcome. Draw a triangle to indicate the end point.
10.The use of decision trees
【BUSS5221 - Quiz - Amy】
17
11. Forecasting as a tool to manage uncertainty
10.4 Advantages & disadvantages of decision tree (Critical thinking)
➢ Advantages
• Decision trees are easy to construct.
• Decision trees perform classification without requiring much calculation.
• Decision trees are capable of handling both continuous and categorical variables.
➢ Disadvantages
• Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a
continuous variable.
• Decision trees are often very simplistic and may not capture the full scope of decisions
Forecasting is yet another way of managing uncertainty in both data and in the real world. Unlike a prediction, a
forecast must have logic to it. That’s what lifts forecasting from superstition. The person who is forecasting must
be able to state and defend that logic.
【BUSS5221 - Quiz - Amy】
18
1.1 Story telling
1.2 Exploratory Data Analysis (EDA)
1.3 Describing different data types correctly
How are date presented and consumed
There no one right answer in data analysis
1.1.1Consider your audience
Elements of all great stories
When thinking about data analysis, we should think in terms of telling a story.
➢ The following are the elements of all great stories:
• Consider your audience (who are you presenting your analysis to)
• Establish a setting (clearly identify the problem you are trying to solve)
• Define the characters (describe the data that you are analysing)
• Establish the conflict (conduct the required analysis to provide information about the
problem)
• Resolve the conflict (interpret the results of your analysis, tell your audience why your
analysis helps solve the problem)
Things you need to consider
➢ In terms of the audience things you need to consider are:
• Are the audience technical or non-technical people?
• What is the audience wanting to use analysis for?
• How much time does the audience have?
• What might the audience already know versus what they need to know?
• start with can I describe how my data is behaving?
• is not a formal process with a rule book for you to follow, rather is just thinking broadly about the data
and understanding
• an understanding of your data that in the early stage of data analysis.
• a creative process and like most creative processes, the key is to start by asking questions.
Importantly, descriptive statistics do not allow us to make conclusions beyond the data we have analysed or
reach conclusions regarding any hypotheses we might have made (to do that we would need to use inferential
statistics). Descriptive statistics are simply a way to describe the data that we currently have.
【BUSS5221 - Quiz - Amy】
19
1.3.1 Describing categorical data
类型 解释 Excel
Mode 众数
What ranking is most common for each company =MODE
Frequency 频率
How often each company is ranked best, second best, third
best, fourth best, and worse.
=COUNTIF
Percentage 占比
The number of times one particular observation (each
different rank) occurs divided total number of observations
in the data (total rankings per company)
Chart 图表 See the distribution (a distribution shows the possible
values for a variable and how often they occur).
• Using the mode only to make a conclusion is a mistake. 不能单单依靠众数得出结论
• Firstly, the mode is only one source of information. It only tells us which category occurs the most and
gives us no information on what else is occurring in the data.
• Secondly, different software can calculate statistics in different ways. Excel can only show one value
inside a cell.
Example:
Which company is best? C
True or false, Company A is better than Company B? F
Using the frequency distributions for the rankings of each company, which would you say is best? B
【BUSS5221 - Quiz - Amy】
20
1.3.2 Describing continuous data
Each USYD graduates provide their annual salaries after graduate from the university for one year.
类型 公式 Excel
Average 平均数
=AVERAGE
Variance 方差
=VAR
Standard Deviation
标准差
=STDEV
Example:
Which company delivers their pizza the fastest, on average? C
Using the standard deviation, which company delivers pizza closest to the average (i.e. most
consistently)? E
Using both the average and the standard deviation information shown, which pizza company do you
think is best? B
Which two companies have the most inconsistent delivery times? Select all that apply. CD
【BUSS5221 - Quiz - Amy】
21
1.3.3 Outliers and boxplots
• Outliers: data points that deviate so far from the other observations, that it becomes suspicious or can
create noise in the data. 数据分析的时候,会遇到一些异常值
• Boxplots: the easiest way to do this is to visualize the data 利用箱型图可视化异常值
The circled “dot” in the boxplot represent an outlier. There is a pizza delivery time for Company C which is just
over 70 minutes; this is very unusual given the bulk of the other data is roughly between 10 and 30 minutes.
1.3.4 Data distribution
Any continuous data can be divided into categories, which can be created a frequency distribution.
We can examine data by looking at how it is distributed.
【BUSS5221 - Quiz - Amy】
22
1.3.5 Flaw of Averages
• D 公司有 35%的概率派送时间很快,另外 65%的概率派送时间很慢。因此,很难通过 average 来全面分析
D 公司的派送情况。需要通过 distribution 来分析。
➢ Median【=MEDIAN() in Excel】
The "middle" value in the list of numbers, or the exact halfway point between 50% of the smallest numbers
in the range and the 50% of the biggest numbers in the range
If your data is skewed you might want to think about reporting the median, as it will be a better
representation of where the middle of the distribution is.
1.3.6 Positive, negative and symmetric distribution
Symmetrical
• Lots of data follow a symmetrical distribution. In such a distribution the average, the mode, and the
median are all approximately the same value. If the data is symmetrical, then the average is the correct
way of describing the middle point.
【BUSS5221 - Quiz - Amy】
23
1.5 Exploratory analysis with descriptive statistics
Asymmetrical
• A positive distribution is one where there are more observations (data points) to the right-hand (or
positive) side of the distribution. As a result, the average is dragged up towards those larger numbers,
moving away from the bulk of the data. In other words, the average becomes biased. Income is almost
always positively skewed: those billionaires really drag the average income away from the rest of us!
• A negative distribution is one where there are more observations (data points) to the left-hand (or
negative) side of the distribution. As a result, the average is dragged down towards those smaller
numbers, moving away from the bulk of the data. In other words, the average becomes biased. The
number of legs a human has is negatively skewed (if you think carefully about it, technically the average
human has less than two legs!)
In data analysis, it is very rare to study the entire population, rather we collect a sample of data. Given that
you are never studying the full population, you also need to think critically about the sample you are
analysing and ask if it represents the population that you are trying to understand. Important questions
include:
• who was a sample?
• how were they sampled?
• for what purpose were they sampled?
1.4 Population vs sample
【BUSS5221 - Quiz - Amy】
24
1.6 Data visualization overview
【BUSS5221 - Quiz - Amy】
25
2.1 How data contribute to analysis
Models, relationships and hypothesis testing
Data analysis is the process of collecting and organising data in such a manner, so as to draw useful conclusions
from it. Data analysis uses a mixture of analytical and logical reasoning to gain information from the data.
There are 4 types of data analytics: (考点:给出一个 scenario,判断 type)
Descriptive Analytics:
• What happened?
• Details about events that occurred in the past
• Summarize data into a format that humans can understand
• Examine past events or historical data so that better strategies
can be framed for the future
Diagnostic Analytics:
• What did it happened?
• Sometimes performed under the category of descriptive analytics
• Deeper investigation of an issue at hand to arrive at the source of the
problem.
• Discover associations and patterns that were not previously obvious
Predictive Analytics:
• What will happened?
• Predictive analyticsis the use of data, statistical algorithms and
machine-learning techniques to identify the likelihood of future
outcomes based on historical data.
Recall lessons: Importance of context and the big picture
【BUSS5221 - Quiz - Amy】
26
2.2 Statistical variables and models
A statistical model almost always has explanatory
and dependent variables.
Dependent variable: the variable that we seek to
describe/explain/predict.
Explanatory variables (independent variables):
the independent variables describe or predict the
dependent variable(s).
Correlation does not ‘necessarily’ imply causation
Quality of insights matter more than quantity.
Prescriptive Analytics:
• What do we do?
• Focuses on finding the best course of action in a scenario given the available data
Example:
What type of analytics are being discussed ? Descriptive analytics
Statistical modelling can be defined as:
"A simplified, mathematically-formalized way to approximate reality (i.e. what generates your data)
and make predictions from this approximation. The statistical model is the mathematical equation that
is used".
【BUSS5221 - Quiz - Amy】
27
2.3 Different statistical relationships
The dependent and explanatory variables may be just one or many. These variables may also be quantitative
or qualitative. Thus, a statistical model can be adapted easily to different situations.
Example
You look up the literature and notice that there is no study that has examined the colour red and cognitive test
performance.
You hypothesis that if you colour red on a font for an exam, students will perform poorer compared to
using another colour. Your sample of study includes students of different genders, different age groups,
different race and undertaking a unit of postgraduate (Group A) and another group, undertaking a unit of
undergraduate (Group B).
You divide the sample group into 2 for each of the postgraduate unit (Groups A1 and A2) and undergraduate
unit (Groups B1 and B2). For Groups A1 and B1, you provide an end of semester exam paper in black ink. For
Groups A2 and B2, you provide an end of semester exam paper in red ink.
The paper is then graded. After running some statistical test, you conclude there the colour red and
poor grades are correlated (that is students receive a lower mark when they have an exam paper in red
ink).
From that example then, this is what we can list as the research objective, the research question, the sample,
description of how the sample was collected, each variable and the test used:
You do not need to know all the different tests but it is important that you make the right
choices to test any hypotheses that you have developed for your project and/or choose
the most appropriate analysis for the project that you have proposed. It is also ok to focus
on parametric tests but you are expected to understand when it is appropriate to use parametric
and non-parametric tests.
Query Answer
Research objective To study if the colour of text used in an exam influences the
exam score.
Research question Does the colour of the text used in exam influences grades?
Sample As described in the text above.
Sample collection Undergraduate students in Unit 10XX and Postgraduate
students in Unit 50XX at Un sample.
Independent Variables Colour of Text (1=Black 2=Red). Qualitative categorical
nominal variable; Genders ( 3=Not Specified) different age
groups (1=18-22; 2=22-26; 3=26-30; 4=>30), differen
2=Asian; 3=African; 4=Other) and undertaking a unit of
postgraduate (Group A) and undertaking a unit of
undergraduate (Group B). (1=Postgraduate;
2=Undergraduate)
Dependent Variable Grades obtained for the course
Test Simple regression analysis
【BUSS5221 - Quiz - Amy】
28
2.4 Causation vs. correlation
T 检验最多只能研究两个样本,超过两个样本需要用方差分析;方差分析(Analysis of Variance,简称 ANOVA), 又
称“变异数分析”,用于两个及两个以上样本均数差别的显著性检验。
【BUSS5221 - Quiz - Amy】
29
2.5 Hypothesis testing
2.4.1 Correlation
Correlation is a statistical measure so it is expressed as a number that describes the size and direction of the
relationship between the two (or more) variables. However, this does not mean that the change in one variable
is the cause of the change in the values of the other variable. This is why we commonly say correlation does not
2.4.2 Causation
Causation indicates that one event is the result of the occurrence of the other event (often called cause and
effect). For example: rain clouds cause rain, overeating causes weight gain!
Seeing two variables moving together does not necessarily mean we know whether one variable causes the
other to occur.
Therefore, correlation does not imply causation.
2.5.1 对 hypothesis testing 的理解
Hypothesis: A premise or claim that we want to test
Testing a claim: doing some sampling and getting that information then testing the hypothesis.
Example
例如,从过往的资料中,我们知道 2016 年大学毕业生平均薪酬为 4765 元,标准差为 300 元,现在从 2017 年的大学毕业
生中随机抽取 10000 名,调查得到其平均薪酬为 4912 元,现在我们想知道 2017 年大学毕业生的平均薪酬和 2016 年相比
是否有显著差异?
从抽样调查结果,我们知道 17 年的平均薪酬为 4912 元,相比于 16 年增加了 147 元,但现在问题在于这 147 元的差异可
能有两种可能引起:第一种可能是,17 年和 16 年的平均薪酬其实并没有太大差别,只是由于抽样误差引起了 147 元的波
动;第二种可能是 17 年和 16 年的平均薪酬确实有明显差异,由于经济的增长,17 年的平均薪酬确实增加了。
事实上,假设检验的核心正是判断这个差异是否足以通过抽样的随机性来解释。
【BUSS5221 - Quiz - Amy】
30
hypothesis, one that they think explains a phenomenon, and then work to reject the null hypotheses.
2.5.2 区分null hypothesis 和 alternative hypothesis
The null hypothesis, H0 is the commonly accepted fact.
这里的 commonly 是不需要思考的 common,比如当我们判断世界上男女的身高是否有差异,common 是没有差异。
H1 is the opposite of the null hypothesis and is called alternative hypothesis.
Researchers work to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate
Example:
• Null hypothesis, H 0: No more than 50% of the population voted in the elections (i.e. p<=50%)
Alternative hypothesis, H 1: More than 50% of the population voted in the elections (i.e. p>50%)
• Null hypothesis, H 0: The drug reduces cholesterol by 30% (i.e. p=30%)
Alternative hypothesis, H 1: The drug does not reduce cholesterol by 30% (i.e. p≠30%)
1. "The average grade of BCom students is 65%". Given this, which of the following is the alternative
hypothesis? The average grade of BCom students is not 65%
2. "The average height of a 8-year old is 120cm". What is the null hypothesis to match this statement? The
average height of a 8-year old is 120cm
3. "The average salary of a MCom graduate in Australia is 90,000 AUD". Please select the alternative
hypothesis. The average salary of a MCom graduate in Australia is not 90,000 AUD
4. "For a company to disclose environmental information, it must have made a profit". Select the statement
that is the null hypothesis. For a company to disclose environmental information, it must have made a
profit
one-tailed test/two-tailed test
只有当我们非常确信方向性的时候,才能使用 one-tail,在实际情况中出现较少。
2.5.1 从 hypothesis testing 中 interpret result
我们可则以构造一个与此相关的统计量,如果该统计量非常的大(即已经超过了一定的临界值),我们则可以认为这
种差异并不仅仅是由抽样误差带来的,因此我们可以拒绝(reject)原假设,认为两个总体有显著差异。
在这里,小概率事件的阈值,我们称之为检验水平,一般情况下我们取,即把发生概率小于 0.05 的事件称之为小概
率事件。相反,如果我们假设检验中,没有拒绝原假设(cannot reject),并不意味着我们完全接受原假设,只是说
明样本数据的“证据”不足,暂时不拒绝原假设。
【BUSS5221 - Quiz - Amy】
31
2.5.2 Steps in Hypothesis Testing
Step 1: State the assumption to be tested.
H0: Null Hypothesis
H1: Alternative Hypothesis
Step 2: Specify what level of inconsistency will lead to rejection of the hypothesis. This is
called a decision rule
Step 3: Collect data and calculate necessary statistics to test the hypothesis.
Test statistic for sample mean, σ known:
Test Statistic for sample Mean, σ Unknown:
Step 4: Make a decision. Should the hypothesis be rejected or not?
The test statistic falls in the right rejection region,so we reject the null hypothesis H0∶μ≤XX and conclude
the alternative hypothesis H1∶μ>XX
Step 5: Take action based on the decision.