FIT1043-无代写
时间:2024-01-04
FIT1043 Lecture 2
Introduction to Data Science
Mahsa Salehi
Faculty of Information Technology, Monash University
Semester 2, 2023
Discussion: Data Science Jobs
Data Science Job Market in Australia
►smaller (per capita) market compared to USA & UK, where giant
industry players are making better use of Data Science
Job Advertisements:
►communication skills and domain expertise are rated highly
►different jobs require different toolset skills
►Week 3 pre-class activity
►see Adzuna’s CV upload page for an interesting application!
Our Standard Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work
We will refer to this
throughout
the semester!
Tools for
data science
Our Standard Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work
Week 4
Week 3
Week 11
Weeks 9-10
Overview of data scienceWeek 1
Weeks 2&8
Weeks 5-7
Week 12
Unit Schedule
Week Activities Assignments
1 Overview of data science
2 Introduction to Python for data science
3 Data visualisation and descriptive statistics
4 Data sources and data wrangling
5 Data analysis theory Assignment 1
6 Regression analysis
7 Classification and clustering
8 Introduction to R for data science
9 Characterising data and "big" data Assignment 2
10 Big data processing
11 Issues in data management
12 Industry guest lecture Assignment 3
§ Introduction to Python for Data Science
§ Motivation to studying Python
§ Python data types
§ Essential libraries
Outline
By the end of this week you should be able to:
►Comprehend the importance of Python as a data
science tool
►Comprehend essentials for coding in Python for data
science
►Explain and interpret given Python codes
►Comprehend the concept of a dataframe
►Work with data using data pre-processing commands
such as aggregating
Learning Outcomes (Week 2)
Introduction to Python for Data
Science
From Python Data Science Handbook by
J. Vanderplas
The 2021 Top Programming
Languages
image src: IEEE
The 2021 Top Programming
Languages
2018
image src: IEEE
Mont in-demand programming
languages of 2022
Data Science Preferred Tools
►Python’s Role in Data
Science
►Many tools out there for data
science.
►Python has gained popularity
over the last few years.
►easy to learn
►flexible and multi-purpose
►great libraries
►well designed computer
language
►good visualization for basic
analysis
image src: kdnuggets.com
Setting Up Python Environment
►Python 2.x vs 3.x
► IPython vs Jupyter Project
►IPython (Interactive Python) is a useful interactive
interface to Python, and provides a number of useful
syntactic additions to the language
►Jupyter provides a browser-based notebook useful
for development, collaboration and publication of
results.
Anaconda Project
►All the Best Tools in One Platform
►Anaconda is a package manager, an environment
manager, a Python/R data science distribution, and
a collection of over 1,500+ open source packages.
Anaconda is free and easy to install.
A desktop
graphical user
interface (GUI) to
use Anaconda
FLUX Question
What is .ipynb?
A. An illegal file extension.
B. Interactive Python NoteBook.
C. Intelligent Python Nota Bene.
D. Typo, it should be ‘pinyin’
Python Basic Types
►Integers
►Floating-Point Numbers
►Boolean
►True/False
►Strings
Integers (int)
►Python interprets a sequence of decimal (power of 10) digits
without any prefix (0b, 0o or 0x) to be a decimal number:
►0b is interpreted as a binary sequence of numbers
>>> print(0b10)
2
►0o is interpreted as a octal sequence of numbers (rarely
used)
>>> print(0o10)
8
►0x is interpreted as a hexadecimal sequence of numbers
>>> print(0x10)
16
Floating Point (float)
►The values are specified
with a decimal point.
>>> 4.2
4.2
>>> type(4.2)
float
>>> 4.
4.0
►For scientific notation style,
the character e followed by a
positive or negative integer
may be used.
>>> .4e7
4000000.0
>>> type(.4e7)
float
>>> 4.2e-4
0.00042
Boolean (bool)
►Note that this type is only
available in Python 3 and it is
not in Python 2.
►Boolean type (in any
language) has one of two
values, True or False
>>> type(True)
bool
>>> type(False)
bool
>>> print(True | False)
True
Strings (str)
►Strings are delimited using
either the single or double
quotes.
►Only the characters between
the opening delimiter and
matching closing delimiter are
part of the string.
>>> print("I am a string.")
I am a string.
>>> type("I am a string.")
str
Strings (str)
►Handling strings can be a bit
more complicated than we
initially think.
►For example, if we want to
include quotes.
►You aren’t simple
>>> print('you aren't
simple')
SyntaxError: invalid
character in identifier
>>> print("you aren't
simple")
you aren’t simple
'
Strings (str)
►The earlier example is just for
the basics of putting the
sequence of characters between
the delimiters as a string.
►There are many other
considerations to cater for
special characters in strings
handling.
►Use \ (back-slash) as the
escape character.
>>> print('you aren\'t
simple')
you aren’t simple
►There are a few reserved
special escape characters:
►\t Tab
►\n New line
►\uxxxx 16-bit unicode character
Dynamic Typed Language
For those who learned programming with static typed
languages, you will need to declare the variables, e.g., in C.
int x;
In Python, there is no declaration and it is only known at run-
time.
>>> x = 10
>>> print(type(x))
>>> x = ' Hello, world '
>>> print(type(x))
Built-in Functions
►There are more than 65 built-in functions in the current
Python version. These functions cover
►Maths
►Type Conversions
►Iterators
►Composite Data Types
►Classes, Attributes, and Inheritance
►Input/Output
►Variables, References, and Scope
►Others
►You can refer to them here
Operators and Strings
Manipulation
►Arithmetic operators
+, -, *, /, % etc.
►Comparison operators
>, <, <=, >=, !=, ==
►String operators
+, *, in
>>> s = 'foobar'
>>> s[0]
'f'
>>> s[3]
'b'
>>> len(s)
6
>>> s[len(s)-1]
'r'
>>> s[-1]
'r'
Strings(useful for Data Science)
►String subset
>>> s = 'foobar'
>>> s[2:5]
'oba'
>>> s[0:4]
'foob'
>>> s[2:]
'obar'
>>> s[:4] + s[4:]
'foobar'
>>> s[:4] + s[4:] == s
True
►Striding
>>> s = 'foobar'
>>> s[0:6:2]
'foa'
>>> s[1:6:2]
'obr'
More Python Data Types
Lists and tuples are useful Python data types.
►A Python list is a collection of objects
(not necessary the same).
►Lists are defined by square brackets
that encloses a comma-separated
sequence of objects([])
>>> a = ['foo', 'bar',
'baz', 'qux']
>>> print(a)
['foo', 'bar', 'baz', 'qux']
►Lists are ordered.
►Lists can contain any
arbitrary objects.
►List elements can be
accessed by index.
►Lists can be nested
to arbitrary depth.
►Lists are mutable.
►Lists are dynamic.
More Python Data Types
Tuple
►Tuples are identical to lists
in all aspects except that the
content are immutable (fixed).
►Tuples are defined by round
brackets (parentheses) that
encloses a comma-separated
sequence of objects ().
Dictionary
►Dictionary is similar to a list in
that it is a collection of objects.
►Only difference is that list is
ordered and indexed by their
position whereas dictionary is
indexed by the key.
►Think of it as a key-value pair.
►This maps nicely to Data
Science when there is access to
NoSQL databases that stores
items in key-value pairs.
Dictionary
d = dict([
(, ),
(, .
.
.
(, )
])
>>> person = {}
>>> person['fname'] = ‘Ian'
>>> person['lname'] = ‘Tan'
>>> person['age'] = 19
>>> person['pets'] = {'dog':
‘Barney', 'cat': ‘Dino'}
>>> person
{'fname': Ian', 'lname':
‘Tan', 'age': 19, 'pets':
{'dog': ‘Barney', 'cat':
'Dino'}}
Controls
Conditions
if :
elif :
elif :
else:
Iterations
while :
Python for loops link
Note: Python uses indentation!
Essential Python and Data
Science
Specific libraries that are considered as the “starter pack” for
Data Science:
►Numpy: Scientific computing, support for multi-
dimensional arrays
►Pandas: Data structures as well as operations for
manipulating numerical tables.
►Matplotlib: library for visualization
►Scikit-learn: Python machine learning library that provides
the tools for data mining and data analysis
For some, you may also want to look at
►NLTK: Natural Language ToolKit to work with human
language data
Loading Libraries
The general syntax to include a library:
>>> import numpy as np
>>> import pandas as pd
>>> from matplotlib import pyplot as plt
>>> import matplotlib.pyplot as plt
Let’s Start!
►Data Science needs DATA
►Reading data
►Writing data
►We can read data from different sources
►Flat files
►CSV files
►Excel files
►Image files
►Relational databases
►NoSQL databases
►Web
Reading from CSV
►Python has a built in CSV reader but for Data Science
purposes, we will use the pandas library.
►Assuming your file name is filename.csv
>>> import pandas as pd
>>> data = pd.read_csv("filename.csv")
>>> data.head()
>>> X = data[["Age"]]
>>> print(X)
Usual 1st Step upon Obtaining
Data
►A description or a summary of it.
►Sometimes, referred to as five number summary if the data
is numeric.
►Minimum, maximum, median, 1st quartile, 3rd quartile
►Work with pandas DataFrames.
>>> df = pd.DataFrame(data)
>>> print(df)
>>> df.describe()
►Select a column by using its column name:
>>> df['Name']
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer)
►Select multiple columns using a list of column names:
>>> df[['Name', 'Survived']]
Name Survived
0 Braund, Mr. Owen Harris 0
1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer) 1
►Select a value using the column name and row index:
>>> df['Name'][3]
'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
Working with DataFrames (Basic)
Working with DataFrames (Basic)
►Select a particular row from the table:
>>> df.loc[2]
PassengerId 3
Survived 1
Pclass 3
Name Heikkinen, Miss. Laina
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: 2, dtype: object
►Select all rows with a particular value in one of the columns:
>>> df.loc[df['Age'] <= 6]
Working with DataFrames (Basic)
Save the Data
►Assuming you just want to analyse a part of the data and you
want to save a resulting data frame to a CSV file.
>>> df2 = df.loc[df['Age'] >= 12]
>>> df2.to_csv ('output.csv', index =
None, header=True)
►We have now read, describe, basic data exploration and
save the data.
Working With Data
►There are some basic data pre-processing that are usually
done or at least taken into consideration.
►Categorical data
►Subsetting data
►Slicing
►Aggregating
►More will be explored in coming weeks
►Removing duplicates
►Dealing with dates
►Missing data
►Concatenating
►Transforming
Categorical Data
►A categorical data is one that has a specific value from a
limited set of values. The options are fixed.
►A ticket class is generally categorical, i.e. 1st class, 2nd class
& 3rd class.
>>> df.loc[df[‘Pclass'] == 1]
►We can create our own categories, e.g.
>>> import pandas as pd
>>> tix_class = pd.Series(['1st','2nd','3rd'],
dtype='category')
Subsetting Data
►We actually already have done this a few slides before J
►Extract only those that survived
>>> df.loc[df['Survived'] == 1]
►What does the code below return?
>>> df.loc[(df['Sex'] == 'female') &
(df['Survived'] == 1)]
Slicing Data
►Slice rows by row index.
>>> df[:5]
>>> df[3:10]
►If we only want certain columns, e.g. Age, Name, Sex,
Survived
>>> df.loc[:,
('Age','Name','Sex','Survived')]
Aggregating
►Like our 5 number statistic, we can also obtain aggregated
values for columns. The total fare can be easily obtained by
>>> df['Fare'].sum()
4385.095600000001
►Or we can get the average age of the passenger by
>>> df['Age'].mean()
28.141507936507935
►Check the answers against the df.describe() earlier
Aggregating
►Like in SQL, we often want to know the aggregated values
for certain values from another column. Similarly, we can use
the groupby function:
>>> df.groupby('Sex')['Age'].mean()
Sex
female 24.468085
male 30.326962
Name: Age, dtype: float64
Aggregating
►What does the following mean?
>>>df.loc[df['Survived']==1].groupby('Sex')['Age'].mean()
Sex
female 26.265625
male 23.314444
Name: Age, dtype: float64
►Compare it with the previous statement, what can you tell
from it?
FLUX Question
What is a dataframe?
A. An array.
B. A list.
C. A theory about data.
D. A structure that stores tabular data
This week we learnt the following:
►Importance of Python as a data science tool
►Comprehend essentials for coding in Python for data
science
►Explain and interpret given Python codes
►Comprehend the concept of a dataframe
►Work with data using data pre-processing commands
such as aggregating
Learning Outcomes (Recap)
►We will be using Python for the next few weeks
►MatPlotLib
►Scikit-Learn
►You can easily look for Python resources online, to be
specific, Python for Data Science.
►An excellent online course will be from DataCamp
Next few weeks
Suggested Reading
Applied Session- Week 2
§ Introductory Python for data science
§ Make sure participate in the the applied
session activities, very important for your
assignments 1&2
§ Use forum if you wish to swap tutorials
Suggested Reading
From Data Analytics Handbook read the interviews of
Abraham Cabangbang (2 pp)
Ben Bregman (2 pp)
Leon Rudyak (3 pp)