QBUS5010-无代写|学霸联盟

QBUS5010-无代写

时间：2023-09-08

QBUS5010 Week 2
Pandas 1
Discipline of Business Analytics
The University of Sydney Business School
Required Reading
Pandas for everyone: Python data analysis - Chen, Daniel Y
https:
//sydney.primo.exlibrisgroup.com/permalink/61USYD_
INST/2rsddf/cdi_askewsholts_vlebooks_9780134547060
I Section 1
I Section 2
Data Wrangling - Python and Pandas
pandas is a fast, powerful, flexible and easy to use open
source data analysis and manipulation tool, built on top
of the Python programming language.
Data Wrangling - Why not Excel?
Data Wrangling - Why not Excel?
The design of Excel and other spreadsheet software is problematic.
I no error control and difficult to test/verify
I poor support for version control
I limited support for use in a data pipeline
I almost impossible to integrate into production
I numerical issues
I cannot scale to large problems
I poor support for operations such as filtering
I proprietary
Using Pandas
Documentation
Pandas has excellent documentation.
This documentation should be your first port of call when
searching for a particular function or for examples on how to use
said function.
Loading Data
Pandas has the ability to read from many data sources including
plain-text and binary files e.g. CSV and Excel files1.
For example to read the PYTHON-USD.csv file
import pandas as pd
data = pd.read_csv("PYTHON-USD.csv")
1
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
DataFrame
The pandas library uses the DataFrame class to represent tabular
data.
For example the pd.read csv function will return a DataFrame
object.
A DataFrame consists of:
I a set of named columns, with potentially different types
I an index, which assigns a unique value to each row
I values, the elements in the table
This format conforms to the tidy principle!
DataFrame - Summaries
Quick summaries of the DataFrame contents can be displayed
using the info and describe methods.
The describe method returns a DataFrame!
DataFrame - Summaries
DataFrame - Selecting Data
Often we are only interested in a subset of columns.
To extract a single column by name:
column = data["Open"]
To extract multiple columns by name:
columns = data[ ["Open", "Close"] ]
DataFrame - Series
Pandas uses another class to represent columnar data: Series.
Therefore the type of variable column is a Series.
column = data["Open"]
The methods available on a Series are different to a DataFrame,
however there is a large amount of overlap.
DataFrame - Selecting Data
Columns and rows can be selected using their physical numbered
position in the table using .iloc.
The syntax of .iloc is:
DataFrame.iloc[ROWS, COLS]
where ROWS are the row indices, and COLS are the column
indices.
The .iloc method is very similar to indexing a list!
DataFrame - Selecting Data
To select the first two columns
column = data.iloc[:, 0:2]
To select the third and fourth rows
rows = data.iloc[2:4, :]
Where a : by itself means “all”. The result of both operations is
another DataFrame!
DataFrame - Selecting Data
You can select both rows and columns at the same time .
To select the third and fourth rows and only the first two columns
subset = data.iloc[2:4, 0:2]
DataFrame - Create or Modify Columns
To insert or modify an existing column, simply refer to it by name
and assign a new Series.
For example to calculate the difference between Open and Close
prices
data['Difference'] = data['Close'] - data['Open']
For example to calculate to multiply the closing price and add a
constant
data['CloseScaled'] = data['Close']*2 + 1
DataFrame - Missing Data
When there is missing data it will be represented by pandas a NaN
value.
The dropna method provides various ways to remove records or
columns containing missing data.
DataFrame - Documentation
DROPNA DOCUMENTATION DEMO
DataFrame - Missing Data
For example to remove all records containing at least one missing
value
data_cleaned = data.dropna()
For example to remove all columns that have all missing values
data_cleaned = data.dropna(axis='columns', how='all')