CIS 325: Programming for Business Analytics
Donghyuk Shin
Dept. of Information Systems
Pandas Summary and Descriptive Statistics
2• It is important that you read Chapter 5, Section 5.3 of the textbook for this lecture
• The official Pandas documentation is your source of reference:
https://pandas.pydata.org/docs/
What will we cover?
Pandas Concepts Series DataFrame
Statistics (mean, standard deviation, correlation, etc) P P
Unique values & value count P P
Membership (is in) P P
Group by (and group-wise statistics) P P
Apply (a function to rows, columns, or all values) P P
3Pandas Series and DataFrame
axis 1
axis 0
axis 0
4
5
6
7Group By
Group by “key”
8Group By (built-in methods)
9import pandas as pd
s1 = pd.Series(['banana', 'orange', 'kiwi', 'apple', 'orange',
'banana', 'orange', 'apple', 'banana', 'orange',
'orange', 'kiwi', 'kiwi', 'kiwi'])
1. Calculate the frequency counts of each unique value of s1 (= how many times each unique
value appears in s1). Assign the result to a variable named val_cnt.
2. Get the most frequent value in s1 using val_cnt and assign it to a variable named
top_val.
3. Get the least frequent value in s1 using val_cnt and assign it to a variable named
bottom_val.
4. In s1, replace every value that's equal to top_val or bottom_val as NaN values (=
np.nan) in the series.
In Class Exercise 1
10
import pandas as pd
d = [['hamster', 'alligator', 'hamster', 'cat', 'snake', 'cat', 'hamster',
'cat', 'cat', 'snake', 'hamster', 'hamster', 'cat', 'alligator'],
[1, 9, 4, 13, 14, 10, 2, 4, 14, 7, 14, 2, 1, 7],
[7, 13, 8, 12, 11, 8, 10, 14, 9, 11, 10, 10, 9, 14],
[8, 6, 9, 1, 8, 9, 5, 6, 6, 6, 5, 3, 4, 5],
['AZ', 'AZ', 'NY', 'WA', 'AZ', 'CA', 'NY', 'AZ', 'WA', 'WA', 'NY', 'AZ', 'AZ', 'WA']]
1. Create a Pandas DataFrame named df using d.
2. Change the column names of df to integers that range from 1 to the number of columns of df.
3. Change the index of df to names = ['animal','age','weight','length','state'].
4. Transpose df (= swap rows and columns) and assign it to a new variable named dft.
5. Check the dtypes of the columns of dft.
6. Change the type of “age”, “weight” and “length” columns to integer type.
7. Compute the average “age”, “weight” and “length” by “animal” group for dft.
8. Compute the average “age”, “weight” and “length” by “animal” and “state” group for dft.
9. Compute the count of animals by “state” group for dft.
10. Generate descriptive statistics of “age” by “animal” group for dft.
11. Generate descriptive statistics of “state” by “animal” group for dft.
In Class Exercise 2
11
import pandas as pd
import numpy as np
data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'score2': [16, 12, 14.5, np.nan, np.nan, 13, 15.5, 11, 12, 17],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
1. Create a Pandas DataFrame named dfa using data as values and labels as index.
2. Using the apply() method, change values of the “qualify” column of dfa to Boolean values. Specifically,
change “yes” and “no” to True and False, respectively.
3. Create a new DataFrame named dfb with rows of dfa where “qualify” is True. Also, drop the “name” and
“qualify” columns from dfb.
4. Using the apply() method, compute the mean for each column of dfb. Then compute the mean of each
row of dfb.
5. Compute the remainder (modulus operator) when dividing each number by 2 for all values in dfb.
In Class Exercise 3
学霸联盟