MAST30034 Applied Data Science 公开课
自我介绍
Today
1. 课程介绍
2. 环境安装
3. Pyspark初体验
4. Project 1 解析
Overview of the course
Project 1: Individual project in the form of report, 30%
Project 2: Group Industry project in the form of presentation, 60%
Quiz: 10% [5% + 5%]
Material
1. Tutorial Sheet
2. Documents/Stackoverflow
3. Google
4. Lecture
Skills and knowledge that you should know
Python, R
pandas, seaborn, sklearn
Github
Latex
Machine Learning Models, Statistical Models
New skills you will Learn
Apache Spark 3.0 Framework (PySpark)
Geospacial Plotting
bash, git
Good News: NO EXAM
课程内容总结
课程计划
PySpark (wk1 - 2)
Project 1 Pipeline (wk2 - 4)
Report Writting (wk4)
Project 2 Pipeline (wk7 - 9)
Project 2 Presentation (tbc)
PySpark
PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine
Learning (MLlib) and Spark Core. Using PySpark we can run applications parallelly on the distributed
cluster (multiple nodes).
PySpark is very well used in Data Science and Machine Learning community as there are many widely
used data science libraries written in Python including NumPy, TensorFlow. Also used due to its efficient
processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi,
Runtastic, and many more.
Advantages
fast speed
large memory
combine local and distributed data transformation
lazy evaluation
Basics
import warnings warnings.filterwarnings("ignore")
An important property of Spark is that it is immutable
A new data format: Parquet
In [ ]:
In [ ]:
stored in columns
single data type per column
compressed much more
faster to read and run
Read in data
from pathlib import Path import pandas as pd
data_dir = Path('data/') full_df = pd.concat( pd.read_parquet(parquet_file) for parquet_file in
data_dir.glob('*.parquet') )
Overview of Data
Project 1 Overview
1. Important information
- Due Date:
- Dataset:
- Final submission:
2. Assumptions:
可以使用任何工具,语言,软件;推荐Python和Pyspark
Report遵照老师提供的模板,用别的模板需要限定页边距和字体大小
Github包含清晰的文件分类归纳和README文件
可以选择从2016开始的任何时间线和出租车种类,时间跨度为6个月[Pyspark]/3个月[Pandas]
可以使用任何相关的外部数据集;严格来说必须使用,否则会扣分[22.5/30]
数据量和时间跨度越大越好;画图时必须做sampling,建模必须用全部数据
3. Report格式
最多8页
没有代码
对于project中任何人为的选择,作出解释
精炼总结所有数据清理,建模方法与作图,代码清晰易懂
对于原版数据集做初步分析与解释
至少两个模型对比分析
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
根据你得到的结果,作出实际可行的提议
合理放置并reference所有图片和表格
Report整体逻辑紧密,言简意赅,主旨明确
4. 外部数据集
常用外部数据集:天气,体育赛事,文体活动,医院、机场。赌场等高人流量地点,人口普查,
人均收入,房价等等;理论上来说,只要你能做到自圆其说就行
数量:个人建议1-2个,对与处理分析数据集比较自信的可以3个或以上
按照Spec上讲的,对于外部数据集做了更深入的analysis,综合采用了更多的外部数据集,分数
就越高 【the highest marks available for students who perform exceptional analysis by drawing
upon several external resources】
但是!!!不要过分追求外部数据集的独特性和数量,就算只用一个很普通的外部数据集,只要
数据处理,建模等常规操作做的严谨合理,分数就一定不会低
5. General Tips