B450F-Python代写|学霸联盟

B450F-Python代写

时间：2022-11-26

BIA B450F
Unit 3
Data staging and
data mining
Dr Franklin Lam
1
Data staging
• Data extracted from disparate sources needs to be changed, converted, and made ready in a
format that is suitable to be stored for querying and analysis.
• Three major functions needs to be performed for getting the data ready:
– Extract: get the data from source system as efficiently as possible
– Transform: changing the source data into the exact formats and structures appropriate for storage, and
performing calculation on data
– Load: load the data into the data warehouse
• Extract, transform and load (ETL) functions take place in the staging area of ’The Back Room’
of DW.
• Data staging area provides a place and an area with a set of functions to clean, change,
combine, convert, de-duplicate, and prepare source data for storage and use in the data
warehouse.
2
Why ETL is important?
• The information contained in a warehouse flows from the same operational systems
that could not be directly used to provide strategic information.
• ETL functions reshape the relevant data from the source systems into useful
information to be stored in the data warehouse. Without these functions, there
would be no strategic information in the data warehouse.
• If the source data is not extracted correctly, cleansed, and integrated in the proper
formats, query processing and delivery of business intelligence, the backbone of the
data warehouse, could not happen.
• Transformed data is still not useful to the end-users until it is moved to the data
warehouse repository.
• Overall, each of the ETL functions fulfills a significant purpose that must be
successfully performed in data warehouse development.
3
An ETL example
• A manager of a company wants to compare and analyze sales by store, by product, and by
month. The sales figures are available in the several sales applications and product details are
captured in a product master file. Further, each sales transaction refers to a specific store. All
these are pieces of data in the source operational systems.
• For doing the analysis, information about the sales must be provided in the data warehouse
database with the sales units and dollars in a fact table, the products in a product dimension
table, the stores in a store dimension table, and months in a time dimension table. How do
you do this?
• Specifically, you have to extract the data from each of the operational systems, reconcile the
variations in data representations among the source systems, and transform all the sales of
all the products. Then load the sales into the fact and dimension tables.
• Now, after completion of these three functions, the extracted data is sitting in the data
warehouse, transformed into strategic information, ready for delivery as business intelligence
for analysis. Notice that it is important for each function to be performed, and performed in
sequence. 4
Why ETL is challenging?
• ETL functions are challenging primarily because of the nature of the source systems. Most of the
challenges in ETL arise from the disparities among the source operational systems.
• Reasons for the types of difficulties in ETL functions:
– Source systems are very diverse and disparate running on multiple platforms and different operating systems.
– Many source systems are older legacy applications running on obsolete database technologies.
– Historical data on changes in values are not preserved in source operational systems.
– Quality of data is dubious in many old source systems that have evolved over time.
– Source system structures keep changing over time because of new business conditions.
– Gross lack of consistency among source systems is prevalent. Same data is likely to be represented differently
in the various source systems. For example, data on salary may be represented as monthly salary, weekly
salary, and bimonthly salary in different source payroll systems.
– Even when inconsistent data is detected among disparate source systems, lack of a means for resolving
mismatches escalates the problem of inconsistency.
– Most source systems do not represent data in types or formats that are meaningful to the users. Many
representations are cryptic and ambiguous.
5
Time consuming and arduous
• ETL functions design is time consuming and arduous.
• Data extraction itself can be quite involved depending on the nature and complexity
of the source systems. The metadata on the source systems must contain information
on every database and every data structure that are needed from the source
systems.
• Activities within the data transformation function can run the gamut of
transformation methods including reformat internal data structures, re-sequence
data, apply various forms of conversion techniques, supply default values wherever
values are missing, and design the whole set of aggregates that are needed for
performance improvement.
• The sheer massive size of the initial loading can populate millions of rows in the data
warehouse database. Creating and managing load images for such large numbers are
not easy tasks. Sometimes, it may take two or more weeks to complete the initial
physical loading.
6
ETL requirements
• Business needs
– DW/BI system users’ information requirements that drive the choice of data sources and their
subsequent transformation in the ETL system, the ETL team must understand and carefully
examine the business needs.
– List the key performance indicators (KPIs) uncovered during the business requirements definition
that the project intends to support.
• Compliance
– DW/BI systems in regulated businesses, such as telecommunications, have complied with
regulatory reporting requirements for years
– List those data inputs and data transformation steps for which the company must maintain the
“chain of custody” showing and proving that final reports were derived from the original data
delivered from the data sources.
7
ETL requirements
• Data quality
– List those data elements whose quality is known to be unacceptable.
– List whether an agreement has been reached with the source systems to correct the data before
extraction.
– List those data elements discovered during data profiling, which will be continuously monitored
and flagged as part of the ETL process.
• Security
– Security awareness has increased significantly in the last few years across IT but often remains an
afterthought and an unwelcome burden to most DW/BI teams.
– Expand the compliance checklist to encompass known security and privacy requirements.
8
ETL requirements
• Data integration
– Data integration aims to make all systems seamlessly work together.
– Data integration usually takes the form of conforming dimensions and conforming facts in the data
warehouse.
– Use the bus matrix of business processes to generate a priority list for conforming dimensions
(columns of the bus matrix).
• Data latency
– Data latency describes how quickly source system data must be delivered to the business users via
the DW/BI system.
– List all legitimate and well-vetted business demands for data that must be provided on a daily
basis, on a many times per day basis, within a few seconds, or instantaneously.
9
ETL requirements
• Archiving and lineage
– Archiving and lineage requirements were driven by the compliance and security needs.
– Even without the legal requirements for saving data, every data warehouse needs various copies
of old data, either for comparisons with new data to generate change capture records or
reprocessing.
– List the data sources and intermediate data steps that will be archived, together with retention
policies, and compliance, security, and privacy constraints.
• BI delivery interfaces
– List all fact and dimension tables that will be directly exposed to BI tools.
– List all OLAP cubes and special database structures required by BI tools.
– List all known indexes and aggregations to build to support BI performance.
10
ETL requirements
• Availability skills
– Some ETL system design decisions must be made on the basis of available resources to build and
manage the system
– Inventory the department’s operating system, ETL tool, scripting language, programming language,
SQL, DBMS, and OLAP skills.
– List those skills required to support current systems and likely future systems.
• Legacy licenses
– List legacy operating system, ETL tool, scripting language, programming language, SQL, DBMS, and
OLAP licenses and whether their exclusive use is mandated or merely recommended.
11
Source systems
• In any ETL process there are one or more source systems and a number of target
systems. The source systems can be as diverse as the range of operational data stores
that typically exist in an organization. Source system data models cover the gamut of
alternatives – hierarchical, fully denormalized flat file, and third normal form as
dictated by the source system.
• For example, in a retail operation, there may be any or all of the following:
– A proprietary Point of Sale (PoS) system that generates large flat text files in CSV format over the
course of a day. The PoS will be the major source of sales facts for the retail sales fact table.
– A product database — a fully normalized relational structure in Oracle DBMS containing most of
the data required for the product dimension table.
– A supplier file — built in Excel and containing all the details required for the supplier field in the
product dimension.
– A major customer database — stored in part in an Access database at headquarters and in more
detail on various sales representatives’ computers.
– A promotions database — another Access database, normalized in parts, designed and built years
ago by someone in marketing to record promotional activities.
12
The data staging area
• Data extracted from source systems needs to be copied to a location where it can be
manipulated without negatively affecting any operational processes.
• There is no set structure to the data in the staging area. It very much depends on the original
format of the data, volume, the types of transformations required and the tools chosen to
perform these tasks.
• For the previous retail operation, the staging area will then contain a copy of the Excel file
that stores supplier data. It will also contain normalized relational structures for the storage
of product data.
• The data models can be designed for performance and ease of development. This means
they are likely to match a combination of source structures on the incoming side and
dimensional warehouse structures on the “finished goods” side, with the transformation
process managing the conversion.
• Third normal form or ER models often appear in the data staging area because the source
systems are duplicated, and become the starting point for the process, or they are built in
the staging area to help tease apart the real nature of the sources. 13
Data extraction
• Pulling the data from the source systems is probably the largest single effort in the
data warehouse project, especially if the source systems are decades-old,
mainframe-based, mystery-house-style systems.
• Most often, the challenge is determining what data to extract and what kinds of
filters to apply.
• Most extracts are based on the host platform and are designed to get the data out of
the source and into a simplified and easily accessible form.
14
Extraction services
• Multiple sources
– Need to deal with multiple platforms and versions.
• Code generation
– Generate actual code that runs in the source system to create the extract files or internal code
that is executed by the extract engine, which in turn will issue commands against the source
system.
• Multiple extract type
– Incremental loads
• Most data warehouses use incremental loads. This is typically based on a transaction date or some kind
of indicator flag in the source system.
– Transaction events
• The extract must be able to identify several different kinds of transactions because the transformation
and load processes will have to treat them differently.
– Full refresh
• Sometimes it is necessary to pull certain tables in their entirety rather than worry about figuring out
what has changed e.g. data sources where history could be changed by the transaction system without
leaving any identifying marks.
15
Extraction services
• Replication
– Continuously update a table in the staging database during the day, so it is up-to-date and
available to support the data staging process overnight.
– Replication is especially valuable in enterprise warehouse situations where multiple load
processes depend on access to updated versions of the conformed dimension tables.
– Bulk replication can mirror or copy the dimensions to multiple data staging areas.
• Compression/decompression
– Data compression/decompression can be an important capability in the extract stream if the
extraction process requires to transfer large amounts of data over a significant distance.
16
Data transformation
• In every system implementation, data conversion is an important function. If data
extraction for a data warehouse poses great challenges, data transformation presents
even greater challenges.
• Once the data is extracted from the source system, a range of unnatural acts are
performed on it to convert it into something presentable to the users and valuable to
the business.
• The data feed is not just an initial load. Any transformation tasks set up for the initial
load will be adapted for the ongoing revisions as well.
17
Transformation services
• Integration
– Integrations involve generating surrogate keys, mapping keys from one system to another, and
mapping codes into full descriptions, and maintain a master key lookup table for transformation.
• Slowly changing dimension maintenance
– Identifying changed values and creating surrogate keys is a tricky process.
• Referential integrity checking
– The data in one table must match the corresponding data in another table.
• Denormalization and renormalization
– Denormalizing a hierarchy of separate tables into a dimension is a standard warehouse
transformation process. Some of the data staging tools offer a star schema feature that
automatically performs this function.
– Some denormalization takes place in the fact table process. For example, a financial schema may
have a dimension that is the amount type, with the values Actual, Budget, or Forecast.
– When a source comes in as a fully denormalized flat file, and it may need to normalize parts of the
record before loading. 18
Transformation services
• Cleansing, deduping, merge/purge
– This is a big problem for many data warehouses, especially those concerned with external entities
like customers, businesses, doctors, and patients. It is a complex process, but several vendors offer
tools and services specifically for this problem.
• Data type conversion
– This involves lower-level transformations converting one data type or format to another.
• Calculation, derivation, allocation
– These are transformations to apply the business rules you identified during the requirements
process.
• Aggregation
– Aggregation can be handled in any part of the load process, depending on which resources are
available at which stage.
19
Transformation services
• Data content audit
– The transformation process should spin off check sums, row counts, and tests of reasonableness
as a matter of course. These are then compared to figures generated by the source system and
alerts generated when they don’t match.
• Data lineage audit
– The extract and transformation process can create logs of the specific runs that created specific
dimension records and fact records.
• Tool- or analysis-specific transformation
– Some tools require specific data elements to be marked or updated as part of every load process.
For example, a tool may rely on a “current date” flag in the calendar table. It is common to have to
flag abnormal facts or changed status, or to normalize facts so they range between 0 and 1. This
prepares the data for the statistical analysis tools that are the heart of data mining systems.
• Null values
– Nulls can be a problem because many legacy systems didn’t have a means to represent nulls. To
handle nulls, programmers chose a particularly unlikely value (like 9/9/99 for the date, or –1 for
the product number).
20
Data loading
• Loading is essentially the process of moving the completed dimension and fact tables
from the staging area to their active locations.
• This may involve deleting and then replacing the previous versions of the tables or
simply adding new rows to the tables currently in use.
• The initial load moves large volumes of data using up substantial amounts of time.
• As the data warehouse starts functioning, the system continues to extract the
changes to the source data, transform the data revisions, and feed the incremental
data revisions on an ongoing basis.
21
Loading services
• Load optimization
– Most DBMSs have a bulk loading capability that includes a range of features and can be scripted or
invoked by your data staging tool through an API. Every database product has a set of techniques
and tricks that optimize its load performance. These include steps like avoiding logging during
loads and taking advantage of bulk loader capabilities like creating indexes and aggregates during
the load.
• Entire load process support
– The loading services also need to support requirements before and after the actual load, like
dropping and re-creating indexes and physical partitioning of table and indexes.
22
Data staging planning and implementation
• Develop the ETL plan
– ETL development starts out with the high-level plan, which is independent of any specific
technology or approach.
– However, it’s a good idea to decide on an ETL tool before doing any detailed planning; this can
avoid redesign and rework later in the process.
• Develop One-Time Historic Load Processing
– The historic load for dimensions and facts consisted largely or entirely of inserts.
– Occasionally, the same ETL code can perform both the initial historic load and ongoing incremental
loads, but more often you build separate ETL processes for the historic and ongoing loads.
• Develop Incremental ETL Processing
– One of the biggest challenges with the incremental ETL process is identifying new, changed, and
deleted rows.
– The incremental processing primarily perform inserts, but updates for dimensions and some kinds
of fact tables are inevitable. 23
Example of high-level staging plan
24
ETL tool
• There are a multitude of ETL tools available in
the data warehouse marketplace.
– Incumbent batch ETL tools: on-premises and based on
batch processing e.g. IBM InfoSphere DataStage,
Microsoft SSIS and Oracle Data Integrator.
– Cloud native ETL tools: cloud-based ETL services
e.g. Alooma, Fivetran, etc.
– Open source ETL tools: a modern management layer
for scheduled workflows and batch processes e.g.
Apache Airflow, Apache Kafka, etc.
– Real-time ETL tools e.g. Alooma, StreamSets, etc.
• Select an ETL tool should base on the specific
requirements of real-time updates, handling
streaming sources, amount of data, use on-
premises technology or open source technology,
etc. 25
(Source: https://www.alooma.com/blog/etl-tools-modern-list)
Example of detailed plan for the fact table
26
(Source: The data warehouse tookit, 3rd edn)
Three types of analytics
27
1. Descriptive analytics refers to
knowing what is happening in
the organization and
understanding some
underlying trends and causes
of such occurrences.
2. Predictive analytics is the use
of statistical techniques and
data mining to determine what
is likely to happen in the
future.
3. Prescriptive analytics is a set of
techniques that use descriptive
data and forecasts to identify
the decisions most likely to
result in the best performance.
OLAP vs data mining
• OLAP and data mining are features of ‘The Front-Room’ of DW — that is, of the
business of extracting meaningful and useful information from a data warehouse.
28
“On-Line Analytical Processing (OLAP) is a category of software technology that
enables analysts, managers, and executives to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of information that
has been transformed from raw data to reflect the real dimensionality of the
enterprise as understood by the user.” (The OLAP Council)
“Data mining is the process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database systems.”
(ACM SIGKDD)
vs
OLAP vs data mining
• In summary, OLAP and data mining are used to solve different kinds of
analytic problems:
• OLAP summarizes data and makes forecasts. For example, OLAP answers
questions like "What are the average sales of mutual funds, by region and by
year?”
• Data mining discovers hidden patterns in data. Data mining operates at a detail
level instead of a summary level. Data mining answers questions like "Who is
likely to buy a mutual fund in the next six months, and what are the
characteristics of these likely buyers?"
29
(Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14223/bi.htm)
MOLAP, ROLAP and HOLAP
• In the OLAP world, there are mainly two different types:
Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP).
• Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and
ROLAP, i.e. HOLAP = MOLAP + ROLAP.
• HOLAP technologies attempt to combine the advantages of MOLAP and
ROLAP.
• For summary-type information, HOLAP leverages cube technology for
faster performance. When detail information is needed, HOLAP can “drill
through” from the cube into the underlying relational data.
30
MOLAP
• MOLAP data is stored in a
multidimensional databases (MDDBs).
The storage is not in the relational
database, but in proprietary formats.
• These multidimensional databases are
formed from the large multidimensional
array. The cells or data cubes of this
multidimensional databases carry pre-
calculated and prefabricated data.
• Proprietary software systems create this
pre-calculated and fabricated data, while
the data is loaded to MDDBs from the
main databases.
31
MOLAP
• Advantages
– Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing
and dicing operations.
– Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
• Disadvantages
– Limited in the amount of data it can handle: Because all calculations are performed when the cube
is built, it is not possible to include a large amount of data in the cube itself and only summary-
level information will be included in the cube itself.
– Requires additional investment: Cube technology are often proprietary and do not already exist in
the organization. Therefore, to adopt MOLAP technology, chances are additional investments in
human and capital resources are needed.
32
ROLAP
• ROLAP relies on manipulating the
data stored in the relational
database (in star schema) to give the
appearance of traditional OLAP’s
slicing and dicing functionality.
• Each action of slicing and dicing is
equivalent to adding a "WHERE"
clause in the SQL statement.
33
ROLAP
• Advantages:
– Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation
on data size of the underlying relational database. In other words, ROLAP itself places no
limitation on data amount.
– Can leverage functionalities inherent in the relational database: Often, relational database already
comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational
database, can therefore leverage these functionalities.
• Disadvantages:
– Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL
queries) in the relational database, the query time can be long if the underlying data size is large.
– Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (for example,
it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore
traditionally limited by what SQL can do.
34
What is data mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data.
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
35
Data mining is a blend of multiple disciplines
36
DATA MINING
(Knowledge
Discovery)
Statistics
Artificial
Intelligence
Machine
Learning & Pattern
Recognition
Information
Visualization
Database
Management
& Data
Warehousing
Management
Science
& Information
Systems
(Source: Business intelligence,
analytics, and data science : a
managerial perspective, 2018)
A taxonomy of
data mining
37
(Source: Business intelligence, analytics, and data science : a managerial perspective, 2018)
Data Mining Algorithms
K-means, Expectation Maximization (EM)
Autoregressive Methods, Averaging
Methods, Exponential Smoothing, ARIMA
Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Apriory, OneR, ZeroR, Eclat, GA
Linear/Nonlinear Regression, ANN,
Regression Trees, SVM, kNN, GA
Decision Trees, Neural Networks, Support
Vector Machines, kNN, Naïve Bayes, GA
Data Mining Tasks & Methods
Prediction
Classification
Regression
Segmentation
Association
Link analysis
Sequence analysis
Clustering
Apriory Algorithm, FP-Growth, Graph-
based Matching
Time Series
Market-basket
Outlier analysis
Learning Type
K-means, Expectation Maximization (EM)
Supervised
Unsupervised
Supervised
Supervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Common applications for data mining across industries
Business Question Application What Is Predicted?
How to better target product/service
offers?
Profiling and segmentation.
Customer behaviors and needs by
segment.
Which product/service to recommend? Cross-sell and up-sell. Probable customer purchases.
How to grow and maintain valuable
customers?
Acquisition and retention.
Customer preferences and purchase
patterns.
How to direct the right offer to the right
person at the right time?
Campaign management.
The success of customer
communications.
Which customers to invest in and how
to best appeal to them?
Profitability and lifetime value.
Drivers of future value (margin and
retention).
38
(Source: Oracle (2016) Data Mining From A to Z)
Industry-specific data mining applications
Business Question Application What Is Predicted?
How to assess and control risk within
existing (or new) consumer portfolios?
Credit scoring (banking).
Creditworthiness of new and existing
sets of customers.
How to increase sales with cross-
sell/up-sell, loyalty programs and
promotions?
Recommendation systems (online
retail).
Products that are likely to be
purchased next.
How to minimize operational
disruptions and maintenance costs?
Asset maintenance (utilities,
manufacturing, oil and gas).
The real drivers of asset or equipment
failure.
How to reduce health care costs and
satisfy patients?
Health and condition management
(health insurance).
Patients at risk of chronic,
treatable/preventable illness.
How to decrease fraud losses and
lower false positives?
Fraud management and cybersecurity
(government, insurance, banks).
Unknown fraud cases and future risks.
How to bring drugs to the marketplace
quickly and effectively?
Drug discovery (life sciences).
Compounds that have desirable
effects.
39
(Source: Oracle (2016) Data Mining From A to Z)
ETL vs ELT (Data warehouse vs Data lake)
40
(Source: https://www.xplenty.com/blog/etl-vs-elt/)
ELT vs ETL
• ETL is normally a continuous, ongoing process with a well-defined workflow. ETL first
extracts data from homogeneous or heterogeneous data sources. Then, data is
cleansed, enriched, transformed, and stored back in a data warehouse.
• ELT (Extract, Load, Transform) is a variant of ETL wherein the extracted data is first
loaded into the target system. Transformations are performed after the data is
loaded into the data lake.
• ELT typically works well when the target system is powerful enough to handle
transformations. Analytical databases like Amazon Redshift and Google BigQuery are
often used in ELT pipelines because they are highly efficient in performing
transformations.
41
(Source: https://www.xplenty.com/blog/etl-vs-elt/)
Comparison of ETL and ELT
ETL ELT
Pros • Less complexity and time involved in
development since the process involves the
development of the output.
• Many tools are present in the market which
implement ETL and hence it provides flexibility in
choosing the tool required.
• The ETL process is mature and has well defined
best practices and process.
• Since ETL is used form a decade it ensures
availability of ETL experts in abundance.
• Since the ELT process involves loading and transforming
data into smaller chunks it makes the project
management easier.
• It uses the same hardware for processing and storage
minimizing the additional hardware cost.
• It can process both semi-structured and unstructured
data.
• It is flexible since it processes and stores raw data and
hence it can be easily merged into warehouse structure.
Cons • Since It loads only the data types defined at the
time of design and if there is a need to add a new
data type it adds time and cost. Hence it is less
flexible.
• The hardware, maintenance and licensing cost of
ETL tools are high.
• ETL tools are mostly limited to processing
relational data.
• Since ELT has not been widely used it has less process
maturity. But many industries are implementing ELT and
its popularity is increasing.
• Due to limited usage, the implementation tools are
limited.
• And the limited adoption has impacted the availability of
experts on ELT. However, it is changing and more people
working on this are increasing.
42(Source: http://www.infosysblogs.com/oracle/2018/03/etl_vs_elt.html)