Data Warehousing and Mining

Data Warehousing
1. Increased corporate productivity.
2. Competitive advantage.
3. Potential for high ROI.
4. Extremely high initial costs (£50k+)
5. Long development time (3 years +/-)
6. High demand for memory.
7. High maintenance costs.
8. Problems with source data (extraction, cleaning, loading).
Building a Data Warehouse Database (Dimensionality Modelling)
1. Fact Tables
  1. Contains facts generated by events in the past.
  2. Data in tables should be regarded as read only.
  3. Tables are often very large.
2. Dimension Tables
  1. Contains descriptive textual data.
  2. Simple primary keys.
  3. Gives a characteristic star scheme or star join.
Star Schema
1. De-normalising reference data can speed up query performance.
2. Main aim is to avoid data redundancy.
3. This achieved in part via the process of normalisation.
OTLP System
1. Automating business saves money.
2. Data could be useful in organisations future operations.
3. Information too detailed.
4. May require information from more than one OTLP system.
5. Difficult to extract information.
Snowflake Schema
1. Variant of Star Schema where dimension tables do not contain de-normalised data.
2. Dimension tables have other dimension tables linked to them via foreign keys.
3. More than one dimension table can share these "dimension of a dimension" tables.
Starflake Schema
1. Hybrid structure that contains a mixture of star and snowflake schema's.
2. Contains both normalised and de-normalised data.
3. Some dimension tables may be present in both normalised and de-normalised forms.
OLAP Analytical Operations
1. Consolitation
  1. Involves the aggregation of data, such as "roll ups" e.g. branches can be rolled up to cities, cities to countries etc.
2. Drill-down
  1. Reverse of consolidation.
  2. Involves displaying the detailed data that compromises the consolidated data.
3. Slicing and Dicing (aka pivoting)
  1. Ability to view data from different viewpoints.
  2. One slice may display revenue by type of property within cities.
  3. Another slice may display revenue by branch office within city.
  4. Often performed along a time axis to find patterns and trends.
Data Mining Operations and Techniques
1. Predictive Modelling
  1. Reflect human experience using observations to form a model of the important characteristics of some phenomenon.
  2. Model developed using a two-phase supervised learning approach.
    1. The training phase uses a large sample of historical data called a training set to build a model of the important characteristics.
    2. The testing phase tests the accuracy and performance of the model on new data.
  3. Used in credit approval, customer retention management, direct marketing.
2. Database Segmentation
  1. Partition database into an unknown number of segments or clusters of similar records.
  2. Results can be displayed on scatterplot.
  3. Used in customer profiling and direct marketing.
3. Link Analysis
  1. Aims to discover links (called associations) between individual records or groups of records in a database.
4. Anomaly Detection
  1. Identifies outliers (expressions of deviation from previously known expectations and norms).
  2. Used in detection of credit card and insurance fraud, quality control and defects tracing.

Next up

Data Warehousing and Mining

Description

Resource summary

Media attachments

Similar

	Created by i7752068 over 10 years ago