Biostat 823 - Data Warehousing
Hilmar Lapp
Duke University, Department of Biostatistics & Bioinformatics
2024-09-12
OLTP vs OLAP
Real-time data transactions |
Data analysis (BI) |
Fast response times, protected data integrity |
Dataset generation for question answering |
frequent INSERT, UPDATE, DELETE |
SELECT, frequently aggregated |
Highly normalized design |
Denormalized design |
Constantly updated |
Read-only, periodically refreshed |
Data Warehousing
- Data Warehouses support OLAP use-cases and applications.
- Typically built by ETL or ELT process(es)
- Often from multiple source databases
- Denormalized database schema design based on fact and dimension tables
Datamarts vs Data Warehouse
- Datamart is a type of Data Warehouse
- Datamarts are typically more narrow in data scope
- Typically focused on one subject area
- Easier and less time-consuming to build
- Less memory and storage-intensive than a full Data Warehouse
Star Schema
- Uses one (central) Fact table and multiple Dimension tables
- Fact table houses quantitative metrics or measurements
- Typically aggregated, based on chosen granularity (of dimension(s))
- Foreign key to each dimension
- Dimension tables house the dimensions along which to aggregate or slice facts
- Time, location, people, etc
- Typically denormalized
- Each dimension table has 1:n relationship to the fact table
- Fact table normally has orders of magnitude more rows than dimension tables
- In essence, this represents a data cube
- Fact table is the cells; dimension tables are the dimensions (“axes”) of the cube
Star Schema example (generic)
Clinical data warehouse example
Figure 1 from Hart and Kuo (2016), “Meeting Health Care Research Needs in a Kimball Integrated Data Warehouse,” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) doi: 10.1109/DSAA.2016.91
Reverse Star Schema
- Deployed for descriptive or observational facts
- Quantitative metrics often are not meaningful
- Typically multiple related facts (“dimension” table rows) for the central fact
- Queries often involve attributes of the central fact
- Fact table has 1:n (“reversed”) relationship to each “dimension”
- Hence, fact table is usually smaller than any of the dimension tables
Genomics Data Warehouse systems