Biostat 823 - Data Warehousing
Hilmar Lapp 
        
            Duke University, Department of Biostatistics & Bioinformatics
          
     
 
  2024-09-12
OLTP vs OLAP
| Real-time data transactions | Data analysis (BI) | 
| Fast response times, protected data integrity | Dataset generation for question answering | 
| frequent INSERT, UPDATE, DELETE | SELECT, frequently aggregated | 
| Highly normalized design | Denormalized design | 
| Constantly updated | Read-only, periodically refreshed | 
Data Warehousing
- Data Warehouses support OLAP use-cases and applications.
- Typically built by ETL or ELT process(es)
- Often from multiple source databases
- Denormalized database schema design based on fact and dimension tables
Datamarts vs Data Warehouse
- Datamart is a type of Data Warehouse
- Datamarts are typically more narrow in data scope
- Typically focused on one subject area
- Easier and less time-consuming to build
- Less memory and storage-intensive than a full Data Warehouse ![]() 
 
Star Schema
- Uses one (central) Fact table and multiple Dimension tables
- Fact table houses quantitative metrics or measurements
- Typically aggregated, based on chosen granularity (of dimension(s))
- Foreign key to each dimension ![]() 
 
- Dimension tables house the dimensions along which to aggregate or slice facts
- Time, location, people, etc
- Typically denormalized
- Each dimension table has 1:n relationship to the fact table
 
- Fact table normally has orders of magnitude more rows than dimension tables
- In essence, this represents a data cube
- Fact table is the cells; dimension tables are the dimensions (“axes”) of the cube
 
Star Schema example (generic)
Clinical data warehouse example
![]()
Figure 1 from Hart and Kuo (2016), “Meeting Health Care Research Needs in a Kimball Integrated Data Warehouse,” 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) doi: 10.1109/DSAA.2016.91
Reverse Star Schema
- Deployed for descriptive or observational facts
- Quantitative metrics often are not meaningful
- Typically multiple related facts (“dimension” table rows) for the central fact ![Figure 1 of Zhang et al (2011), BioMart: a data federation framework for large collaborative projects. doi:10.1093/database/bar038 Figure 1 of Zhang et al (2011)]() 
- Queries often involve attributes of the central fact
 
- Fact table has 1:n (“reversed”) relationship to each “dimension”
- Hence, fact table is usually smaller than any of the dimension tables
 
Genomics Data Warehouse systems