Biostat 823 - Relational data modeling
Hilmar Lapp
Duke University, Department of Biostatistics & Bioinformatics
2024-09-03
Relational data model
- Relational models are based on first-order predicate logic
- Entities and predicates
- Relations (tables) and tuples (rows)
Entity-relationship (E-R) model
- Entities are identifiable things of interest in the domain of interest
- Entities have attributes (properties):
- The set of attributes that uniquely identify the entity is the natural key.
- Relationships between entities
Logical vs Physical E-R model I
Logical data model:
- All entities (including keys), their attributes, and relationships (including cardinalities)
- Independent of database implementation
- Abstract model
Physical data model:
- Usually derived from the logical model
- For instantiation in a relational database manage-ment system (RDBMS)
- Includes indexes and constraints
- Can derive multiple physical data models from same logical model
Relationship cardinalities
Resolving n:n relationships
RDBMSs do not directly support n:n relationships between two tables, so they must be resolved for a physical E-R model.
erDiagram
Instructor }|--|{ Course : teaches
Logical model
erDiagram
INSTRUCTOR ||--|{ INSTRUCTOR_TO_COURSE : teaches
COURSE ||--|{ INSTRUCTOR_TO_COURSE : "taught by"
Physical model, using associative table
erDiagram
Instructor ||--|{ Lesson : teaches
Course ||--|{ Lesson : "consists of"
Logical and/or physical model
Strong vs weak entities
erDiagram
Instructor ||--|{ Lesson : teaches
Course ||--|{ Lesson : "consists of"
Instructor {
string Name
string Email PK
}
Course {
string Title
string Code PK
}
Lesson {
string Name
}
Instructor and Course exist on their own, identifiable by attributes of their own: strong entities
Lesson exists only within the context of a Course, not identifiable on its own: weak entity
Foreign Keys
erDiagram
Instructor ||--|{ Lesson : teaches
Course ||--|{ Lesson : "consists of"
Instructor {
string Name
string Email PK
}
Course {
string Title
string Code PK
}
Lesson {
string Name
}
Foreign keys implicit
erDiagram
Instructor ||--|{ Lesson : teaches
Course ||--|{ Lesson : "consists of"
Instructor {
integer Instructor_ID PK
string Name
string Email PK
}
Course {
integer Course_ID PK
string Title
string Code PK
}
Lesson {
string Name
integer Instructor_ID FK
integer Course_ID FK
}
Explicit foreign keys and synthetic (a.k.a. surrogate) primary keys
Entity relationships can be self-referential
erDiagram
Instructor ||--|{ Lesson : teaches
Course ||--|{ Lesson : "consists of"
Course |o--o{ Course : "required by"
Instructor {
integer Instructor_ID PK
string Name
string Email PK
}
Course {
integer Course_ID PK
string Title
string Code PK
integer Req_Course_ID FK
}
Lesson {
string Name
integer Instructor_ID FK
integer Course_ID FK
}
E-R models can be very complex
- Consider for example the BioSQL ERD
- Represents entries in biological sequence databases (such as GenBank), their key-value annotations, cross-references, and feature annotations (which themselves have key-value annotations)
Database normalization
- Goal is to minimize data redundancy, improve data integrity, prevent anomalies
- Enable a database to enforce data integrity through uniqueness, not-null, and referential integrity constraints
- Formalized as normal forms
- 1st, 2nd, and 3rd Forms are most widely practiced
- Stated initially by Codd in 1970 (1NF) and 1971 (2NF, 3NF)
Logical vs Physical E-R model II
| Entities and their attributes |
Tables and columns |
| Natural keys (identifying attributes) |
Unique key constraints |
|
Surrogate primary keys, auto-increment mechanism |
| Relationships |
Foreign keys; associative tables for n:n relationships |
| Relationship cardinalities |
Foreign key and NOT NULL constraints |
|
Indexes |