Biostat 823 - Relational data modeling

Hilmar Lapp

Duke University, Department of Biostatistics & Bioinformatics

2024-09-03

Relational data model

  • Relational models are based on first-order predicate logic
  • Entities and predicates
  • Relations (tables) and tuples (rows)

Entity-relationship (E-R) model

  • Entities are identifiable things of interest in the domain of interest
    • Entities have attributes (properties):
    • The set of attributes that uniquely identify the entity is the natural key.
  • Relationships between entities

Logical vs Physical E-R model I

Logical data model:

  • All entities (including keys), their attributes, and relationships (including cardinalities)
  • Independent of database implementation
  • Abstract model

Physical data model:

  • Usually derived from the logical model
  • For instantiation in a relational database manage-ment system (RDBMS)
  • Includes indexes and constraints
  • Can derive multiple physical data models from same logical model

Relationship cardinalities

  • {0,1}:1
  • {0,1}:n
  • n:n

Crow’s foot notation

Description Symbol
Ring and dash: Minimum zero, maximum one (optional)
Dash and dash: Exactly one (mandatory)
Ring and crow’s foot: Minimum zero, maximum many (optional)
Dash and crow’s foot: Minimum one, maximum many (mandatory)

Examples for Crow’s foot notation

erDiagram
    Instructor ||--o| Course : teaches
erDiagram
    Instructor ||--|| Course : teaches
erDiagram
    Instructor ||--o{ Course : teaches
erDiagram
    Instructor }|--|{ Course : teaches

Resolving n:n relationships

RDBMSs do not directly support n:n relationships between two tables, so they must be resolved for a physical E-R model.

erDiagram
    Instructor }|--|{ Course : teaches

Logical model

erDiagram
    INSTRUCTOR ||--|{ INSTRUCTOR_TO_COURSE : teaches
    COURSE     ||--|{ INSTRUCTOR_TO_COURSE : "taught by"

Physical model, using associative table

erDiagram
    Instructor ||--|{ Lesson : teaches
    Course     ||--|{ Lesson : "consists of"

Logical and/or physical model

Strong vs weak entities

erDiagram
    Instructor ||--|{ Lesson : teaches
    Course     ||--|{ Lesson : "consists of"
    Instructor {
      string Name
      string Email PK
    }
    Course {
      string Title
      string Code PK
    }
    Lesson {
      string Name
    }
  • Instructor and Course exist on their own, identifiable by attributes of their own: strong entities
  • Lesson exists only within the context of a Course, not identifiable on its own: weak entity

Foreign Keys

erDiagram
    Instructor ||--|{ Lesson : teaches
    Course     ||--|{ Lesson : "consists of"
    Instructor {
      string  Name
      string  Email PK
    }
    Course {
      string Title
      string Code PK
    }
    Lesson {
      string Name
    }

Foreign keys implicit

erDiagram
    Instructor ||--|{ Lesson : teaches
    Course     ||--|{ Lesson : "consists of"
    Instructor {
      integer Instructor_ID PK
      string  Name
      string  Email PK
    }
    Course {
      integer Course_ID PK
      string Title
      string Code PK
    }
    Lesson {
      string Name
      integer Instructor_ID FK
      integer Course_ID FK
    }

Explicit foreign keys and synthetic (a.k.a. surrogate) primary keys

Entity relationships can be self-referential

erDiagram
    Instructor ||--|{ Lesson : teaches
    Course     ||--|{ Lesson : "consists of"
    Course     |o--o{ Course : "required by"
    Instructor {
      integer Instructor_ID PK
      string  Name
      string  Email PK
    }
    Course {
      integer Course_ID PK
      string Title
      string Code PK
      integer Req_Course_ID FK
    }
    Lesson {
      string Name
      integer Instructor_ID FK
      integer Course_ID FK
    }

E-R models can be very complex

  • Consider for example the BioSQL ERD
    • Represents entries in biological sequence databases (such as GenBank), their key-value annotations, cross-references, and feature annotations (which themselves have key-value annotations)

Database normalization

  • Goal is to minimize data redundancy, improve data integrity, prevent anomalies
    • Enable a database to enforce data integrity through uniqueness, not-null, and referential integrity constraints
  • Formalized as normal forms
    • 1st, 2nd, and 3rd Forms are most widely practiced
    • Stated initially by Codd in 1970 (1NF) and 1971 (2NF, 3NF)

First normal form

A relation is in first normal form iff no attribute domain has relations as elements.

Attribute values must be “atomic” (not tables, arrays, lists, etc).

erDiagram
    Instructor {
      string  Name
      string  Email
      array  Course_Lessons
    }

Unnormalized

erDiagram
    Instructor        ||--|{ Course_Lesson : teaches
    Instructor {
      string  Name
      string  Email
    }
    Course_Lesson {
      string Course_Name
      string Lesson_Name
    }

Normalized to 1NF

Second normal form

A relation is in 2NF iff it is in 1NF and it does not have any non-prime attribute functionally dependent on any proper subset of any candidate key of the relation.

If any table has a composite natural key, no column in that table depends on only a part of the composite key.

erDiagram
    Instructor        ||--|{ Course_Lesson : teaches
    Instructor {
      string  Name
      string  Email
    }
    Course_Lesson {
      string Course_Name
      string Lesson_Name
      string Room
    }

In 1NF but not 2NF

erDiagram
    Instructor        ||--|{ Lesson : teaches
    Course            ||--|{ Lesson : "consists of"
    Instructor {
      string  Name
      string  Email
    }
    Course {
      string Name
      string Room
    }
    Lesson {
      string Name
    }

Normalized to 2NF

Third normal form

A relation R is in 3NF iff it is in 2NF and every non-prime attribute of R is non-transitively dependent on every key of R.

For every table, any attribute that is not part of a natural key depends directly on every key for the table.

erDiagram
    Instructor        ||--|{ Lesson : teaches
    Course            ||--|{ Lesson : "consists of"
    Instructor {
      string  Name
      string  Email
    }
    Course {
      string Name
      string Semester
      string Room
      string Room_Address
    }
    Lesson {
      string Name
    }

In 2NF but not 3NF

erDiagram
    Instructor        ||--|{ Lesson : teaches
    Course            ||--|{ Lesson : "consists of"
    Room              |o--o{ Course : "used for"
    Instructor {
      string  Name
      string  Email
    }
    Course {
      string Name
      string Semester
    }
    Room {
      string Name
      string Address
    }
    Lesson {
      string Name
    }

Normalized to 3NF

Logical vs Physical E-R model II

Logical data model Physical data model
Entities and their attributes Tables and columns
Natural keys (identifying attributes) Unique key constraints
Surrogate primary keys, auto-increment mechanism
Relationships Foreign keys; associative tables for n:n relationships
Relationship cardinalities Foreign key and NOT NULL constraints
Indexes