Biostat 823 - SQL

Hilmar Lapp

Duke University, Department of Biostatistics & Bioinformatics

2024-09-05

Structured Query Language (SQL)

  • First published in 1974, uses Codd’s relational model
  • Several successive standardizations
    • Most universally supported: SQL-92
    • Most recent: SQL-2016 (includes JSON)
  • Divided into parts:
    • Data Definition Language (DDL): create table, constraint, etc
    • Data Manipulation Language (DML): insert, update, delete
    • Data Query Language (DQL): select
    • Data Control Language (DCL): grant and revoke privileges

Notation and syntax for code examples

  • SQL is case-insensitive. We will use case to signify SQL keywords and to distinguish them from table, column etc names from the E-R model.
  • We use SQLite as the RDBMS, but where possible try to stick to SQL-92, which should work under any modern RDBMS
  • We try to point out where constructs might need modification for other RDBMSs.

Literate Programming with SQL

Knitr (the workhorse behind rendering Rmarkdown) supports a variety of “engines”, including SQL:

names(knitr::knit_engines$get())
 [1] "awk"       "bash"      "coffee"    "gawk"      "groovy"    "haskell"  
 [7] "lein"      "mysql"     "node"      "octave"    "perl"      "php"      
[13] "psql"      "Rscript"   "ruby"      "sas"       "scala"     "sed"      
[19] "sh"        "stata"     "zsh"       "asis"      "asy"       "block"    
[25] "block2"    "bslib"     "c"         "cat"       "cc"        "comment"  
[31] "css"       "ditaa"     "dot"       "embed"     "eviews"    "exec"     
[37] "fortran"   "fortran95" "go"        "highlight" "js"        "julia"    
[43] "python"    "R"         "Rcpp"      "sass"      "scss"      "sql"      
[49] "stan"      "targets"   "tikz"      "verbatim"  "ojs"       "mermaid"  

In R, the RSQLite package implements a driver for SQLite to enable the API of the DBI package. Here we create an in-memory database, and make the connection object the default for every chunk:

library(DBI)

db <- dbConnect(RSQLite::SQLite(), ":memory:")

knitr::opts_chunk$set(connection = "db")

DDL

  • DDL consists of statements (SQL commands that return a status or a count, but not a result set).
    • {CREATE, DROP, ALTER} TABLE
    • {CREATE, DROP} VIEW
    • {CREATE, DROP} INDEX
    • DROP {TABLE, VIEW, INDEX}

E-R Model

We’ll be implementing (i.e., instantiating a physical model) of this E-R model:

erDiagram
    Instructor        ||--|{ Lesson : teaches
    Course            ||--|{ Lesson : "consists of"
    Room              |o--o{ Course : "used for"
    Instructor {
      integer Instructor_OID PK
      string  Name
      string  Email
    }
    Course {
      integer Course_OID PK
      string Name
      string Semester
      integer Room_OID FK
    }
    Room {
      integer Room_OID PK
      string Name
      string Address
    }
    Lesson {
      string Name
      integer Instructor_OID FK
      integer Course_OID FK
    }

Create and define entities

  • We use CREATE TABLE to define (create) relations (entities in an E-R model), along with their constraints:
-- Double minus is the standard syntax for comment (single-line).
CREATE TABLE Instructor (
  Instructor_OID INTEGER PRIMARY KEY, -- surrogate PK (implies NOT NULL)
  Name VARCHAR(128) NOT NULL,         -- by default, NULL is allowed
  Email VARCHAR(128) NOT NULL UNIQUE  -- natural primary key
);

Composite natural keys

CREATE TABLE Course (
  Course_OID INTEGER PRIMARY KEY,
  Name VARCHAR(64) NOT NULL,
  Semester VARCHAR(16) NOT NULL,
  UNIQUE (Name, Semester) 
);

Note that table constraints must be at the end of a table definition.

Foreign key constraints

CREATE TABLE Lesson (
  Name VARCHAR(64) NOT NULL,
  Instructor_OID INTEGER NOT NULL
      -- foreign key constraint can be part of column definition
      REFERENCES Instructor (Instructor_OID)
      ON DELETE RESTRICT,
  Course_OID INTEGER NOT NULL,
  -- primary key can be composite, and be defined separately
  PRIMARY KEY (Name, Course_OID),
  -- foreign key constraints can be defined separately
  FOREIGN KEY (Course_OID)
      REFERENCES Course (Course_OID)
      ON DELETE CASCADE
);
  • For “weak” entities, natural PK will include a foreign key.
  • Entities that are never referenced in the E-R model by a foreign key may not need a surrogate primary key.

Creating indexes

  • Indexes speed up queries.

    • But they also increase transaction cost
    CREATE INDEX Instructor_Name_Idx ON Instructor (Name);
  • Indexes can add a uniqueness constraint (CREATE UNIQUE INDEX), but better practice is to include those in the table definition

  • For transaction performance, columns in FOREIGN KEY constraints should typically be indexed

CREATE INDEX Lesson_Course_Idx ON Lesson (Course_OID);
CREATE INDEX Lesson_Instructor_Idx ON Lesson (Instructor_OID);

Altering table definitions

  • Add table for rooms:
CREATE TABLE Room (
  Room_OID INTEGER PRIMARY KEY,
  Name VARCHAR(32) NOT NULL UNIQUE,
  Address VARCHAR(128)
)
  • Now add a foreign key to the courses table:
ALTER TABLE Course
  ADD COLUMN Room_OID INTEGER
  REFERENCES Room (Room_OID) ON DELETE SET NULL;
  • Altering table definitions is usually done as part of a schema migration.

DML

  • DML also consists of statements (SQL commands that return a status or a count, but not a result set).
    • INSERT
    • UPDATE
    • DELETE

Inserting rows: Basics

  • INSERT statements add rows to tables.
    • Note that auto-generated surrogate primary keys should not be supplied.
INSERT INTO Instructor (Name, Email) VALUES (
  'Hilmar Lapp', 'h.lapp@duke.edu'
);
  • Multiple rows can be specified:
INSERT INTO Instructor (Name, Email) VALUES
    ('Kouros Owzar', 'k.owzar@duke.edu')
  , ('David Page', 'd.page@duke.edu')
  , ('Cliburn Chan', 'c.chan@duke.edu')
  , ('Chengxin Yang', 'chengxin@duke.edu');
  • NULLable columns can be omitted, or provided as NULL
INSERT INTO Room (Name) VALUES ('North 100');
INSERT INTO Room (Name, Address) VALUES ('Hock 10089', NULL);
  • This extends to optional foreign keys
INSERT INTO Course (Name, Semester) VALUES
    ('Biostat 823', 'Fall 2022'); 

Violating constraints results in error

  • Violating a uniqueness constraint (natural primary key)
INSERT INTO Instructor (Name, Email) VALUES
    ('H. Lapp', 'h.lapp@duke.edu');
Error: UNIQUE constraint failed: Instructor.Email
  • Violating a NOT NULL constraint (required value)
INSERT INTO Instructor (Email) VALUES ('first.last@duke.edu');
Error: NOT NULL constraint failed: Instructor.Name
  • Violating a foreign key constraint (non-existent row)
INSERT INTO Course (Name, Semester, Room_OID) VALUES
    ('Biostat 823', 'Fall 2024', 10);
Error: FOREIGN KEY constraint failed

Inserting foreign keys

  • To insert rows with foreign key values, the statement can include a query in place of a literal value (a.k.a. subquery):
INSERT INTO Course (Name, Semester, Room_OID) VALUES
    ('Biostat 823', 'Fall 2024',
        (SELECT Room_OID FROM Room WHERE Name = 'Hock 10089')
    );
  • In programming, foreign key values are typically first retrieved in a separate query and then included as a literal (\(\rightarrow\) more concise and potentially faster INSERT statement).

Updating rows: Basics

  • The common case is to update specific row(s) matching a condition:
UPDATE Room
SET    Address = '2424 Erwin Rd, Durham, NC 27705'
WHERE  Name = 'Hock 10089'
  • The WHERE clause is optional; if left off, updates all rows.
  • Updates that create a constraint violation result in error:
UPDATE Instructor
SET    Email = 'first.last@duke.edu'
Error: UNIQUE constraint failed: Instructor.Email

Updating foreign keys

  • Similar to INSERT statements, UPDATE statements can also use a query in place of a literal value to retrieve a value dynamically:
UPDATE Course
SET    Room_OID = (SELECT Room_OID FROM Room WHERE Name = 'Hock 10089')
WHERE  Name = 'Biostat 823' AND Semester = 'Fall 2022';

Deleting rows: Basics

  • The common case is to delete row(s) matching a specific condition:
DELETE FROM Room
WHERE Name = 'North 100'; 
  • The WHERE clause is optional; if left off, all rows are deleted.

  • Note that for both UPDATE and DELETE, no rows matching the condition is not an error.

DELETE FROM Instructor
WHERE Name = 'John Smith';

Foreign key enforcement on delete

  • Enforcement action depends on FK constraint definition:
CREATE TABLE Lesson (
  -- ... (non-foreign key columns etc)
  Course_OID INTEGER NOT NULL
      REFERENCES Course (Course_OID)
      -- deleting a course cascades to deleting all its lessons
      ON DELETE CASCADE,
  Instructor_OID INTEGER NOT NULL
      REFERENCES Instructor (Instructor_OID)
      -- prevent deletion of instructor if they are assigned to lessons
      ON DELETE RESTRICT
);
CREATE TABLE Course (
  -- ... (non-foreign key columns etc)
  Room_OID INTEGER
      REFERENCES Room (Room_OID)
      -- if FK is optional, setting to NULL could be appropriate
      ON DELETE SET NULL
);

DDL and DML have many more options

  • For example, see the column constraint grammar for SQLite, which is part of the CREATE TABLE syntax definition.
  • Not all clauses are supported by all major RDBMSs.
  • When persisting data to a database in a programming language environment, table creation, inserts, etc will often be handled under the hood by a ORM (object-relational mapping) library.

DQL

  • The SELECT statement consists of:
    • SELECT: which columns (or values) to report
    • FROM: which table(s) to query and how to join tables
    • WHERE: conditions to be met for rows to be reported
    • GROUP BY: how to aggregate rows by certain columns
    • HAVING: conditions to be met for aggregated rows
    • ORDER BY: how to order the rows in the report
  • A DQL statement returns a result set
    • Technically, it returns a cursor into a result set.

Simple querying

SELECT * FROM Instructor;
5 records
Instructor_OID Name Email
1 Hilmar Lapp h.lapp@duke.edu
2 Kouros Owzar k.owzar@duke.edu
3 David Page d.page@duke.edu
4 Cliburn Chan c.chan@duke.edu
5 Chengxin Yang chengxin@duke.edu

Controlling columns to be reported

  • We can enumerate and thus limit columns to be reported:
SELECT Name, Address FROM Room;
1 records
Name Address
Hock 10089 2424 Erwin Rd, Durham, NC 27705
  • We can also rename columns, transform them, and apply functions:
SELECT Course_OID AS ID, 
       Name || ' (' || Semester || ')' AS Course_Name,
       IFNULL(Room_OID, -1) AS Room
FROM Course;
4 records
ID Course_Name Room
1 Biostat 823 (Fall 2022) 2
2 Biostat 823 (Fall 2024) 2
3 Biostat 823 (Fall 2021) -1
4 Biostat 823 (Fall 2023) 2

Filter matching rows: WHERE

SELECT Email from Instructor
WHERE Name = 'Hilmar Lapp';
1 records
Email
h.lapp@duke.edu
  • We can use functions; wildcard matches use `LIKE’:
SELECT DISTINCT Name from Lesson   -- DISTINCT makes rows unique
WHERE UPPER(Name) LIKE '%DATA%';
2 records
Name
Databases and Graphs
Relational Data Modeling

Note that unlike shell glob patterns, regular expressions etc, SQL uses % for any number of characters (including zero), and _ for one character as wildcards.

  • Conditions can include subqueries
SELECT DISTINCT Name FROM Lesson
WHERE Instructor_OID IN (
   SELECT Instructor_OID
   FROM Instructor WHERE Name LIKE 'h%'
)
4 records
Name
Containerization
Relational Data Modeling
Python for AI/ML
Reproducible Worflows

NULL is special

  • The representation of NULL depends on environment; in R:
SELECT * FROM Course;
4 records
Course_OID Name Semester Room_OID
1 Biostat 823 Fall 2022 2
2 Biostat 823 Fall 2024 2
3 Biostat 823 Fall 2021 NA
4 Biostat 823 Fall 2023 2
  • Note that NULL is not equal or unequal to anything, including itself:
SELECT Name, Semester FROM Course
WHERE Room_OID = NULL OR NOT Room_OID = NULL;
0 records
Name Semester
  • Instead we must use a special boolean operator:
SELECT Name, Semester FROM Course WHERE Room_OID IS NULL;
1 records
Name Semester
Biostat 823 Fall 2021

Joining tables: Inner Join

  • Table joins allow creating denormalized reports from a normalized database
SELECT c.Name AS Course, c.Semester, l.Name AS Lesson
FROM Course AS c INNER JOIN Lesson AS l ON (c.Course_OID = l.Course_OID)
LIMIT 5;
5 records
Course Semester Lesson
Biostat 823 Fall 2023 Applications of vector calculus
Biostat 823 Fall 2024 Containerization
Biostat 823 Fall 2023 Containerization
Biostat 823 Fall 2022 Databases and Graphs
Biostat 823 Fall 2022 Deep Neural Networks
  • If column names for joining are the same, we can shorten with USING:
SELECT c.Name AS Course, c.Semester, l.Name AS Lesson
FROM Course AS c INNER JOIN Lesson AS l
     USING (Course_OID);

Joining multiple tables

  • Table joins can span more than 2 tables:
SELECT c.Name AS Course, substr(c.Semester,-4) AS Year,
       r.Name AS Room, l.Name AS Lesson, i.Name AS Instructor
FROM Course AS c INNER JOIN Lesson AS l USING (Course_OID)
     INNER JOIN Instructor AS i USING (Instructor_OID)
     INNER JOIN Room AS r USING (Room_OID);
15 records
Course Year Room Lesson Instructor
Biostat 823 2024 Hock 10089 Containerization Hilmar Lapp
Biostat 823 2024 Hock 10089 Relational Data Modeling Hilmar Lapp
Biostat 823 2024 Hock 10089 Python for AI/ML Hilmar Lapp
Biostat 823 2024 Hock 10089 Reproducible Worflows Hilmar Lapp
Biostat 823 2023 Hock 10089 Containerization Hilmar Lapp
Biostat 823 2023 Hock 10089 Relational Data Modeling Hilmar Lapp
Biostat 823 2022 Hock 10089 Databases and Graphs David Page
Biostat 823 2022 Hock 10089 NP-Completeness David Page
Biostat 823 2022 Hock 10089 Deep Neural Networks David Page
Biostat 823 2023 Hock 10089 General Linear Model Kouros Owzar
Biostat 823 2023 Hock 10089 Expectation-Maximization Kouros Owzar
Biostat 823 2023 Hock 10089 Applications of vector calculus Kouros Owzar
Biostat 823 2023 Hock 10089 Recurrent neural networks David Page
Biostat 823 2023 Hock 10089 Transformers David Page
Biostat 823 2023 Hock 10089 How ChatGPT works David Page

Joining tables: Outer join

  • Use a left or right outer join to include rows from the left or right side of a join that don’t match the join condition:
SELECT c.Name AS Course, c.Semester, r.Name AS Room
FROM Course AS c LEFT OUTER JOIN Room AS r USING (Room_OID);
4 records
Course Semester Room
Biostat 823 Fall 2022 Hock 10089
Biostat 823 Fall 2024 Hock 10089
Biostat 823 Fall 2021 NA
Biostat 823 Fall 2023 Hock 10089

Outer joins cont’d

  • Outer joins are not limited to NULLable columns.
    • For example, reporting all instructors, and courses for them
SELECT DISTINCT i.Name AS Instructor, c.Name AS Course, c.Semester, r.Name AS Room
FROM Instructor AS i LEFT OUTER JOIN Lesson AS l USING (Instructor_OID)
     LEFT OUTER JOIN Course AS c USING (Course_OID)
     LEFT OUTER JOIN Room AS r USING (Room_OID);
7 records
Instructor Course Semester Room
Chengxin Yang NA NA NA
Cliburn Chan Biostat 823 Fall 2021 NA
David Page Biostat 823 Fall 2022 Hock 10089
David Page Biostat 823 Fall 2023 Hock 10089
Hilmar Lapp Biostat 823 Fall 2024 Hock 10089
Hilmar Lapp Biostat 823 Fall 2023 Hock 10089
Kouros Owzar Biostat 823 Fall 2023 Hock 10089

Join order

  • Join order does not matter for inner joins, but does matter for whether a join must be outer and left or right.
SELECT DISTINCT i.Name AS Instructor, c.Name AS Course, c.Semester, r.Name AS Room
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     LEFT OUTER JOIN Room AS r USING (Room_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
7 records
Instructor Course Semester Room
Hilmar Lapp Biostat 823 Fall 2024 Hock 10089
Hilmar Lapp Biostat 823 Fall 2023 Hock 10089
David Page Biostat 823 Fall 2022 Hock 10089
Kouros Owzar Biostat 823 Fall 2023 Hock 10089
David Page Biostat 823 Fall 2023 Hock 10089
Cliburn Chan Biostat 823 Fall 2021 NA
Chengxin Yang NA NA NA
  • Note that multiple outer joins can seriously hamper performance.

Notes and other join constructs

  • INNER is optional (and the default), but it can help understanding a query
  • Additional, but rarely used join constructs include:
    • FULL OUTER JOIN: combines left and right outer join
    • NATURAL JOIN: joins by columns that have the same name, and reports only one of them
    • CROSS JOIN: creates a cartesian product

Existential subquery vs join

  • Some queries that can be answered joining related facts may be easier to define and faster to execute by querying for existence of a fact.

  • For example, which instructors are teaching in a given semester:

SELECT i.Name AS Instructor, i.Email
FROM Instructor AS i
WHERE EXISTS (
    SELECT 1 FROM Lesson AS l INNER JOIN Course c USING (Course_OID)
    WHERE c.Semester = 'Fall 2023'
    AND   l.Instructor_OID = i.Instructor_OID
)
3 records
Instructor Email
Hilmar Lapp h.lapp@duke.edu
Kouros Owzar k.owzar@duke.edu
David Page d.page@duke.edu

Existential subquery negation

  • For example, instructors who don’t teach in a given semester
SELECT i.Name AS Instructor, i.Email
FROM Instructor AS i
WHERE NOT EXISTS (
    SELECT 1 FROM Lesson AS l INNER JOIN Course c USING (Course_OID)
    WHERE c.Semester = 'Fall 2024'
    AND   l.Instructor_OID = i.Instructor_OID
)
4 records
Instructor Email
Kouros Owzar k.owzar@duke.edu
David Page d.page@duke.edu
Cliburn Chan c.chan@duke.edu
Chengxin Yang chengxin@duke.edu

Aggregating rows

  • Rows can be aggregated into groups by one or more columns of the same value, using aggregating functions:
    • Note that COUNT(<column>) and COUNT(*) have different semantics
SELECT i.Name AS Instructor, c.Name AS Course, c.Semester,
       COUNT(l.Name) AS '#Lessons',
       COUNT(*) AS '#RowsInGroup'
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
GROUP BY i.Name, c.Name, c.Semester;
7 records
Instructor Course Semester #Lessons #RowsInGroup
Chengxin Yang NA NA 0 1
Cliburn Chan Biostat 823 Fall 2021 2 2
David Page Biostat 823 Fall 2022 3 3
David Page Biostat 823 Fall 2023 3 3
Hilmar Lapp Biostat 823 Fall 2023 2 2
Hilmar Lapp Biostat 823 Fall 2024 4 4
Kouros Owzar Biostat 823 Fall 2023 3 3

Grouping vs aggregating

  • SELECTed columns should be GROUPed or aggregated.
    • If neither, the value is from a randomly chosen member of the group.
SELECT i.Name AS Instructor, c.Name AS Course,
       MAX(c.Semester) AS LastSemester, -- choose one non-NULL value from group
       COUNT(l.Name) AS '#Lessons'
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
GROUP BY i.Name, c.Name;
5 records
Instructor Course LastSemester #Lessons
Chengxin Yang NA NA 0
Cliburn Chan Biostat 823 Fall 2021 2
David Page Biostat 823 Fall 2023 6
Hilmar Lapp Biostat 823 Fall 2024 6
Kouros Owzar Biostat 823 Fall 2023 3

Restricting aggregate groups I

  • We can restrict which aggregate groups match based on conditions using aggregate function(s):
SELECT i.Name AS Instructor, c.Name AS Course, c.Semester,
       COUNT(l.Name) AS '#Lessons'
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
GROUP BY i.Name, c.Name, c.Semester
HAVING COUNT(l.Name) > 0
6 records
Instructor Course Semester #Lessons
Cliburn Chan Biostat 823 Fall 2021 2
David Page Biostat 823 Fall 2022 3
David Page Biostat 823 Fall 2023 3
Hilmar Lapp Biostat 823 Fall 2023 2
Hilmar Lapp Biostat 823 Fall 2024 4
Kouros Owzar Biostat 823 Fall 2023 3

Restricting aggregate groups II

  • SELECT and HAVING clauses can use different aggregate functions:
SELECT i.Name AS Instructor, c.Name AS Course, c.Semester,
       GROUP_CONCAT(l.Name,', ') AS Lessons
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
GROUP BY i.Name, c.Name, c.Semester
HAVING COUNT(l.Name) > 0
6 records
Instructor Course Semester Lessons
Cliburn Chan Biostat 823 Fall 2021 Python Programming, Python Visualization
David Page Biostat 823 Fall 2022 Databases and Graphs, NP-Completeness, Deep Neural Networks
David Page Biostat 823 Fall 2023 Recurrent neural networks, Transformers, How ChatGPT works
Hilmar Lapp Biostat 823 Fall 2023 Containerization, Relational Data Modeling
Hilmar Lapp Biostat 823 Fall 2024 Containerization, Relational Data Modeling, Python for AI/ML, Reproducible Worflows
Kouros Owzar Biostat 823 Fall 2023 General Linear Model, Expectation-Maximization, Applications of vector calculus

Restricting aggregate groups III

  • Queries can use both row conditions and aggregate group conditions
SELECT i.Name AS Instructor, c.Name AS Course, c.Semester,
       COUNT(l.Name) AS '#Lessons'
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
WHERE c.Semester = 'Fall 2022'
GROUP BY i.Name, c.Name, c.Semester
HAVING COUNT(l.Name) > 0
1 records
Instructor Course Semester #Lessons
David Page Biostat 823 Fall 2022 3

Views

  • Database views are predefined queries
    • Their definition is part of DDL
  • Views are an API to the E-R model
    • Denormalized “business objects” as opposed to relational entities
    • Business objects rarely change, as opposed to how they are normalized in a relational model
  • Views can be treated like tables for querying (DQL)
    • But not for DML

Creating views

  • We create views simply based on a query, which can be simple or complex
CREATE VIEW Instructor_Courses AS
SELECT i.Instructor_OID,
       MAX(i.Name) AS Instructor,
       MAX(c.Name) AS Course, MAX(c.Semester) AS Semester,
       GROUP_CONCAT(l.Name, ', ') AS Lessons,
       COUNT(l.Name) AS 'NumLessons',
       MAX(c.Room_OID) AS Room_OID
FROM Lesson AS l INNER JOIN Course AS c USING (Course_OID)
     RIGHT OUTER JOIN Instructor AS i USING (Instructor_OID)
GROUP BY i.Instructor_OID, c.Course_OID;
  • By default, the column names of a view are determined by those of the query

Querying views

SELECT Instructor, Course, Semester, Lessons, NumLessons FROM Instructor_Courses;
7 records
Instructor Course Semester Lessons NumLessons
Hilmar Lapp Biostat 823 Fall 2024 Containerization, Relational Data Modeling, Python for AI/ML, Reproducible Worflows 4
Hilmar Lapp Biostat 823 Fall 2023 Containerization, Relational Data Modeling 2
Kouros Owzar Biostat 823 Fall 2023 General Linear Model, Expectation-Maximization, Applications of vector calculus 3
David Page Biostat 823 Fall 2022 Databases and Graphs, NP-Completeness, Deep Neural Networks 3
David Page Biostat 823 Fall 2023 Recurrent neural networks, Transformers, How ChatGPT works 3
Cliburn Chan Biostat 823 Fall 2021 Python Programming, Python Visualization 2
Chengxin Yang NA NA NA 0

Views can be filtered, aggregated etc

  • For example, report the average number of lessons per course semester
SELECT Course, Semester, ROUND(AVG(NumLessons), 2) AS 'mean(#Lessons)'
FROM Instructor_Courses
WHERE NumLessons > 0
GROUP BY Course, Semester
4 records
Course Semester mean(#Lessons)
Biostat 823 Fall 2021 2.00
Biostat 823 Fall 2022 3.00
Biostat 823 Fall 2023 2.67
Biostat 823 Fall 2024 4.00

Views can be joined

  • As APIs, it is useful to include surrogate primary keys of participating objects in a view, which allows joining them for auxiliary facts
SELECT Instructor, Course, Semester, r.Name AS Room
FROM Instructor_Courses 
     LEFT OUTER JOIN Room AS r USING (Room_OID)
WHERE Course IS NOT NULL
6 records
Instructor Course Semester Room
Hilmar Lapp Biostat 823 Fall 2024 Hock 10089
Hilmar Lapp Biostat 823 Fall 2023 Hock 10089
Kouros Owzar Biostat 823 Fall 2023 Hock 10089
David Page Biostat 823 Fall 2022 Hock 10089
David Page Biostat 823 Fall 2023 Hock 10089
Cliburn Chan Biostat 823 Fall 2021 NA
  • Views can also be joined with another view.

Further notes and resources

  • Use ORDER BY <col1>[, <col2>, ...] as the last clause of a query to control the order of rows
    • append DESC for descending order
    • note that ordering can have a performance cost
  • In most (OO) programming language environments, SQL database interaction is handled under the hood by ORM (Object-Relational Mapping) libraries
  • Some popular data science packages for R and Python have SQL database interfaces
  • SQLite Documentation
  • Joe Celko, Joe Celko’s SQL for Smarties: Advanced SQL Programming. Morgan & Kaufmann, 2015