Biostat 823 - Recursive data structures in SQL

Hilmar Lapp

Duke University, Department of Biostatistics & Bioinformatics

2024-09-17

Recursive data structure

  • Trees and more generally Directed Acyclic Graphs (DAGs) are commonly occurring recursive data structures
    • Subject classification systems, hierarchical codes, etc
    • Biological taxonomies (species classifications)
    • Ancestry trees, phylogenetic trees
    • Ontologies

Example: Tree (a special case of DAG)

Edges = Child to parent node
erDiagram
    Node |o--o{ Node : "parent node"
    Node {
      integer ID PK
      string  Label
      integer Parent_ID FK
    }

Relational model: Adjacency table using self-referential FK (Root node is identified by Parent_ID IS NULL.)

Recursion is through relationship

  • Define x ancestor of y as \(anc(x,y) := \{ parent(x, y)\ \cup\ parent(x, anc(z, y)) \}\)
    • Node \(x\) is an ancestor of node \(y\) iff \(x\) is a parent of \(y\), or if \(x\) is a parent of \(z\) and \(z\) is an ancestor of \(y\).
  • Examples of queries requiring recursion:
    • List all ancestors of node \(x\).
    • List all nodes for which node \(x\) is an ancestor.
    • List the common (or distinct) ancestors of nodes \(x\) and \(y\).
    • Obtain the most recent common ancestor (MRCA) of nodes \(x\) and \(y\) (ancestor with shortest distance).

Options for enabling recursive queries

  1. Create and then query transitive closure (“path”) table
  2. Implement and then query nested sets enumeration (only for trees, not graphs)
  3. Use recursive SQL SELECT query (if RDBMS supports it)

Transitive closure E-R model

erDiagram
    Node |o--o{ Node : "parent node"
    Node ||--o{ Node_Path : "ancestor"
    Node ||--o{ Node_Path : "node"
    Node {
      integer ID PK
      string  Label
      integer Parent_ID FK
    }
    Node_Path {
      integer Node_ID FK
      integer Ancestor_ID FK
      integer Path_Length
    }

E-R Diagram of adjacency table with path table

Transitive closure computation (I)

erDiagram
    Node |o--o{ Node : "parent node"
    Node ||--o{ Node_Path : "ancestor"
    Node ||--o{ Node_Path : "node"
    Node {
      integer ID PK
      string  Label
      integer Parent_ID FK
    }
    Node_Path {
      integer Node_ID FK
      integer Ancestor_ID FK
      integer Path_Length
    }

E-R Diagram of adjacency table with path table

Algorithm:

  1. Add \(anc(x,y,1)\ \\\forall\ (x,y) \in \{ parent(x,y) \}\)
  2. Let l = 1
    1. Add \(anc(x,y,l+1)\ \\\forall (x,y) \in \{anc(z,y,l) \land parent(x,z)\}\)
    2. \(l = l + 1\)
    3. Repeat (a) unless no tuples added.

Transitive closure computation (II)

erDiagram
    Node |o--o{ Node : "parent node"
    Node ||--o{ Node_Path : "ancestor"
    Node ||--o{ Node_Path : "node"
    Node {
      integer ID PK
      string  Label
      integer Parent_ID FK
    }
    Node_Path {
      integer Node_ID FK
      integer Ancestor_ID FK
      integer Path_Length
    }

E-R Diagram of adjacency table with path table

Transitive closure computation in SQL:

INSERT INTO Node_Path
      (Node_ID, Ancestor_ID, Path_Length)
SELECT ID, Parent_ID, 1 FROM Node
WHERE Parent_ID IS NOT NULL;
-- Repeat until no more rows inserted
INSERT INTO Node_Path
       (Node_ID, Ancestor_ID, Path_Length)
SELECT p.Node_ID, n.Parent_ID, p.Path_Length+1
FROM Node_Path p 
     JOIN Node n ON (p.Ancestor_ID = n.ID)
WHERE n.Parent_ID IS NOT NULL
AND NOT EXISTS (
  SELECT 1 FROM Node_Path pp
  WHERE pp.Node_ID = p.Node_ID
  AND pp.Ancestor_ID = n.Parent_ID
  -- Note that for graphs we would also have to
  -- test path length!
)

Transitive closure visualized

Node relationships added by transitive closure in blue

Nested Set Enumeration

We add two attributes to a Node, a left and a right number:

Same tree as before, but nodes with left and right (integer) values added

Nested Set Computation (I)

The left and right nested set values are computed using recursive depth-first traversal of the nodes in the tree structure:

def nestedSet(node, nestedNumber):
  node.left = nestedNumber
  for child in node.children():
    nestedNumber = nestedSet(child, nestedNumber + 1)
  node.right = nestedNumber + 1
  return node.right

nestedSet(rootNode, 1)

Nested Set Computation (II)

Tree with computed Nested Set values

Recursive SQL SELECT

  • Oracle introduced a non-standard CONNECT BY clause in the 1980s
  • Common Table Expressions (CTE) enabling recursive queries introduced in the SQL:1999 standard
    • In essence, CTEs can be seen as temporary named result sets
    • Supported by many popular RDBMSs (including MySQL, PostgreSQL, Sqlite) in their more modern versions

Recursive CTE example: factorial()

WITH RECURSIVE factorials (n, factorial) AS (
       SELECT 0, 1 -- Initial Subquery
       UNION ALL
       -- Recursive Subquery
       SELECT n+1, (n+1)*factorial
       FROM   factorials
       WHERE  n < 10
)
SELECT * FROM factorials;
Displaying records 1 - 10
n factorial
0 1
1 1
2 2
3 6
4 24
5 120
6 720
7 5040
8 40320
9 362880

Recursive CTE example: fibonacci()

WITH RECURSIVE fibonacci (n, fib_n, fib_n1) AS (
       SELECT 1, 1, 0 -- Initial Subquery
       UNION ALL
       -- Recursive Subquery
       SELECT n+1, fib_n+fib_n1, fib_n
       FROM   fibonacci
       WHERE  n < 10
)
SELECT n, fib_n as "fib(n)", fib_n1 as "fib(n-1)" FROM fibonacci;
Displaying records 1 - 10
n fib(n) fib(n-1)
1 1 0
2 1 1
3 2 1
4 3 2
5 5 3
6 8 5
7 13 8
8 21 13
9 34 21
10 55 34

Transitive closure using CTE (I)

WITH RECURSIVE Node_Path (Node_ID, Ancestor_ID, Path_Len) AS (
       SELECT ID, Parent_ID, 1 -- Initial Subquery
       FROM   Node
       WHERE  Name = 'Homo sapiens'
       UNION ALL
       -- Recursive Subquery
       SELECT Node_ID, p.Parent_ID, Path_Len + 1
       FROM   Node AS p JOIN Node_Path ON (p.ID = Node_Path.Ancestor_ID)
)
SELECT n.ID, n.Name, a.ID, a.Name, np.Path_Len
FROM Node_Path np JOIN Node AS n ON (np.Node_ID = n.ID)
     JOIN Node AS a ON (np.Ancestor_ID = a.ID)
ORDER BY np.Path_Len;
29 records
ID Name ID Name Path_Len
9606 Homo sapiens 9605 Homo 1
9606 Homo sapiens 207598 Homininae 2
9606 Homo sapiens 9604 Hominidae 3
9606 Homo sapiens 314295 Hominoidea 4
9606 Homo sapiens 9526 Catarrhini 5
9606 Homo sapiens 314293 Simiiformes 6
9606 Homo sapiens 376913 Haplorrhini 7
9606 Homo sapiens 9443 Primates 8
9606 Homo sapiens 314146 Euarchontoglires 9
9606 Homo sapiens 1437010 Boreoeutheria 10
9606 Homo sapiens 9347 Eutheria 11
9606 Homo sapiens 32525 Theria 12
9606 Homo sapiens 40674 Mammalia 13
9606 Homo sapiens 32524 Amniota 14
9606 Homo sapiens 32523 Tetrapoda 15
9606 Homo sapiens 1338369 Dipnotetrapodomorpha 16
9606 Homo sapiens 8287 Sarcopterygii 17
9606 Homo sapiens 117571 Euteleostomi 18
9606 Homo sapiens 117570 Teleostomi 19
9606 Homo sapiens 7776 Gnathostomata 20
9606 Homo sapiens 7742 Vertebrata 21
9606 Homo sapiens 89593 Craniata 22
9606 Homo sapiens 7711 Chordata 23
9606 Homo sapiens 33511 Deuterostomia 24
9606 Homo sapiens 33213 Bilateria 25
9606 Homo sapiens 6072 Eumetazoa 26
9606 Homo sapiens 33208 Metazoa 27
9606 Homo sapiens 33154 Opisthokonta 28
9606 Homo sapiens 2759 Eukaryota 29

Transitive closure using CTE (II)

WITH RECURSIVE Node_Path (Node_ID, Ancestor_ID, Path_Len) AS (
       SELECT ID, Parent_ID, 1 -- Initial Subquery
       FROM   Node
       WHERE  Name IN ('Homo sapiens', 'Gallus gallus')
       UNION ALL
       -- Recursive Subquery
       SELECT Node_ID, p.Parent_ID, Path_Len + 1
       FROM   Node AS p JOIN Node_Path ON (p.ID = Node_Path.Ancestor_ID)
)
SELECT a.ID, a.Name, np1.Path_Len, np2.Path_Len
FROM Node_Path AS np1 JOIN Node AS a ON (np1.Ancestor_ID = a.ID)
     JOIN Node_Path AS np2 ON (np1.Ancestor_ID = np2.Ancestor_ID)
WHERE np1.Node_ID < np2.Node_ID;
16 records
ID Name Path_Len Path_Len
32524 Amniota 16 14
32523 Tetrapoda 17 15
1338369 Dipnotetrapodomorpha 18 16
8287 Sarcopterygii 19 17
117571 Euteleostomi 20 18
117570 Teleostomi 21 19
7776 Gnathostomata 22 20
7742 Vertebrata 23 21
89593 Craniata 24 22
7711 Chordata 25 23
33511 Deuterostomia 26 24
33213 Bilateria 27 25
6072 Eumetazoa 28 26
33208 Metazoa 29 27
33154 Opisthokonta 30 28
2759 Eukaryota 31 29

Transitive closure using CTE (III)

WITH RECURSIVE Node_Path (Node_ID, Ancestor_ID, Path_Len) AS (
       SELECT ID, Parent_ID, 1 -- Initial Subquery
       FROM   Node
       WHERE  Name IN ('Homo sapiens', 'Gallus gallus')
       UNION ALL
       -- Recursive Subquery
       SELECT Node_ID, p.Parent_ID, Path_Len + 1
       FROM   Node AS p JOIN Node_Path ON (p.ID = Node_Path.Ancestor_ID)
       WHERE  p.Parent_ID IS NOT NULL
)
SELECT a.ID, a.Name, group_concat(n.Name,', ') AS Common_Ancestor_Of, 
       MIN(np.Path_Len) AS Min_Path_Len
FROM Node_Path AS np
     JOIN Node AS a ON (np.Ancestor_ID = a.ID)
     JOIN Node AS n ON (np.Node_ID = n.ID)
GROUP BY a.ID, a.Name
HAVING COUNT(n.Name) > 1
ORDER BY Min_Path_Len;
16 records
ID Name Common_Ancestor_Of Min_Path_Len
32524 Amniota Homo sapiens, Gallus gallus 14
32523 Tetrapoda Homo sapiens, Gallus gallus 15
1338369 Dipnotetrapodomorpha Homo sapiens, Gallus gallus 16
8287 Sarcopterygii Homo sapiens, Gallus gallus 17
117571 Euteleostomi Homo sapiens, Gallus gallus 18
117570 Teleostomi Homo sapiens, Gallus gallus 19
7776 Gnathostomata Homo sapiens, Gallus gallus 20
7742 Vertebrata Homo sapiens, Gallus gallus 21
89593 Craniata Homo sapiens, Gallus gallus 22
7711 Chordata Homo sapiens, Gallus gallus 23
33511 Deuterostomia Homo sapiens, Gallus gallus 24
33213 Bilateria Homo sapiens, Gallus gallus 25
6072 Eumetazoa Homo sapiens, Gallus gallus 26
33208 Metazoa Homo sapiens, Gallus gallus 27
33154 Opisthokonta Homo sapiens, Gallus gallus 28
2759 Eukaryota Homo sapiens, Gallus gallus 29

Directed Acyclic Graph (DAG)

Edges = directed from “start” to “end”

Edges can have a label or type

Directed Graph as relational model

erDiagram
    Node ||--o{ Edge : "edge start"
    Node ||--o{ Edge : "edge end"
    Node {
      integer ID PK
      string  Label
    }
    Edge {
      integer startNode FK
      string Label
      integer endNode FK
    }

Relational model of a Directed Graph: Adjacency table connecting nodes to each other. (Note that acyclic property of graph cannot be enforced by a relational database.)

erDiagram
    Node ||--o{ Edge : "edge start"
    Node ||--o{ Edge : "edge end"
    Node ||--o{ Edge : "edge type"
    Node {
      integer ID PK
      string  Label
    }
    Edge {
      integer startNode FK
      integer type FK
      integer endNode FK
    }

Relational model of a Directed Graph with edge type normalized as a node entity. (This could also be a separate table.)

Paths through directed graph

  • Define recursively: \(path(x,y,t,l) := \{ edge(x,y,t) \cup\ edge(path(x,z,t,l-1), y, t) \}\)
    • There is a path from \(x\) to \(y\) of type \(t\) and length \(l\) iff there is an edge from \(x\) to \(y\) of type \(t\), or if there is path from \(x\) to \(z\) of type \(t\) and length \(l-1\) and an egde of type \(t\) from \(z\) to \(y\).
  • The set of all paths is called the transitive closure.

Transitive closure for directed graph

erDiagram
%%| fig-cap: "Relational model of a Directed Graph with a path table for transitive closure."
    Node ||--o{ Edge : "edge start"
    Node ||--o{ Edge : "edge end"
    Node ||--o{ Edge : "edge type"
    Node ||--o{ Path : "path start"
    Node ||--o{ Path : "path end"
    Node ||--o{ Path : "path type"
    Node {
      integer ID PK
      string  Label
      boolean isTransitive
    }
    Edge {
      integer startNode FK
      integer type FK
      integer endNode FK
    }
    Path {
      integer startNode FK
      integer type FK
      integer endNode FK
      integer length
    }