Category-theoretic view of SQL

David Spivak has given a nice, category-theoretic way of looking at certain elements of data analysis, and here I’m just going to spell out a few of the basic ideas for myself, comparing a handful of concrete SQL examples to the abstract machinery Spivak lays out. I might delve into more bits and pieces at a later date.

As I was going through his paper, I found myself thinking that these formulations of schemas and tables as category-theoretic structures could actually be practically useful, say in making more robust and manageable/trackable entity relationship models than what might exist out there right now. As it turns out, Spivak (along with others like Ryan Wisnesky) does indeed appear to be working on turning these ideas into usable tools, as evidenced by their Algebraic Query Language and a related commercial endeavor. It really seems like an interesting theoretical direction math-wise, and a promising possibility for use in actual data-analytic work.

The schema for a table is a specification of its column names, along with data type assignments for each column:

-- schema definition for a table 'my_table'
create table my_table (
   col1  integer
  ,col2  char(1)

Describing the schema mathematically, then, it consists of an ordered set C := \langle col1, col2\rangle of column names, along with a function \sigma : C \rightarrow \mathbf{DT} that assigns each column in the list to a SQL data type. (So we’re using \mathbf{DT} to denote the set of all data types available in SQL, like integer, char(1), varchar(27), etc.) To fix ideas, for the rest of this post I’ll assume that \mathbf{DT} actually consists of just the two types from our schema above: that we have only an integer type, and a single-character string type.

Now let’s start to get a little more abstract with our schema. We’ll let U be the infinite set of all objects that are of either type in \mathbf{DT}; that is, U is the union of the set \mathbb{Z} of integers, and the set of all single-character strings from our alphabet. Basically, U is the universe of all values that a cell of a data table in our world (with its limited \mathbf{DT} set) might be populated by.

Let’s let \pi : U \rightarrow \mathbf{DT} be the function that maps each object to its type, i.e. that maps each integer to the integer type and each single-character string to the char(1) type. Between our schema function \sigma and this type assignment \pi, we have the following picture:

\begin{CD} @. U \\ @. @V{\pi}VV \\ C @>{\sigma}>> \mathbf{DT} \end{CD}

To complete this picture a little bit, we want to add another idealization; we have the universe U of all values that could possibly go into a cell in some table, and we’d now like to craft the universe U_\sigma of all cell value assignments that could appear in a table that has schema \sigma. The full Cartesian product C\times U is overly inclusive, as we would have elements like \langle col2, 35\rangle, where an integer value has been erroneously assigned to col2. The right specification of the \sigma universe is

U_\sigma := C \times_\mathbf{DT} U = \{\langle c,u\rangle \in C\times U : \sigma(c)=\pi(u)\}

That is, it’s the most inclusive subset of C\times U that still respects the type assignments of the schema in questions, i.e. such that the following diagram commutes (where the unlabeled maps are the respective projection maps from U_\sigma to each of C and U):

\begin{CD} U_\sigma @>>> U \\ @VVV @V{\pi}VV \\ C @>{\sigma}>> \mathbf{DT} \end{CD}

Category-theoretically speaking, this U_\sigma is the pullback of \pi along \sigma.

Fixing the universal type assignment \pi, Spivak notes that with an appropriate notion of morphism between schemas \sigma : C \rightarrow \mathbf{DT} and \sigma' : C' \rightarrow \mathbf{DT}, we have a category of schemas of type \pi. Next up will be to build on this to form a category of data tables …