David Spivak has given a nice, category-theoretic way of looking at certain elements of data analysis, and here I’m just going to spell out a few of the basic ideas for myself, comparing a handful of concrete SQL examples to the abstract machinery Spivak lays out. I might delve into more bits and pieces at a later date.
As I was going through his paper, I found myself thinking that these formulations of schemas and tables as category-theoretic structures could actually be practically useful, say in making more robust and manageable/trackable entity relationship models than what might exist out there right now. As it turns out, Spivak (along with others like Ryan Wisnesky) does indeed appear to be working on turning these ideas into usable tools, as evidenced by their Algebraic Query Language and a related commercial endeavor. It really seems like an interesting theoretical direction math-wise, and a promising possibility for use in actual data-analytic work.
The schema for a table is a specification of its column names, along with data type assignments for each column:
-- schema definition for a table 'my_table' create table my_table ( col1 integer ,col2 char(1) ) ;
Describing the schema mathematically, then, it consists of an ordered set of column names, along with a function
that assigns each column in the list to a SQL data type. (So we’re using
to denote the set of all data types available in SQL, like
integer
, char(1)
, varchar(27)
, etc.) To fix ideas, for the rest of this post I’ll assume that actually consists of just the two types from our schema above: that we have only an integer type, and a single-character string type.
Now let’s start to get a little more abstract with our schema. We’ll let be the infinite set of all objects that are of either type in
; that is,
is the union of the set
of integers, and the set of all single-character strings from our alphabet. Basically,
is the universe of all values that a cell of a data table in our world (with its limited
set) might be populated by.
Let’s let be the function that maps each object to its type, i.e. that maps each integer to the
integer
type and each single-character string to the char(1)
type. Between our schema function and this type assignment
, we have the following picture:
To complete this picture a little bit, we want to add another idealization; we have the universe of all values that could possibly go into a cell in some table, and we’d now like to craft the universe
of all cell value assignments that could appear in a table that has schema
. The full Cartesian product
is overly inclusive, as we would have elements like
, where an integer value has been erroneously assigned to
col2
. The right specification of the universe is
That is, it’s the most inclusive subset of that still respects the type assignments of the schema in questions, i.e. such that the following diagram commutes (where the unlabeled maps are the respective projection maps from
to each of
and
):
Category-theoretically speaking, this is the pullback of
along
.
Fixing the universal type assignment , Spivak notes that with an appropriate notion of morphism between schemas
and
, we have a category of schemas of type
. Next up will be to build on this to form a category of data tables …