Aggregate functions in SQL, ALTER clauses in SQLite

Today I decided to give SQLite a try. I’m working on a nice directed graph (with 80,000+ nodes and 500,000+ edges) given by its edge(origin, dest) adjacency relation, and I needed to compute a degrees(node, indeg, outdeg) relation aggregating each node’s in- and out-degree. This is the kind of things than can be done easily within the database: here is how.

How SELECT works

Here are general considerations on SELECT clauses (i.e. valid with any language of the SQL family). The basic structure of a SELECT statement is as follows:

SELECT [result-set] FROM [input data] WHERE [conditions]

The result-set is a list of expressions that will be evaluated on the rows of the input data. Expressions can be attribute names or aggregate functions such as MIN, MAX, SUM or COUNT. These aggregate functions work together with the GROUP BY clause. Here is how the output of a SELECT statement is computed:

If no aggregate function is used, then each expression in the result-set is evaluated on each row of the input data (FROM part of the query), filtered by the WHERE clause.

If aggregate functions are used but there is no GROUP BY clause, then each aggregate expression in the result-set is evaluated across the entire dataset, while non-aggregate expressions are evaluated on an arbitrary row. Such a SELECT statement always returns a single row (if there are no rows in the input data, non-aggregate expressions are set to NULL and aggregate expressions are set to their default value).

If aggregate functions are used together with a GROUP BY clause, then GROUP BY expressions are evaluated on each row of the dataset, and rows with the same values are grouped together. Each expression in the result-set is then evaluated once for each group of rows. (As before, aggregate expressions are evaluated on the whole group while non-aggregate expressions are evaluated on an arbitrary row.)

Computing the degrees table

Now let us focus on SQLite and our directed graph. Using the COUNT function, one can then compute in- and out- degrees from the edge relation as follows:

CREATE TABLE outrel AS SELECT origin AS id,
    COUNT(*) AS outdeg FROM edges
    GROUP BY origin;

CREATE TABLE inrel AS SELECT dest AS id,
    COUNT(*) AS indeg FROM edges
    GROUP BY dest;

Then, suffices to join them into a single table, e.g. as follows:

CREATE TABLE nodes AS SELECT origin AS id FROM edges
    UNION SELECT dest AS id FROM edges;

CREATE TABLE degrees AS SELECT nodes.id AS id,
    inrel.indeg AS indeg,
    outrel.outdeg AS outdeg FROM nodes,
    outrel,
    inrel WHERE nodes.id = outrel.id AND outrel.id = inrel.id;

Yet, if one tries this directly, computation time is likely to be excessively long. The reason for this is that our tables created from SELECT statements have no index (nor primary keys), so SQLite will use loop joins (i.e. joins with nested loops) and computation time is going to be O(N^3). One way to avoid this would be ALTER the tables so that the attribute id becomes a primary key. Yet, (what I think of as) a major limitation of SQLite arises here:

SQLite has limited ALTER TABLE support that you can use to add a column to the end of a table or to change the name of a table. If you want to make more complex changes in the structure of a table, you will have to recreate the table. You can save existing data to a temporary table, drop the old table, create the new table, then copy the data back in from the temporary table.

(From the SQLite FAQ.) So you can’t ALTER your table to set node identifiers as primary keys. An SQLite way to do this is to create indexes:

CREATE INDEX nodes_id ON nodes(id);
CREATE INDEX outrel_id ON outrel(id);
CREATE INDEX inrel_id ON inrel(id);

Another option is to create initially empty auxiliary tables (so you can specify where you want your primary key in the attributes list) and then fill them with an INSERT on the same SELECT statement, e.g.:

CREATE TABLE inrel(id INT PRIMARY KEY, indeg INT);
INSERT INTO inrel SELECT dest AS id, COUNT(*) AS indeg FROM edges
    GROUP BY dest;

Both ways reduce to a decent level the computation time for the CREATE instruction of the degrees table. As for me, though I instantly bumped into this limitation about the ALTER clause, I’ll spend some more time trying SQLite, to see how it can help me dissecting this nice little graph.

Pages of this website are under the CC-BY 4.0 license.