jmtd → log → data-types for representing stream-processing programs

This year I want to write much more about my PhD work on my blog, and here's my first effort. Most of this material has been languishing as a draft for over a year, so it's past time to get it out!

1 + 2

As part of my PhD work, I've been looking at data structures for representing stream-processing programs. The intention for our system is to take a user-supplied stream-processing program, rewrite it in order to alter its behaviour and partition it up into sub-programs which could be deployed and executed on different computers, connected together via TCP/IP.

1 * 2

To help familiarise myself with the existing system, when I started working on this I begun to explore different ways of representing both a stream-processing program and a set of interconnected, partitioned programs. Graph data structures seem like a natural fit for these, with stream-processing programs as a graph and interconnected programs as a graph-of-graphs¹.

1 * (2 + 3)

There are a number of different graph libraries for Haskell. The most common approach they use for representation is "tabular": lists of edges as pairs of vertices, or similar. This isn't the only approach. One of the older, more established libraries — fgl — uses inductive types. But the one I have initially settled on is Algebra.Graph, which defines an algebra of graphs with which you can construct your instances².

The USP for Algebra.Graph is that the four provided constructors are all total functions, so certain types of invalid graph are impossible to represent with the library (such as those where an edge does not point to a vertex).

The four basic constructors are³:

Vertex x, a single vertex, containing x
Overlay x y, which overlays one graph upon another
Connect x y, which connects all the vertices from Graph x to all of the vertices in Graph y.
Empty, for an empty graph

The Graph type implements the Num type-class, so Overlay can be abbreviated to + and connect to *. I've included some example graph definitions, encoded using + and * for brevity, and images of their corresponding renderings within this blog post.

I didn't perform an exhaustive search — nor evaluation — of all the available graph libraries. There's no definitive "right" answer to the question of which to choose: the graphs I will be dealing with are relatively small, so raw performance is not a major consideration.

So, what does a stream-processing program look like, encoded in this way? Here's a real example of a simple 5-node path graph (from here), simplified a little for clarity:

λ> foldg Empty (Vertex . vertexId) Overlay Connect graph
Overlay (Connect (Vertex 1) (Vertex 2)) (Overlay (Connect (Vertex 2)
(Vertex 3)) (Overlay (Connect (Vertex 3) (Vertex 4)) (Connect (Vertex 4)
(Vertex 5))))

Rendering it graphically is more clear:

simple 5-node stream graph

Graphs are not the only data-type that could be used, of course. I've started out using a graph representation in order to bootstrap the implementation and get further along with a proof-of-concept, but there are shortcomings that might be addressed by other approaches. I'll write more about those in another blog post.↩
By coincidence, Andrey Mokhov, the author of Algebra.Graph was a Senior Lecturer at Newcastle University, where I am a student, and was also co-author of a draft paper that was responsible for me getting interested in pursuing this work in the first place. Later, Andrey briefly became my second supervisor, but has now moved on to work for Jane Street. He remains a visiting fellow at Newcastle.↩
Different variants of the grammar can vary these constructors to achieve different results. For example, you can forbid empty graphs by removing the Empty constructor. An adjustment to the types is made to support edge-labelling.↩