jmtd → log → template haskell → Template Haskell and Stream-processing programs

I've written about what Template Haskell is, and given an example of what it can be used for, it's time to explain why I was looking at it in the context of my PhD work.

Encoding stream-processing programs

StrIoT is an experimental distributed stream-processing system that myself and others are building in order to explore our research questions. A user of StrIoT writes a stream-processing program, using a set of 8 functional operators provided for the purpose. A simple example is

streamFn :: Stream Int -> Stream Int
streamFn = streamFilter (<15)
         . streamFilter (>5)
         . streamMap (*2)

Our system is distributed: we take a stream-processing program and partition it into sub-programs, which are distributed to and run on separate nodes (perhaps cloud instances, or embedded devices like Raspberry Pis etc.). In order to do that, we need to be able to manipulate the stream-processing program as data. We've initially opted for a graph data-structure, with the vertices in the graph defined as

data StreamVertex = StreamVertex
    { vertexId   :: Int
    , operator   :: StreamOperator
    , parameters :: [String]
    , intype     :: String
    , outtype    :: String
    } deriving (Eq,Show)

A stream-processing program encoded this way, equivalent to the first example

path [ StreamVertex 0 Map    ["(*2)"]  "Int" "Int"
     , StreamVertex 1 Filter ["(>5)"]  "Int" "Int"
     , StreamVertex 2 Filter ["(<15)"] "Int" "Int"
     ]

We can easily manipulate instances of such types, rewrite them, partition them and generate code from them. Unfortunately, this is quite a departure from the first simple code example from the perspective of a user writing their program.

Template Haskell gives us the ability to manipulate code as a data structure, and also to inspect names to gather information about them (their type, etc.). I started looking at TH to see if we could build something where the user-supplied program was as close to that first case as possible.

TH limitations

There are two reasons that we can't easily manipulate a stream-processing definition written as in the first example. The following expressions are equivalent, in some sense, but are not equal, and so yield completely different expression trees when quasi-quoted:

[| streamFilter (<15) . streamFilter (>5) . streamMap (*2) |]
| \s -> streamFilter (<15) (streamFilter (>5) (streamMap (*2) s)) |]
[| streamMap (*2) >>> streamFilter (>5) >>> streamFilter (<15) |]
[| \s -> s & streamMap (*2) & streamFilter (>5) & streamFilter (<15) |]
[| streamFn |] -- a named expression, defined outside the quasi-quotes

In theory, reify can give you the definition of a function from its name, but in practice it doesn't, because this was never implemented. So at the very least we would need to insist that a user included the entirety of a stream-processing program within quasi-quotes, and not split it up into separate bits, with some bits defined outside the quotes and references within (as in the last case above). We would probably have to insist on a consistent approach for composing operators together, such as always use (.) and never >>>, &, etc. which is limiting.

Incremental approach

After a while ruminating on this, and before moving onto something else, I thought I'd try approaching it from the other side. Could I introduce some TH into the existing approach, and improve it? The first thing I've tried is to change the parameters field to TH's ExpQ, meaning the map instance example above would be

StreamVertex 0 Map [ [| (*2) |] ] "Int" "Int"

I worked this through. It's an incremental improvement ease and clarity for the user writing a stream-processing program. It catches a class of programming bugs that would otherwise slip through: the expressions in the brackets have to be syntactically valid (although they aren't type checked). Some of the StrIoT internals are also much improved, particularly the logical operator. Here's an excerpt from a rewrite rule that involves composing code embedded in strings, dealing with all the escaping rules and hoping we've accounted for all possible incoming expression encodings:

let f' = "(let f = ("++f++"); p = ("++p++"); g = ("++g++") in\
         \ \\ (a,b) v -> (f a v, if p v a then g b v else b))"
    a' = "("++a++","++b++")"
    q' = "(let p = ("++p++"); q = ("++q++") in \\v (y,z) -> p v y && q v z)"

And the same section after, manipulating ExpQ types:

let f' = [| \ (a,b) v -> ($(f) a v, if $(p) v a then $(g) b v else b) |]
    a' = [| ($(a), $(b)) |]
    q' = [| \v (y,z) -> $(p) v y && $(q) v z |]

I think the code-generation part of StrIoT could be radically refactored to take advantage of this change but I have not made huge inroads into that.

Next steps

This is, probably, where I am going to stop. This work is very interesting to me but not the main thrust of my research. But incrementally improving the representation gave me some ideas of what I could try next:

intype and outtype could be TH Types instead of Strings. This would catch some simple problems like typos, etc., but we could possibly go further, and
remove the explicit in-and-out-types and infer their values from the parameters field, as its an expression with some type that should match
parameters is a list, because the different stream operators have different arities. streamFilter has one parameter (the filter predicate), so the list should have one element in that case, but streamExpand has none, so it should be empty. We could collapse this to a single ExpQ, which encoded however many parameters are necessary, either in an internal list, or…
the operator field could be merged in too, so that the parameters expression was actually a call to the relevant operator with its parameters supplied.

The type would have collapsed down to

data StreamVertex = StreamVertex
    { vertexId   :: Int
    , opAndParams :: ExpQ
    } deriving (Eq,Show)

Example instances might be

StreamVertex 0 [| streamMap (*2) |]
StreamVertex 1 [| streamExpand |]
StreamVertex 2 [| streamScan (\c _ -> c+1) 0 |]

The vertexId field is a bit of wart, but we require that due to the graph data structure that we are using. A change there could eliminate it, too. By this point we are not that far away from where we started, and certainly much closer to the "pure" function application in the very first example.

Comments

✏ comment on this page (Help on commenting)