jmtd → log → type design → equivalence problems with StreamGraph
I've been tackling an equivalence problem with rewritten programs in StrIoT, our proof-of-concept stream-processing system.
The StrIoT Logical Optimiser applies a set of rewrite rules to a
stream-processing program, generating a set of variants that can be reasoned
about, ranked, and deployed. The problem I've been tackling is that a variant
may appear to be semantically equivalent to another, but compare (with ==
)
as distinct.
The issue relates to the design of our data-type representing programs, in
particular, a consequence of our choice to outsource the structural aspect to a
3rd-party library (Algebra.Graph
). The Graph library deems nodes that compare
as equivalent (again with ==
) to be the same. Since a stream-processing
program may contain many operators which are equivalent, but distinct, we
needed to add a field to our payload type to differentiate them: so we opted
for an Integer field, vertexId
(something I've described as a "wart"
elsewhere)
Here's a simplified example of our existing payload type, StreamVertex
:
data StreamVertex = StreamVertex
{ vertexId :: Int
, operator :: StreamOperator
, parameters :: [ExpQ]
, intype :: String
, outtype :: String
}
A rewrite rule might introduce or eliminate operators from a stream-processing program. For example, consider the rule which "hoists" a filter upstream from a merge operator. In pseudo-Haskell,
streamFilter p . streamMerge [a , b, ...]
=>
streamMerge [ streamFilter p a
, streamFilter p b
, ...]
The original streamFilter
is removed, and new streamFilter
s are introduced,
one per stream arriving at streamMerge
. In general, rules may need to
synthesise new operators, and thus new vertexId
s.
Another rewrite rule might perform the reverse operation. But the individual
rules operate in isolation: and so, the program variant that results after
applying a rule and then applying an inverse rule may not have the same
vertexId
s, or the same order of vertexId
s, as the original program.
I thought of the outline of two possible solutions to this.
"well-numbered" StreamGraph
s
The first was to encode (and enforce) some rules about how vertexId
s are used.
If they always began from (say) 1, and were strictly-ascending from the
source operator(s), and rewrite rules guaranteed that a "well numbered" input
would be "well numbered" after rewriting, this would be sufficient to rule out
a rewritten-but-semantically-equivalent program being considered distinct.
The trouble with this approach is using properties of a numerical system
built around vertexId
as a stand-in for the real structural problem. I
was not sure I could prove both that the stand-in system was sound and
that it was a proper analogue for the underlying structural issue.
It feels to me more that the choice to use an external library to encode the structure of a stream-processing program was the issue: the structure itself is a fundamental part of the semantics of the program. What if we had encoded the structure of programs within the same data-type?
alternative data-type
StrIoT programs are trees. The root is the sink node: there is always exactly one. There can be multiple source (leaf nodes), but they always converge. Operators can have multiple inputs (including zero). The root node has no output, but all other operators have exactly one.
I explored transforming StreamVertex
into a tree by adding a field
representing incoming streams, and dispensing with Graph
and vertexId
.
Something like this
data StreamProg = StreamProg StreamOperator [Exp] String String [StreamProg]
A uni-directional transformation from Graph StreamVertex
to StreamProg
is all that's needed to implement something like ==
, so we don't need
to keep track of vertexId
mappings. Unfortunately, we can't fix the
actual Eq (Graph StreamVertex)
implementation this way: it delegates
to Eq StreamVertex
, and we just don't have enough information to fix
the problem at that level. But, we can write a separate graphEq
and
use that instead where we need to.
could I go further?
Spoiler: I haven't. But I've been sorely tempted.
We still have a separate StreamOperator
type, which it would be nice to fold
in; and we still have to use
a list around the incoming nodes, since different operators accept different
numbers of incoming streams. It would be better to encode the correct valences
in the type.
In 2020 I explored iteratively reducing the StreamVertex
data-type to try and
get it as close as possible to the ideal end-user API: simple functions. I
wrote about one step along that path in Template Haskell and
Stream-processing programs, but concluded that,
since this was not my main PhD focus, I wouldn't go further. But it was
nagging at my subconcious ever since.
I allowed myself a couple of days exploring some advanced concepts including typed Template Haskell (that has had some developments since 2020), generalised abstract data types (GADTs) and more generic programming to see what could be achieved.
I'll summarise all that in the next blog post.
Comments
Have you considered viewing your transformations as an expression and use term rewriting to normalize it, e.g. with Hegg? There is a bunch of theory around term rewriting and normalization to lean on, e.g https://zubairabid.com/Semester7/subjects/PoPL/books/TRaAT.pdf
Good luck!