For my PhD, my colleagues/collaborators and I built a distributed stream-processing system using Haskell. There are several other Haskell stream-processing systems. How do they compare?

First, let's briefly discuss and define streaming in this context.

Structure and Interpretation of Computer Programs introduces Streams as an analogue of lists, to support delayed evaluation. In brief, the inductive list type (a list is either an empty list or a head element pre-pended to another list) is replaced with a structure with a head element and a promise which, when evaluated, will generate the tail (which in turn may have a head element and a promise to generate another tail, culminating in the equivalent of an empty list.) Later on SICP also covers lazy evaluation.

However, the streaming we're talking about originates in the relational community, rather than the functional one, and is subtly different. It's about building a pipeline of processing that receives and emits data but doesn't need to (indeed, cannot) reference the whole stream (which may be infinite) at once.

Haskell streaming systems

Now let's go over some Haskell streaming systems.

conduit (2011-)

Conduit is the oldest of the ones I am reviewing here, but I doubt it's the first in the Haskell ecosystem. If I've made any obvious omissions, please let me know!

Conduit provides a new set of types to model streaming data, and a completely new set of functions which are analogues of standard Prelude functions, e.g. sumC in place of sum. It provides its own combinator(s) such as .| ( aka fuse) which is like composition but reads left-to-right.

The motivation for this is to enable (near?) constant memory usage for processing large streams of data -- presumably versus using a list-based approach and to provide some determinism: the README gives the example of "promptly closing file handles". I think this is another way of saying that it uses strict evaluation, or at least avoids lazy evaluation for some things.

Conduit offers interleaved effects: which is to say, IO can be performed mid-stream.

Conduit supports distributed operation via Data.Conduit.Network in the conduit-extra package. Michael Snoyman, principal Conduit author, wrote up how to use it here: https://www.yesodweb.com/blog/2014/03/network-conduit-async To write a distributed Conduit application, the application programmer must manually determine the boundaries between the clients/servers and write specific code to connect them.

pipes (2012-)

The Pipes Tutorial contrasts itself with "Conventional Haskell stream programming": whether that means Conduit or something else, I don't know.

Paraphrasing their pitch: Effects, Streaming Composability: pick two. That's the situation they describe for stream programming prior to Pipes. They argue Pipes offers all three.

Pipes offers it's own combinators (which read left-to-right) and offers interleaved effects.

At this point I can't really see what fundamentally distinguishes Pipes from Conduit.

Pipes has some support for distributed operation via the sister library pipes-network. It looks like you must send and receive ByteStrings, which means rolling your own serialisation for other types. As with Conduit, to send or receive over a network, the application programmer must divide their program up into the sub-programs for each node, and add the necessary ingress/egress code.

io-streams (2013-)

io-streams emphasises simple primitives. Reading and writing is done under the IO Monad, thus, in an effectful (but non-pure) context. The presence or absence of further stream data are signalled by using the Maybe type (Just more data or Nothing: the producer has finished.)

It provides a library of functions that shadow the standard Prelude, such as S.fromList, S.mapM, etc.

It's not clear to me what the motivation for io-streams is, beyond providing a simple interface. There's no declaration of intent that I can find about (e.g.) constant-memory operation.

There's no mention of or support (that I can find) for distributed operation.

streaming (2015-)

Similar to io-streams, Streaming emphasises providing a simple interface that gels well with traditional Haskell methods. Streaming provides effectful streams (via a Monad -- any Monad?) and a collection of functions for manipulating streams which are designed to closely mimic standard Prelude (and Data.List) functions.

Streaming doesn't push its own combinators: the examples provided use $ and read right-to-left.

The motivation for Streaming seems to be to avoid memory leaks caused by extracting pure lists from IO with traditional functions like mapM, which require all the list constructors to be evaluated, the list to be completely deconstructed, and then a new list constructed.

Like io-streams, the focus of the library is providing a low-level streaming abstraction, and there is no support for distributed operation.

streamly (2017-)

Streamly appears to have the grand goal of providing a unified programming tool as suited for quick-and-dirty programming tasks (normally the domain of scripting languages) and high-performance work (C, Java, Rust, etc.). Their intended audience appears to be everyone, or at least, not just existing Haskell programmers. See their rationale

Streamly offers an interface to permit composing concurrent (note: not distributed) programs via combinators. It relies upon fusing a streaming pipeline to remove intermediate list structure allocations and de-allocations (i.e. de-forestation, similar to GHC rewrite rules)

The examples I've seen use standard combinators (e.g. Control.Function.&, which reads left-to-right, and Applicative).

Streamly provide benchmarks versus Haskell pure lists, Streaming, Pipes and Conduit: these generally show Streamly several orders of magnitude faster.

I'm finding it hard to evaluate Streamly. It's big, and it's focus is wide. It provides shadows of Prelude functions, as many of these libraries do.

wrap-up

It seems almost like it must be a rite-of-passage to write a streaming system in Haskell. Stones and glass houses, I'm guilty of that too.

The focus of the surveyed libraries is mostly on providing a streaming abstraction, normally with an analogous interface to standard Haskell lists. They differ on various philosophical points (whether to abstract away the mechanics behind type synonyms, how much to leverage existing Haskell idioms, etc). A few of the libraries have some rudimentary support for distributed operation, but this is limited to connecting separate nodes together: in some cases serialising data remains the application programmer's job, and in all cases the application programmer must manually carve up their processing according to a fixed idea of what nodes they are deploying to. They all define a fixed-function pipeline.


Comments