jmtd → log → haskell streaming libraries
For my PhD, my colleagues/collaborators and I built a distributed stream-processing system using Haskell. There are several other Haskell stream-processing systems. How do they compare?
First, let's briefly discuss and define streaming in this context.
Structure and Interpretation of Computer Programs introduces Streams as an analogue of lists, to support delayed evaluation. In brief, the inductive list type (a list is either an empty list or a head element pre-pended to another list) is replaced with a structure with a head element and a promise which, when evaluated, will generate the tail (which in turn may have a head element and a promise to generate another tail, culminating in the equivalent of an empty list.) Later on SICP also covers lazy evaluation.
However, the streaming we're talking about originates in the relational community, rather than the functional one, and is subtly different. It's about building a pipeline of processing that receives and emits data but doesn't need to (indeed, cannot) reference the whole stream (which may be infinite) at once.
Haskell streaming systems
Now let's go over some Haskell streaming systems.
conduit (2011-)
Conduit is the oldest of the ones I am reviewing here, but I doubt it's the first in the Haskell ecosystem. If I've made any obvious omissions, please let me know!
Conduit provides a new set of types to model streaming data, and a completely
new set of functions which are analogues of standard Prelude functions, e.g.
sumC
in place of sum
. It provides its own combinator(s) such as .|
(
aka fuse)
which is like composition but reads left-to-right.
The motivation for this is to enable (near?) constant memory usage for processing large streams of data -- presumably versus using a list-based approach and to provide some determinism: the README gives the example of "promptly closing file handles". I think this is another way of saying that it uses strict evaluation, or at least avoids lazy evaluation for some things.
Conduit offers interleaved effects: which is to say, IO can be performed mid-stream.
Conduit supports distributed operation via Data.Conduit.Network
in the
conduit-extra
package. Michael Snoyman, principal Conduit author, wrote
up how to use it here: https://www.yesodweb.com/blog/2014/03/network-conduit-async
To write a distributed Conduit application, the application programmer must
manually determine the boundaries between the clients/servers and write specific
code to connect them.
pipes (2012-)
The Pipes Tutorial contrasts itself with "Conventional Haskell stream programming": whether that means Conduit or something else, I don't know.
Paraphrasing their pitch: Effects, Streaming Composability: pick two. That's the situation they describe for stream programming prior to Pipes. They argue Pipes offers all three.
Pipes offers it's own combinators (which read left-to-right) and offers interleaved effects.
At this point I can't really see what fundamentally distinguishes Pipes from Conduit.
Pipes has some support for distributed operation via the sister library
pipes-network. It
looks like you must send and receive ByteString
s, which means rolling
your own serialisation for other types. As with Conduit, to send or receive
over a network, the application programmer must divide their program up
into the sub-programs for each node, and add the necessary ingress/egress
code.
io-streams (2013-)
io-streams emphasises simple primitives. Reading and writing is done
under the IO Monad, thus, in an effectful (but non-pure) context. The
presence or absence of further stream data are signalled by using the
Maybe
type (Just
more data or Nothing
: the producer has finished.)
It provides a library of functions that shadow the standard Prelude, such
as S.fromList
, S.mapM
, etc.
It's not clear to me what the motivation for io-streams is, beyond providing a simple interface. There's no declaration of intent that I can find about (e.g.) constant-memory operation.
There's no mention of or support (that I can find) for distributed operation.
streaming (2015-)
Similar to io-streams, Streaming emphasises providing a simple
interface that gels well with traditional Haskell methods. Streaming
provides effectful streams (via a Monad -- any Monad?) and a collection
of functions for manipulating streams which are designed to closely
mimic standard Prelude (and Data.List
) functions.
Streaming doesn't push its own combinators: the examples provided
use $
and read right-to-left.
The motivation for Streaming seems to be to avoid memory leaks caused by
extracting pure lists from IO with traditional functions like mapM
,
which require all the list constructors to be evaluated, the list to be
completely deconstructed, and then a new list constructed.
Like io-streams, the focus of the library is providing a low-level streaming abstraction, and there is no support for distributed operation.
streamly (2017-)
Streamly appears to have the grand goal of providing a unified programming tool as suited for quick-and-dirty programming tasks (normally the domain of scripting languages) and high-performance work (C, Java, Rust, etc.). Their intended audience appears to be everyone, or at least, not just existing Haskell programmers. See their rationale
Streamly offers an interface to permit composing concurrent (note: not distributed) programs via combinators. It relies upon fusing a streaming pipeline to remove intermediate list structure allocations and de-allocations (i.e. de-forestation, similar to GHC rewrite rules)
The examples I've seen use standard combinators (e.g. Control.Function.&
,
which reads left-to-right, and Applicative
).
Streamly provide benchmarks versus Haskell pure lists, Streaming, Pipes and Conduit: these generally show Streamly several orders of magnitude faster.
I'm finding it hard to evaluate Streamly. It's big, and it's focus is wide. It provides shadows of Prelude functions, as many of these libraries do.
wrap-up
It seems almost like it must be a rite-of-passage to write a streaming system in Haskell. Stones and glass houses, I'm guilty of that too.
The focus of the surveyed libraries is mostly on providing a streaming abstraction, normally with an analogous interface to standard Haskell lists. They differ on various philosophical points (whether to abstract away the mechanics behind type synonyms, how much to leverage existing Haskell idioms, etc). A few of the libraries have some rudimentary support for distributed operation, but this is limited to connecting separate nodes together: in some cases serialising data remains the application programmer's job, and in all cases the application programmer must manually carve up their processing according to a fixed idea of what nodes they are deploying to. They all define a fixed-function pipeline.
Comments