jmtd → Jonathan Dowland's Weblog
Below are the five most recent posts in my weblog.
- haskell streaming libraries, posted on
- printables.com feed, posted on
- 10 years at Red Hat, posted on
- FOSDEM 2025, posted on
- dsafilter 20th Anniversary, posted on
You can also see a reverse-chronological list of all posts, dating back to 1999.
For my PhD, my colleagues/collaborators and I built a distributed stream-processing system using Haskell. There are several other Haskell stream-processing systems. How do they compare?
First, let's briefly discuss and define streaming in this context.
Structure and Interpretation of Computer Programs introduces Streams as an analogue of lists, to support delayed evaluation. In brief, the inductive list type (a list is either an empty list or a head element pre-pended to another list) is replaced with a structure with a head element and a promise which, when evaluated, will generate the tail (which in turn may have a head element and a promise to generate another tail, culminating in the equivalent of an empty list.) Later on SICP also covers lazy evaluation.
However, the streaming we're talking about originates in the relational community, rather than the functional one, and is subtly different. It's about building a pipeline of processing that receives and emits data but doesn't need to (indeed, cannot) reference the whole stream (which may be infinite) at once.
Haskell streaming systems
Now let's go over some Haskell streaming systems.
conduit (2011-)
Conduit is the oldest of the ones I am reviewing here, but I doubt it's the first in the Haskell ecosystem. If I've made any obvious omissions, please let me know!
Conduit provides a new set of types to model streaming data, and a completely
new set of functions which are analogues of standard Prelude functions, e.g.
sumC
in place of sum
. It provides its own combinator(s) such as .|
(
aka fuse)
which is like composition but reads left-to-right.
The motivation for this is to enable (near?) constant memory usage for processing large streams of data -- presumably versus using a list-based approach and to provide some determinism: the README gives the example of "promptly closing file handles". I think this is another way of saying that it uses strict evaluation, or at least avoids lazy evaluation for some things.
Conduit offers interleaved effects: which is to say, IO can be performed mid-stream.
Conduit supports distributed operation via Data.Conduit.Network
in the
conduit-extra
package. Michael Snoyman, principal Conduit author, wrote
up how to use it here: https://www.yesodweb.com/blog/2014/03/network-conduit-async
To write a distributed Conduit application, the application programmer must
manually determine the boundaries between the clients/servers and write specific
code to connect them.
pipes (2012-)
The Pipes Tutorial contrasts itself with "Conventional Haskell stream programming": whether that means Conduit or something else, I don't know.
Paraphrasing their pitch: Effects, Streaming Composability: pick two. That's the situation they describe for stream programming prior to Pipes. They argue Pipes offers all three.
Pipes offers it's own combinators (which read left-to-right) and offers interleaved effects.
At this point I can't really see what fundamentally distinguishes Pipes from Conduit.
Pipes has some support for distributed operation via the sister library
pipes-network. It
looks like you must send and receive ByteString
s, which means rolling
your own serialisation for other types. As with Conduit, to send or receive
over a network, the application programmer must divide their program up
into the sub-programs for each node, and add the necessary ingress/egress
code.
io-streams (2013-)
io-streams emphasises simple primitives. Reading and writing is done
under the IO Monad, thus, in an effectful (but non-pure) context. The
presence or absence of further stream data are signalled by using the
Maybe
type (Just
more data or Nothing
: the producer has finished.)
It provides a library of functions that shadow the standard Prelude, such
as S.fromList
, S.mapM
, etc.
It's not clear to me what the motivation for io-streams is, beyond providing a simple interface. There's no declaration of intent that I can find about (e.g.) constant-memory operation.
There's no mention of or support (that I can find) for distributed operation.
streaming (2015-)
Similar to io-streams, Streaming emphasises providing a simple
interface that gels well with traditional Haskell methods. Streaming
provides effectful streams (via a Monad -- any Monad?) and a collection
of functions for manipulating streams which are designed to closely
mimic standard Prelude (and Data.List
) functions.
Streaming doesn't push its own combinators: the examples provided
use $
and read right-to-left.
The motivation for Streaming seems to be to avoid memory leaks caused by
extracting pure lists from IO with traditional functions like mapM
,
which require all the list constructors to be evaluated, the list to be
completely deconstructed, and then a new list constructed.
Like io-streams, the focus of the library is providing a low-level streaming abstraction, and there is no support for distributed operation.
streamly (2017-)
Streamly appears to have the grand goal of providing a unified programming tool as suited for quick-and-dirty programming tasks (normally the domain of scripting languages) and high-performance work (C, Java, Rust, etc.). Their intended audience appears to be everyone, or at least, not just existing Haskell programmers. See their rationale
Streamly offers an interface to permit composing concurrent (note: not distributed) programs via combinators. It relies upon fusing a streaming pipeline to remove intermediate list structure allocations and de-allocations (i.e. de-forestation, similar to GHC rewrite rules)
The examples I've seen use standard combinators (e.g. Control.Function.&
,
which reads left-to-right, and Applicative
).
Streamly provide benchmarks versus Haskell pure lists, Streaming, Pipes and Conduit: these generally show Streamly several orders of magnitude faster.
I'm finding it hard to evaluate Streamly. It's big, and it's focus is wide. It provides shadows of Prelude functions, as many of these libraries do.
wrap-up
It seems almost like it must be a rite-of-passage to write a streaming system in Haskell. Stones and glass houses, I'm guilty of that too.
The focus of the surveyed libraries is mostly on providing a streaming abstraction, normally with an analogous interface to standard Haskell lists. They differ on various philosophical points (whether to abstract away the mechanics behind type synonyms, how much to leverage existing Haskell idioms, etc). A few of the libraries have some rudimentary support for distributed operation, but this is limited to connecting separate nodes together: in some cases serialising data remains the application programmer's job, and in all cases the application programmer must manually carve up their processing according to a fixed idea of what nodes they are deploying to. They all define a fixed-function pipeline.
I wanted to follow new content posted to Printables.com with a feed reader, but Printables.com doesn't provide one. Neither do the other obvious 3d model catalogues. So, I started building one.
I have something that spits out an Atom feed and a couple of beta testers gave me some valuable feedback. I had planned to make it public, with the ultimate goal being to convince Printables.com to implement feeds themselves.
Meanwhile, I stumbled across someone else who has done basically the same thing. Here are 3rd party feeds for
- trending - https://dyn.tedder.me/rss/printables/trending.json
- top rated - https://dyn.tedder.me/rss/printables/top_rated.json
- top downloads - https://dyn.tedder.me/rss/printables/top_downloads.json
The format of their feeds is JSON-Feed, which is new to me. FreshRSS and NetNewsWire seems happy with it. (I went with Atom.) I may still release my take, if I find time to make one improvmment that my beta-testers suggested.
I've just passed my 10th anniversary of starting at Red Hat! As a personal milestone, this is the longest I've stayed in a job: I managed 10 years at Newcastle University, although not in one continuous role.
I haven't exactly worked in one continuous role at Red Hat either, but it feels like what I do Today is a logical evolution from what I started doing, whereas in Newcastle I jumped around a bit.
I've seen some changes: in my time here, we changed the logo from Shadow Man; we transitioned to using Google Workspace for lots of stuff, instead of in-house IT; we got bought by IBM; we changed President and CEO, twice. And millions of smaller things.
I won't reach an 11th: my Organisation in Red Hat is moving to IBM. I think this is sad news for Red Hat: they're losing some great people. But I'm optimistic for the future of my Organisation.
I'm going to FOSDEM 2025!
As usual, I'll be in the Java Devroom for most of that day, which this time around is Saturday.
Please recommend me any talks!
This is my shortlist so far:
- no more boot loader: boot using the Linux kernel
- aerc, an email client for the discerning hacker
- Supersonic retro development with Docker
- Raiders of the lost hard drive
- Rediscovering the fun of programming with the Game Boy
- Fixing CVEs on Debian: almost everything you should know about it
- Building the Future: Understanding and Contributing to Immutable Linux Distributions
- Generating immutable, A/B updatable, securely booting Debian images
- a tale of several distros joining forces for a common goal: reproducible builds
- Finding Anomalies in the Debian Packaging System to Detect Supply Chain Attacks
- The State of OpenJDK
- Project Lilliput - Looking Back and Ahead
- (Almost) everything I knew about Java performance was wrong
- Reduce the size of your Java run-time image
- Project Leyden - Past and the Future
- DMARCaroni: where do DMARC reports go after they are sent?
Happy 20th birthday, dsafilter!
dsafilter
is a mail filter I wrote two decades ago to solve a problem I had:
I was dutifully subscribed to
debian-security-announce
to learn of new security package updates, but most were not relevant to me.
The filter creates a new, summarizing mail, reporting on whether the DSA was applicable to any package installed on the system running the filter, and attached the original DSA mail for reference. Users can then choose to drop mails for packages that aren't relevant.
In 2005 I'd been a Debian user for about 6 years, I'd met a few Debian
developers in person and I was interested in getting involved. I started my
journey to Developer later that same year. I published dsafilter
, and I think
I sent an announcement to debian-devel
, but didn't do a great deal to
socialise it, so I suspect nobody else is using it.
That said, I have been for the two decades, and I still am! What's notable to me about that is that I haven't had to modify the script at all to keep up with software changes, in particular, from the interpreter. I wrote it as a Ruby script. If I had chosen Perl, it would probably be the same story, but if I'd chosen Python, there's no chance at all that it would still be working today.
If it sounds interesting to you, please give it a try. I think it might be due some spring cleaning.
Older posts are available on the all posts page.