Can anyone recommend software for running a web service similar to archive.org?

We are looking for something similar to manage digital assets within the Computing History Special Interest Group.

One suggestion I've had is CKAN which looks very interesting but possibly more geared towards opening up an API to existing live data (such as an relational DB of stuff, distributed or otherwise). We are mostly concerned with relatively static data sets: source code archives, collections of various types of publications, collections of images, etc.

(Having said that, there are some interesting possibilities for projects that consume the data sets in some fashion, perhaps via a web service, for e.g. reviewing OCR results for old raster scans of papers.)

I envisage something similar to the software powering archive.org. We want both something that lets people explore collections of stuff via the web, including potentially via machine-friendly APIs in some cases; but also ideally manage uploading and categorising items via the web as well.

I've also had suggestions to look at media-manager software, but what I've seen so far is designed for personal media collections like movies, photos, etc., and focussed more on streaming them to LAN clients.

Can anyone recommend something worth looking at?


Comments

comment 1

Excellent question! I've wondered that myself many times...

Did you consider asking the archive.org people themselves? Or https://archive.is/? They seem to be running on free software...

Comment by anarcat,
comment 1
I believe archive.org make their software available to qualifying institutions, but not aware of the details/pricing. It doesn't look like it's open source though. https://archive-it.org/ But I think your organization would qualify.
Comment by Ben Lau,
comment 4
DSpace?
Comment by Anonymous,
comment 5

turns out archive.org does publish their software, it's called heritrix.

archive.is uses a PhantomJS thing to fetch the contents, from what I understand, but I still haven't found their source code.

but considering that PhantomJS was archived in march that hardly seems like a good solution. Some people seem to be using headless chromium instead now, so I guess that would be the basis for a modern crawler.

Comment by anarcat,
comment 6

I ended up writing a blog post about it. It's mostly about archiving websites, but it might still be of interest to you:

https://anarc.at/blog/2018-10-04-archiving-web-sites/

From what I understand, the archive.org web interface isn't free software (as in: not published) yet, but from the looks of it, it looks like a pretty old piece of junk, to be honest, so I would look somewhere else. My article deals mainly with WARC files: how to create and view them. If you have other artifacts, I am not sure the article will be very useful, but at least it's a start for that part of the problem.

Comment by anarcat,