Webrecorder | Introducing: Browsertrix

I’m excited to announce that Webrecorder is working on embarking on perhaps our most ambitious development effort to date: the collaborative development of Browsertrix!

You can read more on our official site browsertrix.com, including a list of key features.

Browsertrix is a fully integrated open source browser-based crawling platform that will allow users to create their own high-fidelity web archives in an automated way at scale.

Current Status

The full source code for Browsertrix is available on GitHub.

Development is currently in the early stages, with a focus on implementing core features and a user-friendly UI.

Our Senior Frontend Developer Sua Yoo has been tackling the creation of a brand new user interface to manage crawls and crawl configurations. To keep up with the development progress, please follow the project on GitHub.

Planned Service and Collaborative Development with IIPC Community

Webrecorder plans to eventually offer Browsertrix as a service, and we will be rolling out testing gradually over the next few months.

If you’re interested in participating in early testing, please sign-up for the Browsertrix info list

But, Browsertrix won’t just be another siloed online service!

Browsertrix is still in early stages of development, but we believe it is important to share this work, and more importantly, develop it in collaboration with our partners in the web archiving community.

Towards this goal, we are also very excited to announce our collaboration with the IIPC community!

The International Internet Preservation Consortium (IIPC) has agreed to contribute funding towards the development of Browsertrix over the next one or two years.

In addition to this funding and as part of this collaboration, several IIPC members, including The Royal Danish Library, British Library, the National Library of New Zealand, and the University of North Texas will also be deploying Browsertrix within their institutions.

The goal of Browsertrix is to provide a kind of federated web archiving system, which can be deployed not just by Webrecorder, but by other institutions. The close collaboration with IIPC members from the start will ensure that this system can meet the broader goals of the web archiving community, from smaller institutions to large national libraries.

Open Source and Open Web Archive Data

Webrecorder believes strongly that web archiving tools should be fully open-source to ensure long-term viability of the digital record. Web archiving is too critical to be relegated to proprietary processes and the whims of individual vendors.

While the WARC format provides a standard way to store raw HTTP data, there are no standards formats for everything else, from crawl specifications and crawl logs, page lists and indexes, full text search data, and curatorial metadata. With the WACZ format, Webrecorder is beginning to standardize some of these remaining components of the web archiving workflow.

One of our goals with Browsertrix is to allow crawl outputs (WACZ or WARC) to be stored in any storage of the users’ choosing. Browsertrix will allow output to any S3-bucket, so that even if the service were to disappear or stop working, all of the data will still be accessible using existing tools like ReplayWeb.page. This federated approach to storage will allow crawled data to be stored almost anywhere, from custom institutional repositories like Archipelago, to existing WARC data centers at national libraries, to any cloud S3 provider (like Amazon or Digital Ocean), to local file system, as well as decentralized storage systems like IPFS.

With Browsertrix, we hope to enable users to truly own all of their web archive data, and to be able to access and make use of it without relying on infrastructure from any single vendor (including Webrecorder!)

Collaborations Welcome

We know this will be an ambitious project, and we are just getting started! Web archiving is becoming more critical and more difficult.

If you would like to contribute to the development, testing, or are interested in a custom deployment of Browsertrix, please feel free to reach out directly via e-mail, GitHub or our forums.

If you would like to support Webrecorder financially, please consider supporting Webrecorder via our Open Collective or GitHub Sponsors accounts and don’t hesitate to reach out to us with any questions.

EDIT 2024-05-22: “Browsertrix” was previously referred to here as “Browsertrix Cloud”. This post has been updated to reflect the new name.

Have thoughts or comments on this post? Feel free to start a discussion on our forum!