Serverless replay of web archives directly in the browser
-
Updated
Jun 12, 2024 - TypeScript
Serverless replay of web archives directly in the browser
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Run a high-fidelity browser-based crawler in a single Docker container
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Streaming WARC/ARC library for fast web archive IO
Parser for WARC (aka WebArchive) files
Bitextor generates translation memories from multilingual websites
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
🗄️ A simple CLI for converting WARC to Parquet.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
A tool for detecting viruses and NSFW material in WARC files
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."