warc

Star

Here are 100 public repositories matching this topic...

webrecorder / replayweb.page

Sponsor

Star

Serverless replay of web archives directly in the browser

service-worker warc web-archiving wayback-machine web-archive replay-web-page web-replay wacz

Updated Jun 12, 2024
TypeScript

webrecorder / browsertrix

Sponsor

Star

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

kubernetes cloud archiving warc web-archiving webrecorder web-archive wacz

Updated Jun 12, 2024
TypeScript

oduwsdl / ipwb

Star

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

python docker service-worker ipfs memento warc web-archiving wayback memento-rfc

Updated Jun 12, 2024
Python

openzim / zimit-frontend

Sponsor

Star

Zimit Public Web UI

spider warc zim

Updated Jun 12, 2024
Vue

openzim / warc2zim

Sponsor

Star

Command line tool to convert a file in the WARC format to a file in the ZIM format

scraper warc zim

Updated Jun 11, 2024
Python

webrecorder / browsertrix-crawler

Sponsor

Star

Run a high-fidelity browser-based crawler in a single Docker container

crawler web-crawler crawling warc web-archiving webrecorder wacz

Updated Jun 12, 2024
TypeScript

harvard-lil / warc-gpt

Star

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

ai warc webarchiving rag

Updated Jun 10, 2024
Python

ArchiveBox / ArchiveBox

Sponsor

Star

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Updated Jun 10, 2024
Python

internetarchive / heritrix3

Star

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

java warc heritrix webcrawling

Updated Jun 8, 2024
Java

helgeho / ArchiveSpark

Star

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

spark internet-archive warc web-archiving webarchive archivespark spark-framework