Example of using warcutils with Apach Spark
-
Updated
Jul 25, 2017 - Scala
Example of using warcutils with Apach Spark
Transform stream to read .warc or .warc.gz file member by member in nodejs
This library is a very lightweight client to Common Crawl's WARC files.
A search engine, but currently a filtering pipeline for WARC files. Legacy repo, look for abracabra repo.
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
ES6 Class to read .warc or .warc.gz file member by member in nodejs
Hadoop streaming EMR job
This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Partition (W)ARC Files by MIME Type and Year
Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service
This project is intended to turn a WARC file into a sitemap or into something (a graph description) one could build a sitemap from. The first release only offers to create a Graphviz file that can then be rendered - for example into SVG.
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."