Repository navigation

#

warc

ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Python
25055
5 个月前

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Java
3067
1 天前

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Python
1524
4 个月前
Rhizome-Conifer/conifer

Collect and revisit web pages.

Python
1520
9 个月前

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

TypeScript
1071
1 天前

Run a high-fidelity browser-based web archiving crawler in a single Docker container

TypeScript
879
1 天前

Serverless replay of web archives directly in the browser

TypeScript
847
1 天前

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Python
645
17 天前

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

JavaScript
447
5 年前

Streaming WARC/ARC library for fast web archive IO

Python
431
10 个月前

WarcDB: Web crawl data as SQLite databases.

Python
406
1 年前

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

Roff
379
7 个月前

News crawling with StormCrawler - stores content as WARC

Java
356
8 个月前

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

TypeScript
336
2 天前

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

Python
259
8 个月前

Chrome extension to "Create WARC files from any webpage"

JavaScript
223
2 年前

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Python
189
3 年前

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Python
184
2 天前

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Scala
153
2 个月前