Repository navigation

#

warc

ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Python
24793
3 个月前

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Java
3029
6 天前
Rhizome-Conifer/conifer

Collect and revisit web pages.

Python
1513
7 个月前

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Python
1508
3 个月前

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

TypeScript
1050
25 天前

Run a high-fidelity browser-based web archiving crawler in a single Docker container

TypeScript
852
19 天前

Serverless replay of web archives directly in the browser

TypeScript
826
1 个月前

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Python
642
3 个月前

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

JavaScript
448
5 年前

Streaming WARC/ARC library for fast web archive IO

Python
428
8 个月前

WarcDB: Web crawl data as SQLite databases.

Python
404
1 年前

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

Roff
377
5 个月前

News crawling with StormCrawler - stores content as WARC

Java
352
6 个月前

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

TypeScript
314
1 天前

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

Python
258
6 个月前

Chrome extension to "Create WARC files from any webpage"

JavaScript
222
2 年前

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Python
191
3 年前

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Python
181
8 个月前

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Scala
152
20 天前