Search Results - RepositoryStats

1.2k

22.8k

mit

174

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Created 2017-05-05

4,606 commits to dev branch, last one 4 days ago

heritrix3 internetarchive

761

2.9k

other

187

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

java warc heritrix webcrawling

Created 2011-10-21

2,624 commits to master branch, last one 6 days ago

conifer Rhizome-Conifer

119

1.5k

apache-2.0

52

Collect and revisit web pages.

pywb warc docker python wayback archives webrecorder web-archiving

Created 2015-05-13

1,889 commits to master branch, last one 3 years ago

grab-site ArchiveTeam

137

1.4k

other

42

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

warc crawl spider crawler archiving

Created 2015-02-05

1,172 commits to master branch, last one 5 months ago

archiveweb.page webrecorder

62

912

agpl-3.0

19

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

wacz warc chromium archiving extension webrecorder web-archiving browser-extension

Created 2020-02-10

245 commits to main branch, last one about a month ago

replayweb.page webrecorder

59

728

agpl-3.0

16

Serverless replay of web archives directly in the browser

wacz warc web-replay web-archive web-archiving service-worker replay-web-page wayback-machine

Created 2019-12-09

486 commits to main branch, last one 7 days ago

browsertrix-crawler webrecorder

87

681

agpl-3.0

24

Run a high-fidelity browser-based web archiving crawler in a single Docker container

wacz warc crawler crawling web-crawler webrecorder web-archiving

Created 2020-11-02

458 commits to main branch, last one about a month ago

ipwb oduwsdl

39

617

mit

23

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

ipfs warc docker python memento wayback memento-rfc web-archiving service-worker

Created 2016-03-04

1,618 commits to master branch, last one 2 months ago

WarcDB Florents-Tselai

11

397

apache-2.0

10

WarcDB: Web crawl data as SQLite databases.

cli warc sqlite crawling database web-data web-archiving

Created 2022-05-29

73 commits to main branch, last one 5 months ago

warcio webrecorder

58

392

apache-2.0

22

Streaming WARC/ARC library for fast web archive IO

pywb warc python web-archives web-archiving

Created 2017-03-06

157 commits to master branch, last one 20 days ago

wail machawk1

36

355

mit

14

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

gui warc python wayback heritrix openwayback pyinstaller web-archiving

Created 2013-03-20

836 commits to main branch, last one 2 months ago

news-crawl commoncrawl

35

327

apache-2.0

34

News crawling with StormCrawler - stores content as WARC

news warc crawler commoncrawl web-crawler apache-storm common-crawl storm-crawler

Created 2016-07-18

159 commits to master branch, last one about a year ago

bitextor bitextor

43

292

gpl-3.0

30

Bitextor generates translation memories from multilingual websites

Created 2018-04-16

4,001 commits to master branch, last one about a month ago

warc-gpt harvard-lil

23

235

mit

12

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

ai rag warc webarchiving

Created 2023-10-23

211 commits to main branch, last one about a month ago

browsertrix webrecorder

40

216

agpl-3.0

12

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

wacz warc cloud archiving kubernetes web-archive webrecorder web-archiving

Created 2021-06-28

1,437 commits to main branch, last one 7 days ago

warcreate machawk1

13

215

mit

17

Chrome extension to "Create WARC files from any webpage"

warc web-archiving chrome-extension

Created 2013-03-20

181 commits to main branch, last one about a year ago

cocrawler cocrawler

24

188

apache-2.0

20

CoCrawler is a versatile web crawler built using modern tools and concurrency.

warc aiohttp crawler python3 screenshot concurrency async-python aiohttp-client pluggable-modules

Created 2016-07-15

1,142 commits to main branch, last one 2 years ago

cdx_toolkit cocrawler

31

161

apache-2.0

11

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

cdx warc python cdx-api commoncrawl web-archives web-archiving

Created 2018-03-03

259 commits to main branch, last one 3 months ago

ArchiveSpark helgeho

19

146

mit

15

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

warc spark webarchive archivespark web-archiving spark-framework internet-archive

Created 2015-08-06

144 commits to master branch, last one 3 months ago

troll-a crissyfield

11

137

apache-2.0

6

Drill into WARC web archives

warc security common-crawl security-tools internet-archive command-line-tool

Created 2023-12-07

78 commits to main branch, last one 2 months ago

wget-lua ArchiveTeam

15

107

gpl-3.0

20

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

ftp lua warc wget zstd crawl spider crawler scraper crawlers crawling scraping wget-lua archiving downloader archiveteam webarchiving

Created 2019-09-07

4,429 commits to v1.21.3-at branch, last one about a month ago

warc-parquet maxcountryman

0

106

mit

5

🗄️ A simple CLI for converting WARC to Parquet.

warc duckdb parquet crawling web-archiving

Created 2022-06-20

72 commits to main branch, last one about a month ago

node-warc N0taN3rd

21

95

mit

9

Parse And Create Web ARChive (WARC) files with node.js

warc pupeteer warc-files webarchive web-archives webarchiving web-archiving chrome-remote-interface

Created 2017-05-21

114 commits to master branch, last one 5 years ago

chatnoir-resiliparse chatnoir-eu

14

90

apache-2.0

9

A robust web archive analytics toolkit

cpp web warc cython python bigdata extraction htmlparser webarchive

Created 2021-06-22

930 commits to develop branch, last one 25 days ago

forum-dl mikwielgus

2

77

mit

4

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

warc forum phpbb python scraper discourse data-fetching simplemachines internet-archiving

Created 2023-02-05

420 commits to develop branch, last one about a year ago

CommonCrawlDocumentDownload centic9

18

63

bsd-2-clause

13

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java warc cdx-files mime-types commoncrawl

Created 2015-04-22

296 commits to master branch, last one 10 days ago

cdx-summary internetarchive

10

62

agpl-3.0

21

Summarize web archive capture index (CDX) files.