14 results found Sort:

412
2.0k
apache-2.0
52
news-please - an integrated web crawler and information extractor for news that just works
Created 2016-12-18
736 commits to master branch, last one 3 days ago
Process Common Crawl data with Python and Spark
Created 2017-04-12
104 commits to main branch, last one 2 months ago
34
309
apache-2.0
32
News crawling with StormCrawler - stores content as WARC
Created 2016-07-18
159 commits to master branch, last one 6 months ago
A python utility for downloading Common Crawl data
Created 2020-01-09
122 commits to master branch, last one 3 years ago
29
155
apache-2.0
12
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Created 2018-03-03
243 commits to main branch, last one 2 years ago
14
154
apache-2.0
2
:spider: The pipeline for the OSCAR corpus
Created 2021-02-15
419 commits to main branch, last one 7 months ago
64
142
mit
6
A very simple news crawler with a funny name
Created 2022-10-28
1,986 commits to master branch, last one a day ago
Extract web archive data using Wayback Machine and Common Crawl
Created 2019-06-14
28 commits to master branch, last one about a year ago
Statistics of Common Crawl monthly archives mined from URL index files
Created 2016-07-14
209 commits to master branch, last one 23 hours ago
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
Created 2022-05-27
26 commits to master branch, last one about a year ago
Index Common Crawl archives in tabular format
Created 2017-11-09
89 commits to main branch, last one 9 months ago
4
74
apache-2.0
11
Tools to construct and process webgraphs from Common Crawl data
Created 2017-11-12
97 commits to main branch, last one 23 hours ago
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
Created 2015-04-22
286 commits to master branch, last one 12 days ago
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
Created 2017-03-27
139 commits to master branch, last one 5 months ago