Search Results - RepositoryStats

news-please fhamborg

432

2.2k

apache-2.0

53

news-please - an integrated web crawler and information extractor for news that just works

Created 2016-12-18

802 commits to master branch, last one 5 months ago

cc-pyspark commoncrawl

88

422

mit

19

Process Common Crawl data with Python and Spark

wet spark pyspark sparksql wat-files warc-files commoncrawl common-crawl

Created 2017-04-12

115 commits to main branch, last one about a month ago

fundus flairNLP

85

356

mit

7

A very simple news crawler with a funny name

nlp rss corpus python cc-news crawler scraper sitemap datasets web-corpus commoncrawl corpus-tools news-crawler web-scraping news-scraping text-extraction image-extraction image-classification

Created 2022-10-28

2,691 commits to master branch, last one 6 days ago

news-crawl commoncrawl

36

339

apache-2.0

32

News crawling with StormCrawler - stores content as WARC

news warc crawler commoncrawl web-crawler apache-storm common-crawl storm-crawler

Created 2016-07-18

159 commits to master branch, last one about a year ago

comcrawl michaelharms

42

234

mit

5

A python utility for downloading Common Crawl data

data python scraping commoncrawl common-crawl deep-learning training-dataset

This repository has been archived (exclude archived)

Created 2020-01-09

122 commits to master branch, last one 4 years ago

cc-crawl-statistics commoncrawl

11

175

apache-2.0

17

Statistics of Common Crawl monthly archives mined from URL index files

statistics commoncrawl common-crawl

Created 2016-07-14

234 commits to master branch, last one 2 days ago

cdx_toolkit cocrawler

31

168

apache-2.0

10

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

cdx warc python cdx-api commoncrawl web-archives web-archiving

Created 2018-03-03

259 commits to main branch, last one 6 months ago

ungoliant oscar-project

15

167

apache-2.0

2

:spider: The pipeline for the OSCAR corpus

nlp oscar crawler fasttext commoncrawl common-crawl corpus-linguistics language-classification

Created 2021-02-15

419 commits to main branch, last one about a year ago

gogetcrawl karust

17

154

mit

3

Extract web archive data using Wayback Machine and Common Crawl

golang crawler webarchive commoncrawl concurrency wayback-machine

Created 2019-06-14

31 commits to master branch, last one 4 months ago

c4-dataset-script shjwudp

14

120

mit

4

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

nlp spark python dataset commoncrawl massivetext

Created 2022-05-27

26 commits to master branch, last one about a year ago

cc-index-table commoncrawl

9

113

apache-2.0

13

Index Common Crawl archives in tabular format

sql spark aws-athena commoncrawl apache-parquet columnar-storage

Created 2017-11-09

101 commits to main branch, last one 7 days ago

cc-webgraph commoncrawl

5

87

apache-2.0

11

Tools to construct and process webgraphs from Common Crawl data

pagerank webgraph commoncrawl common-crawl webgraph-framework centrality-measures

Created 2017-11-12

111 commits to main branch, last one 7 days ago

CommonCrawlDocumentDownload centic9

18

65

bsd-2-clause

12

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java warc cdx-files mime-types commoncrawl

Created 2015-04-22

296 commits to master branch, last one 2 months ago

KeywordAnalysis CI-Research

11

56

unknown

5

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

wordcount commoncrawl cluster-analysis keyword-extraction

Created 2017-03-27

139 commits to master branch, last one about a year ago

cc-notebooks commoncrawl

9

51

apache-2.0

17

Various Jupyter notebooks about Common Crawl data

aws-athena commoncrawl common-crawl webarchiving jupyter-notebook webgraph-framework

Created 2019-07-19

23 commits to main branch, last one 2 years ago

uforall rix4uni

8

39

mit

2

uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl

osint recon crawler urlscan wayback bugbounty alienvault commoncrawl reconnaissance

Created 2022-12-30

34 commits to main branch, last one 3 months ago

cc-downloader commoncrawl

1

34

apache-2.0

7

A polite and user-friendly downloader for Common Crawl data

rust downloader commoncrawl

Created 2024-06-10

63 commits to main branch, last one about a month ago