Search Results - RepositoryStats

87

2.5k

apache-2.0

25

Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖

csv html json simd laion ndjson parser string dataset substring common-crawl beautifulsoup string-search string-parsing string-matching sorting-algorithms pattern-recognition string-manipulation information-retrieval

Created 2020-08-14

877 commits to main branch, last one 15 days ago

cc-pyspark commoncrawl

88

422

mit

19

Process Common Crawl data with Python and Spark

wet spark pyspark sparksql wat-files warc-files commoncrawl common-crawl

Created 2017-04-12

115 commits to main branch, last one about a month ago

news-crawl commoncrawl

36

339

apache-2.0

32

News crawling with StormCrawler - stores content as WARC

news warc crawler commoncrawl web-crawler apache-storm common-crawl storm-crawler

Created 2016-07-18

159 commits to master branch, last one about a year ago

comcrawl michaelharms

42

236

mit

5

A python utility for downloading Common Crawl data

data python scraping commoncrawl common-crawl deep-learning training-dataset

This repository has been archived (exclude archived)

Created 2020-01-09

122 commits to master branch, last one 4 years ago

cc-crawl-statistics commoncrawl

11

175

apache-2.0

17

Statistics of Common Crawl monthly archives mined from URL index files

statistics commoncrawl common-crawl

Created 2016-07-14

234 commits to master branch, last one 8 days ago

ungoliant oscar-project

15

167

apache-2.0

2

:spider: The pipeline for the OSCAR corpus

nlp oscar crawler fasttext commoncrawl common-crawl corpus-linguistics language-classification

Created 2021-02-15

419 commits to main branch, last one about a year ago

troll-a crissyfield

11

134

apache-2.0

6

Drill into WARC web archives

warc security common-crawl security-tools internet-archive command-line-tool

Created 2023-12-07

78 commits to main branch, last one 5 months ago

goclassy oscar-project

6

86

apache-2.0

9

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

nlp fasttext common-crawl corpus-linguistics language-classification

This repository has been archived (exclude archived)

Created 2019-03-01

17 commits to master branch, last one 3 years ago

cc-webgraph commoncrawl

5

85

apache-2.0

11

Tools to construct and process webgraphs from Common Crawl data

pagerank webgraph commoncrawl common-crawl webgraph-framework centrality-measures

Created 2017-11-12

111 commits to main branch, last one 13 days ago

cc-notebooks commoncrawl

9

51

apache-2.0

17

Various Jupyter notebooks about Common Crawl data

aws-athena commoncrawl common-crawl webarchiving jupyter-notebook webgraph-framework

Created 2019-07-19

23 commits to main branch, last one 2 years ago