10 results found Sort:

81
2.3k
apache-2.0
27
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
Created 2020-08-14
859 commits to main branch, last one 2 days ago
Process Common Crawl data with Python and Spark
Created 2017-04-12
107 commits to main branch, last one 2 days ago
35
327
apache-2.0
34
News crawling with StormCrawler - stores content as WARC
Created 2016-07-18
159 commits to master branch, last one about a year ago
A python utility for downloading Common Crawl data
This repository has been archived (exclude archived)
Created 2020-01-09
122 commits to master branch, last one 4 years ago
Statistics of Common Crawl monthly archives mined from URL index files
Created 2016-07-14
225 commits to master branch, last one 6 days ago
14
163
apache-2.0
2
:spider: The pipeline for the OSCAR corpus
Created 2021-02-15
419 commits to main branch, last one about a year ago
11
137
apache-2.0
6
Drill into WARC web archives
Created 2023-12-07
78 commits to main branch, last one 2 months ago
6
86
apache-2.0
9
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
This repository has been archived (exclude archived)
Created 2019-03-01
17 commits to master branch, last one 3 years ago
5
84
apache-2.0
12
Tools to construct and process webgraphs from Common Crawl data
Created 2017-11-12
107 commits to main branch, last one 2 days ago
9
48
apache-2.0
18
Various Jupyter notebooks about Common Crawl data
Created 2019-07-19
23 commits to main branch, last one 2 years ago