shjwudp / c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Date Created 2022-05-27 (2 years ago)
Commits 26 (last one about a year ago)
Stargazers 120 (0 this week)
Watchers 4 (0 this week)
Forks 14
License mit
Ranking

RepositoryStats indexes 628,868 repositories, of these shjwudp/c4-dataset-script is ranked #269,579 (57th percentile) for total stargazers, and #364,615 for total watchers. Github reports the primary language for this repository as Python, for repositories using this language it is ranked #50,547/128,166.

shjwudp/c4-dataset-script is also tagged with popular topics, for these it's ranked: python (#12,012/23309),  nlp (#1,351/2512),  dataset (#572/1218),  spark (#316/551)

Other Information

There have been 1 release, the latest one was published on 2022-05-27 (2 years ago) with the name C4 Dataset Script v0.1.0.

Star History

Github stargazers over time

12012010010080806060404020200020232023Jul '23Jul '2320242024Jul '24Jul '2420252025

Watcher History

Github watchers over time, collection started in '23

55554444443333Jun '23Jun '23Jul '23Jul '23Aug '23Aug '23Sep '23Sep '23Oct '23Oct '23Nov '23Nov '23Dec '23Dec '2320242024Feb '24Feb '24Mar '24Mar '24Apr '24Apr '24May '24May '24Jun '24Jun '24Jul '24Jul '24Aug '24Aug '24Sep '24Sep '24Oct '24Oct '24Nov '24Nov '24Dec '24Dec '2420252025Feb '25Feb '25Mar '25Mar '25

Recent Commit History

26 commits on the default branch (master) since jan '22

303025252020151510105500Jul '22Jul '2220232023Jul '23Jul '2320242024Jul '24Jul '2420252025

Yearly Commits

Commits to the default branch (master) per year

1818161614141212101088664422002022202220242024

Issue History

No issues have been posted

Languages

The only known language in this repository is Python

PythonPython

updated: 2025-01-30 @ 08:39am, id: 496950762 / R_kgDOHZ7d6g