shjwudp / c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Date Created 2022-05-27 (2 years ago)
Commits 26 (last one about a year ago)
Stargazers 119 (0 this week)
Watchers 5 (0 this week)
Forks 14
License mit
Ranking

RepositoryStats indexes 589,134 repositories, of these shjwudp/c4-dataset-script is ranked #259,475 (56th percentile) for total stargazers, and #333,644 for total watchers. Github reports the primary language for this repository as Python, for repositories using this language it is ranked #47,704/117,584.

shjwudp/c4-dataset-script is also tagged with popular topics, for these it's ranked: python (#11,624/22145),  nlp (#1,318/2415),  dataset (#545/1150),  spark (#307/537)

Other Information

There have been 1 release, the latest one was published on 2022-05-27 (2 years ago) with the name C4 Dataset Script v0.1.0.

Star History

Github stargazers over time

Watcher History

Github watchers over time, collection started in '23

Recent Commit History

26 commits on the default branch (master) since jan '22

Yearly Commits

Commits to the default branch (master) per year

Issue History

No issues have been posted

Languages

The only known language in this repository is Python

updated: 2024-09-23 @ 01:18am, id: 496950762 / R_kgDOHZ7d6g