esbatmop / MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

Date Created 2022-12-31 (about a year ago)
Commits 255 (last one 4 days ago)
Stargazers 3,592 (11 this week)
Watchers 66 (0 this week)
Forks 249
License mit
Ranking

RepositoryStats indexes 595,890 repositories, of these esbatmop/MNBVC is ranked #13,846 (98th percentile) for total stargazers, and #29,968 for total watchers.

esbatmop/MNBVC is also tagged with popular topics, for these it's ranked: nlp (#108/2430),  chinese (#51/407)

Other Information

esbatmop/MNBVC has 1 open pull request on Github, 1 pull request has been merged over the lifetime of the repository.

Github issues are enabled, there are 17 open issues and 38 closed issues.

Star History

Github stargazers over time

Watcher History

Github watchers over time, collection started in '23

Recent Commit History

255 commits on the default branch (main) since jan '22

Yearly Commits

Commits to the default branch (main) per year

Issue History

Languages

We don't have any language data for this repository

It's a mystery

updated: 2024-12-20 @ 09:55am, id: 583824526 / R_kgDOIsx0jg