esbatmop / MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

Date Created 2022-12-31 (about a year ago)
Commits 222 (last one 18 hours ago)
Stargazers 3,085 (11 this week)
Watchers 61 (0 this week)
Forks 214
License mit
Ranking

RepositoryStats indexes 523,840 repositories, of these esbatmop/MNBVC is ranked #15,125 (97th percentile) for total stargazers, and #32,343 for total watchers.

esbatmop/MNBVC is also tagged with popular topics, for these it's ranked: nlp (#115/2227),  chinese (#47/357)

Other Information

esbatmop/MNBVC has 1 open pull request on Github, 1 pull request has been merged over the lifetime of the repository.

Github issues are enabled, there are 16 open issues and 36 closed issues.

Star History

Github stargazers over time

Watcher History

Github watchers over time, collection started in '23

Recent Commit History

222 commits on the default branch (main) since jan '22

Yearly Commits

Commits to the default branch (main) per year

Issue History

Languages

We don't have any language data for this repository

It's a mystery

updated: 2024-05-31 @ 08:47pm, id: 583824526 / R_kgDOIsx0jg