13 results found Sort:

220
3.2k
mit
62
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
Created 2022-12-31
233 commits to main branch, last one 11 days ago
132
799
gpl-3.0
7
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
Created 2023-04-26
31 commits to main branch, last one about a month ago
82
659
mit
10
汉语现代诗歌语料库整理,3489诗人,81.7K诗歌,15.43M字。持续扩充...
Created 2019-04-16
1,863 commits to master branch, last one 11 months ago
58
484
unknown
24
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Created 2018-11-22
383 commits to master branch, last one about a year ago
60
337
unknown
8
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
Created 2020-07-31
5 commits to master branch, last one about a year ago
21
255
cc-by-4.0
13
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Created 2021-01-19
114 commits to main branch, last one 4 months ago
21
85
unknown
3
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
Created 2020-05-04
8 commits to master branch, last one about a year ago
Utilities for Processing the Switchboard Dialogue Act Corpus
Created 2018-11-14
48 commits to master branch, last one 3 years ago
14
65
unknown
3
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Created 2021-11-09
40 commits to main branch, last one 2 years ago
文本去重
Created 2023-02-15
67 commits to main branch, last one 2 months ago
Biomedical NLP Corpus or Datasets.
Created 2020-10-07
28 commits to main branch, last one 2 years ago
爬取bilibili视频下的评论,最新出品!!!⚠本代码只适用于学习,做其他事情概不负责!!!
Created 2022-12-13
15 commits to main branch, last one about a year ago
粵文語料篩選器 Cantonese text filter
Created 2022-04-03
33 commits to main branch, last one 3 days ago