7 results found Sort:

262
3.7k
apache-2.0
32
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created 2019-04-08
1,576 commits to master branch, last one a day ago
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
Created 2018-11-19
98 commits to master branch, last one 6 months ago
79
957
other
14
🧹 Python package for text cleaning
Created 2018-12-06
83 commits to main branch, last one 2 years ago
16
300
apache-2.0
0
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M o...
Created 2024-08-04
190 commits to main branch, last one 2 months ago
26
246
unknown
11
Tools for cleaning and normalizing text data
Created 2016-01-07
231 commits to master branch, last one 3 years ago
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
Created 2020-12-04
434 commits to master branch, last one about a year ago
Grammarify is a npm package that safely cleans up text that has mispellings, improper capitalization, lexical illusions, among other things.
Created 2018-04-22
31 commits to master branch, last one 2 years ago