26 results found Sort:

258
3.6k
apache-2.0
31
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created 2019-04-08
1,570 commits to master branch, last one 2 days ago
530
3.5k
apache-2.0
113
Module for automatic summarization of text documents and HTML pages.
Created 2013-02-20
456 commits to main branch, last one 5 months ago
255
2.6k
other
30
Golang PDF library for creating and processing PDF files (pure go)
Created 2019-05-16
1,835 commits to master branch, last one 15 days ago
234
1.5k
apache-2.0
39
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Created 2014-06-26
475 commits to master branch, last one about a year ago
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Created 2017-02-09
163 commits to master branch, last one about a year ago
79
726
bsd-2-clause
21
Heuristic based boilerplate removal tool
Created 2013-02-10
187 commits to main branch, last one 6 months ago
53
596
agpl-3.0
29
A self-hosted search engine for documents.
Created 2016-04-20
4,215 commits to main branch, last one a day ago
71
523
other
29
Text Extraction, Rendering and Converting of PDF Documents
Created 2016-02-23
314 commits to master branch, last one about a month ago
44
473
mit
17
A simple library and set of tools for parsing, modifying, and composing SRT files.
Created 2014-12-28
710 commits to develop branch, last one about a year ago
40
298
agpl-3.0
7
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Created 2020-05-23
86 commits to master branch, last one 3 years ago
74
288
mit
7
A very simple news crawler with a funny name
Created 2022-10-28
2,367 commits to master branch, last one a day ago
11
222
bsd-3-clause
5
Benchmarking PDF libraries
Created 2022-05-08
57 commits to main branch, last one about a year ago
26
205
bsd-2-clause
22
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Created 2012-05-03
227 commits to master branch, last one 6 months ago
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Created 2024-01-12
464 commits to main branch, last one 7 months ago
13
177
other
6
Entity Disambiguation as text extraction (ACL 2022)
Created 2022-03-22
9 commits to main branch, last one 2 years ago
78
154
unknown
16
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Created 2019-01-15
15 commits to master branch, last one 3 years ago
33
137
apache-2.0
15
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Created 2017-07-06
1,032 commits to main branch, last one 8 months ago
PDF Reader Library for Native Julia.
Created 2017-05-28
369 commits to master branch, last one about a year ago
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Created 2015-08-30
360 commits to master branch, last one 5 months ago
Simple app to extract text from pictures using Tesseract
Created 2019-12-27
9 commits to master branch, last one 3 years ago
16
93
unlicense
5
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Created 2019-03-02
81 commits to master branch, last one about a year ago
PDF text data extraction web app with OCR for scanned documents
Created 2022-05-13
46 commits to main branch, last one 5 months ago
15
66
other
5
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created 2023-05-29
1,188 commits to master branch, last one 22 days ago
9
61
gpl-3.0
2
python based software to unpack kindlegen generated ebooks
Created 2020-03-02
29 commits to master branch, last one about a year ago
Get text content from any file
Created 2020-07-08
38 commits to master branch, last one about a year ago
Yet another library to extract text from MS Office and PDF files
Created 2021-03-04
86 commits to main branch, last one 3 months ago