30 results found Sort:

287
4.1k
apache-2.0
31
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created 2019-04-08
1,594 commits to master branch, last one 4 days ago
529
3.6k
apache-2.0
113
Module for automatic summarization of text documents and HTML pages.
Created 2013-02-20
456 commits to main branch, last one 10 months ago
263
2.7k
other
29
Golang PDF library for creating and processing PDF files (pure go)
Created 2019-05-16
1,843 commits to master branch, last one 27 days ago
A text extraction library supporting PDFs, images, office documents and more
Created 2025-01-31
127 commits to main branch, last one 16 days ago
240
1.6k
apache-2.0
38
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Created 2014-06-26
475 commits to master branch, last one about a year ago
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Created 2017-02-09
163 commits to master branch, last one about a year ago
83
764
bsd-2-clause
20
Heuristic based boilerplate removal tool
Created 2013-02-10
191 commits to main branch, last one 24 days ago
56
621
agpl-3.0
28
A self-hosted search engine for documents.
Created 2016-04-20
4,478 commits to main branch, last one 2 days ago
71
531
other
28
Text Extraction, Rendering and Converting of PDF Documents
Created 2016-02-23
316 commits to master branch, last one 18 days ago
45
495
mit
16
A simple library and set of tools for parsing, modifying, and composing SRT files.
Created 2014-12-28
710 commits to develop branch, last one about a year ago
85
359
mit
7
A very simple news crawler with a funny name
Created 2022-10-28
2,691 commits to master branch, last one 10 days ago
Parse PDFs into markdown using Vision LLMs
Created 2024-12-16
112 commits to main branch, last one about a month ago
39
314
agpl-3.0
7
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Created 2020-05-23
86 commits to master branch, last one 3 years ago
15
266
bsd-3-clause
5
Benchmarking PDF libraries
Created 2022-05-08
57 commits to main branch, last one about a year ago
25
204
bsd-2-clause
21
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Created 2012-05-03
227 commits to master branch, last one 10 months ago
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Created 2024-01-12
467 commits to main branch, last one 3 months ago
13
181
other
5
Entity Disambiguation as text extraction (ACL 2022)
Created 2022-03-22
9 commits to main branch, last one 2 years ago
77
157
unknown
16
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Created 2019-01-15
15 commits to master branch, last one 4 years ago
32
142
apache-2.0
14
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Created 2017-07-06
1,032 commits to main branch, last one about a year ago
PDF Reader Library for Native Julia.
Created 2017-05-28
374 commits to master branch, last one about a month ago
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Created 2015-08-30
375 commits to master branch, last one 5 days ago
Simple app to extract text from pictures using Tesseract
Created 2019-12-27
9 commits to master branch, last one 3 years ago
16
97
unlicense
5
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Created 2019-03-02
81 commits to master branch, last one about a year ago
PDF text data extraction web app with OCR for scanned documents
Created 2022-05-13
46 commits to main branch, last one 9 months ago
18
80
other
6
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created 2023-05-29
1,327 commits to master branch, last one 19 days ago
Yet another library to extract text from MS Office and PDF files
Created 2021-03-04
86 commits to main branch, last one 8 months ago
Get text content from any file
Created 2020-07-08
38 commits to master branch, last one about a year ago
9
62
gpl-3.0
1
python based software to unpack kindlegen generated ebooks
Created 2020-03-02
29 commits to master branch, last one 2 years ago
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...
Created 2024-09-10
26 commits to main branch, last one 3 months ago
The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework ...
Created 2024-06-14
24 commits to main branch, last one about a month ago