32 results found Sort:

289
4.2k
apache-2.0
32
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created 2019-04-08
1,594 commits to master branch, last one about a month ago
529
3.6k
apache-2.0
113
Module for automatic summarization of text documents and HTML pages.
Created 2013-02-20
456 commits to main branch, last one 11 months ago
264
2.8k
other
29
Golang PDF library for creating and processing PDF files (pure go)
Created 2019-05-16
1,845 commits to master branch, last one 29 days ago
A text extraction library supporting PDFs, images, office documents and more
Created 2025-01-31
179 commits to main branch, last one 12 days ago
241
1.6k
apache-2.0
38
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Created 2014-06-26
495 commits to master branch, last one 7 days ago
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Created 2017-02-09
163 commits to master branch, last one about a year ago
84
766
bsd-2-clause
20
Heuristic based boilerplate removal tool
Created 2013-02-10
191 commits to main branch, last one about a month ago
57
628
agpl-3.0
28
A self-hosted search engine for documents.
Created 2016-04-20
4,523 commits to main branch, last one 3 days ago
71
533
other
28
Text Extraction, Rendering and Converting of PDF Documents
Created 2016-02-23
316 commits to master branch, last one about a month ago
46
502
mit
16
A simple library and set of tools for parsing, modifying, and composing SRT files.
Created 2014-12-28
710 commits to develop branch, last one about a year ago
88
371
mit
7
A very simple news crawler with a funny name
Created 2022-10-28
2,777 commits to master branch, last one 4 days ago
Parse PDFs into markdown using Vision LLMs
Created 2024-12-16
112 commits to main branch, last one 2 months ago
39
314
agpl-3.0
7
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Created 2020-05-23
86 commits to master branch, last one 4 years ago
15
271
bsd-3-clause
5
Benchmarking PDF libraries
Created 2022-05-08
57 commits to main branch, last one about a year ago
25
204
bsd-2-clause
21
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Created 2012-05-03
227 commits to master branch, last one 11 months ago
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Created 2024-01-12
467 commits to main branch, last one 4 months ago
13
181
other
5
Entity Disambiguation as text extraction (ACL 2022)
Created 2022-03-22
9 commits to main branch, last one 3 years ago
77
157
unknown
16
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Created 2019-01-15
15 commits to master branch, last one 4 years ago
32
143
apache-2.0
14
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Created 2017-07-06
1,032 commits to main branch, last one about a year ago
PDF Reader Library for Native Julia.
Created 2017-05-28
374 commits to master branch, last one 2 months ago
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Created 2015-08-30
375 commits to master branch, last one about a month ago
Simple app to extract text from pictures using Tesseract
Created 2019-12-27
9 commits to master branch, last one 3 years ago
16
100
unlicense
5
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Created 2019-03-02
81 commits to master branch, last one about a year ago
PDF text data extraction web app with OCR for scanned documents
Created 2022-05-13
46 commits to main branch, last one 10 months ago
18
83
other
6
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created 2023-05-29
1,375 commits to master branch, last one 8 days ago
Yet another library to extract text from MS Office and PDF files
Created 2021-03-04
86 commits to main branch, last one 9 months ago
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...
Created 2024-09-10
26 commits to main branch, last one 5 months ago
Get text content from any file
Created 2020-07-08
38 commits to master branch, last one 2 years ago
9
62
gpl-3.0
1
python based software to unpack kindlegen generated ebooks
Created 2020-03-02
29 commits to master branch, last one 2 years ago
The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework ...
Created 2024-06-14
24 commits to main branch, last one 2 months ago