30 results found Sort:

280
3.9k
apache-2.0
31
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created 2019-04-08
1,592 commits to master branch, last one a day ago
530
3.5k
apache-2.0
113
Module for automatic summarization of text documents and HTML pages.
Created 2013-02-20
456 commits to main branch, last one 9 months ago
258
2.7k
other
30
Golang PDF library for creating and processing PDF files (pure go)
Created 2019-05-16
1,841 commits to master branch, last one 23 days ago
239
1.5k
apache-2.0
39
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Created 2014-06-26
475 commits to master branch, last one about a year ago
A text extraction library supporting PDFs, images, office documents and more
Created 2025-01-31
92 commits to main branch, last one 2 days ago
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Created 2017-02-09
163 commits to master branch, last one about a year ago
82
746
bsd-2-clause
21
Heuristic based boilerplate removal tool
Created 2013-02-10
187 commits to main branch, last one 9 months ago
55
611
agpl-3.0
29
A self-hosted search engine for documents.
Created 2016-04-20
4,395 commits to main branch, last one a day ago
71
528
other
29
Text Extraction, Rendering and Converting of PDF Documents
Created 2016-02-23
314 commits to master branch, last one 5 months ago
43
487
mit
17
A simple library and set of tools for parsing, modifying, and composing SRT files.
Created 2014-12-28
710 commits to develop branch, last one about a year ago
82
330
mit
7
A very simple news crawler with a funny name
Created 2022-10-28
2,667 commits to master branch, last one 3 days ago
40
311
agpl-3.0
8
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Created 2020-05-23
86 commits to master branch, last one 3 years ago
Parse PDFs into markdown using Vision LLMs
Created 2024-12-16
112 commits to main branch, last one 10 days ago
15
254
bsd-3-clause
5
Benchmarking PDF libraries
Created 2022-05-08
57 commits to main branch, last one about a year ago
25
204
bsd-2-clause
22
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Created 2012-05-03
227 commits to master branch, last one 9 months ago
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Created 2024-01-12
467 commits to main branch, last one 2 months ago
13
179
other
6
Entity Disambiguation as text extraction (ACL 2022)
Created 2022-03-22
9 commits to main branch, last one 2 years ago
78
157
unknown
16
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Created 2019-01-15
15 commits to master branch, last one 4 years ago
32
141
apache-2.0
15
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Created 2017-07-06
1,032 commits to main branch, last one 11 months ago
PDF Reader Library for Native Julia.
Created 2017-05-28
374 commits to master branch, last one 13 days ago
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Created 2015-08-30
360 commits to master branch, last one 8 months ago
Simple app to extract text from pictures using Tesseract
Created 2019-12-27
9 commits to master branch, last one 3 years ago
16
96
unlicense
6
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Created 2019-03-02
81 commits to master branch, last one about a year ago
PDF text data extraction web app with OCR for scanned documents
Created 2022-05-13
46 commits to main branch, last one 8 months ago
18
78
other
6
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created 2023-05-29
1,295 commits to master branch, last one 27 days ago
Yet another library to extract text from MS Office and PDF files
Created 2021-03-04
86 commits to main branch, last one 7 months ago
Get text content from any file
Created 2020-07-08
38 commits to master branch, last one about a year ago
9
61
gpl-3.0
2
python based software to unpack kindlegen generated ebooks
Created 2020-03-02
29 commits to master branch, last one 2 years ago
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...
Created 2024-09-10
26 commits to main branch, last one 2 months ago
The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework ...
Created 2024-06-14
24 commits to main branch, last one 18 days ago