30 results found Sort:
- Filter by Primary Language:
- Python (15)
- HTML (3)
- C++ (2)
- Go (2)
- PHP (1)
- Scala (1)
- Bicep (1)
- TypeScript (1)
- Java (1)
- JavaScript (1)
- Julia (1)
- +
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created
2019-04-08
1,592 commits to master branch, last one a day ago
Module for automatic summarization of text documents and HTML pages.
Created
2013-02-20
456 commits to main branch, last one 9 months ago
Golang PDF library for creating and processing PDF files (pure go)
Created
2019-05-16
1,841 commits to master branch, last one 23 days ago
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Created
2014-06-26
475 commits to master branch, last one about a year ago
A text extraction library supporting PDFs, images, office documents and more
Created
2025-01-31
92 commits to main branch, last one 2 days ago
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Created
2017-02-09
163 commits to master branch, last one about a year ago
Heuristic based boilerplate removal tool
Created
2013-02-10
187 commits to main branch, last one 9 months ago
A self-hosted search engine for documents.
Created
2016-04-20
4,395 commits to main branch, last one a day ago
Text Extraction, Rendering and Converting of PDF Documents
Created
2016-02-23
314 commits to master branch, last one 5 months ago
A simple library and set of tools for parsing, modifying, and composing SRT files.
Created
2014-12-28
710 commits to develop branch, last one about a year ago
A very simple news crawler with a funny name
Created
2022-10-28
2,667 commits to master branch, last one 3 days ago
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Created
2020-05-23
86 commits to master branch, last one 3 years ago
Parse PDFs into markdown using Vision LLMs
Created
2024-12-16
112 commits to main branch, last one 10 days ago
Benchmarking PDF libraries
Created
2022-05-08
57 commits to main branch, last one about a year ago
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Created
2012-05-03
227 commits to master branch, last one 9 months ago
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Created
2024-01-12
467 commits to main branch, last one 2 months ago
Entity Disambiguation as text extraction (ACL 2022)
Created
2022-03-22
9 commits to main branch, last one 2 years ago
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Created
2019-01-15
15 commits to master branch, last one 4 years ago
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Created
2017-07-06
1,032 commits to main branch, last one 11 months ago
PDF Reader Library for Native Julia.
Created
2017-05-28
374 commits to master branch, last one 13 days ago
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Created
2015-08-30
360 commits to master branch, last one 8 months ago
Simple app to extract text from pictures using Tesseract
Created
2019-12-27
9 commits to master branch, last one 3 years ago
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Created
2019-03-02
81 commits to master branch, last one about a year ago
PDF text data extraction web app with OCR for scanned documents
Created
2022-05-13
46 commits to main branch, last one 8 months ago
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created
2023-05-29
1,295 commits to master branch, last one 27 days ago
Yet another library to extract text from MS Office and PDF files
Created
2021-03-04
86 commits to main branch, last one 7 months ago
Get text content from any file
Created
2020-07-08
38 commits to master branch, last one about a year ago
python based software to unpack kindlegen generated ebooks
Created
2020-03-02
29 commits to master branch, last one 2 years ago
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...
Created
2024-09-10
26 commits to main branch, last one 2 months ago
The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework ...
Created
2024-06-14
24 commits to main branch, last one 18 days ago