26 results found Sort:
- Filter by Primary Language:
- Python (12)
- HTML (3)
- Go (2)
- C++ (2)
- PHP (1)
- Scala (1)
- TypeScript (1)
- Java (1)
- JavaScript (1)
- Julia (1)
- +
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Created
2019-04-08
1,570 commits to master branch, last one 2 days ago
Module for automatic summarization of text documents and HTML pages.
Created
2013-02-20
456 commits to main branch, last one 5 months ago
Golang PDF library for creating and processing PDF files (pure go)
Created
2019-05-16
1,835 commits to master branch, last one 15 days ago
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Created
2014-06-26
475 commits to master branch, last one about a year ago
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Created
2017-02-09
163 commits to master branch, last one about a year ago
Heuristic based boilerplate removal tool
Created
2013-02-10
187 commits to main branch, last one 6 months ago
A self-hosted search engine for documents.
Created
2016-04-20
4,215 commits to main branch, last one a day ago
Text Extraction, Rendering and Converting of PDF Documents
Created
2016-02-23
314 commits to master branch, last one about a month ago
A simple library and set of tools for parsing, modifying, and composing SRT files.
Created
2014-12-28
710 commits to develop branch, last one about a year ago
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Created
2020-05-23
86 commits to master branch, last one 3 years ago
A very simple news crawler with a funny name
Created
2022-10-28
2,367 commits to master branch, last one a day ago
Benchmarking PDF libraries
Created
2022-05-08
57 commits to main branch, last one about a year ago
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Created
2012-05-03
227 commits to master branch, last one 6 months ago
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Created
2024-01-12
464 commits to main branch, last one 7 months ago
Entity Disambiguation as text extraction (ACL 2022)
Created
2022-03-22
9 commits to main branch, last one 2 years ago
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Created
2019-01-15
15 commits to master branch, last one 3 years ago
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Created
2017-07-06
1,032 commits to main branch, last one 8 months ago
PDF Reader Library for Native Julia.
Created
2017-05-28
369 commits to master branch, last one about a year ago
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Created
2015-08-30
360 commits to master branch, last one 5 months ago
Simple app to extract text from pictures using Tesseract
Created
2019-12-27
9 commits to master branch, last one 3 years ago
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Created
2019-03-02
81 commits to master branch, last one about a year ago
PDF text data extraction web app with OCR for scanned documents
Created
2022-05-13
46 commits to main branch, last one 5 months ago
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created
2023-05-29
1,188 commits to master branch, last one 22 days ago
python based software to unpack kindlegen generated ebooks
Created
2020-03-02
29 commits to master branch, last one about a year ago
Get text content from any file
Created
2020-07-08
38 commits to master branch, last one about a year ago
Yet another library to extract text from MS Office and PDF files
Created
2021-03-04
86 commits to main branch, last one 3 months ago