Search Results - RepositoryStats

289

4.2k

apache-2.0

32

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Created 2019-04-08

1,594 commits to master branch, last one about a month ago

sumy miso-belica

529

3.6k

apache-2.0

113

Module for automatic summarization of text documents and HTML pages.

lsa nlp sumy python summary html-page reduction summarizer textteaser summarization html-extractor html-extraction text-extraction pagerank-algorithm

Created 2013-02-20

456 commits to main branch, last one 11 months ago

unipdf unidoc

264

2.8k

other

29

Golang PDF library for creating and processing PDF files (pure go)

pdf golang signing pdf-sign pdf-reader pdf-library pdf-reports pdf-generator pdf-generation pdf-compression text-extraction pdf-manipulation pdf-document-processor

Created 2019-05-16

1,845 commits to master branch, last one 29 days ago

kreuzberg Goldziher

63

1.8k

mit

10

A text extraction library supporting PDFs, images, office documents and more

ocr pdf docx asyncio text-extraction

Created 2025-01-31

179 commits to main branch, last one 12 days ago

tika-python chrismattmann

241

1.6k

apache-2.0

38

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Created 2014-06-26

495 commits to master branch, last one 7 days ago

image-text-localization-recognition whitelok

233

952

unknown

75

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

ocr awesome scene-texts deep-learning text-detection text-extraction machine-learning text-recognition deep-learning-algorithms convolutional-neural-networks

Created 2017-02-09

163 commits to master branch, last one about a year ago

jusText miso-belica

84

766

bsd-2-clause

20

Heuristic based boilerplate removal tool

python html-parser html-parsing text-extraction

Created 2013-02-10

191 commits to main branch, last one about a month ago

datashare ICIJ

57

628

agpl-3.0

28

A self-hosted search engine for documents.

docker extract web-gui datashare elasticsearch text-extraction investigative-journalism named-entity-recognition

Created 2016-04-20

4,523 commits to main branch, last one 3 days ago

pdftools ropensci

71

533

other

28

Text Extraction, Rendering and Converting of PDF Documents

r rstats poppler pdftools pdf-files r-package pdf-format poppler-library text-extraction

Created 2016-02-23

316 commits to master branch, last one about a month ago

srt cdown

46

502

mit

16

A simple library and set of tools for parsing, modifying, and composing SRT files.

srt tools python library subtitle subtitles mit-license command-line subtitle-fixer subtitle-parser text-extraction command-line-tool subtitles-parsing

Created 2014-12-28

710 commits to develop branch, last one about a year ago

fundus flairNLP

88

371

mit

7

A very simple news crawler with a funny name

nlp rss corpus python cc-news crawler scraper sitemap datasets web-corpus commoncrawl corpus-tools news-crawler web-scraping news-scraping text-extraction image-extraction image-classification

Created 2022-10-28

2,777 commits to master branch, last one 4 days ago

vision-parse iamarunbrahma

45

345

mit

4

Parse PDFs into markdown using Vision LLMs

pdf-parser document-parser pdf-to-markdown text-extraction

Created 2024-12-16

112 commits to main branch, last one 2 months ago

pd3f pd3f

39

314

agpl-3.0

7

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

ocr pdf pd3f parsr python pipeline pdf-to-text extract-text language-model text-extraction machine-learning

Created 2020-05-23

86 commits to master branch, last one 4 years ago

benchmarks py-pdf

15

271

bsd-3-clause

5

Benchmarking PDF libraries

pdf mupdf pypdf2 benchmark poppler-utils data-extraction text-extraction

Created 2022-05-08

57 commits to main branch, last one about a year ago

breadability bookieio

25

204

bsd-2-clause

21

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining html-parsing html-extractor html-extraction text-extraction

Created 2012-05-03

227 commits to master branch, last one 11 months ago

hotpdf weareprestatech

9

186

mit

3

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdf python text-search text-extraction

Created 2024-01-12

467 commits to main branch, last one 4 months ago

extend SapienzaNLP

13

181

other

5

Entity Disambiguation as text extraction (ACL 2022)

acl nlp acl2022 pytorch entity-linking text-extraction entity-disambiguation natural-language-processing entity-disambiguation-models

Created 2022-03-22

9 commits to main branch, last one 3 years ago

CUTIE vsymbol

77

157

unknown

16

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

deep-learning computer-vision text-extraction

Created 2019-01-15

15 commits to master branch, last one 4 years ago

aut archivesunleashed

32

143

apache-2.0

14

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

scala spark hadoop pyspark python3 analysis big-data dataframe webarchives apache-spark text-extraction network-graphing big-data-analytics digital-humanities

Created 2017-07-06

1,032 commits to main branch, last one about a year ago

PDFIO.jl sambitdash

14

131

other

4

PDF Reader Library for Native Julia.

pdf julia pdf-files pdf-library pdf-document pdf-development text-extraction pdf-specification

Created 2017-05-28

374 commits to master branch, last one 2 months ago

php-apache-tika vaites

23

116

mit

5

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

ocr tika apache php-library text-extraction text-recognition

Created 2015-08-30

375 commits to master branch, last one about a month ago

ocr victorqribeiro

9

106

mit

2

Simple app to extract text from pictures using Tesseract

ocr tesseract text-extraction text-recognition image-recognition

Created 2019-12-27

9 commits to master branch, last one 3 years ago

cat lu4p

16

100

unlicense

5

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

go cat golang odt2txt pdf2txt docx2txt pdftotext rtf-to-text extract-text cross-platform textextracting text-extraction

Created 2019-03-02

81 commits to master branch, last one about a year ago

pdf-text-data-extractor nainiayoub

49

87

unknown

4

PDF text data extraction web app with OCR for scanned documents

ocr pdf python streamlit ocr-python pdf-to-text ocr-text-reader text-extraction streamlit-webapp

Created 2022-05-13

46 commits to main branch, last one 10 months ago

docwire docwire

18

83

other

6

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...

Created 2023-05-29

1,375 commits to master branch, last one 8 days ago

office-text-extractor gamemaker1

7

75

isc

2

Yet another library to extract text from MS Office and PDF files

pdf docx pptx xlsx parser ms-word get-text ms-excel ms-office ms-powerpoint text-extraction

Created 2021-03-04

86 commits to main branch, last one 9 months ago

pdf-to-markdown iamarunbrahma

7

73

mit

3

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...

rag python pdf-parsing pdf-converter pdf-extraction pdf-to-markdown text-extraction document-conversion document-processing information-retrieval retrieval-augmented-generation

Created 2024-09-10

26 commits to main branch, last one 5 months ago

any-text abhinaba-ghosh

11

65

mit

2

Get text content from any file

text reader file-reader text-extractor text-extraction

Created 2020-07-08

38 commits to master branch, last one 2 years ago

mobi iscc

9

62

gpl-3.0

1

python based software to unpack kindlegen generated ebooks

mobi kindle text-extraction

Created 2020-03-02

29 commits to master branch, last one 2 years ago

doc-intelligence-in-a-box Azure-Samples

7

34

mit

11

The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework ...

ai azd azure accelerator azd-templates form-analysis text-extraction cognitive-services document-intelligence

Created 2024-06-14

24 commits to main branch, last one 2 months ago