Search Results - RepositoryStats

ragflow infiniflow

4.4k

47.6k

apache-2.0

225

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Created 2023-12-12

2,727 commits to main branch, last one a day ago

docling docling-project

1.5k

25.9k

mit

113

Get your documents ready for gen AI

ai pdf docx html pptx xlsx tables convert markdown documents pdf-to-json pdf-to-text pdf-converter document-parser document-parsing

Created 2024-07-09

431 commits to main branch, last one a day ago

unstructured Unstructured-IO

892

10.8k

apache-2.0

69

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Created 2022-09-26

1,711 commits to main branch, last one 2 days ago

llama_cloud_services run-llama

384

3.9k

mit

26

Knowledge Agents and Management in the Cloud

pdf pptx tables parsing document pdf-to-json pdf-to-text ppt-to-json pdf-to-excel document-parser pdf-to-markdown ppt-to-markdown structured-data document-parsing docx-to-markdown pdf-document-processor

Created 2024-01-31

263 commits to main branch, last one 16 hours ago

ExtractThinker enoch3712

114

1.2k

apache-2.0

20

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

ai llm nlp ocr pdf openai python langchain pdf-to-text document-parsing machine-learning document-processing document-intelligence document-image-analysis

Created 2024-02-01

391 commits to main branch, last one a day ago

pd3f pd3f

39

314

agpl-3.0

7

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

ocr pdf pd3f parsr python pipeline pdf-to-text extract-text language-model text-extraction machine-learning

Created 2020-05-23

86 commits to master branch, last one 4 years ago

pdf-text-data-extractor nainiayoub

49

87

unknown

4

PDF text data extraction web app with OCR for scanned documents

ocr pdf python streamlit ocr-python pdf-to-text ocr-text-reader text-extraction streamlit-webapp

Created 2022-05-13

46 commits to main branch, last one 10 months ago

markdrop shoryasethia

3

84

apache-2.0

1

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functio...

llm agents marker docling markdrop markitdown open-source pdf-to-text pypi-package image-to-text table-to-text pdf-to-markdown

Created 2024-12-24

57 commits to main branch, last one 7 days ago

adobe-pdf-library-samples datalogics

62

81

unknown

26

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

ocr pdf pdfa ocr-pdf pdf-lib pdf-split pdf-tools pdf-merger pdf-parser pdf-render pdf-to-text pdf-document pdf-to-image pdf-converter pdf-to-office pdf-conversion pdf-generation pdf-compression pdf-manipulation

Created 2017-03-28

247 commits to master branch, last one about a year ago

ocr-python NanoNets

14

79

mit

3

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

ocr pdf python textract tesseract pdf-to-csv pdf-to-json pdf-to-text extract-table image-to-text table-extract searchable-pdf pytesseract-ocr extract-text-from-pdf extract-text-from-image image-to-text-converter

Created 2022-08-04

27 commits to main branch, last one 2 years ago

pdf-text-extraction galkahana

20

78

apache-2.0

2

cli for extracting text from PDF files (and maybe possibly tables)

pdf pdf-to-text

Created 2020-09-28

89 commits to master branch, last one 13 days ago

Docotic.Pdf.Samples BitMiracle

39

76

unknown

10

C# and VB.NET samples for Docotic.Pdf library

Created 2017-12-13

559 commits to master branch, last one 29 days ago

papercast papercast-dev

1

49

mit

1

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines...

dag nlp tts arxiv grobid python podcast pipeline pdf-to-text pdf-converter document-parser document-parsing semantic-scholar pdf-document-processor

Created 2023-03-31

109 commits to main branch, last one 17 days ago

nocodefunctions-web-app seinecle

7

38

unknown

3

The code base of the front-end of nocodefunctions.com

nlp java nocode webapp pdf2text pdf-to-text text-mining data-science jakarta-faces topic-modeling data-processing network-analysis sentiment-analysis

Created 2021-11-22

15 commits to main branch, last one 11 days ago

KITAB-Bench mbzuai-oryx

0

32

mit

1

A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

ocr vqa vlms arabic benchmark pdf-to-text table-detection layout-detection

Created 2025-02-20

60 commits to main branch, last one a day ago