Search Results - RepositoryStats

2.3k

29.3k

agpl-3.0

146

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

ocr pdf parser python ai4science pdf-parser extract-data pdf-converter layout-analysis document-analysis pdf-extractor-llm pdf-extractor-rag pdf-extractor-pretrain

Created 2024-02-29

2,453 commits to master branch, last one 6 days ago

PdfPig UglyToad

254

1.9k

apache-2.0

48

Read and extract text and other content from PDFs in C# (port of PDFBox)

pdf hocr csharp pdfbox alto-xml page-xml pdf-files netstandard pdf-document pdf-extractor pdf-generation layout-analysis document-analysis pdf-document-processor

Created 2017-11-09

1,636 commits to master branch, last one 3 days ago

AdvancedLiterateMachinery AlibabaResearch

190

1.7k

apache-2.0

40

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Created 2022-09-28

69 commits to main branch, last one 3 months ago

awesome-document-understanding tstanislawek

160

1.4k

unknown

37

A curated list of resources for Document Understanding (DU) topic

Created 2021-04-06

76 commits to main branch, last one about a year ago

documind DocumindHQ

44

1.3k

other

10

Open-source platform for extracting structured data from documents using AI.

ai ocr pdf llms parser open-source extract-data pdf-converter pdf-extractor developer-tools document-analysis pdf-extractor-llm document-extraction

Created 2024-11-17

61 commits to main branch, last one about a month ago

PICK-pytorch wenwenyu

192

564

mit

22

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

graph-learning document-analysis graph-neural-networks document-understanding key-information-extraction graph-convolutional-network

Created 2020-07-15

52 commits to master branch, last one 2 years ago

LiLT jpWang

41

347

mit

6

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

nlp document-ai document-analysis multilingual-models document-understanding information-extraction multimodal-pre-trained-model

Created 2022-03-01

13 commits to main branch, last one 2 years ago

assemblyline CybercentreCanada

18

295

mit

8

AssemblyLine 4: File triage and malware analysis

cert infosec malware python3 framework assemblyline cybersecurity file-analysis cyber-security security-tools malware-analysis malware-analyzer malware-research document-analysis incident-response malware-detection security-automation automation-framework security-automation-framework

Created 2020-12-03

226 commits to master branch, last one about a month ago

llmdocparser lazyFrogLOL

8

266

mit

3

A package for parsing PDFs and analyzing their content using LLMs.

llm nlp ocr rag chunking pdfparser pdf-parser text-chunking document-analysis

Created 2024-07-26

28 commits to main branch, last one 7 months ago

pandora pandora-analysis

41

259

agpl-3.0

9

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

infosec document-analysis malware-detection document-analyzing

Created 2021-09-21

1,037 commits to main branch, last one 3 days ago

dedoc ispras

25

225

apache-2.0

12

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic...

doc ocr odt pdf txt docx html excel documents pdf-parser docx-parser html-parser document-analysis scanned-documents table-of-contents table-recognition document-content-extraction logical-structure-extraction

Created 2020-12-07

297 commits to master branch, last one 3 months ago

robin masyagin1998

38

180

mit

11

RObust document image BINarization

ocr keras u-net opencv python deep-learning computer-vision neural-networks document-analysis document-binarization

Created 2018-10-25

87 commits to master branch, last one 2 years ago

local_adaptive_binarization chriswolfvision

25

126

unknown

10

Local adaptive image binarization

computer-vision document-analysis document-binarization

Created 2017-07-24

6 commits to master branch, last one 6 years ago

Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit mirabdullahyaser

59

119

unknown

3

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it i...

gpt-3 langchain streamlit generative-ai openai-chatgpt chat-application document-analysis question-answering large-language-models artificial-intelligence natural-language-processing retrieval-augmented-generation

Created 2023-08-21

23 commits to master branch, last one about a year ago

yolo-doclaynet ppaanngggg

16

97

agpl-3.0

3

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

yolo yolov8 doclaynet ultralytics layout-analysis document-analysis

Created 2024-05-13

40 commits to main branch, last one 18 days ago

amazon-textract-transformer-pipeline aws-samples

25

96

mit-0

24

Post-process Amazon Textract results with Hugging Face transformer models for document understanding

ocr amazon-textract document-analysis huggingface-transformers

Created 2021-09-14

159 commits to main branch, last one 8 months ago

docExtractor monniert

10

88

mit

7

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

pytorch segmentation historical-data document-analysis

Created 2020-04-09

30 commits to master branch, last one about a year ago

pydoxtools Xyntopia

12

81

mit

5

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

llm nlp pdf python chatgpt extraction document-analysis document-extraction information-retrieval

Created 2023-04-11

667 commits to main branch, last one 9 months ago

ViBERTgrid-PyTorch ZeningLin

5

54

unknown

4

An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"

document-ai document-analysis information-extraction key-information-extraction visual-information-extraction

Created 2021-11-08

180 commits to main branch, last one about a year ago

UTRNet-High-Resolution-Urdu-Text-Recognition abdur75648

10

50

other

6

UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)

ocr unet urdu hrnet icdar utrnet pytorch urdu-nlp urdu-ocr icdar2023 urdu-synth deep-learning text-detection computer-vision high-resolution machine-learning text-recognition document-analysis scene-text-recognition

Created 2023-06-29

20 commits to main branch, last one 5 months ago

detectron2-publaynet JPLeoRX

7

48

other

3

Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

python python3 pytorch publaynet detectron2 faster-rcnn deep-learning neural-network computer-vision document-layout layout-analysis neural-networks machine-learning object-detection document-analysis instance-segmentation artificial-intelligence document-classification document-layout-analysis

Created 2021-09-14

33 commits to master branch, last one about a year ago

enhanced-document-understanding-on-aws aws-solutions

14

37

apache-2.0

16

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates...

document-analysis document-processing

Created 2023-08-16

96 commits to main branch, last one 17 days ago

GNN-TableExtraction AILab-UniFI

5

35

other

5

Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"

document-analysis graph-neural-networks

Created 2022-04-15

4 commits to main branch, last one 2 years ago

DocTer lin-tan

4

34

other

3

For our ISSTA22 paper "DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions" by Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey

fuzzing testing deep-learning document-analysis software-reliability software-text-analytics natural-language-processing

Created 2022-05-25

10 commits to main branch, last one 2 years ago

synthetic-rag-index microsoft

5

29

apache-2.0

3

Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless archi...

llm rag azure serverless document-analysis few-shot-learning large-language-model retrieval-augmented-generation

Created 2024-06-07

95 commits to main branch, last one 8 months ago

publaynet-models CaseDrive

2

27

other

2

Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

python python3 pytorch publaynet detectron2 faster-rcnn deep-learning neural-network computer-vision document-layout layout-analysis neural-networks machine-learning object-detection document-analysis instance-segmentation artificial-intelligence document-classification document-layout-analysis

Created 2023-04-16

32 commits to master branch, last one 2 years ago