22 results found Sort:
- Filter by Primary Language:
- Python (17)
- C++ (2)
- C# (1)
- JavaScript (1)
- +
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
Created
2024-02-29
1,654 commits to master branch, last one 19 hours ago
Read and extract text and other content from PDFs in C# (port of PDFBox)
Created
2017-11-09
1,593 commits to master branch, last one 4 days ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
ocr
document
documentai
multimodal
end-to-end-ocr
text-detection
computer-vision
vision-language
text-recognition
document-analysis
document-recognition
scene-text-detection
document-intelligence
vision-language-model
document-understanding
scene-text-recognition
artificial-intelligence
multimodal-deep-learning
vision-language-transformer
scene-text-detection-recognition
Created
2022-09-28
62 commits to main branch, last one about a month ago
A curated list of resources for Document Understanding (DU) topic
nlp
ocr
pdf
rpa
awesome
document-ai
awesome-list
deep-learning
pdf-documents
machine-learning
document-analysis
unstructured-data
document-intelligence
document-understanding
information-extraction
intelligent-processing
document-layout-analysis
key-information-extraction
robotic-process-automation
natural-language-processing
Created
2021-04-06
76 commits to main branch, last one about a year ago
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
Created
2020-07-15
52 commits to master branch, last one 2 years ago
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
Created
2022-03-01
13 commits to main branch, last one 2 years ago
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
Created
2021-09-21
965 commits to main branch, last one 2 days ago
AssemblyLine 4: File triage and malware analysis
Created
2020-12-03
196 commits to master branch, last one 27 days ago
A package for parsing PDFs and analyzing their content using LLMs.
Created
2024-07-26
28 commits to main branch, last one 3 months ago
RObust document image BINarization
Created
2018-10-25
87 commits to master branch, last one 2 years ago
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic...
Created
2020-12-07
295 commits to master branch, last one about a month ago
Local adaptive image binarization
Created
2017-07-24
6 commits to master branch, last one 6 years ago
53
103
unknown
3
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it i...
Created
2023-08-21
23 commits to master branch, last one 8 months ago
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
Created
2021-09-14
159 commits to main branch, last one 3 months ago
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Created
2020-04-09
30 commits to master branch, last one about a year ago
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
Created
2023-04-11
667 commits to main branch, last one 4 months ago
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Created
2024-05-13
35 commits to main branch, last one about a month ago
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
Created
2021-11-08
180 commits to main branch, last one 10 months ago
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Created
2021-09-14
33 commits to master branch, last one about a year ago
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
Created
2023-06-29
20 commits to main branch, last one about a month ago
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates...
Created
2023-08-16
73 commits to main branch, last one 6 days ago
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
Created
2022-04-15
4 commits to main branch, last one about a year ago