25 results found Sort:
- Filter by Primary Language:
- Python (19)
- C++ (2)
- C# (1)
- JavaScript (1)
- +
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Created
2024-02-29
2,008 commits to master branch, last one 2 days ago
Read and extract text and other content from PDFs in C# (port of PDFBox)
Created
2017-11-09
1,602 commits to master branch, last one 6 days ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
ocr
document
documentai
multimodal
end-to-end-ocr
text-detection
computer-vision
vision-language
text-recognition
document-analysis
document-recognition
scene-text-detection
document-intelligence
vision-language-model
document-understanding
scene-text-recognition
artificial-intelligence
multimodal-deep-learning
vision-language-transformer
scene-text-detection-recognition
Created
2022-09-28
64 commits to main branch, last one 5 days ago
A curated list of resources for Document Understanding (DU) topic
nlp
ocr
pdf
rpa
awesome
document-ai
awesome-list
deep-learning
pdf-documents
machine-learning
document-analysis
unstructured-data
document-intelligence
document-understanding
information-extraction
intelligent-processing
document-layout-analysis
key-information-extraction
robotic-process-automation
natural-language-processing
Created
2021-04-06
76 commits to main branch, last one about a year ago
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
Created
2020-07-15
52 commits to master branch, last one 2 years ago
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
Created
2022-03-01
13 commits to main branch, last one 2 years ago
AssemblyLine 4: File triage and malware analysis
Created
2020-12-03
198 commits to master branch, last one 3 days ago
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
Created
2021-09-21
987 commits to main branch, last one 2 days ago
A package for parsing PDFs and analyzing their content using LLMs.
Created
2024-07-26
28 commits to main branch, last one 4 months ago
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic...
Created
2020-12-07
296 commits to master branch, last one about a month ago
RObust document image BINarization
Created
2018-10-25
87 commits to master branch, last one 2 years ago
Local adaptive image binarization
Created
2017-07-24
6 commits to master branch, last one 6 years ago
55
107
unknown
3
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it i...
Created
2023-08-21
23 commits to master branch, last one 10 months ago
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
Created
2021-09-14
159 commits to main branch, last one 4 months ago
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Created
2020-04-09
30 commits to master branch, last one about a year ago
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
Created
2023-04-11
667 commits to main branch, last one 6 months ago
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Created
2024-05-13
37 commits to main branch, last one 2 days ago
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
Created
2021-11-08
180 commits to main branch, last one 11 months ago
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Created
2021-09-14
33 commits to master branch, last one about a year ago
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
Created
2023-06-29
20 commits to main branch, last one 2 months ago
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates...
Created
2023-08-16
81 commits to main branch, last one 3 days ago
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
Created
2022-04-15
4 commits to main branch, last one 2 years ago
For our ISSTA22 paper "DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions" by Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey
Created
2022-05-25
10 commits to main branch, last one 2 years ago
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Created
2023-04-16
32 commits to master branch, last one 2 years ago
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless archi...
Created
2024-06-07
95 commits to main branch, last one 5 months ago