23 results found Sort:

1.3k
17.8k
agpl-3.0
97
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Created 2024-02-29
1,745 commits to master branch, last one 2 days ago
241
1.7k
apache-2.0
50
Read and extract text and other content from PDFs in C# (port of PDFBox)
Created 2017-11-09
1,595 commits to master branch, last one a day ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Created 2022-09-28
62 commits to main branch, last one about a month ago
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
Created 2020-07-15
52 commits to master branch, last one 2 years ago
41
345
mit
6
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
Created 2022-03-01
13 commits to main branch, last one 2 years ago
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
Created 2021-09-21
968 commits to main branch, last one a day ago
A package for parsing PDFs and analyzing their content using LLMs.
Created 2024-07-26
28 commits to main branch, last one 3 months ago
21
183
apache-2.0
12
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic...
Created 2020-12-07
296 commits to master branch, last one a day ago
RObust document image BINarization
Created 2018-10-25
87 commits to master branch, last one 2 years ago
Local adaptive image binarization
Created 2017-07-24
6 commits to master branch, last one 6 years ago
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it i...
Created 2023-08-21
23 commits to master branch, last one 9 months ago
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
Created 2021-09-14
159 commits to main branch, last one 3 months ago
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Created 2020-04-09
30 commits to master branch, last one about a year ago
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
Created 2023-04-11
667 commits to main branch, last one 5 months ago
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Created 2024-05-13
35 commits to main branch, last one about a month ago
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
Created 2021-11-08
180 commits to main branch, last one 10 months ago
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
Created 2023-06-29
20 commits to main branch, last one about a month ago
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
Created 2022-04-15
4 commits to main branch, last one about a year ago
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates...
Created 2023-08-16
78 commits to main branch, last one 2 days ago
4
33
other
3
For our ISSTA22 paper "DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions" by Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey
Created 2022-05-25
10 commits to main branch, last one 2 years ago