34 results found Sort:

599
5.6k
mit
142
Extract Keywords from sentence or Replace keywords in sentences.
Created 2017-08-15
108 commits to master branch, last one 4 years ago
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (f...
Created 2015-10-11
49 commits to master branch, last one 3 years ago
232
1.5k
apache-2.0
37
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Created 2017-07-13
6,411 commits to develop branch, last one about a year ago
55
859
gpl-2.0
13
Lightweight library for scraping web-sites with LLMs
Created 2024-08-12
96 commits to main branch, last one 14 hours ago
:newspaper: Let ChatGPT Summarize Hacker News for You
Created 2014-09-17
464 commits to master branch, last one 29 days ago
🚜 Parse text and tables from PDF files.
Created 2015-03-05
161 commits to master branch, last one 3 days ago
139
562
other
44
A powerful Python library for getting rich data from the Vietnam Stock Market using just a few lines of code
Created 2022-02-27
289 commits to main branch, last one 5 days ago
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
Created 2020-05-12
542 commits to master branch, last one about a year ago
11
222
bsd-3-clause
5
Benchmarking PDF libraries
Created 2022-05-08
57 commits to main branch, last one about a year ago
Wikipedia information extraction library
Created 2015-06-15
340 commits to master branch, last one about a year ago
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
Created 2023-07-06
16 commits to master branch, last one 7 months ago
This repository provides usage examples for the Python module Newspaper3k.
Created 2020-10-11
73 commits to main branch, last one 10 months ago
7
129
unknown
4
Accurate, private and configurable document retrieval LLM
Created 2024-03-14
204 commits to main branch, last one 3 days ago
23
122
gpl-3.0
9
A Python utility to digitize plots.
Created 2018-07-12
157 commits to main branch, last one 2 months ago
15
121
apache-2.0
7
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Created 2020-04-22
195 commits to main branch, last one 7 months ago
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
Created 2022-09-17
19 commits to main branch, last one 8 months ago
Superpipe - optimized LLM pipelines for structured data
Created 2024-02-07
99 commits to main branch, last one 4 months ago
Line segmentation algorithm for Google Vision API.
Created 2018-01-14
36 commits to master branch, last one 2 years ago
High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!
Created 2019-02-21
38 commits to master branch, last one about a year ago
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
Created 2023-07-27
296 commits to main branch, last one about a month ago
Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.
Created 2020-07-20
45 commits to main branch, last one about a month ago
15
66
other
5
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created 2023-05-29
1,188 commits to master branch, last one 22 days ago
file metadata parsing, done cheap
Created 2017-12-08
423 commits to master branch, last one about a month ago
⚡️ Next-generation data transformation framework for TypeScript that puts developer experience first
This repository has been archived (exclude archived)
Created 2022-03-23
66 commits to main branch, last one 2 years ago
15
53
agpl-3.0
27
Information extraction and interactive visualization of textual datasets for investigative data-driven journalism and eDiscovery
Created 2016-06-24
818 commits to master branch, last one 2 years ago
3
52
apache-2.0
4
Domain-specific language for extracting structured data from HTML documents
Created 2016-03-03
1,598 commits to master branch, last one 3 days ago
Refinery is a tool to extract and transform semi-structured data from Excel spreadsheets of different layouts in a declarative way.
Created 2021-11-01
142 commits to master branch, last one about a year ago