38 results found Sort:

445
6.1k
agpl-3.0
38
🔥 Open-source no-code web data extraction platform. Turn websites to APIs and spreadsheets with no-code robots in minutes! [In Beta]
Created 2023-10-23
3,894 commits to develop branch, last one 12 hours ago
601
5.6k
mit
142
Extract Keywords from sentence or Replace keywords in sentences.
Created 2017-08-15
108 commits to master branch, last one 4 years ago
91
1.7k
bsd-3-clause
15
Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
Created 2024-10-13
299 commits to main branch, last one 2 days ago
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (f...
Created 2015-10-11
49 commits to master branch, last one 3 years ago
232
1.5k
apache-2.0
38
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Created 2017-07-13
6,411 commits to develop branch, last one about a year ago
59
930
gpl-2.0
13
Lightweight library for scraping web-sites with LLMs
Created 2024-08-12
110 commits to main branch, last one 12 days ago
:newspaper: Let ChatGPT Summarize Hacker News for You
Created 2014-09-17
464 commits to master branch, last one 2 months ago
🚜 Parse text and tables from PDF files.
Created 2015-03-05
162 commits to master branch, last one 7 days ago
144
585
other
44
A powerful Python library for getting rich data from the Vietnam Stock Market using just a few lines of code
Created 2022-02-27
290 commits to main branch, last one about a month ago
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
Created 2020-05-12
542 commits to master branch, last one about a year ago
11
243
bsd-3-clause
5
Benchmarking PDF libraries
Created 2022-05-08
57 commits to main branch, last one about a year ago
Wikipedia information extraction library
Created 2015-06-15
340 commits to master branch, last one about a year ago
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
Created 2023-07-06
16 commits to master branch, last one 9 months ago
This repository provides usage examples for the Python module Newspaper3k.
Created 2020-10-11
73 commits to main branch, last one 11 months ago
8
129
unknown
3
Accurate, private and configurable document retrieval LLM
Created 2024-03-14
246 commits to main branch, last one 13 days ago
24
124
gpl-3.0
9
A Python utility to digitize plots.
Created 2018-07-12
157 commits to main branch, last one 4 months ago
15
121
apache-2.0
7
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Created 2020-04-22
195 commits to main branch, last one 9 months ago
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
Created 2022-09-17
20 commits to main branch, last one about a month ago
Superpipe - optimized LLM pipelines for structured data
Created 2024-02-07
99 commits to main branch, last one 6 months ago
High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!
Created 2019-02-21
38 commits to master branch, last one about a year ago
Line segmentation algorithm for Google Vision API.
Created 2018-01-14
36 commits to master branch, last one 2 years ago
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
Created 2023-07-27
303 commits to main branch, last one 8 days ago
17
73
other
6
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
Created 2023-05-29
1,231 commits to master branch, last one 16 days ago
Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.
Created 2020-07-20
45 commits to main branch, last one 2 months ago
file metadata parsing, done cheap
Created 2017-12-08
424 commits to master branch, last one 10 days ago
15
53
agpl-3.0
27
Information extraction and interactive visualization of textual datasets for investigative data-driven journalism and eDiscovery
Created 2016-06-24
818 commits to master branch, last one 3 years ago
⚡️ Next-generation data transformation framework for TypeScript that puts developer experience first
This repository has been archived (exclude archived)
Created 2022-03-23
66 commits to main branch, last one 2 years ago