Trending repositories for topic crawling
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Scrapy, a fast high-level web crawling & scraping framework for Python.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
Python 3 script to dump/scrape/extract company employees from LinkedIn API
a reliable high-level web crawling & scraping framework for Node.js.
🤖 Scrape data from HTML websites automatically by just providing examples
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Python 3 script to dump/scrape/extract company employees from LinkedIn API
a reliable high-level web crawling & scraping framework for Node.js.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
🤖 Scrape data from HTML websites automatically by just providing examples
Scrapy, a fast high-level web crawling & scraping framework for Python.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Scrapy, a fast high-level web crawling & scraping framework for Python.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
Python 3 script to dump/scrape/extract company employees from LinkedIn API
a reliable high-level web crawling & scraping framework for Node.js.
🤖 Scrape data from HTML websites automatically by just providing examples
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Python 3 script to dump/scrape/extract company employees from LinkedIn API
a reliable high-level web crawling & scraping framework for Node.js.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
🤖 Scrape data from HTML websites automatically by just providing examples
Scrapy, a fast high-level web crawling & scraping framework for Python.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Scrapy, a fast high-level web crawling & scraping framework for Python.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
List of libraries, tools and APIs for web scraping and data processing.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
SpideyX a multipurpose Web Penetration Testing tool with asynchronous concurrent performance with multiple mode and configurations.
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
SpideyX a multipurpose Web Penetration Testing tool with asynchronous concurrent performance with multiple mode and configurations.
SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze, and interact with the web in all its dimensions.
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract Client/Server-side Rendered content
Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.
Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.
SpideyX a multipurpose Web Penetration Testing tool with asynchronous concurrent performance with multiple mode and configurations.
A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze, and interact with the web in all its dimensions.
NetExtract: Efficiently extract core content from any webpage and convert it to clean, LLM-optimized Markdown with a simple API.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Scrapy, a fast high-level web crawling & scraping framework for Python.
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
List of libraries, tools and APIs for web scraping and data processing.
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
Run a high-fidelity browser-based web archiving crawler in a single Docker container
🤖 Scrape data from HTML websites automatically by just providing examples
Crawly, a high-level web crawling & scraping framework for Elixir.
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on dema...
SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.
SpideyX a multipurpose Web Penetration Testing tool with asynchronous concurrent performance with multiple mode and configurations.
Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Run a high-fidelity browser-based web archiving crawler in a single Docker container
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract Client/Server-side Rendered content
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.