Trending repositories for topic web-crawler
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Model Context Protocol (MCP) Server for Graphlit Platform
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Undetected Web-Scraping & Seamless HTML Parsing in Python!
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
A collection of awesome web crawler,spider in different languages
Model Context Protocol (MCP) Server for Graphlit Platform
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Undetected Web-Scraping & Seamless HTML Parsing in Python!
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
A collection of awesome web crawler,spider in different languages
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Model Context Protocol (MCP) Server for Graphlit Platform
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Run a high-fidelity browser-based web archiving crawler in a single Docker container
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Undetected Web-Scraping & Seamless HTML Parsing in Python!
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
A collection of awesome web crawler,spider in different languages
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Model Context Protocol (MCP) Server for Graphlit Platform
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Undetected Web-Scraping & Seamless HTML Parsing in Python!
Run a high-fidelity browser-based web archiving crawler in a single Docker container
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
News crawling with StormCrawler - stores content as WARC
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Model Context Protocol (MCP) Server for Graphlit Platform
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
A collection of awesome web crawler,spider in different languages
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.
Model Context Protocol (MCP) Server for Graphlit Platform
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
GroqCrawl is a powerful and user-friendly web crawling and scraping application built with Streamlit and powered by PocketGroq. It provides an intuitive interface for extracting LLM friendly AI consum...
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Undetected Web-Scraping & Seamless HTML Parsing in Python!
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
GroqCrawl is a powerful and user-friendly web crawling and scraping application built with Streamlit and powered by PocketGroq. It provides an intuitive interface for extracting LLM friendly AI consum...
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
A collection of awesome web crawler,spider in different languages
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Undetected Web-Scraping & Seamless HTML Parsing in Python!
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Model Context Protocol (MCP) Server for Graphlit Platform
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
Undetected Web-Scraping & Seamless HTML Parsing in Python!
GroqCrawl is a powerful and user-friendly web crawling and scraping application built with Streamlit and powered by PocketGroq. It provides an intuitive interface for extracting LLM friendly AI consum...
This repository contains the python source code, containing more than 40 python projects, involving many fields.仓库用于储存python源代码, 包含40多个python项目,涉及爬虫、算法、OpenGL、tkinter、面向对象编程等多个领域。
Model Context Protocol (MCP) Server for Graphlit Platform
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang
Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
Internet search engine for text-oriented websites. Indexing the small, old and weird web.