Search Results - RepositoryStats

heritrix3 internetarchive

761

2.9k

other

187

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

java warc heritrix webcrawling

Created 2011-10-21

2,655 commits to master branch, last one 3 days ago

FinnewsHunter DemonDamon

282

1.1k

mit

31

从新浪财经、每经网、金融界、中国证券网、证券时报网上，爬取上市公司（个股）的历史新闻文本数据进行文本分析、提取特征集，然后利用SVM、随机森林等分类器进行训练，最后对实施抓取的新闻数据进行分类预测

text-mining webcrawling machine-learning

Created 2018-02-25

167 commits to main branch, last one 5 months ago

scrapyrt scrapinghub

160

851

bsd-3-clause

44

HTTP API for Scrapy spiders

python scrapy crawler scraper twisted crawling webcrawler webcrawling hacktoberfest hacktoberfest2021

Created 2015-01-06

247 commits to master branch, last one about a year ago

opensearchserver jaeksoft

190

505

apache-2.0

76

Open-source Enterprise Grade Search Engine Software

ocr java lucene search crawler indexing synonyms enterprise webcrawler webcrawling custom-search search-engine opensearchserver

Created 2013-07-18

5,642 commits to master branch, last one 3 years ago

DotnetCrawler mehmetozkaya

66

175

unknown

11

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like Web...

csharp scrapy crawler crawling scraping dotnetcore webcrawler webscraper webcrawling webscraping scrapy-crawler htmlagilitypack ddd-architecture entity-framework-core webcrawler-htmlagilitypack

Created 2019-02-19

55 commits to master branch, last one 5 years ago

gotor DedSecInside

44

166

gpl-3.0

6

This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.

go cli tor osint docker golang torbot service rest-api webcrawler http-server osint-tools webcrawling webscraping command-line golang-server hacktoberfest command-line-tool information-extraction

Created 2018-06-02

223 commits to main branch, last one 11 months ago

ralger feddelegrand7

14

155

other

5

ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2.

r rstats webcrawling webscraping dataextraction webscraper-website

Created 2020-02-18

274 commits to master branch, last one 8 months ago

data-api scrapyman

8

150

unknown

6

Scrapyman数据接口服务。提供：淘宝、小红书、同程旅行、京东、抖音（电商）、美团、抖音（视频）、快手、蒲公英、星图、拼多多、微信公众号、大众点评、哔哩哔哩、知乎、微博、贝壳、Bigo、Temu、Lazada、Shopee、SHEIN、百度指数、携程、Boss直聘、智联招聘、拉钩、今日头条、Facebook、Youtube、Instgram、Twitter。爬虫、采集、scrapy、接口、AP...

api data crawl douyin taobao jingdong kuaishou pinduoduo pugongying taobao-api webcrawling xiaohongshu xiaohongshu-api

Created 2023-08-03

49,776 commits to main branch

Raspagem-de-dados-para-iniciantes DwarfThief

21

133

gpl-3.0

9

Raspagem de dados para iniciante usando Scrapy e outras libs básicas

estudo python scrapy spyder opensource web-crawler webcrawling datascraping hacktoberfest jupyter-notebook raspagem-de-dados

Created 2018-10-28

53 commits to master branch, last one about a year ago

malheatmap andersonkrs

2

100

mit

1

An extension for tracking your activities on myanimelist.net

ruby rails myanimelist webcrawling

Created 2020-03-01

1,085 commits to main branch, last one 3 months ago

LLMWebCrawler Aavache

10

92

unknown

1

A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.

api llm nlp rag ray milvus python raylib fastapi pydantic webcrawler huggingface transformer webcrawling vector-database machine-learning distributed-computing large-language-models

Created 2023-09-28

9 commits to main branch, last one about a year ago

ARGUS datawizard1337

25

88

gpl-3.0

6

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks...

python scrapy scrapyd crawling scraping webcrawling webscraping

Created 2018-05-18

186 commits to master branch, last one 3 years ago

newspaperjs flickz

20

75

mit

4

News extraction and scraping. Article Parsing

news nodejs crawler scraper webcrawling webscraping news-aggregator

Created 2017-04-13

66 commits to master branch, last one 2 years ago

Stock-Fundamental-data-scraping-and-analysis Skumarr53

28

71

unknown

3

Project on building a web crawler to collect the fundamentals of the stock and review their performance in one go

python3 selenium automation webcrawling web-scraping datacollection stock-fundamentalplots

Created 2019-07-13

30 commits to master branch, last one 4 years ago

Ultimate-Guide-to-Sneaker-Bot-Creation spieredd

7

52

mit

7

The Ultimate Guide to Sneaker Bot 🤖 Creation using JavaScript and NodeJS ☣️ . Learn how to get the most out of tools like the Chrome devTools, and JS Libraries like Puppeteer or Axios.

bot auto bots node axios nodejs bot-api requests sneakers puppeteer webdriver javascript playwright sneakerbot webcrawling webscraping bot-framework sneakermonitor

Created 2021-05-09

9 commits to main branch, last one 3 years ago