Trending repositories for topic big-data

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

lakehq/sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

418 (+43)

apache-2.0

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

39,977 (+43)

apache-2.0

binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

59,072 (+42)

mit

ClickHouse/ClickHouse

ClickHouse® is a real-time analytics DBMS

37,710 (+41)

apache-2.0

provectus/kafka-ui

Open-Source Web UI for Apache Kafka Management

9,843 (+29)

apache-2.0

questdb/questdb

QuestDB is an open source time-series database for fast ingest and SQL queries

14,626 (+28)

apache-2.0

apache/flink

Apache Flink

24,138 (+21)

apache-2.0

vesoft-inc/nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability

10,839 (+17)

apache-2.0

quickwit-oss/quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

8,297 (+17)

paradedb/paradedb

Postgres for Search and Analytics

6,238 (+17)

agpl-3.0

vespa-engine/vespa

AI + Data, online. https://vespa.ai

5,832 (+16)

apache-2.0

StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...

9,014 (+14)

apache-2.0

andkret/Cookbook

The Data Engineering Cookbook

13,773 (+13)

apache-2.0

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

3,725 (+13)

mit

risingwavelabs/risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...

7,058 (+12)

apache-2.0

cython/cython

The most widely used Python to C compiler

9,538 (+12)

apache-2.0

apache/datafusion

Apache DataFusion SQL Query Engine

6,313 (+11)

apache-2.0

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,610 (+11)

apache-2.0

xl-xueling/xl-lighthouse

XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括：PV、UV统计；电商销售额、下单用户数统计；日志量统计；接口调用量、异常量、耗时情况统计；服务器运维指标监控等功能。系统支持多维度统计，支持各种复杂的条件筛选和逻辑判断，一键部署，一行代码接入，轻松实现各种海量数据实时统计，帮助企业以更低的成本快速搭建起数据指标体系，是企业...

264 (+10)

apache-2.0

tonbo-io/tonbo

A portable embedded database using Arrow.

787 (+9)

apache-2.0

Last 3 days (relative gain)

lakehq/sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

418 (+11%)

apache-2.0

xl-xueling/xl-lighthouse

264 (+4%)

apache-2.0

chitralverma/scala-polars

Polars for Scala & Java projects!

70 (+3%)

apache-2.0

apache/cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.

414 (+2%)

apache-2.0

kafbat/kafka-ui

Open-Source Web UI for managing Apache Kafka clusters

622 (+1%)

apache-2.0

tonbo-io/tonbo

A portable embedded database using Arrow.

787 (+1%)

apache-2.0

gkiril/oie-resources

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

495 (+0.6%)

paradedb/pg_analytics

DuckDB-powered analytics for Postgres

382 (+0.5%)

postgresql

airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...

204 (+0.5%)

adidas/lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Prod...

224 (+0.4%)

apache-2.0

scikit-hep/uproot5

ROOT I/O in pure Python and NumPy.

241 (+0.4%)

bsd-3-clause

kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

1,296 (+0.4%)

apache-2.0

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

3,725 (+0.4%)

mit

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,339 (+0.3%)

apache-2.0

oneapi-src/oneDAL

oneAPI Data Analytics Library (oneDAL)

615 (+0.3%)

apache-2.0

apache/datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

1,550 (+0.3%)

apache-2.0

provectus/kafka-ui

Open-Source Web UI for Apache Kafka Management

9,843 (+0.3%)

apache-2.0

vespa-engine/vespa

AI + Data, online. https://vespa.ai

5,832 (+0.3%)

apache-2.0

apache/datafusion

Apache DataFusion SQL Query Engine

6,313 (+0.2%)

apache-2.0

natayadev/dataengineering-roadmap

Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨

617 (+0.2%)

mit

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

ClickHouse/ClickHouse

ClickHouse® is a real-time analytics DBMS

37,710 (+107)

apache-2.0

binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

59,072 (+92)

mit

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

39,977 (+89)

apache-2.0

provectus/kafka-ui

Open-Source Web UI for Apache Kafka Management

9,843 (+50)

apache-2.0

apache/flink

Apache Flink

24,138 (+47)

apache-2.0

StarRocks/starrocks

9,014 (+46)

apache-2.0

lakehq/sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

418 (+43)

apache-2.0

quickwit-oss/quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

8,297 (+37)

questdb/questdb

QuestDB is an open source time-series database for fast ingest and SQL queries

14,626 (+37)

apache-2.0

paradedb/paradedb

Postgres for Search and Analytics

6,238 (+34)

agpl-3.0

trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

10,483 (+33)

apache-2.0

apache/datafusion

Apache DataFusion SQL Query Engine

6,313 (+31)

apache-2.0

vesoft-inc/nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability

10,839 (+30)

apache-2.0

vespa-engine/vespa

AI + Data, online. https://vespa.ai

5,832 (+27)

apache-2.0

andkret/Cookbook

The Data Engineering Cookbook

13,773 (+24)

apache-2.0

donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...

27,492 (+23)

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,610 (+21)

apache-2.0

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

3,725 (+20)

mit

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,339 (+18)

apache-2.0

kafbat/kafka-ui

Open-Source Web UI for managing Apache Kafka clusters

622 (+17)

apache-2.0

Last week (relative gain)

elastic/eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

23 (+15%)

apache-2.0

lakehq/sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

418 (+11%)

apache-2.0

haiderjabbar/laravelsolr

A Laravel package for seamless integration with Apache Solr, providing easy-to-use commands for core management and a fluent interface for Solr operations.

48 (+7%)

xl-xueling/xl-lighthouse

264 (+5%)

apache-2.0

caioricciuti/ch-ui

Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...

121 (+4%)

mit

apache/cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.

414 (+4%)

apache-2.0

block-mesh/block-mesh-monorepo

No description

30 (+3%)

paradedb/pg_analytics

DuckDB-powered analytics for Postgres

382 (+3%)

postgresql

rogerioxavier/X-Wines

A world wines dataset with user ratings for recommendation systems and general use.

33 (+3%)

cc0-1.0

chitralverma/scala-polars

Polars for Scala & Java projects!

70 (+3%)

apache-2.0

trieu/leo-cdp-free-edition

The binary build of LEO CDP Free Edition for training purposes

35 (+3%)

apache-2.0

kafbat/kafka-ui

Open-Source Web UI for managing Apache Kafka clusters

622 (+3%)

apache-2.0

The-Joker123/BigData_beauty_analysis

数据大屏可视化,大数据分析（SpringBoot+hiveJDBC+echarts)

37 (+3%)

apache/bigtop-manager

Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.

51 (+2%)

apache-2.0

tonbo-io/tonbo

A portable embedded database using Arrow.

787 (+2%)

apache-2.0

airscholar/e2e-data-engineering

204 (+1%)

NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

822 (+1%)

apache-2.0

parashardhapola/scarf

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.

98 (+1%)

bsd-3-clause

scikit-hep/uproot5

ROOT I/O in pure Python and NumPy.

241 (+0.8%)

bsd-3-clause

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,339 (+0.8%)

apache-2.0

Last month (new repositories)

haiderjabbar/laravelsolr

A Laravel package for seamless integration with Apache Solr, providing easy-to-use commands for core management and a fluent interface for Solr operations.

Last month (absolute gain)

ClickHouse/ClickHouse

ClickHouse® is a real-time analytics DBMS

37,710 (+466)

apache-2.0

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

39,977 (+456)

apache-2.0

binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

59,072 (+388)

mit

paradedb/pg_analytics

DuckDB-powered analytics for Postgres

382 (+192)

postgresql

StarRocks/starrocks

9,014 (+178)

apache-2.0

apache/flink

Apache Flink

24,138 (+177)

apache-2.0

provectus/kafka-ui

Open-Source Web UI for Apache Kafka Management

9,843 (+164)

apache-2.0

apache/datafusion

Apache DataFusion SQL Query Engine

6,313 (+158)

apache-2.0

quickwit-oss/quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

8,297 (+157)

questdb/questdb

QuestDB is an open source time-series database for fast ingest and SQL queries

14,626 (+152)

apache-2.0

paradedb/paradedb

Postgres for Search and Analytics

6,238 (+146)

agpl-3.0

trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

10,483 (+137)

apache-2.0

donnemartin/data-science-ipython-notebooks

27,492 (+121)

vesoft-inc/nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability

10,839 (+108)

apache-2.0

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,339 (+104)

apache-2.0

risingwavelabs/risingwave

7,058 (+92)

apache-2.0

kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

1,296 (+91)

apache-2.0

vespa-engine/vespa

AI + Data, online. https://vespa.ai

5,832 (+91)

apache-2.0

andkret/Cookbook

The Data Engineering Cookbook

13,773 (+90)

apache-2.0

heibaiying/BigData-Notes

大数据入门指南 :star:

15,949 (+89)

Last month (relative gain)

paradedb/pg_analytics

DuckDB-powered analytics for Postgres

382 (+101%)

postgresql

block-mesh/block-mesh-monorepo

No description

30 (+88%)

caioricciuti/ch-ui

121 (+26%)

mit

confluentinc/kafka-connect-hdfs

Kafka Connect HDFS connector

12 (+20%)

tuanx18/data-engineer-portfolio

This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.

32 (+19%)

lakehq/sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

418 (+17%)

apache-2.0

xl-xueling/xl-lighthouse

264 (+14%)

apache-2.0

apache/bigtop-manager

Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.

51 (+13%)

apache-2.0

tonbo-io/tonbo

A portable embedded database using Arrow.

787 (+12%)

apache-2.0

dataflint/spark

Performance Observability for Apache Spark

198 (+11%)

apache-2.0

apache/cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.

414 (+10%)

apache-2.0

kafbat/kafka-ui

Open-Source Web UI for managing Apache Kafka clusters

622 (+8%)

apache-2.0

chitralverma/scala-polars

Polars for Scala & Java projects!

70 (+8%)

apache-2.0

kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

1,296 (+8%)

apache-2.0

wzqwtt/BigData

小白大数据学习笔记 :star:

31 (+7%)

samber/awesome-olap

A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.

48 (+7%)

mit

airscholar/e2e-data-engineering

204 (+6%)

trieu/leo-cdp-free-edition

The binary build of LEO CDP Free Edition for training purposes

35 (+6%)

apache-2.0

apache/spark-docker

Official Dockerfile for Apache Spark

107 (+6%)

apache-2.0

EuclidOLAP/EuclidOLAP

Multidimensional Database

54 (+6%)

apache-2.0

Last 12-months (new repositories)

quarylabs/quary

Open-source BI for engineers

2,186

apache-2.0

myscale/MyScaleDB

A @ClickHouse fork that supports high-performance vector search and full-text search.

868

apache-2.0

tonbo-io/tonbo

A portable embedded database using Arrow.

787

apache-2.0

kafbat/kafka-ui

Open-Source Web UI for managing Apache Kafka clusters

622

apache-2.0

natayadev/dataengineering-roadmap

Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨

617

mit

lakehq/sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

418

apache-2.0

paradedb/pg_analytics

DuckDB-powered analytics for Postgres

382

postgresql

linkedin/openhouse

Open Control Plane for Tables in Data Lakehouse

312

bsd-2-clause

caioricciuti/ch-ui

121

mit

apache/paimon-rust

Apache Paimon Rust The rust implementation of Apache Paimon.

100

apache-2.0

chuongmep/aps-toolkit

An Libray Unlock BIM Data With Autodesk Platform Services

gpl-3.0

apache/bigtop-manager

Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.

apache-2.0

haiderjabbar/laravelsolr

A Laravel package for seamless integration with Apache Solr, providing easy-to-use commands for core management and a fluent interface for Solr operations.

airscholar/FlinkCommerce

This repository contains an Apache Flink application for real-time sales analytics built using Docker Compose to orchestrate the necessary infrastructure components, including Apache Flink, Elasticsea...

block-mesh/block-mesh-monorepo

No description

Last 12-months (absolute gain)

binhnguyennus/awesome-scalability

The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

59,072 (+9,828)

mit

ClickHouse/ClickHouse

ClickHouse® is a real-time analytics DBMS

37,710 (+5,931)

apache-2.0

paradedb/paradedb

Postgres for Search and Analytics

6,238 (+4,533)

agpl-3.0

quickwit-oss/quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

8,297 (+4,004)

StarRocks/starrocks

9,014 (+3,213)

apache-2.0

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

39,977 (+2,746)

apache-2.0

provectus/kafka-ui

Open-Source Web UI for Apache Kafka Management

9,843 (+2,522)

apache-2.0

quarylabs/quary

Open-source BI for engineers

2,186 (+2,053)

apache-2.0

apache/datafusion

Apache DataFusion SQL Query Engine

6,313 (+2,048)

apache-2.0

apache/flink

Apache Flink

24,138 (+1,857)

apache-2.0

donnemartin/data-science-ipython-notebooks

27,492 (+1,688)

risingwavelabs/risingwave

7,058 (+1,634)

apache-2.0

trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

10,483 (+1,634)

apache-2.0

questdb/questdb

QuestDB is an open source time-series database for fast ingest and SQL queries

14,626 (+1,634)

apache-2.0

apache/iotdb

Apache IoTDB

5,620 (+1,549)

apache-2.0

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,339 (+1,411)

apache-2.0

andkret/Cookbook

The Data Engineering Cookbook

13,773 (+1,339)

apache-2.0

heibaiying/BigData-Notes

大数据入门指南 :star:

15,949 (+1,255)

vesoft-inc/nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability

10,839 (+1,171)

apache-2.0

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,610 (+1,097)

apache-2.0

Last 12-months (relative gain)

tonbo-io/tonbo

A portable embedded database using Arrow.

787 (+19,575%)

apache-2.0

myscale/MyScaleDB

A @ClickHouse fork that supports high-performance vector search and full-text search.

868 (+17,260%)

apache-2.0

dataflint/spark

Performance Observability for Apache Spark

198 (+3,860%)

apache-2.0

quarylabs/quary

Open-source BI for engineers

2,186 (+1,544%)

apache-2.0

airscholar/e2e-data-engineering

204 (+1,175%)

apache/bigtop-manager

Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.

51 (+1,175%)

apache-2.0

godaai/scale-py-zh

Python 数据科学加速：Dask、Ray、Xorbits、mpi4py

41 (+925%)

xl-xueling/xl-lighthouse

264 (+878%)

apache-2.0

apache/paimon-rust

Apache Paimon Rust The rust implementation of Apache Paimon.

100 (+809%)

apache-2.0

tuanx18/data-engineer-portfolio

This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.

32 (+540%)

apache/cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.

414 (+399%)

apache-2.0

samber/awesome-olap

A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.

48 (+380%)

mit

adidas/lakehouse-engine

224 (+357%)

apache-2.0

paradedb/paradedb

Postgres for Search and Analytics

6,238 (+266%)

agpl-3.0

chuongmep/aps-toolkit

An Libray Unlock BIM Data With Autodesk Platform Services

55 (+244%)

gpl-3.0

mfuu/vue3-virtual-drag-list

A virtual scrolling list component that can be sorted by dragging, for vue3

38 (+217%)

mit

chitralverma/scala-polars

Polars for Scala & Java projects!

70 (+204%)

apache-2.0

kshitijzutshi/DAMG7245-Final-Project

Content based Music Recommendation System

48 (+153%)

mit

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,339 (+152%)

apache-2.0

The-Joker123/BigData_beauty_analysis

数据大屏可视化,大数据分析（SpringBoot+hiveJDBC+echarts)

37 (+118%)