Trending repositories for topic big-data
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Apache Spark - A unified analytics engine for large-scale data processing
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
Collaborative Datacenter Simulation and Exploration for Everybody
🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
Apache Wayang(incubating) is the first cross-platform data processing system.
Apache Paimon Rust The rust implementation of Apache Paimon.
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
A @ClickHouse fork that supports high-performance vector search and full-text search.
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Apache Spark - A unified analytics engine for large-scale data processing
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
Apache Wayang(incubating) is the first cross-platform data processing system.
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Apache Wayang(incubating) is the first cross-platform data processing system.
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Collaborative Datacenter Simulation and Exploration for Everybody
YTsaurus is a scalable and fault-tolerant open-source big data platform.
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Apache Paimon Rust The rust implementation of Apache Paimon.
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
A @ClickHouse fork that supports high-performance vector search and full-text search.
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Apache Spark - A unified analytics engine for large-scale data processing
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
A Laravel package for seamless integration with Apache Solr, providing easy-to-use commands for core management and a fluent interface for Solr operations.
🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
Apache Wayang(incubating) is the first cross-platform data processing system.
Apache Paimon Rust The rust implementation of Apache Paimon.
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
A world wines dataset with user ratings for recommendation systems and general use.
Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
A @ClickHouse fork that supports high-performance vector search and full-text search.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.
A Laravel package for seamless integration with Apache Solr, providing easy-to-use commands for core management and a fluent interface for Solr operations.
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Apache Spark - A unified analytics engine for large-scale data processing
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
A distributed, fast open-source graph database featuring horizontal scalability and high availability
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
A @ClickHouse fork that supports high-performance vector search and full-text search.
Bigtop Manager provides a modern, low-threshold web application to simplify the deployment and management of components for Bigtop, similar to Apache Ambari and Cloudera Manager.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Apache Paimon Rust The rust implementation of Apache Paimon.
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
A virtual scrolling list component that can be sorted by dragging, for vue3
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.