Trending repositories for topic big-data
ClickHouse® is a real-time analytics database management system
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Apache Spark - A unified analytics engine for large-scale data processing
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
Simple Windows desktop application for viewing & querying Apache Parquet files
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
Apache Paimon Rust The rust implementation of Apache Paimon.
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
Simple Windows desktop application for viewing & querying Apache Parquet files
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
A @ClickHouse fork that supports high-performance vector search and full-text search.
基于Vue、three.js、echarts,数据可视化展示项目,包含三维模型导入交互、三维模型标注等功能
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
ClickHouse® is a real-time analytics database management system
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Apache Spark - A unified analytics engine for large-scale data processing
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
ParadeDB is a modern Elasticsearch alternative built on Postgres. Built for real-time, update-heavy workloads.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Bigtop Manager is a modern, AI-driven web application designed to simplify the complexity of bigdata cluster management.
Apache Paimon Rust The rust implementation of Apache Paimon.
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
ParquetSharp is a .NET library for reading and writing Apache Parquet files.
新一代实时计算底座,计算性能超越flink/spark 100倍,XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维监控等功能,系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现业务全链路...
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Simple Windows desktop application for viewing & querying Apache Parquet files
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
An open source, standard data file format for graph data storage and retrieval.
📙 Awesome Data Catalogs and Observability Platforms.
A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
ClickHouse® is a real-time analytics database management system
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Apache Spark - A unified analytics engine for large-scale data processing
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
ParadeDB is a modern Elasticsearch alternative built on Postgres. Built for real-time, update-heavy workloads.
A distributed, fast open-source graph database featuring horizontal scalability and high availability
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Wrangler Transform: A DMD system for transforming Big Data
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
Bigtop Manager is a modern, AI-driven web application designed to simplify the complexity of bigdata cluster management.
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
An open source, standard data file format for graph data storage and retrieval.
Incremental view maintenance & query rewriting for materialized views in DataFusion
CortexBrain is an ambitious open source project aimed at creating an intelligent, lightweight, and efficient service mesh architecture to seamlessly connect cloud and edge devices
Apache Paimon Rust The rust implementation of Apache Paimon.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Use CH-UI to work with your data from Click House self-hosted with a user-friendly interface. CH-UI is a modern and feature-rich user interface for ClickHouse databases. It offers an intuitive platfor...
Bigtop Manager is a modern, AI-driven web application designed to simplify the complexity of bigdata cluster management.
A Laravel package for seamless integration with Apache Solr, providing easy-to-use commands for core management and a fluent interface for Solr operations.
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
CortexBrain is an ambitious open source project aimed at creating an intelligent, lightweight, and efficient service mesh architecture to seamlessly connect cloud and edge devices
Incremental view maintenance & query rewriting for materialized views in DataFusion
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
ClickHouse® is a real-time analytics database management system
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
ParadeDB is a modern Elasticsearch alternative built on Postgres. Built for real-time, update-heavy workloads.
Apache Spark - A unified analytics engine for large-scale data processing
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for...
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Bigtop Manager is a modern, AI-driven web application designed to simplify the complexity of bigdata cluster management.
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Apache Paimon Rust The rust implementation of Apache Paimon.
One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
新一代实时计算底座,计算性能超越flink/spark 100倍,XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维监控等功能,系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现业务全链路...
A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.
This repository contains an Apache Flink application for real-time sales analytics built using Docker Compose to orchestrate the necessary infrastructure components, including Apache Flink, Elasticsea...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
ParadeDB is a modern Elasticsearch alternative built on Postgres. Built for real-time, update-heavy workloads.