Trending repositories for topic data-pipeline
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
A list of useful resources to learn Data Engineering from scratch
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Code for "Efficient Data Processing in Spark" Course
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
Code for "Efficient Data Processing in Spark" Course
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
A list of useful resources to learn Data Engineering from scratch
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
A list of useful resources to learn Data Engineering from scratch
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Code for "Efficient Data Processing in Spark" Course
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
Found a data engineering challenge or participated in a selection process ? Share with us!
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
Found a data engineering challenge or participated in a selection process ? Share with us!
Code for "Efficient Data Processing in Spark" Course
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
A curated list of awesome public DBT projects
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
A list of useful resources to learn Data Engineering from scratch
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
A list of useful resources to learn Data Engineering from scratch
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.
Practical Data Engineering: A Hands-On Real-Estate Project Guide
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
A curated list of open source tools used in analytics platforms and data engineering ecosystem
End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)
A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
Learn the basics of Apache Kafka® from leaders in the Kafka community with these video courses covering the Kafka ecosystem and hands-on exercises.
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
A curated list of awesome public DBT projects
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.
Code for "Efficient Data Processing in Spark" Course
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Practical Data Engineering: A Hands-On Real-Estate Project Guide
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)
Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
A list of useful resources to learn Data Engineering from scratch
Practical Data Engineering: A Hands-On Real-Estate Project Guide
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
Code for "Efficient Data Processing in Spark" Course
Privacy and Security focused Segment-alternative, in Golang and React
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
A curated list of open source tools used in analytics platforms and data engineering ecosystem
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
The leader in Next-Generation Customer Data Infrastructure
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
Code for "Efficient Data Processing in Spark" Course
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A curated list of awesome public DBT projects
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Flexible development framework for building streaming data applications in SQL with Kafka, Flink, Postgres, GraphQL, and more.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).