Trending repositories for topic data-pipeline
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
A list of useful resources to learn Data Engineering from scratch
Privacy and Security focused Segment-alternative, in Golang and React
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Code for "Efficient Data Processing in Spark" Course
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Code for "Efficient Data Processing in Spark" Course
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
A list of useful resources to learn Data Engineering from scratch
Privacy and Security focused Segment-alternative, in Golang and React
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
A list of useful resources to learn Data Engineering from scratch
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Privacy and Security focused Segment-alternative, in Golang and React
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Code for "Efficient Data Processing in Spark" Course
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Code for "Efficient Data Processing in Spark" Course
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Practical Data Engineering: A Hands-On Real-Estate Project Guide
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
:whale: Tool to automate data quality checks on data pipelines
A list of useful resources to learn Data Engineering from scratch
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
A list of useful resources to learn Data Engineering from scratch
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Practical Data Engineering: A Hands-On Real-Estate Project Guide
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Privacy and Security focused Segment-alternative, in Golang and React
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
Code for "Efficient Data Processing in Spark" Course
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Learn the basics of Apache Kafka® from leaders in the Kafka community with these video courses covering the Kafka ecosystem and hands-on exercises.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Code for "Efficient Data Processing in Spark" Course
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
A curated list of awesome public DBT projects
SQLpipe makes it easy to move the result of one query from one database to another.
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
A compute framework for building Search, RAG, Recommendations and Analytics over complex structured & unstructured data.
A list of useful resources to learn Data Engineering from scratch
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Privacy and Security focused Segment-alternative, in Golang and React
Code for "Efficient Data Processing in Spark" Course
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
Memphis.dev is a highly scalable and effortless data streaming platform
The leader in Next-Generation Customer Data Infrastructure
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
Code for "Efficient Data Processing in Spark" Course
A curated list of open source tools used in analytics platforms and data engineering ecosystem
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
A curated list of awesome public DBT projects
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Flexible development framework for building streaming data applications in SQL with Kafka, Flink, Postgres, GraphQL, and more.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core
Data Engineering - Metropolitan Transportation Authority (MTA) Subway Data Analysis
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Ordered-concurrently a library for concurrent processing with ordered output in Go. Process work concurrently and returns output in a channel in the order of input. It is useful in concurrently proces...
SQLpipe makes it easy to move the result of one query from one database to another.
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes