Trending repositories for topic data-pipeline
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Memphis.dev is a highly scalable and effortless data streaming platform
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A list of useful resources to learn Data Engineering from scratch
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
A curated list of open source tools used in analytical stacks and data engineering ecosystem
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Privacy and Security focused Segment-alternative, in Golang and React
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Practical Data Engineering: A Hands-On Real-Estate Project Guide
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Memphis.dev is a highly scalable and effortless data streaming platform
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
A list of useful resources to learn Data Engineering from scratch
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Privacy and Security focused Segment-alternative, in Golang and React
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Memphis.dev is a highly scalable and effortless data streaming platform
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A list of useful resources to learn Data Engineering from scratch
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Privacy and Security focused Segment-alternative, in Golang and React
A curated list of open source tools used in analytical stacks and data engineering ecosystem
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Practical Data Engineering: A Hands-On Real-Estate Project Guide
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Code for "Efficient Data Processing in Spark" Course
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
A list of useful resources to learn Data Engineering from scratch
Memphis.dev is a highly scalable and effortless data streaming platform
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Privacy and Security focused Segment-alternative, in Golang and React
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Code for "Efficient Data Processing in Spark" Course
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Learn the basics of Apache Kafka® from leaders in the Kafka community with these video courses covering the Kafka ecosystem and hands-on exercises.
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Code for "Efficient Data Processing in Spark" Course
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
Flexible development framework for building streaming data applications in SQL with Kafka, Flink, Postgres, GraphQL, and more.
A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Found a data engineering challenge or participated in a selection process ? Share with us!
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
Code for "Efficient Data Processing in Spark" Course
A curated list of open source tools used in analytical stacks and data engineering ecosystem
This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...
Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
A list of useful resources to learn Data Engineering from scratch
A compute framework for building Search, RAG, Recommendations and Analytics over complex (structured+unstructured) data, with ultra-modal vector embeddings.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Memphis.dev is a highly scalable and effortless data streaming platform
Privacy and Security focused Segment-alternative, in Golang and React
Practical Data Engineering: A Hands-On Real-Estate Project Guide
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
The leader in Next-Generation Customer Data Infrastructure
Code for "Efficient Data Processing in Spark" Course
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Code for "Efficient Data Processing in Spark" Course
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
A curated list of awesome public DBT projects
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Flexible development framework for building streaming data applications in SQL with Kafka, Flink, Postgres, GraphQL, and more.
📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core
Data Engineering - Metropolitan Transportation Authority (MTA) Subway Data Analysis
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
SQLpipe makes it easy to move the result of one query from one database to another.
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes