Trending repositories for topic data-pipeline

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+53)

apache-2.0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

17,393 (+30)

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

3,686 (+8)

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+7)

apache-2.0

apache/flink-cdc

Flink CDC is a streaming data integration tool

5,979 (+6)

apache-2.0

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+3)

mit

reugn/go-streams

A lightweight stream processing library for Go

1,993 (+3)

mit

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+2)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+1)

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+1)

apache-2.0

apache/shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

20,094 (+1)

apache-2.0

Last 3 days (relative gain)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+23%)

apache-2.0

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+0.8%)

apache-2.0

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+0.7%)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+0.4%)

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

3,686 (+0.2%)

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

17,393 (+0.2%)

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+0.2%)

apache-2.0

reugn/go-streams

A lightweight stream processing library for Go

1,993 (+0.2%)

mit

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+0.1%)

mit

apache/flink-cdc

Flink CDC is a streaming data integration tool

5,979 (+0.1%)

apache-2.0

apache/shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

20,094 (+0.0%)

apache-2.0

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+114)

apache-2.0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

17,393 (+52)

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

3,686 (+23)

apache/flink-cdc

Flink CDC is a streaming data integration tool

5,979 (+16)

apache-2.0

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+13)

apache-2.0

apache/shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

20,094 (+13)

apache-2.0

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+7)

apache-2.0

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+7)

mit

reugn/go-streams

A lightweight stream processing library for Go

1,993 (+7)

mit

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

268 (+6)

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+5)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+4)

scribe-org/Scribe-Data

Wikidata and Wikipedia language data extraction

37 (+2)

gpl-3.0

digitalghost-dev/premier-league

A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.

229 (+2)

elementary-data/elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

2,005 (+2)

apache-2.0

abeltavares/real-time-data-pipeline

📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.

39 (+1)

mit

behnamyazdan/ecommerce_realtime_data_pipeline

Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)

54 (+1)

minhadona/data_engineer_interview_challenges

Found a data engineering challenge or participated in a selection process ? Share with us!

65 (+1)

InfuseAI/awesome-public-dbt-projects

A curated list of awesome public DBT projects

113 (+1)

cc0-1.0

dataflint/spark

Performance Observability for Apache Spark

229 (+1)

apache-2.0

Last week (relative gain)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+66%)

apache-2.0

scribe-org/Scribe-Data

Wikidata and Wikipedia language data extraction

37 (+6%)

gpl-3.0

abeltavares/real-time-data-pipeline

📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.

39 (+3%)

mit

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

268 (+2%)

behnamyazdan/ecommerce_realtime_data_pipeline

Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)

54 (+2%)

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+2%)

minhadona/data_engineer_interview_challenges

Found a data engineering challenge or participated in a selection process ? Share with us!

65 (+2%)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+1%)

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+1%)

apache-2.0

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+1%)

apache-2.0

InfuseAI/awesome-public-dbt-projects

A curated list of awesome public DBT projects

113 (+0.9%)

cc0-1.0

digitalghost-dev/premier-league

A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.

229 (+0.9%)

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

3,686 (+0.6%)

dataflint/spark

Performance Observability for Apache Spark

229 (+0.4%)

apache-2.0

airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...

230 (+0.4%)

reugn/go-streams

A lightweight stream processing library for Go

1,993 (+0.4%)

mit

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

17,393 (+0.3%)

apache/flink-cdc

Flink CDC is a streaming data integration tool

5,979 (+0.3%)

apache-2.0

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+0.2%)

mit

ConduitIO/conduit

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.

431 (+0.2%)

apache-2.0

Last month (new repositories)

no newly created repositories trending in the last month

Last month (absolute gain)

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

17,393 (+309)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+162)

apache-2.0

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

268 (+73)

apache/shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

20,094 (+70)

apache-2.0

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

3,686 (+58)

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+55)

apache-2.0

apache/flink-cdc

Flink CDC is a streaming data integration tool

5,979 (+50)

apache-2.0

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+25)

mit

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+23)

elementary-data/elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

2,005 (+20)

apache-2.0

reugn/go-streams

A lightweight stream processing library for Go

1,993 (+19)

mit

AgnostiqHQ/covalent

Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.

814 (+17)

apache-2.0

Multiwoven/multiwoven

🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.

1,585 (+17)

agpl-3.0

ssp-data/practical-data-engineering

Practical Data Engineering: A Hands-On Real-Estate Project Guide

616 (+17)

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+16)

apache-2.0

snowplow/snowplow

The leader in Next-Generation Customer Data Infrastructure

6,892 (+16)

apache-2.0

damklis/DataEngineeringProject

Example end to end data engineering project.

1,233 (+16)

mit

bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...

1,653 (+12)

apache-2.0

sparkfish/augraphy

Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

389 (+12)

mit

dataflint/spark

Performance Observability for Apache Spark

229 (+11)

apache-2.0

Last month (relative gain)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+131%)

apache-2.0

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

268 (+37%)

thanhENC/e2e-data-platform

End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)

30 (+15%)

mit

scribe-org/Scribe-Data

Wikidata and Wikipedia language data extraction

37 (+12%)

gpl-3.0

awesome-mlops/awesome-data-management

A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀

32 (+10%)

apache-2.0

airscholar/RedditDataEngineering

This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and service...

120 (+9%)

confluentinc/learn-kafka-courses

Learn the basics of Apache Kafka® from leaders in the Kafka community with these video courses covering the Kafka ecosystem and hands-on exercises.

25 (+9%)

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+8%)

montara-io/dbt-command-center

Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.

28 (+8%)

mit

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+6%)

apache-2.0

abeltavares/real-time-data-pipeline

📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.

39 (+5%)

mit

dataflint/spark

Performance Observability for Apache Spark

229 (+5%)

apache-2.0

InfuseAI/awesome-public-dbt-projects

A curated list of awesome public DBT projects

113 (+5%)

cc0-1.0

starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

80 (+4%)

apache-2.0

behnamyazdan/ecommerce_realtime_data_pipeline

Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)

54 (+4%)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+4%)

opensnowcat/opensnowcat-collector

OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)

31 (+3%)

apache-2.0

sparkfish/augraphy

Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

389 (+3%)

mit

ssp-data/practical-data-engineering

Practical Data Engineering: A Hands-On Real-Estate Project Guide

616 (+3%)

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+3%)

apache-2.0

Last 12-months (new repositories)

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286

apache-2.0

behnamyazdan/ecommerce_realtime_data_pipeline

Ecommerce Realtime Data Pipeline (Data Modeling, Workflow Orchestration, Change Data Capture, Analytical Database and Dashboarding)

abeltavares/real-time-data-pipeline

📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.

mit

thanhENC/e2e-data-platform

End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)

mit

montara-io/dbt-command-center

Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.

mit

Last 12-months (absolute gain)

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

17,393 (+3,961)

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+1,206)

mit

Multiwoven/multiwoven

🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.

1,585 (+1,154)

agpl-3.0

apache/flink-cdc

Flink CDC is a streaming data integration tool

5,979 (+976)

apache-2.0

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+934)

apache-2.0

apache/shardingsphere

Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

20,094 (+812)

apache-2.0

adilkhash/Data-Engineering-HowTo

A list of useful resources to learn Data Engineering from scratch

3,686 (+634)

ssp-data/practical-data-engineering

Practical Data Engineering: A Hands-On Real-Estate Project Guide

616 (+402)

elementary-data/elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

2,005 (+318)

apache-2.0

reugn/go-streams

A lightweight stream processing library for Go

1,993 (+308)

mit

datazip-inc/olake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB

286 (+284)

apache-2.0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+278)

rudderlabs/rudder-server

Privacy and Security focused Segment-alternative, in Golang and React

4,139 (+268)

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+268)

damklis/DataEngineeringProject

Example end to end data engineering project.

1,233 (+252)

mit

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+246)

apache-2.0

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

268 (+242)

whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collect...

2,691 (+188)

apache-2.0

AgnostiqHQ/covalent

Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.

814 (+181)

apache-2.0

snowplow/snowplow

The leader in Next-Generation Customer Data Infrastructure

6,892 (+180)

apache-2.0

Last 12-months (relative gain)

superlinked/superlinked

Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

940 (+15,567%)

apache-2.0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

282 (+6,950%)

remyxai/VQASynth

Compose multimodal datasets 🎹

294 (+1,031%)

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

268 (+931%)

montara-io/dbt-command-center

Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.

28 (+600%)

mit

airscholar/RedditDataEngineering

120 (+532%)

Multiwoven/multiwoven

🔥🔥🔥 Open source composable CDP - alternative to hightouch and census.

1,585 (+268%)

agpl-3.0

airscholar/e2e-data-engineering

230 (+259%)

scribe-org/Scribe-Data

Wikidata and Wikipedia language data extraction

37 (+208%)

gpl-3.0

ssp-data/practical-data-engineering

Practical Data Engineering: A Hands-On Real-Estate Project Guide

616 (+188%)

InfuseAI/awesome-public-dbt-projects

A curated list of awesome public DBT projects

113 (+176%)

cc0-1.0

opensnowcat/opensnowcat-collector

OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)

31 (+158%)

apache-2.0

starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

80 (+158%)

apache-2.0

jvalue/jayvee

Jayvee is a domain-specific language and runtime for automated processing of data pipelines

177 (+149%)

dataflint/spark

Performance Observability for Apache Spark

229 (+120%)

apache-2.0

digitalghost-dev/premier-league

A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.

229 (+96%)

DataSQRL/sqrl

Flexible development framework for building streaming data applications in SQL with Kafka, Flink, Postgres, GraphQL, and more.

102 (+85%)

bruin-data/ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

2,889 (+72%)

mit

awesome-mlops/awesome-data-management

A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀

32 (+68%)

apache-2.0

apache/seatunnel-web

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

633 (+64%)

apache-2.0