Trending repositories for topic data-engineering

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

37,870 (+246)

apache-2.0

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+159)

apache-2.0

DataTalksClub/data-engineering-zoomcamp

Free Data Engineering course!

25,893 (+137)

apache/superset

Apache Superset is a Data Visualization and Data Exploration Platform

63,447 (+44)

apache-2.0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

16,518 (+38)

PrefectHQ/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

17,837 (+34)

apache-2.0

quadratichq/quadratic

Quadratic | Spreadsheet with Python, SQL, and AI

3,091 (+32)

dagster-io/dagster

An orchestration platform for the development, production, and observation of data assets.

12,103 (+25)

apache-2.0

dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

2,835 (+20)

apache-2.0

argoproj/argo-workflows

Workflow Engine for Kubernetes

15,191 (+20)

apache-2.0

risingwavelabs/risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...

7,158 (+17)

apache-2.0

risesoft-y9/DataFlow-Engine

数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。

448 (+13)

gpl-3.0

great-expectations/great_expectations

Always know what to expect from your data.

10,067 (+11)

apache-2.0

eugeneyan/applied-ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

27,432 (+11)

mit

GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

37,828 (+10)

mit

growthbook/growthbook

Open Source Feature Flagging and A/B Testing Platform

6,273 (+9)

andkret/Cookbook

The Data Engineering Cookbook

13,871 (+9)

apache-2.0

DAGWorks-Inc/hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

1,915 (+8)

bsd-3-clause-clear

feast-dev/feast

The Open Source Feature Store for Machine Learning

5,685 (+8)

apache-2.0

datastacktv/data-engineer-roadmap

Roadmap to becoming a data engineer in 2021

12,439 (+6)

Last 3 days (relative gain)

risesoft-y9/DataFlow-Engine

448 (+3%)

gpl-3.0

ebonnal/streamable

[Python] Stream-like manipulation of iterables.

172 (+2%)

apache-2.0

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

165 (+1%)

apecloud/myduckserver

MySQL & Postgres Analytics, Reimagined

339 (+1%)

apache-2.0

quadratichq/quadratic

Quadratic | Spreadsheet with Python, SQL, and AI

3,091 (+1%)

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+0.9%)

apache-2.0

bitol-io/open-data-contract-standard

Home of the Open Data Contract Standard (ODCS).

411 (+0.7%)

apache-2.0

dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

2,835 (+0.7%)

apache-2.0

apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

37,870 (+0.7%)

apache-2.0

Snowflake-Labs/snowflake-demo-notebooks

Collection of Snowflake Notebook demos, tutorials, and examples

181 (+0.6%)

apache-2.0

DataTalksClub/data-engineering-zoomcamp

Free Data Engineering course!

25,893 (+0.5%)

DAGWorks-Inc/hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

1,915 (+0.4%)

bsd-3-clause-clear

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

259 (+0.4%)

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,440 (+0.4%)

apache-2.0

mlrun/mlrun

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates t...

1,464 (+0.3%)

apache-2.0

opendatadiscovery/awesome-data-catalogs

📙 Awesome Data Catalogs and Observability Platforms.

747 (+0.3%)

mit

meltano/meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

1,881 (+0.3%)

mit

damklis/DataEngineeringProject

Example end to end data engineering project.

1,185 (+0.3%)

mit

risingwavelabs/risingwave

7,158 (+0.2%)

apache-2.0

kdeldycke/awesome-billing

💰 Billing & Payments knowledge for cloud platforms

958 (+0.2%)

cc0-1.0

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

DataTalksClub/data-engineering-zoomcamp

Free Data Engineering course!

25,893 (+332)

apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

37,870 (+280)

apache-2.0

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+191)

apache-2.0

apache/superset

Apache Superset is a Data Visualization and Data Exploration Platform

63,447 (+123)

apache-2.0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

16,518 (+73)

PrefectHQ/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

17,837 (+59)

apache-2.0

dagster-io/dagster

An orchestration platform for the development, production, and observation of data assets.

12,103 (+45)

apache-2.0

quadratichq/quadratic

Quadratic | Spreadsheet with Python, SQL, and AI

3,091 (+39)

dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

2,835 (+34)

apache-2.0

argoproj/argo-workflows

Workflow Engine for Kubernetes

15,191 (+34)

apache-2.0

risesoft-y9/DataFlow-Engine

448 (+32)

gpl-3.0

evidence-dev/evidence

Business intelligence as code: build fast, interactive data visualizations in SQL and markdown

4,644 (+27)

mit

andkret/Cookbook

The Data Engineering Cookbook

13,871 (+26)

apache-2.0

GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

37,828 (+25)

mit

risingwavelabs/risingwave

7,158 (+24)

apache-2.0

eugeneyan/applied-ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

27,432 (+22)

mit

ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

642 (+16)

datastacktv/data-engineer-roadmap

Roadmap to becoming a data engineer in 2021

12,439 (+15)

growthbook/growthbook

Open Source Feature Flagging and A/B Testing Platform

6,273 (+15)

apecloud/myduckserver

MySQL & Postgres Analytics, Reimagined

339 (+14)

apache-2.0

Last week (relative gain)

AhmetFurkanDEMIR/Data-Engineering-Project-with-HDFS-and-Kafka

Data Engineering Project with Hadoop HDFS and Kafka

38 (+9%)

mit

risesoft-y9/DataFlow-Engine

448 (+8%)

gpl-3.0

ebonnal/streamable

[Python] Stream-like manipulation of iterables.

172 (+8%)

apache-2.0

Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions

This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.

47 (+4%)

apecloud/myduckserver

MySQL & Postgres Analytics, Reimagined

339 (+4%)

apache-2.0

ris-tlp/audiophile-e2e-pipeline

Pipeline that extracts data from Crinacle's Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard. The dashboard is then used to support a purchasing decision of which Headp...

211 (+3%)

starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

63 (+3%)

apache-2.0

Amrit-Hub/Databricks-Certified-Data-Engineer-Associate-Questions

This repo contains "Databricks Certified Data Engineer Associate" Questions and related docs.

95 (+3%)

ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

642 (+3%)

coder2j/pyspark-tutorial

PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...

88 (+2%)

mit

cnstlungu/portable-data-stack-dagster

A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB and Superset

199 (+2%)

mit

airbytehq/airflow-summit-airbyte-2022

git push your data stack with Airbyte, Airflow, and dbt - 2022 Airflow Summit

53 (+2%)

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

165 (+2%)

data-burst/data-engineering-roadmap

No description

226 (+2%)

mit

Snowflake-Labs/snowflake-demo-notebooks

Collection of Snowflake Notebook demos, tutorials, and examples

181 (+2%)

apache-2.0

dbt-labs/jaffle-shop

🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.

123 (+2%)

bmeares/Meerschaum

Create and manage data pipes with Meerschaum.

135 (+2%)

apache-2.0

HamzaG737/data-engineering-project

End to end data engineering project with kafka, airflow, spark, postgres and docker.

74 (+1%)

mit

bitol-io/open-data-contract-standard

Home of the Open Data Contract Standard (ODCS).

411 (+1%)

apache-2.0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

259 (+1%)

Last month (new repositories)

StabRise/spark-pdf

PDF DataSource for Apache Spark

agpl-3.0

Last month (absolute gain)

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+1,952)

apache-2.0

DataTalksClub/data-engineering-zoomcamp

Free Data Engineering course!

25,893 (+676)

apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

37,870 (+655)

apache-2.0

apache/superset

Apache Superset is a Data Visualization and Data Exploration Platform

63,447 (+569)

apache-2.0

dagster-io/dagster

An orchestration platform for the development, production, and observation of data assets.

12,103 (+373)

apache-2.0

PrefectHQ/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

17,837 (+318)

apache-2.0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

16,518 (+294)

GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

37,828 (+195)

mit

evidence-dev/evidence

Business intelligence as code: build fast, interactive data visualizations in SQL and markdown

4,644 (+184)

mit

apecloud/myduckserver

MySQL & Postgres Analytics, Reimagined

339 (+157)

apache-2.0

dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

2,835 (+157)

apache-2.0

risesoft-y9/DataFlow-Engine

448 (+137)

gpl-3.0

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,440 (+101)

apache-2.0

risingwavelabs/risingwave

7,158 (+100)

apache-2.0

andkret/Cookbook

The Data Engineering Cookbook

13,871 (+98)

apache-2.0

eugeneyan/applied-ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

27,432 (+87)

mit

argoproj/argo-workflows

Workflow Engine for Kubernetes

15,191 (+85)

apache-2.0

growthbook/growthbook

Open Source Feature Flagging and A/B Testing Platform

6,273 (+79)

feast-dev/feast

The Open Source Feature Store for Machine Learning

5,685 (+70)

apache-2.0

great-expectations/great_expectations

Always know what to expect from your data.

10,067 (+68)

apache-2.0

Last month (relative gain)

apecloud/myduckserver

MySQL & Postgres Analytics, Reimagined

339 (+86%)

apache-2.0

mattiasthalen/arcane-insight

Arcane Insight is a data analytics project designed to harness the power of SQLMesh & DuckDB to collect, transform, and analyze data from Blizzard’s Hearthstone API. Focused on card statistics and att...

28 (+75%)

gpl-3.0

risesoft-y9/DataFlow-Engine

448 (+44%)

gpl-3.0

ebonnal/streamable

[Python] Stream-like manipulation of iterables.

172 (+31%)

apache-2.0

Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions

This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.

47 (+21%)

AhmetFurkanDEMIR/Data-Engineering-Project-with-HDFS-and-Kafka

Data Engineering Project with Hadoop HDFS and Kafka

38 (+15%)

mit

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+13%)

apache-2.0

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

165 (+11%)

ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

642 (+11%)

sonhmai/data-system-design

System Design, Solution Architecture, Data Systems Practice

30 (+11%)

DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...

96 (+9%)

apache-2.0

HamzaG737/data-engineering-project

End to end data engineering project with kafka, airflow, spark, postgres and docker.

74 (+9%)

mit

zkan/data-engineering-bootcamp

Data Engineering Bootcamp

25 (+9%)

cc0-1.0

starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

63 (+9%)

apache-2.0

cnstlungu/portable-data-stack-dagster

A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB and Superset

199 (+8%)

mit

opensnowcat/opensnowcat-collector

OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)

27 (+8%)

apache-2.0

dbt-labs/jaffle-shop

🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.

123 (+8%)

Snowflake-Labs/snowflake-demo-notebooks

Collection of Snowflake Notebook demos, tutorials, and examples

181 (+8%)

apache-2.0

data-burst/data-engineering-roadmap

No description

226 (+8%)

mit

DataKitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...

48 (+7%)

apache-2.0

Last 12-months (new repositories)

latitude-dev/latitude

Developer-first embedded analytics

888

lgpl-3.0

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

611

apache-2.0

risesoft-y9/DataFlow-Engine

448

gpl-3.0

apecloud/myduckserver

MySQL & Postgres Analytics, Reimagined

339

apache-2.0

airbytehq/PyAirbyte

PyAirbyte brings the power of Airbyte to every Python developer.

240

data-burst/data-engineering-roadmap

No description

226

mit

Snowflake-Labs/snowflake-demo-notebooks

Collection of Snowflake Notebook demos, tutorials, and examples

181

apache-2.0

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

165

LuozhuZhang/zksync-era-ETL

Best zkSync-era ETL ever 😜

132

DataKitchen/data-observability-installer

apache-2.0

RamiKrispin/pydata-ny-ga-workshop

Materials for the Deploy and Monitor ML Pipelines with Python, Docker and GitHub Actions workshop at the PyData NYC 2024 conference

josephmachado/python_essentials_for_data_engineers

Code for blog at https://www.startdataengineering.com/post/python-for-de/

muyu42/DataS

本项目旨在结合以往研究人员的代表性工作，从多个维度评估sft数据，并自动化过滤sft数据。

apache-2.0

cnstlungu/portable-data-stack-mage

A portable Datamart and Business Intelligence suite built with Docker, Mage, dbt, DuckDB and Superset

apache-2.0

DataKitchen/dataops-testgen

apache-2.0

DataKitchen/dataops-observability

DataOps Observability is part of DataKitchen's Open Source Data Observability. DataOps Observability monitors every data journey from data source to customer value, from any team development environm...

apache-2.0

dbecorp/snowflakecli

A DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.

tobilg/sql-workbench

Public issue-tracking and feature suggestion for sql-workbench.com

mattiasthalen/arcane-insight

gpl-3.0

opensnowcat/opensnowcat-collector

OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)

apache-2.0

Last 12-months (absolute gain)

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+14,785)

apache-2.0

DataTalksClub/data-engineering-zoomcamp

Free Data Engineering course!

25,893 (+9,683)

apache/superset

Apache Superset is a Data Visualization and Data Exploration Platform

63,447 (+7,592)

apache-2.0

apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

37,870 (+5,072)

apache-2.0

PrefectHQ/prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

17,837 (+4,224)

apache-2.0

airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

16,518 (+3,928)

GokuMohandas/Made-With-ML

Learn how to design, develop, deploy and iterate on production-grade ML applications.

37,828 (+3,063)

mit

dagster-io/dagster

An orchestration platform for the development, production, and observation of data assets.

12,103 (+2,886)

apache-2.0

eugeneyan/applied-ml

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

27,432 (+2,357)

mit

evidence-dev/evidence

Business intelligence as code: build fast, interactive data visualizations in SQL and markdown

4,644 (+1,972)

mit

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,028 (+1,923)

apache-2.0

dlt-hub/dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

2,835 (+1,861)

apache-2.0

dathere/qsv

Blazing-fast Data-Wrangling toolkit

2,561 (+1,664)

unlicense

risingwavelabs/risingwave

7,158 (+1,608)

apache-2.0

Multiwoven/multiwoven

🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation

1,556 (+1,553)

agpl-3.0

Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

2,440 (+1,461)

apache-2.0

andkret/Cookbook

The Data Engineering Cookbook

13,871 (+1,363)

apache-2.0

argoproj/argo-workflows

Workflow Engine for Kubernetes

15,191 (+1,361)

apache-2.0

redpanda-data/connect

Fancy stream processing made operationally mundane

8,172 (+1,249)

growthbook/growthbook

Open Source Feature Flagging and A/B Testing Platform

6,273 (+1,074)

Last 12-months (relative gain)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

259 (+6,375%)

ebonnal/streamable

[Python] Stream-like manipulation of iterables.

172 (+4,200%)

apache-2.0

DataRecce/recce

The data-validation toolkit for enhanced dbt (data build tool) PR review

278 (+3,375%)

apache-2.0

kanton-bern/hellodata-be

The Open-Source Enterprise Data Platform in a single Portal

225 (+3,114%)

benrutter/wimsey

Easy and flexible data contracts

94 (+1,467%)

mit

pracdata/awesome-open-source-data-engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem

165 (+1,400%)

data-burst/data-engineering-roadmap

No description

226 (+1,229%)

mit

coder2j/pyspark-tutorial

88 (+1,157%)

mit

dbt-labs/jaffle-shop

🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.

123 (+1,130%)

jvalue/jayvee

Jayvee is a domain-specific language and runtime for automated processing of data pipelines

152 (+794%)

airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...

209 (+771%)

unit-mesh/unit-gen

UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据：代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.

49 (+717%)

mpl-2.0

lucjankonopka/portfolio

This repository is to show my Data Analytics & Engineering skills, share projects, and track my progress.

39 (+680%)

sonhmai/data-system-design

System Design, Solution Architecture, Data Systems Practice

30 (+650%)

Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...

389 (+607%)

agpl-3.0

Avaiga/taipy

Turns Data and AI algorithms into production-ready web applications in no time.

17,429 (+559%)

apache-2.0