Trending repositories for topic data-engineering
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
An orchestration platform for the development, production, and observation of data assets.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
An Awesome List of Open-Source Data Engineering Projects
Always know what to expect from your data.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
🧙 Build, run, and manage data pipelines for integrating and transforming data.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
A list of useful resources to learn Data Engineering from scratch
The best place to learn data engineering. Built and maintained by the data engineering community.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
An Awesome List of Open-Source Data Engineering Projects
The data-validation toolkit for enhanced dbt (data build tool) PR review
The best place to learn data engineering. Built and maintained by the data engineering community.
More than 2000+ Data engineer interview questions.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Prod...
Implementing best practices for PySpark ETL jobs and applications.
🎨 UI for the Free Data Engineering Zoomcamp Course provided by DataTalksClub
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
An orchestration platform for the development, production, and observation of data assets.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Turns Data and AI algorithms into production-ready web applications in no time.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
An Awesome List of Open-Source Data Engineering Projects
Always know what to expect from your data.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
The best place to learn data engineering. Built and maintained by the data engineering community.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
Data Engineering Project with Hadoop HDFS and Kafka
A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)
My Digital Palace - A Personal Journal for Reflection - A place to store all my thoughts
This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
This repo contains "Databricks Certified Data Engineer Associate" Questions and related docs.
Home of the Open Data Contract Standard (ODCS).
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
An Awesome List of Open-Source Data Engineering Projects
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Superset is a Data Visualization and Data Exploration Platform
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
An orchestration platform for the development, production, and observation of data assets.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Turns Data and AI algorithms into production-ready web applications in no time.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
An Awesome List of Open-Source Data Engineering Projects
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
Always know what to expect from your data.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Dataform tools - a vscode extension to run and visualise Dataform data pipelines
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
A curated collection of AI, data engineering, and DevOps projects featuring real-world applications, advanced techniques, and tutorials—ideal for learners and practitioners exploring data science and ...
This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.
This repo contains "Databricks Certified Data Engineer Associate" Questions and related docs.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Unified MySQL, Postgres & FlightSQL Server, Powered by DuckDB.
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
Data Engineering Project with Hadoop HDFS and Kafka
The data-validation toolkit for enhanced dbt (data build tool) PR review
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Collection of Snowflake Notebook demos, tutorials, and examples
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
Materials for the Deploy and Monitor ML Pipelines with Python, Docker and GitHub Actions workshop at the PyData NYC 2024 conference
Code for blog at https://www.startdataengineering.com/post/python-for-de/
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
A portable Datamart and Business Intelligence suite built with Docker, Mage, dbt, DuckDB and Superset
DataOps Observability is part of DataKitchen's Open Source Data Observability. DataOps Observability monitors every data journey from data source to customer value, from any team development environm...
A DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Arcane Insight is a data analytics project designed to harness the power of SQLMesh & DuckDB to collect, transform, and analyze data from Blizzard’s Hearthstone API. Focused on card statistics and att...
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
An orchestration platform for the development, production, and observation of data assets.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
🧙 Build, run, and manage data pipelines for integrating and transforming data.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Code for "Efficient Data Processing in Spark" Course
The data-validation toolkit for enhanced dbt (data build tool) PR review
A curated list of open source tools used in analytics platforms and data engineering ecosystem
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
This repository is to show my Data Analytics & Engineering skills, share projects, and track my progress.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Data Engineering Project with Hadoop HDFS and Kafka
Turns Data and AI algorithms into production-ready web applications in no time.
breadroll 🥟 is a simple lightweight library for data processing operations written in Typescript and powered by Bun.
A Python package that creates fine-grained dbt tasks on Apache Airflow